"sql" entries

Dynamic columns in MariaDB

Add columns to a table on the fly without altering its schema.

MariaDB and similar SQL database systems allow for a variety of data types that may be used for storing data in columns within tables. When creating or altering a table’s schema, it’s good to know what to expect, to know what kind of data will be stored in each column. If you know that a column will contain numbers, use a numeric data type like INT, not VARCHAR. It’s best to use the appropriate data type for a column. Generally, you’ll have better control of the data and possibly better performance.

But sometimes you can’t predict what type of data might be entered into a column. For such a situation, you might use VARCHAR set to 255 characters wide, or maybe TEXT if plenty of data might be entered. This is a very cool and fairly new alternative: you could create a table in which you would add columns on the fly, but without altering the table’s schema. That may sound absurd, but it’s possible to do this in MariaDB with dynamic columns.

Dynamic columns are basically columns within a column. If you know programming well, they’re like a hash within an array. That may sound confusing, but it will make more sense when you see it in action. To illustrate this, I’ll pull some ideas from my new book, Learning MySQL and MariaDB (O’Reilly 2015). All of the examples in my book and this article are based on a database for bird-watchers.

Read more…

More than enough Arel

When ActiveRecord just isn’t enough

In Just Enough Arel, we explored a bit into how the Arel library transforms our Ruby code into SQL to be executed by the database. To do so, we discovered that Arel abstracts database tables and the fields therein as objects, which in turn receive messages not normally available in ActiveRecord queries. Wrapping up the article, we also looked at arguments for using Arel over falling back to SQL.

As alluded at the end of the previous article, Arel can do much more than merely provide a handful of comparison operators. In this post, we’ll look at how we can call native database functions, construct unions and intersects, and we’ll wrap things up by explicitly building joins with Arel.

Read more…

Just Enough Arel

Replacing hand-coded SQL with object oriented programming

If you were a web developer prior to ActiveRecord, you probably remember rolling your own SQL and being specific about which fields you retrieved, writing multiple queries to handle “upserts” (update or insert), and getting frustrated with how difficult it was to generate SQL dynamically.

Then Ruby on Rails came along with its ActiveRecord API, and all of that pain seemed to melt away into distant memory. Now we no longer have to worry about which fields to retrieve, “upserts” can be handled with a single method, and scopes allow us to be as dynamic as we want.

Unfortunately, ActiveRecord introduced new pains: the “WHERE” clause (i.e. the predicate) only allows connecting conditions with “AND”s, and the only comparisons allowed are “=” and “in”. To get around this, we often concede ActiveRecord’s shortcomings and revert back to raw SQL. But in Rails 3.0, we were introduced to another way.

Arel

The Arel relational algebra library is an abstract syntax tree (AST) manager and has the goals of “1) Simplif[ying] the generation of complex SQL queries; and 2) Adapt to various RDBMSes”. It’s the “engine” ActiveRecord uses to change method calls into actual SQL. To be clear, Arel only generates SQL, it doesn’t access the database and retrieve data.

With Arel, we are now capable of fully utilizing the power of SQL, without the mess of hand-coding it. ActiveRecord uses Arel to transform queries like this:

Post.joins(:comments).
  where(:published => true).
  where(:comments => {:flagged => true})

Into SQL like this:

SELECT "posts".*
  FROM "posts" INNER JOIN "comments" ON "posts"."id" = "comments"."post_id"
 WHERE "posts"."published" = true
   AND "comments"."flagged" = true

Read more…

Four short links: 30 August 2013

Four short links: 30 August 2013

Flexible Layouts, Web Components, Distributed SQL Database, and Reverse-Engineering Dropbox Client

  1. intention.jsmanipulates the DOM via HTML attributes. The methods for manipulation are placed with the elements themselves, so flexible layouts don’t seem so abstract and messy.
  2. Introducing Brick: Minimal-markup Web Components for Faster App Development (Mozilla) — a cross-browser library that provides new custom HTML tags to abstract away common user interface patterns into easy-to-use, flexible, and semantic Web Components. Built on Mozilla’s x-tags library, Brick allows you to plug simple HTML tags into your markup to implement widgets like sliders or datepickers, speeding up development by saving you from having to initially think about the under-the-hood HTML/CSS/JavaScript.
  3. F1: A Distributed SQL Database That Scalesa distributed relational database system built at Google to support the AdWords business. F1 is a hybrid database that combines high availability, the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. F1 is built on Spanner, which provides synchronous cross-datacenter replication and strong consistency. Synchronous replication implies higher commit latency, but we mitigate that latency by using a hierarchical schema model with structured data types and through smart application design. F1 also includes a fully functional distributed SQL query engine and automatic change tracking and publishing.
  4. Looking Inside The (Drop)Box (PDF) — This paper presents new and generic techniques, to reverse engineer frozen Python applications, which are not limited to just the Dropbox world. We describe a method to bypass Dropbox’s two factor authentication and hijack Dropbox accounts. Additionally, generic techniques to intercept SSL data using code injection techniques and monkey patching are presented. (via Tech Republic)

Interactive Big Data analysis using approximate answers

As data sizes continue to grow, interactive query systems may start adopting the sampling approach central to BlinkDB

Interactive query analysis for (Hadoop scale data) has recently attracted the attention of many companies and open source developers – some examples include Cloudera’s Impala, Shark, Pivotal’s HAWQ, Hadapt, CitusDB, Phoenix, Sqrrl, Redshift, and BigQuery. These solutions use distributed computing, and a combination of other techniques including data co-partitioning, caching (into main memory), runtime code generation, and columnar storage.

One approach that hasn’t been exploited as much is sampling. By this I mean employing samples to generate approximate answers, and speed up execution. Database researchers have written papers on approximate answers, but few working (downloadable) systems are actually built on this approach.

Approximate query engine from U.C. Berkeley’s Amplab
An interesting, open source database released yesterday0 uses sampling to scale to big data. BlinkDB is a massively-parallel, approximate query system from UC Berkeley’s Amplab. It uses a series of data samples to generate approximate answers. Users compose queries by specifying either error bounds or time constraints, BlinkDB uses sufficiently large random samples to produce answers. Because random samples are stored in memory1, BlinkDB is able to provide interactive response times:

BlinkDB

Read more…

Why Choose a Graph Database

Collaborative filtering with Neo4j

By this time, chances are very likely that you’ve heard of NoSQL, and of graph databases like Neo4j.

NoSQL databases address important challenges that we face today, in terms of data size and data complexity. They offer a valuable solution by providing particular data models to address these dimensions.

On one side of the spectrum, these databases resolve issues for scaling out and high data values using compounded aggregate values, on the other side is a relationship based data model that allows us to model real world information containing high fidelity and complexity.

Neo4j, like many other graph databases, builds upon the property graph model; labeled nodes (for informational entities) are connected via directed, typed relationships. Both nodes and relationships hold arbitrary properties (key-value pairs). There is no rigid schema, but with node-labels and relationship-types we can have as much meta-information as we like. When importing data into a graph database, the relationships are treated with as much value as the database records themselves. This allows the engine to navigate your connections between nodes in constant time. That compares favorably to the exponential slowdown of many-JOIN SQL-queries in a relational database.

property-graph

How can you use a graph database?

Graph databases are well suited to model rich domains. Both object models and ER-diagrams are already graphs and provide a hint at the whiteboard-friendliness of the data model and the low-friction mapping of objects into graphs.

Instead of de-normalizing for performance, you would normalize interesting attributes into their own nodes, making it much easier to move, filter and aggregate along these lines. Content and asset management, job-finding, recommendations based on weighted relationships to relevant attribute-nodes are some use cases that fit this model very well.

Many people use graph databases because of their high performance online query capabilities. They process large amounts or high volumes of raw data with Map/Reduce in Hadoop or Event-Processing (like Storm, Esper, etc.) and project the computation results into a graph. We’ve seen examples of this from many domains from financial (fraud detection in money flow graphs), biotech (protein analysis on genome sequencing data) to telco (mobile network optimizations on signal-strength-measurements).

Graph databases shine when you can express your queries as a local search using a few starting points (e.g., people, products, places, orders). From there, you can follow relevant relationships to accumulate interesting information, or project visited nodes and relationships into a suitable result.

Read more…

A Hands-on Introduction to R

OSCON 2013 Speaker Series

R is an open-source statistical computing environment similar to SAS and SPSS that allows for the analysis of data using various techniques like sub-setting, manipulation, visualization and modeling. There are versions that run on Windows, Mac OS X, Linux, and other Unix-compatible operating systems.

To follow along with the examples below, download and install R from your local CRAN mirror found at r-project.org. You’ll also want to place the example CSV into your Documents folder (Windows) or home directory (Mac/Linux).

After installation, open the R application. The R Console will pop-up automatically. This is where R code is processed. To begin writing code, open an editor window (File -> New Script on Windows or File -> New Document on a Mac) and type the following code into your editor:

1+1

Place your cursor anywhere on the “1+1” code line, then hit Control-R (in Windows) or Command-Return (in Mac). You’ll notice that your “1+1” code is automatically executed in the R Console. This is the easiest way to run code in R. You can also run R code by typing the code directly into your R Console, but using the editor is much easier.

If you want to refresh your R Console, click anywhere inside of it and hit Control-L (in Windows) or Command-Option-L (in Mac).

Now let’s create a Vector, the simplest possible data structure in R. A Vector is similar to a column of data inside a spreadsheet. We use the combine function to do so:

raysVector <- c(2, 5, 1, 9, 4)

To view the contents of raysVector, just run the line of code above. After running the code shown above, double-click on raysVector (in the editor) and then run the code that is automatically highlighted after double-clicking. You will now see the contents of raysVector in your R Console.

The object we just created is now stored in memory and we can see this by running the following code:

ls()

R is an interpreted language with support for procedural and object-oriented programming. Here we use the mean statistical function to calculate the statistical mean of raysVector:

mean(raysVector)

Getting help on the mean function is easy using:

?mean

We can create a simple plot of raysVector using:

barplot(raysVector, col = "red")

Importing CSV files is simple too:

data <- read.csv("raysData3.csv", na.strings = "")

We can subset the CSV data in many different ways. Here are two different methods that do the same thing:

data[ 1:2, 2:4 ]
data[ 1:2, c("age", "weight", "height") ]

There are many ways to transform your data in R. Here’s a method that doubles everyone’s age:

dataT <- transform( data, age = age * 2 )

The apply function allows us to apply a standard or custom function without loops. Here we apply the mean function column-wise to the first 3 rows of the dataset in order to analyze the age and height columns of the dataset. We will also ignore missing values during the calculation:

apply( data[1:3, c("age", "height")], 2, mean, na.rm = T )

Here we build a linear regression model that predicts a person’s weight based on their age and height:

raysModel <- lm(weight ~ age + height, data = data)

We can plot our residuals like this:

plot(raysModel$residuals, pch = 15, col = "red")

We can install the Predictive Model Markup Language (PMML) package to quickly deploy our predictive model in a Business Intelligence system without custom SQL: Read more…

Scaling People, Process, and Technology with Python

OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.

We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.

In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.

People

So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.

Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.

Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.

Process

Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.

First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.

Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.

 { "dbs":{
  "my_db": {
   "wrapper": "MysqlDB",
   "host": "mysql-master.123fakestreet.net",
   "password": "",
   "user": "",
   "database": ""
  }
 }}

Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.

In [1]: from link import lnk
 
In [2]: my_db = lnk.dbs.my_db
 
In [3]: df = my_db.select('select * from my_table').as_dataframe()
 

Int64Index: 325 entries, 0 to 324
Data columns:
id    325 non-null values
user_id   323 non-null values
app_id   325 non-null values
name    325 non-null values
body    325 non-null values
created   324 non-null values

By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.

Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.
Read more…

Four short links: 30 April 2013

Four short links: 30 April 2013

China Threat, China Opportunity, Open Source Sustainability, and SQL for Cohort Analysis

  1. China = 41% of World’s Internet Attack Traffic (Bloomberg) — numbers are from Akamai’s research. Verizon Communications said in a separate report that China accounted for 96 percent of all global espionage cases it investigated. One interpretation is that China is a rogue Internet state, but another is that we need to harden up our systems. (via ZD Net)
  2. Open Source Cannot Live on Donations Alone — excellent summary of some of the sustainability questions facing open source projects.
  3. China Startups: The Gold Rush (Steve Blank) — dense fact- and insight-filled piece. Not only is the Chinese ecosystem completely different but also the consumer demographics and user expectations are equally unique. 70% of Chinese Internet users are under 30. Instead of email, they’ve grown up with QQ instant messages. They’re used to using the web and increasingly the mobile web for everything, commerce, communication, games, etc. (They also probably haven’t seen a phone that isn’t mobile.) By the end of 2012, there were 85 million iOS and 160 million Android devices in China. And they were increasing at an aggregate 33 million IOS and Android activations per month.
  4. Calculating Rolling Cohort Retention with SQL — just what it says. (via Max Lynch)

Four data themes to watch from Strata + Hadoop World 2012

In-memory data storage, SQL, data preparation and asking the right questions all emerged as key trends at Strata + Hadoop World.

At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last. Read more…