Scaling People, Process, and Technology with Python

OSCON 2013 Speaker Series

NOTE: If you are interested in attending OSCON to check out Dave’s talk or the many other cool sessions, click over to the OSCON website where you can use the discount code OS13PROG to get 20% off your registration fee.

Since 2009, I’ve been leading the optimization team at AppNexus, a real-time advertising exchange. On this exchange, advertisers participate in real-time auctions to bid on individual ad impressions. The highest bid wins the auction, and that advertiser gets to show an ad. This allows advertisers to carefully target where they advertise—maximizing the effectiveness of their advertising budget—and lets websites maximize their ad revenue.

We do these auctions often (~50 billion a day) and fast (<100 milliseconds). Not surprisingly, this creates a lot of technical challenges. One of those challenges is how to automatically maximize the value advertisers get for their marketing budgets—systematically driving consumer engagement through ad placements on particular websites, times of day, etc.—and we call this process “optimization.” The volume of data is large, and the algorithms and strategies aren’t trivial.

In order to win clients and build our business to the scale we have today, it was crucial that we build a world-class optimization system. But when I started, we didn’t have a scalable tech stack to process the terabytes of data flowing through our systems every day, and we didn't have the team to do any of the required data modeling.

People

So, we needed to hire great people fast. However, there aren’t many veterans in the advertising optimization space, and because of that, we couldn’t afford to narrow our search to only experts in Java or R or Matlab. In order to give us the largest talent pool possible to recruit from, we had to choose a tech stack that is both powerful and accessible to people with diverse experience and backgrounds. So we chose Python.

Python is easy to learn. We found that people coding in R, Matlab, Java, PHP, and even those who have never programmed before could quickly learn and get up to speed with Python. This opened us up to hiring a tremendous pool of talent who we could train in Python once they joined AppNexus. To top it off, there’s a great community for hiring engineers and the PyData community is full of programmers who specialize in modeling and automation.

Additionally, Python has great libraries for data modeling. It offers great analytical tools for analysts and quants and when combined, Pandas, IPython, and Matplotlib give you a lot of the functionality of Matlab or R. This made it easy to hire and onboard our quants and analysts who were familiar with those technologies. Even better, analysts and quants can share their analysis through the browser with IPython.

Process

Now that we had all of these wonderful employees, we needed a way to cut down the time to get them ramped up and pushing code to production.

First, we wanted to get our analysts and quants looking at and modeling data as soon as possible. We didn’t want them worrying about writing database connector code, or figuring out how to turn a cursor into a data frame. To tackle this, we built a project called Link.

Imagine you have a MySQL database. You don’t want to hardcode all of your connection information because you want to have a different config for different users, or for different environments. Link allows you to define your “environment” in a JSON config file, and then reference it in code as if it is a Python object.

 { "dbs":{
  "my_db": {
   "wrapper": "MysqlDB",
   "host": "mysql-master.123fakestreet.net",
   "password": "",
   "user": "",
   "database": ""
  }
 }}

Now, with only three lines of code you have a database connection and a data frame straight from your mysql database. This same methodology works for Vertica, Netezza, Postgres, Sqlite, etc. New “wrappers” can be added to accommodate new technologies, allowing team members to focus on modeling the data, not how to connect to all these weird data sources.

In [1]: from link import lnk
 
In [2]: my_db = lnk.dbs.my_db
 
In [3]: df = my_db.select('select * from my_table').as_dataframe()
 

Int64Index: 325 entries, 0 to 324
Data columns:
id    325 non-null values
user_id   323 non-null values
app_id   325 non-null values
name    325 non-null values
body    325 non-null values
created   324 non-null values

By having the flexibility to easily connect to new data sources and APIs, our quants were able to adapt to the evolving architectures around us, and stay focused on modeling data and creating algorithms.

Second, we wanted to minimize the amount of work it took to take an algorithm from research/prototype phase to full production scale. Luckily, with everyone working in Python, our quants, analysts, and engineers are using the same language and data processing libraries. There was no need to re-implement an R script in Java to get it out across the platform.

Technology

Lastly, Python is a versatile programming language with libraries that help us manage the complexity of our system. It all started with 1 machine, 1 core, and 2 databases. Today, we are processing jobs across 100+ cores, reading data from over 6 data sources.

However, prior to adopting Python, things were not so clean. We actually had a PHP script, which dumped an SQL query to a file, ran an R script over it, then read in the results and pushed them to our API. Not only is the back and forth to disk inefficient; changing and maintaining this code written in two languages was cumbersome.

Today, all of that can be done in Python because of its great libraries. Numpy and Pandas allow us to get the speed of type languages for our data crunching, with the simplicity of Python. The team can write SQL-like aggregations and filters on in memory data to explore large data sets.

#counting the number of records in this Dataframe. Takes .01 seconds
In [12]: %time data['imps'].count()
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
Out[12]: 1000000

#group by day and sum impressions (ads requests) takes .14 seconds
In [13]: %time data.groupby('ymd')['imps'].sum()
CPU times: user 0.14 s, sys: 0.00 s, total: 0.14 s
Wall time: 0.14 s
Out[13]:
ymd
2013-06-11 37511241
2013-06-12 29381549
2013-06-13 12243933

When it comes to developing the models, there are a lot of great algorithms and tools we can use in Scipy and Scikit-learn.

Most importantly for our engineering team, libraries like Pika allow us to leverage a message queue system to schedule jobs based on events, such as updates in the data pipeline, or the finishing of a dependent job. Around this message queue system we were able to build a work queue that allows us to shard our work and farm it out on those 100 cores.

scaling_people_w_python

This allows the developer to concentrate on developing the algorithm, without worrying about how they are going to scale out their compute resources. If we need more cores, we simply add a new box to the compute cluster and we are good to go.

Evolving with the organization

Python has become a critical tool that has allowed the optimization team to grow and mature our people, processes, and technology. Luckily, adopting Python has also set us up to evolve with the larger organization in the future.

For example, as AppNexus scaled, the Data Team replaced most of our large-scale data processing infrastructure with Hadoop. We obviously want to continue to use our Python tools, and leverage them on terabytes of log-level data. Again, Python’s many libraries have given us a lot of options to consider…almost too many.

In our OSCON presentation, Python in an Evolving Enterprise System: Integration Solutions with Hadoop, we will explore these options. If you want to hear more about our research… I guess we will see you at OSCON.

[adrotate banner=”7″]

tags: , , , , , , , , , , , ,