ENTRIES TAGGED "data"
Unlocking Scientific Data with Python
Most people working on complex software systems have had That Moment, when you throw up your hands and say “If only we could start from scratch!” Generally, it’s not possible. But every now and then, the chance comes along to build a really exciting project from the ground up.
In 2011, I had the chance to participate in just such a project: the acquisition, archiving and database systems which power a brand-new hypervelocity dust accelerator at the University of Colorado.
Hadoop, Sqoop, and ZooKeeper
Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.
Key highlights include:
- Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
- ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
- Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
- Sqoop is a bulk data transfer tool [Discussed at 2:47]
- Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
- ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
You can view the full interview here:
Computing Twitter Influence, Part 2
In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.
Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.
An interview with Allen Downey, the author of Think Bayes
When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math. I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.
How did the “Think” series come about? What led you to start the series?
Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.
I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.
At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.
Working with big data and open source software
I recently sat down with Mark Grover (@mark_grover), a Software Engineer at Cloudera, to talk about the Hadoop ecosystem. He is a committer on Apache Bigtop and a contributor to Apache Hadoop, Hive, Sqoop, and Flume. He also contributed to O’Reilly Media’s Programming Hive title.
Key highlights include:
FtanML looks for the best of both
Today’s Balisage conference got off to a great start. After years of discussing the pros and cons of XML, HTML, JSON, SGML, and more, it was great to see Michael Kay (creator of the SAXON processor for XSLT and XQuery) take a fresh look at what a markup language should be.
Simplifying IT automation
IT infrastructure should be simpler to automate. A new method of describing IT configurations and policy as data formats can help us get there. To understand this conclusion, it helps to understand how the existing tool chains of automation software came to be.
In the beginnings of IT infrastructure, administrators seeking to avoid redundant typing wrote scripts to help them manage their growing computer hordes. The development of these inhouse automation systems were not without cost; each organization built its own redundant tools. As scripting gurus left an organization, these scripts were often very difficult to maintain by new employees.
As we all know by the huge number of books written on the topic, software development sometimes has a large amount of time investment required to do it right. Systems management software is especially complex, due to all the possible variables and corner cases to be managed. These inhouse scripting systems often grew to be fragile.
OSCON 2013 Speaker Series
R is an open-source statistical computing environment similar to SAS and SPSS that allows for the analysis of data using various techniques like sub-setting, manipulation, visualization and modeling. There are versions that run on Windows, Mac OS X, Linux, and other Unix-compatible operating systems.
To follow along with the examples below, download and install R from your local CRAN mirror found at r-project.org. You’ll also want to place the example CSV into your Documents folder (Windows) or home directory (Mac/Linux).
After installation, open the R application. The R Console will pop-up automatically. This is where R code is processed. To begin writing code, open an editor window (File -> New Script on Windows or File -> New Document on a Mac) and type the following code into your editor:
Place your cursor anywhere on the “1+1” code line, then hit Control-R (in Windows) or Command-Return (in Mac). You’ll notice that your “1+1” code is automatically executed in the R Console. This is the easiest way to run code in R. You can also run R code by typing the code directly into your R Console, but using the editor is much easier.
If you want to refresh your R Console, click anywhere inside of it and hit Control-L (in Windows) or Command-Option-L (in Mac).
Now let’s create a Vector, the simplest possible data structure in R. A Vector is similar to a column of data inside a spreadsheet. We use the
combine function to do so:
raysVector <- c(2, 5, 1, 9, 4)
To view the contents of
raysVector, just run the line of code above. After running the code shown above, double-click on
raysVector (in the editor) and then run the code that is automatically highlighted after double-clicking. You will now see the contents of
raysVector in your R Console.
The object we just created is now stored in memory and we can see this by running the following code:
R is an interpreted language with support for procedural and object-oriented programming. Here we use the
mean statistical function to calculate the statistical mean of
Getting help on the
mean function is easy using:
We can create a simple plot of
barplot(raysVector, col = "red")
Importing CSV files is simple too:
data <- read.csv("raysData3.csv", na.strings = "")
We can subset the CSV data in many different ways. Here are two different methods that do the same thing:
data[ 1:2, 2:4 ] data[ 1:2, c("age", "weight", "height") ]
There are many ways to transform your data in R. Here’s a method that doubles everyone’s age:
dataT <- transform( data, age = age * 2 )
apply function allows us to apply a standard or custom function without loops. Here we apply the mean function
column-wise to the first 3 rows of the dataset in order to analyze the age and height columns of the dataset. We will also ignore missing values during the calculation:
apply( data[1:3, c("age", "height")], 2, mean, na.rm = T )
Here we build a linear regression model that predicts a person’s weight based on their age and height:
raysModel <- lm(weight ~ age + height, data = data)
We can plot our residuals like this:
plot(raysModel$residuals, pch = 15, col = "red")