ENTRIES TAGGED "data"
New report covers areas of innovation and their difficulties
O’Reilly recently released a report I wrote called The Information Technology Fix for Health: Barriers and Pathways to the Use of Information Technology for Better Health Care. Along with our book Hacking Healthcare, I hope this report helps programmers who are curious about Health IT see what they need to learn and what they in turn can contribute to the field.
Computers in health are a potentially lucrative domain, to be sure, given a health care system through which $2.8 trillion, or $8.915 per person, passes through each year in the US alone. Interest by venture capitalists ebbs and flows, but the impetus to creative technological hacking is strong, as shown by the large number of challenges run by governments, pharmaceutical companies, insurers, and others.
Some things you should consider doing include:
- Join open source projects
- Numerous projects to collect and process health data are being conducted as free software; find one that raises your heartbeat and contribute. For instance, the most respected health care system in the country, VistA from the Department of Veterans Affairs, has new leadership in OSEHRA, which is trying to create a community of vendors and volunteers. You don’t need to understand the oddities of the MUMPS language on which VistA is based to contribute, although I believe some knowledge of the underlying database would be useful. But there are plenty of other projects too, such as the OpenMRS electronic record system and the projects that cooperate under the aegis of Open Health Tools.
Unlocking Scientific Data with Python
Most people working on complex software systems have had That Moment, when you throw up your hands and say “If only we could start from scratch!” Generally, it’s not possible. But every now and then, the chance comes along to build a really exciting project from the ground up.
In 2011, I had the chance to participate in just such a project: the acquisition, archiving and database systems which power a brand-new hypervelocity dust accelerator at the University of Colorado.
Hadoop, Sqoop, and ZooKeeper
Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.
Key highlights include:
- Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
- ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
- Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
- Sqoop is a bulk data transfer tool [Discussed at 2:47]
- Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
- ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
You can view the full interview here:
Computing Twitter Influence, Part 2
In the previous post of this series, we aspired to compute the influence of a Twitter account and explored some relevant variables to arriving at a base metric. This post continues the conversation by presenting some sample code for making “reliable” requests to Twitter’s API to facilitate the data collection process.
Given a Twitter screen name, it’s (theoretically) quite simple to get all of the account profiles that follow the screen name. Perhaps the most economical route is to use the GET /followers/ids API to request all of the follower IDs in batches of 5,000 per response, followed by the GET /users/lookup API to retrieve full account profiles for up to Y of those IDs in batches of 100 per response. Thus, if an account has X followers, you’d need to anticipate making ceiling(X/5000) API calls to GET /followers/ids and ceiling(X/100) API calls toGET /users/lookup. Although most Twitter accounts may not have enough followers that the total number of requests to each API resource presents rate-limiting problems, you can rest assured that the most popular accounts will trigger rate-limiting enforcements that manifest as an HTTP error in RESTful APIs.
An interview with Allen Downey, the author of Think Bayes
When Mike first discussed Allen Downey’s Think Bayes book project with me, I remember nodding a lot. As the data editor, I spend a lot of time thinking about the different people within our Strata audience and how we can provide what I refer to “bridge resources”. We need to know and understand the environments that our users are the most comfortable in and provide them with the appropriate bridges in order to learn a new technique, language, tool, or …even math. I’ve also been very clear that almost everyone will need to improve their math skills should they decide to pursue a career in data science. So when Mike mentioned that Allen’s approach was to teach math not using math…but using Python, I immediately indicated my support for the project. Once the book was written, I contacted Allen about an interview and he graciously took some time away from the start of the semester to answer a few questions about his approach, teaching, and writing.
How did the “Think” series come about? What led you to start the series?
Allen Downey: A lot of it comes from my experience teaching at Olin College. All of our students take a basic programming class in the first semester, and I discovered that I could use their programming skills as a pedagogic wedge. What I mean is if you know how to program, you can use that skill to learn everything else.
I started with Think Stats because statistics is an area that has really suffered from the mathematical approach. At a lot of colleges, students take a mathematical statistics class that really doesn’t prepare them to work with real data. By taking a computational approach I was able to explain things more clearly (at least I think so). And more importantly, the computational approach lets students dive in and work with real data right away.
At this point there are four books in the series and I’m working on the fifth. Think Python covers Python programming–it’s the prerequisite for all the other books. But once you’ve got basic Python skills, you can read the others in any order.
Working with big data and open source software
I recently sat down with Mark Grover (@mark_grover), a Software Engineer at Cloudera, to talk about the Hadoop ecosystem. He is a committer on Apache Bigtop and a contributor to Apache Hadoop, Hive, Sqoop, and Flume. He also contributed to O’Reilly Media’s Programming Hive title.
Key highlights include:
FtanML looks for the best of both
Today’s Balisage conference got off to a great start. After years of discussing the pros and cons of XML, HTML, JSON, SGML, and more, it was great to see Michael Kay (creator of the SAXON processor for XSLT and XQuery) take a fresh look at what a markup language should be.
Simplifying IT automation
IT infrastructure should be simpler to automate. A new method of describing IT configurations and policy as data formats can help us get there. To understand this conclusion, it helps to understand how the existing tool chains of automation software came to be.
In the beginnings of IT infrastructure, administrators seeking to avoid redundant typing wrote scripts to help them manage their growing computer hordes. The development of these inhouse automation systems were not without cost; each organization built its own redundant tools. As scripting gurus left an organization, these scripts were often very difficult to maintain by new employees.
As we all know by the huge number of books written on the topic, software development sometimes has a large amount of time investment required to do it right. Systems management software is especially complex, due to all the possible variables and corner cases to be managed. These inhouse scripting systems often grew to be fragile.