"MapReduce" entries

Moving from Batch to Continuous Computing at Yahoo!

Spark, Storm, HBase, and YARN power large-scale, real-time models.

My favorite session at the recent Hadoop Summit was a keynote by Bruno Fernandez-Ruiz, Senior Fellow & VP Platforms at Yahoo! He gave a nice overview of their analytic and data processing stack, and shared some interesting factoids about the scale of their big data systems. Notably many of their production systems now run on MapReduce 2.0 (MRv2) or YARN – a resource manager that lets multiple frameworks share the same cluster.

Yahoo! was the first company to embrace Hadoop in a big way, and it remains a trendsetter within the Hadoop ecosystem. In the early days the company used Hadoop for large-scale batch processing (the key example being, computing their web index for search). More recently, many of its big data models require low latency alternatives to Hadoop MapReduce. In particular, Yahoo! leverages user and event data to power its targeting, personalization, and other “real-time” analytic systems. Continuous Computing is a term Yahoo! uses to refer to systems that perform computations over small batches of data (over short time windows), in between traditional batch computations that still use Hadoop MapReduce. The goal is to be able to quickly move from raw data, to information, to knowledge:

On a side note: many organizations are beginning to use cluster managers that let multiple frameworks share the same cluster. In particular I’m seeing many companies – notably Twitter – use Mesos1 (instead of YARN) to run similar services (Storm, Spark, Hadoop MapReduce, HBase) on the same cluster.

Going back to Bruno’s presentation, here are some interesting bits – current big data systems at Yahoo! by the numbers:

Read more…

Four short links: 18 January 2012

Four short links: 18 January 2012

Nondeterministic Multicore, Cloning UI, jQuery Secrets, and MapReduce Alternative

  1. Many Core Processors — not the first time I’ve heard nondeterministic computing discussed as a solution to some of our parallel-programming travails. Can’t imagine what a pleasure it is to debug.
  2. Pinterest Cloned — it’s not the pilfering of the idea that offends my sensibilities, it’s the blatant clone of every aspect of the UI. I never thought much of the old Apple look’n’feel lawsuit but this really rubs me the wrong way.
  3. What You May Not Know About jQuery — far more than DOM and AJAX calls. (via Javascript Weekly)
  4. Spark — Scala-implemented alternative framework to the model of parallelism in MapReduce. (via Pete Warden)
Four short links: 6 December 2011

Four short links: 6 December 2011

Dispel Your Illusions, Simple Mac OS X Apps, Assisted Translation, and AutoTagging

  1. How to Dispel Your Illusions (NY Review of Books) — Freeman Dyson writing about Daniel Kahneman’s latest book. Only by understanding our cognitive illusions can we hope to transcend them.
  2. Appify-UI (github) — Create the simplest possible Mac OS X apps. Uses HTML5 for the UI. Supports scripting with anything and everything. (via Hacker News)
  3. Translation Memory (Etsy) — using Lucene/SOLR to help automate the translation of their UI. (via Twitter)
  4. Automatically Tagging Entities with Descriptive Phrases (PDF) — Microsoft Research paper on automated tagging. Under the hood it uses Map/Reduce and the Microsoft Dryad framework. (via Ben Lorica)

Strata Week: Simplifying MapReduce through Java

MapReduce gets easier, a new search engine for data, and now you can monitor the universe's forces on your phone.

Cloudera's Crunch hopes to make MapReduce easier, Datafiniti launches a search engine for data, and the University of Oxford releases an Android app for monitoring CERN data.

Four short links: 16 September 2011

Four short links: 16 September 2011

Gamification Critique, Google+ API, Time Series Visualization, and SQL on Map-Reduce

  1. A Quick Buck by Copy and Paste — scorching review of O’Reilly’s Gamification by Design title. tl;dr: reviewer, he does not love. Tim responded on Google Plus. Also on the gamification wtfront, Mozilla Open Badges. It talks about establishing a part of online identity, but to me it feels a little like a Mozilla Open Gradients project would: cargocult-confusing the surface for the substance.
  2. Google + API Launched — first piece of a Google + API is released. It provides read-only programmatic access to people, posts, checkins, and shares. Activities are retrieved as triples of (subject, verb, object), which is semweb cute and ticks the social object box, but is unlikely in present form to reverse Declining numbers of users.
  3. Cube — open source time-series visualization software from Square, built on MongoDB, Node, and Redis. As Artur Bergman noted, the bigger news might be that Square is using MongoDB (known meh).
  4. Tenzing — an SQL implementation on top of Map/Reduce. Tenzing supports a mostly complete SQL implementation (with several extensions) combined with several key characteristics such as heterogeneity, high performance, scalability, reliability, metadata awareness, low latency, support for columnar storage and structured data, and easy extensibility. Tenzing is currently used internally at Google by 1000+ employees and serves 10000+ queries per day over 1.5 petabytes of compressed data. In this paper, we describe the architecture and implementation of Tenzing, and present benchmarks of typical analytical queries. (via RaphaĆ«l Valyi)

Strata Week: MapReduce gets its arms around a million songs

MapReduce crunches a million-song dataset, GPS and accident reconstruction, and WWI crowdsourcing.

This week's data stories include a guide to using MapReduce to process the Million Song Dataset, a story about how GPS data can help reconstruct lost memories (and accidents), and evidence that emergency crowdsourcing goes back further than many realize.

Four short links: 25 July 2011

Four short links: 25 July 2011

Minecraft Emergent Behaviour, Algorithmic 3D Printing, Automated MapReduce Optimization, and Multi-Device Preview

  1. Anonymity in BitcoinTL;DR: Bitcoin is not inherently anonymous. It may be possible to conduct transactions is such a way so as to obscure your identity, but, in many cases, users and their transactions can be identified. We have performed an analysis of anonymity in the Bitcoin system and published our results in a preprint on arXiv. (via Hacker News)
  2. 3D Printing + Algorithmic Generation — clever designers use algorithms based on leaf vein generation to create patterns for lamps, which are then 3d-printed. (via Imran Ali)
  3. Manimal: Relational Optimization for Data-Intensive Programs (PDF) — static code analysis to detect MapReduce program semantics and thereby enable wholly-automatic optimization of MapReduce programs. (via BigData)
  4. Screenfly — preview your site in different devices’ screen sizes and resolutions. (via Smashing Magazine)
Four short links: 23 June 2011

Four short links: 23 June 2011

Communities, Statistics, News, and Doubting Data

  1. The Wisdom of Communities — Luke Wroblewski’s notes from Derek Powazek‘s talk at Event Apart. Wisdom of Crowds theory shows that, in aggregate, crowds are smarter than any single individual in the crowd. See this online in most emailed features, bit torrent, etc. Wise crowds are built on a few key characteristics: diversity (of opinion), independence (of other ideas), decentralization, and aggregation.
  2. How to Fit an Elephant (John D. Cook) — for the stats geeks out there. Someone took von Neumann’s famous line “with four parameters I can fit an elephant, and with five I can make him wiggle his trunk”, and found the four complex parameters that do, indeed, fit an elephant.
  3. How to Run a News Site and Newspaper Using WordPress and Google Docs — clever workflow that’s digital first but integrated with print. (via Sacha Judd)
  4. All Watched Over: On Foo, Cybernetics, and Big Data — I’m glad someone preserved Matt Jones’s marvelous line, “the map-reduce is not the territory”. (via Tom Armitage)

Hadoop: What it is, how it works, and what it can do

Cloudera CEO Mike Olson on Hadoop's architecture and its data applications.

Hadoop gets a lot of buzz in database circles, but some folks are still hazy about what it is and how it works. In this interview, Cloudera CEO and Strata speaker Mike Olson discusses Hadoop's background and its current utility.

Big data faster: A conversation with Bradford Stephens

The founder of Drawn to Scale explains how his database platform does simple things quickly.

Bradford Stephens, founder of of Drawn to Scale, discusses big data systems that work in "user time."