"cloudera" entries

The human side of Hadoop

Doug Cutting on applications of Hadoop, where "Hadoop" comes from, and the new partnership between Cloudera and O'Reilly.

Roger Magoulas, director of market research at O’Reilly and Strata co-chair, recently sat down with Doug Cutting, chief architect at Cloudera, to talk about the new partnership between Cloudera and O’Reilly, and the state of the Hadoop landscape.

Cutting shares interesting applications of Hadoop, several of which had touching human elements. For instance, he tells a story about visiting Children’s Healthcare of Atlanta and discovering the staff using Hadoop to reduce stress in babies. Read more…

Get Hadoop, Hive, and HBase Up and Running in Less Than 15 Minutes

OSCON 2013 Speaker Series

If you have delved into Apache Hadoop and related projects, you know that installing and configuring Hadoop is hard. Often, a minor mistake during installation or configuration with messy tarballs will lurk for a long time until some otherwise innocuous change to the system or workload causes difficulties. Moreover, there is little to no integration testing among different projects (e.g. Hadoop, Hive, HBase, Zookeeper, etc.) in the ecosystem. Apache Bigtop is an open source project aimed at bridging exactly those gaps by:

1. Making it easier for users to deploy and configure Hadoop and related projects on their bare metal or virtualized clusters.

2. Performing integration testing among various components in the Hadoop ecosystem.

More about Apache Bigtop

The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of Hadoop related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc.) developed by a community with a focus on the system as a whole, rather than individual projects.

The latest released version of Apache Bigtop is Bigtop 0.5 which integrates the latest versions of various projects including Hadoop, Hive, HBase, Flume, Sqoop, Oozie and many more! The supported platforms include CentOS/RHEL 5 and 6, Fedora 16 and 17, SuSE Linux Enterprise 11, OpenSuSE 12.2, Ubuntu LTS Lucid and Precise, and Ubuntu Quantal.

Who uses Bigtop?

Folks who use Bigtop can be divided into two major categories. The first category of users are those who leverage Bigtop to power their own Hadoop Distributions. The second category of users are those who use Bigtop for deployment purposes.

In alphabetical order, they are:
Read more…

Strata Week: Real-time Hadoop

Cloudera ventures into real-time queries with Impala, data centers are the new landfill, and Jesper Andersen looks at the relationship between art and data.

Here are a few stories from the data space that caught my attention this week.

Cloudera’s Impala takes Hadoop queries into real-time

Cloudera ventured into real-time Hadoop querying this week, opening up its Impala software platform. As Derrick Harris reports at GigaOm, Impala — an SQL query engine — doesn’t rely on MapReduce, making it faster than tools such as Hive. Cloudera estimates its queries run 10 times faster than Hive, and Charles Zedlewski, Cloudera’s cloud VP of products, told Harris that “small queries can run in less than a second.”

Harris notes that Zedlewski pointed out that Impala wasn’t designed to replace business intelligence (BI) tools, and that “Cloudera isn’t interested in selling BI or other analytic applications.” Rather, Impala serves as the execution engine, still relying on software from Cloudera partners — Zedlewski told Harris, “We’re sticking to our knitting as a platform vendor.”

Joab Jackson at PC World reports that “[e]ventually, Impala will be the basis of a Cloudera commercial offering, called the Cloudera Enterprise RTQ (Real-Time Query), though the company has not specified a release date.”

Impala has plenty of competition on this playing field, which Harris also covers, and he notes the significance of all the recent Hadoop innovation:

“I can’t underscore enough how critical all of this innovation is for Hadoop, which in order to add substance to its unparalleled hype needed to become far more useful to far more users. But the sudden shift from Hadoop as a batch-processing engine built on MapReduce into an ad hoc SQL querying engine might leave industry analysts and even Hadoop users scratching their heads.”

You can read more from Harris’ piece here and Jackson’s piece here. Wired also has an interesting piece on Impala, covering the Google F1 database upon which it is based and the Googler Cloudera hired away to help build it.

(Cloudera CEO Mike Olson discussed Impala, Hadoop and the importance of real-time at this week’s Strata Conference + Hadoop World.)

Read more…

Four short links: 25 October 2012

Four short links: 25 October 2012

Big Data's Big Picture, Real-Time Queries, Real-Time Queries, Single-Process Real-Time Queries

  1. Big Data: the Big Picture (Vimeo) — Jim Stogdill’s excellent talk: although Big Data is presented as part of the Gartner Hype Cycle, it’s an epoch of the Information Age which will have significant effects on the structure of corporations and the economy.
  2. Impala (github) — Cloudera’s open source (Apache) implementation of Google’s F1 (PDF), for realtime queries across clusters. Impala is different from Hive and Pig because it uses its own daemons that are spread across the cluster for queries. Furthermore, Impala does not leverage MapReduce, allowing Impala to return result in real-time. (via Wired)
  3. druid (github) — open source (GPLv2) a distributed, column-oriented analytical datastore. It was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. See also the announcement of its open-sourcing.
  4. Supersonic (Google Code) — an ultra-fast, column oriented query engine library written in C++. It provides a set of data transformation primitives which make heavy use of cache-aware algorithms, SIMD instructions and vectorised execution, allowing it to exploit the capabilities and resources of modern, hyper pipelined CPUs. It is designed to work in a single process. Apache-licensed.

Strata Week: Data prospecting with Kaggle

Kaggle now accepting data before a contest, HP's Autonomy purchase comes into focus, Cloudera's new Hadoop distribution.

In this week's data news, Kaggle launches Prospect, HP unveils its big data plans, and Cloudera releases CDH4 (the latest version of its Hadoop distribution).

Strata Week: A .data TLD?

A proposal for a .data TLD, flavors of Hadoop, and a vote for pseudonymous commenters.

In this week's data news, Stephen Wolfram calls for a .data top-level domain and Cloudera responds to Hadoop version 1.0.

Strata Week: Simplifying MapReduce through Java

MapReduce gets easier, a new search engine for data, and now you can monitor the universe's forces on your phone.

Cloudera's Crunch hopes to make MapReduce easier, Datafiniti launches a search engine for data, and the University of Oxford releases an Android app for monitoring CERN data.

Strata Week: Oracle’s big data play

Oracle unveils its big data appliance, the Hadoop community gauges contributions.

In this week's data news, Oracle unveils its big data strategy, and Cloudera looks at the contributions to the Hadoop core and community.

Strata Week: Oracle's big data play

Oracle unveils its big data appliance, the Hadoop community gauges contributions.

In this week's data news, Oracle unveils its big data strategy, and Cloudera looks at the contributions to the Hadoop core and community.

Strata Gems: Whirr makes Hadoop and Cassandra a snap

Get control over cloud resources

The cloud makes clusters easy, but for rapid prototyping purposes, bringing up clusters still involves quite a bit of effort. The Whirr project makes cloud control simple.