Working in the Hadoop Ecosystem

Working with big data and open source software

I recently sat down with Mark Grover (@mark_grover), a Software Engineer at Cloudera, to talk about the Hadoop ecosystem. He is a committer on Apache Bigtop and a contributor to Apache Hadoop, Hive, Sqoop, and Flume. He also contributed to O’Reilly Media’s Programming Hive title.

Key highlights include:

  • Marks spends a lot of time in and around the Hadoop ecosystem. So I asked him to provide an overview of the environment and why someone would want to use these tools. He tells us how Hadoop has applications in finance, marketing, advertising, and healthcare industries. It’s completely changed how data is mined and how we make use of it. [Discussed at 0:24]
  • Hadoop is a step in the right direction to handle big data regardless of whether it’s structured or unstructured. [Discussed at 1:32]
  • While Hadoop is a cheaper cost per terabyte solution, its flexibility in handling today’s increasing amounts of unstructured data make it a big data environment regardless of cost. [Discussed at 2:39]
  • Mark gives examples of Hadoop’s use in cancer research and suicide prevention. [Discussed at 4:10]
  • How to get support from the community of resources including vendors, enthusiasts, and developers. [Discussed at 5:26]
  • Finding and choosing a vendor. [Discussed at 6:46]
  • Creating a test environment. [Discussed at 7:40]
  • How to get involved in an open source project. [Discussed at 8:44]
  • You can view the full interview here:



    Sign up for the O'Reilly Programming Newsletter to get weekly insight from industry insiders.
topic: Programming
  • Arthur

    Nicely said: “Hadoop is an ecosystem”. Totally agree. By the way, Hortonworks’ VM based Hadoop Sandbox is an awesome starting point toward getting known Hadoop and Big Data (

    • romaintech

      Checkout the ecosystem piece where the Sandbox comes from: ;) Also comes with tutorials !