ENTRIES TAGGED "apache"
Hadoop, Sqoop, and ZooKeeper
Kathleen Ting (@kate_ting), Technical Account Manager at Cloudera, and our own Andy Oram (@praxagora) sat down to discuss how to work with structured and unstructured data as well as how to keep a system up and running that is crunching that data.
Key highlights include:
- Misconfigurations consist of almost half of the support issues that the team at Cloudera is seeing [Discussed at 0:22]
- ZooKeeper, the canary in the Hadoop coal mine [Discussed at 1:10]
- Leaky clients are often a problem ZooKeeper detects [Discussed at 2:10]
- Sqoop is a bulk data transfer tool [Discussed at 2:47]
- Sqoop helps to bring together structured and unstructured data [Discussed at 3:50]
- ZooKeep is not for storage, but coordination, reliability, availability [Discussed at 4:44]
You can view the full interview here:
OSCON 2013 Speaker Series
If you have delved into Apache Hadoop and related projects, you know that installing and configuring Hadoop is hard. Often, a minor mistake during installation or configuration with messy tarballs will lurk for a long time until some otherwise innocuous change to the system or workload causes difficulties. Moreover, there is little to no integration testing among different projects (e.g. Hadoop, Hive, HBase, Zookeeper, etc.) in the ecosystem. Apache Bigtop is an open source project aimed at bridging exactly those gaps by:
1. Making it easier for users to deploy and configure Hadoop and related projects on their bare metal or virtualized clusters.
2. Performing integration testing among various components in the Hadoop ecosystem.
More about Apache Bigtop
The primary goal of Apache Bigtop is to build a community around the packaging and interoperability testing of Hadoop related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc.) developed by a community with a focus on the system as a whole, rather than individual projects.
The latest released version of Apache Bigtop is Bigtop 0.5 which integrates the latest versions of various projects including Hadoop, Hive, HBase, Flume, Sqoop, Oozie and many more! The supported platforms include CentOS/RHEL 5 and 6, Fedora 16 and 17, SuSE Linux Enterprise 11, OpenSuSE 12.2, Ubuntu LTS Lucid and Precise, and Ubuntu Quantal.
Who uses Bigtop?
Folks who use Bigtop can be divided into two major categories. The first category of users are those who leverage Bigtop to power their own Hadoop Distributions. The second category of users are those who use Bigtop for deployment purposes.
In alphabetical order, they are:
OSCON 2013 Speaker Series
Automating the configuration management of your operating systems and the rollout of your applications is one of the most important things an administrator or developer can do to avoid surprises when updating services, scaling up, or recovering from failures. However, it’s often not enough. Some of the most common operations that happen in your datacenter (or cloud environment) involve large numbers of machines working together and humans to mediate those processes. While we have been able to remove a lot of human effort from configuration, there has been a lack of software able to handle these higher-level operations.
I used to work for a hosted web application company where the IT process for executing an application update involved locking six people in a room for sometimes 3-4 hours, each person pressing the right buttons at the right time. This process almost always had a glitch somewhere where someone forgot to run the right command or something wasn’t well tested beforehand. While some technical solutions were applied to handle configuration automation, nothing that could perform configuration could really accomplish that high level choreography on top as well. This is why I wrote Ansible.
Ansible is a configuration management, application deployment, and IT orchestration system. One of Ansible’s strong points is having a very simple, human readable language – it allows users very fine, precise control over what happens on what machines at what times.
To get started, create an inventory file, for instance, ~/ansible_hosts that defines what machines you are managing, and which machines are frequently organized into groups. Ansible can also pull inventory from multiple cloud sources, but an inventory file is a quick way to get started:
# add more webservers here
Now that you have defined what machines you are managing, you have to define what you are going to do on the remote machines.
Ansible calls this description of processes a “playbook,” and you don’t have to have just one, you could have different playbooks for different kinds of tasks.
Let’s look at an example for describing a rolling update process. This example is somewhat involved because it’s using haproxy, but haproxy is freely available. Ansible also includes modules for dealing with Netscalers and F5 load balancers, so this is just an example — ordinarily you would start more simply and work up to an example like this:
Key open source considerations for businesses, communities and developers.
OSCON’s theme last year was “from disruption to default.” Over the last decade, we’ve seen open source shift from the shadows to the limelight. Today, more businesses than ever are considering the role of open source in their strategies. I’ve had the chance to watch and participate in the transitions of numerous businesses and business units to using open source for the first time, as well as observing how open source strategies evolve for software businesses, both old and new.
In the view of many, open source is the pragmatic expression of the ethical idea of “software freedom,” articulated in various ways for several decades by communities around both Richard Stallman’s GNU Project and the BSD project. The elements of open source and free software are simple to grasp; software freedom delivers the rights to use, study, modify and distribute software for any purpose, and the Open Source Definition clarifies one area of that ethical construct with pragmatic rules that help identify copyright licenses that promote software freedom. But just as simple LEGO bricks unlock an infinite world of creativity, so these open source building blocks offer a wide range of usage models, which are still evolving.
This paper offers some thinking tools for those involved in the consideration and implementation of open source strategies, both in software consuming organizations and by software creators. It aims to equip you with transferrable explanations for some of the concepts your business leaders will need to consider. It includes:
- A model for understanding the different layers of community that can form around an open source code “commons” and how you should (and should not) approach them.
- An exploration of the symbiotic relationship of transparency and privacy in open source communities.
- An explanation of where customer value comes from in enterprise open source, which illuminates the problems with “open core” strategies for communities and customers.
- A reflection on the principle that can be seen at work across all these examples: “trade control for influence”
Apache adds to their donated portfolio and your travel-patent guide to East Texas.
In the latest Developer Week in Review: Apache gets a gift of code from IBM, and a handy patent / travel guide for your next trip to East Texas.
iPhone devs may need lawyers, Apache gets a new project, and Java programmers abuse a pattern
If you were an iOS developer, you may have gotten to meet a process server in person this week, as Lodsys doles out the first batch of lawsuits. Oracle gave Apache the keys to OpenOffice, and told them to take it out for a spin, and your faithful editor vents about a commonly overused Java pattern.
Tomcat purrs, Amazon dictates, and HTML5 brands
In this edition of Developer Week in Review: there's a new Tomcat in town; Amazon sets app prices; and HTML5 may be a work in progress, but now it's got a logo.
Intel opens an app store, Apache fumes over Java, old software Microsoft should open source, Apple updates on the way
In this edition of Developer Week in Review: Intel opens an app store, Apache is peeved at Oracle, Microsoft open sources a language you've probably never heard of, and Radar detects an incoming salvo of point-releases from Apple.
Apache co-founder Brian Behlendorf discusses the CONNECT health data project.
In this podcast interview, Apache co-founder Brian Behlendorf discusses the CONNECT project and the role data can play in improving patient care and the medical system.