Google Summer of Code opportunities in data science and machine learning with Ganglia


As mentioned in my blog on Monday, the Ganglia Project is proud to be part of Google Summer of Code in 2014

The Ganglia team are offering various types of projects and different parts of Ganglia would welcome students with different skills, for example:

ComponentSkills
gmond agentC
gmond modulesC or Python
JMXetricJava
gmetad and rrdtool for storing time series dataC, R
Ganglia web interfaceJavaScript and jQuery
Ganglia integration (e.g. ganglia-nagios-bridge)Python

Big data right under your nose

I have had many queries from students about how to get into data science.

Very few students will be lucky enough to get an internship where they can study time-series from the financial markets and experiment making their own trading algorithms.

On the other hand, network performance data is everywhere. It is real-time. It is surprisingly similar in some ways to processing financial data and it provides excellent opportunities for students to practice data science skills and make a meaningful contribution to solving real problems.

Finding public Ganglia data with Google

Many large organizations, including universities, governments and corporations are using Ganglia to gather metrics from all the hosts in their networks. Some of them even expose this data to the public. Here are two Google searches you can use to find them:


Courtesy of Université Montpellier 2, France

Some sites may even expose their data as an XML feed, you can try and extract it by connecting to the Ganglia server on one of these ports

PortComments
8649gmond: sends an XML snapshot to anybody who connects
8651gmetad: sends an XML snapshot to anybody who connects
8651gmetad: works a little bit like HTTP, returns a subset of the XML snapshot when you make a GET request

You can discover a Ganglia environment in your campus by looking for a gmond process on your machine and the gmond.conf file, often in /etc/gmond.conf or /etc/ganglia/gmond.conf. That file may contain a clue about the name of the host where Ganglia data is aggregated:

udp_send_channel {
  host = ganglia-reports.example.edu
  port = 8649
  ttl = 1
}

This tells you that the host ganglia-reports.example.edu is collecting the data - you could try the URL http://ganglia-reports.example.edu/ganglia/ in a web browser or try connecting to one of the TCP ports 8649, 8651 or 8652 on that host. Here is an example with netcat:

$ nc ganglia-reports.example.edu 8649 | grep ^.H

It will return a list of all hosts that Ganglia knows about.

Once you have a data feed, you can then configure a gmetad process on your own system to poll the remote system and generate local RRDs for you to study.

Install your own Ganglia

It is very easy to get your own Ganglia setup.

On a Debian or Ubuntu system, just do:

# apt-get update
# apt-get install ganglia-monitor ganglia-webfrontend

On Fedora and RPM-based systems (such as CentOS or RHEL with EPEL) you can do:

# yum install ganglia-gmond ganglia-web

Everything should be autoconfigured. You can then browse to http://localhost/ganglia to see the charts.

If you have several hosts with the gmond agent (just the ganglia-monitor.deb or ganglia-gmond.rpm) on the same LAN, they will automatically find each other using multicast and you will see an aggregated report on the machine with the web server.

The data is real-time

It is important to keep in mind that the data is real-time. This means you can often detect problems in real-time. If this blog appears on slashdot, for example, then that image from Université Montpellier 2 will be hit many times. The image actually shows the network load on the web server producing the image, so you will see the slashdot effect graphically in the image itself.

Processing real-time data is often the most advanced step in any data science exercise. Initially, you may simply log a few days of data to RRD files to start studying a static data set with your tool of choice, whether it is the R project, Weka or Hadoop

Once you have a a hypothesis (for example, an algorithm that understands the normal characteristics of each metric) you may then take each new real-time value from the gmetad XML and test it with the algorithm. The algorithm would then raise an alert if any metric on any host deviates from its normal behavior.

Mixing in other sources of data

Depending upon the computing environment in your campus or organization, you may also be able to get other data sources, such as a list of people logged in to different machines at different times and the processes that each user starts and stops.

This might help to make more accurate predictions about when network or computing resources will be under stress. For example, if users bob, alice and eve all appear on the same host, your algorithm might conclude that the load average will reach an excessive level within 15 minutes and send those three users a suggestion to each try other machines.

Making a successful application for GSoC 2014

Here are some tips