Google Summer of Code opportunities in data science and machine learning with Ganglia

21:19 Fri, 28 Feb 2014

As mentioned in my blog on Monday, the Ganglia Project is proud to be part of Google Summer of Code in 2014

The Ganglia team are offering various types of projects and different parts of Ganglia would welcome students with different skills, for example:

Component	Skills
gmond agent	C
gmond modules	C or Python
JMXetric	Java
gmetad and rrdtool for storing time series data	C, R
Ganglia web interface	JavaScript and jQuery
Ganglia integration (e.g. ganglia-nagios-bridge)	Python

Big data right under your nose

I have had many queries from students about how to get into data science.

Very few students will be lucky enough to get an internship where they can study time-series from the financial markets and experiment making their own trading algorithms.

On the other hand, network performance data is everywhere. It is real-time. It is surprisingly similar in some ways to processing financial data and it provides excellent opportunities for students to practice data science skills and make a meaningful contribution to solving real problems.

Finding public Ganglia data with Google

Many large organizations, including universities, governments and corporations are using Ganglia to gather metrics from all the hosts in their networks. Some of them even expose this data to the public. Here are two Google searches you can use to find them:

Courtesy of Université Montpellier 2, France

Some sites may even expose their data as an XML feed, you can try and extract it by connecting to the Ganglia server on one of these ports

Port	Comments
8649	gmond: sends an XML snapshot to anybody who connects
8651	gmetad: sends an XML snapshot to anybody who connects
8651	gmetad: works a little bit like HTTP, returns a subset of the XML snapshot when you make a GET request

You can discover a Ganglia environment in your campus by looking for a gmond process on your machine and the gmond.conf file, often in /etc/gmond.conf or /etc/ganglia/gmond.conf. That file may contain a clue about the name of the host where Ganglia data is aggregated:

udp_send_channel {
  host = ganglia-reports.example.edu
  port = 8649
  ttl = 1
}

This tells you that the host ganglia-reports.example.edu is collecting the data - you could try the URL http://ganglia-reports.example.edu/ganglia/ in a web browser or try connecting to one of the TCP ports 8649, 8651 or 8652 on that host. Here is an example with netcat:

$ nc ganglia-reports.example.edu 8649 | grep ^.H

It will return a list of all hosts that Ganglia knows about.

Once you have a data feed, you can then configure a gmetad process on your own system to poll the remote system and generate local RRDs for you to study.

Install your own Ganglia

It is very easy to get your own Ganglia setup.

On a Debian or Ubuntu system, just do:

# apt-get update
# apt-get install ganglia-monitor ganglia-webfrontend

On Fedora and RPM-based systems (such as CentOS or RHEL with EPEL) you can do:

# yum install ganglia-gmond ganglia-web

Everything should be autoconfigured. You can then browse to http://localhost/ganglia to see the charts.

If you have several hosts with the gmond agent (just the ganglia-monitor.deb or ganglia-gmond.rpm) on the same LAN, they will automatically find each other using multicast and you will see an aggregated report on the machine with the web server.

The data is real-time

It is important to keep in mind that the data is real-time. This means you can often detect problems in real-time. If this blog appears on slashdot, for example, then that image from Université Montpellier 2 will be hit many times. The image actually shows the network load on the web server producing the image, so you will see the slashdot effect graphically in the image itself.

Processing real-time data is often the most advanced step in any data science exercise. Initially, you may simply log a few days of data to RRD files to start studying a static data set with your tool of choice, whether it is the R project, Weka or Hadoop

Once you have a a hypothesis (for example, an algorithm that understands the normal characteristics of each metric) you may then take each new real-time value from the gmetad XML and test it with the algorithm. The algorithm would then raise an alert if any metric on any host deviates from its normal behavior.

Mixing in other sources of data

Depending upon the computing environment in your campus or organization, you may also be able to get other data sources, such as a list of people logged in to different machines at different times and the processes that each user starts and stops.

This might help to make more accurate predictions about when network or computing resources will be under stress. For example, if users bob, alice and eve all appear on the same host, your algorithm might conclude that the load average will reach an excessive level within 15 minutes and send those three users a suggestion to each try other machines.

Making a successful application for GSoC 2014

Here are some tips

For all organizations/projects
- Make sure you have a profile on sites like Github, here is mine - include the link to such profiles in your emails to mentors. If you have nothing else to publish, then upload some of your assignments/projects from class. Find open bugs in other free software projects on Github and try contributing pull requests with fixes.
- Find any other links that show a history of your activity in the free software community, for example, this link shows all the feedback I have submitted to the Debian bug tracker - find similar links to demonstrate your own activities. Send these links to your desired mentor.
- Make a blog - write about your experiences with free software. Send us a link.
- Look for free software events in your area. Attend a Linux User Group meeting.
- Create a PGP key and find developers to sign it for you.
- If you are female, look for some of the dedicated groups for women in computing, such as the Debian women initiative. Find out about their events (such as the Debian women's mini-DebConf in Barcelona). Ask if anybody in the group can provide feedback about your GSoC application.
For Ganglia and the data science project in particular
- Join the ganglia-general mailing list and send an email to introduce yourself
- Try Ganglia on your own Linux system. Use the packages, it is really easy. Send an email to the list with any questions.
- Explore the source code in github - ask us questions about it. For the data science project, you may also need to look at RRDtool source code and documentation about making a plugin for R (using C)
- You don't have to do it our way: if you prefer to work with another tool instead of R, please tell us your idea
- Write some skeleton code or make a diagram to explain what you want to do. While you do this, do you think of any new questions or problems? Make a list of them.
- We want to give every student a small coding task as a test. Please tell us which language you prefer (e.g. C, Java, Python) so we can give you a suitable test. If you are really keen, follow the link to bugs I reported in Debian, look for one that is easy and try to write a small patch for it - helping fix bugs that annoy your mentor is likely to be a good way to get on the short-list for selection.