Automatically building Java projects


Another of the GSoC project areas I have offered to mentor involves automatically and recursively building Java projects.

Why is this important?

Recently, I decided to start using travis-ci to automatically build some of my Java projects.

One of the projects, JMXetric, depends on another, gmetric4j. When travis-ci builds JMXetric, it needs to look in Maven repositories to find gmetric4j (and also remotetea / oncrpc.jar and junit). The alternative to fetching things from repositories is to stop using Maven and ship the binary JARs of dependencies in the JMXetric repository itself, which is not desirable for various reasons.

Therefore, I submitted gmetric4j into the Maven Central repository by using the Sonatype Nexus service. One discovery I made in this process disturbed me: Sonatype allows me to sign and upload my binary JAR, they don't build it from source themselves, so users of the JAR have no way to know that it really is built from the source I publish in Github.

In fact, as anybody who works with Java knows, there is no shortage of Java projects out there that have a mix of binary dependencies in their source tree, sometimes without any easy way to find the dependency sources. HermesJMS is one popular example that is crippled by the inclusion of some JARs that are binary-only.

No silver bullet, but there is hope

Although there are now tens of thousands of JAR libraries out there in repositories and projects that depend on them (and their transitive dependencies and build tools and such), there is some hope:

With that in mind, I'm hopeful that a system could be developed to scrape some of these data sources to find some source code and properly build some subset of the thousands of JARs available in the Maven Central Repository.

But why bother if you can't completely succeed?

One recent post on maven-users suggested that because nobody would ever be able to build 100% of JARs from source, the project is already doomed to failure.

Personally, I feel it is quite the opposite: by failing to build 100% of JARs from source, the project will help to pinpoint those hierarchies of JARs that are not really free software and increase pressure on their publishers to adhere to the standards that people reasonably expect for source distribution or provide a red flag to help dependant projects stop using them.

On top of that, the confirmation of true free-software status for many other JARs will make it safer for people to rely on them, package them and distribute them in various ways.

Dumping a gazillion new Java packages into Debian

Just to clear up one point: being able to automatically build JARs from source (or a chain of dependencies involving hundreds of them) doesn't mean they will be automatically uploaded to the official Debian archive by a Debian Developer (DD).

Having this view of the source will make it easier for a DD to look at a set of JARs and decide if they are suitable for packaging, but there would still be some manual oversight involved. The tool would simply take out some of the tedious manual steps (where possible) and give the DD the ability to spot traps (JARs without real source hidden in the dependency hierarchy) much more quickly.

How would it work?

The project - whether completed under GSoC or by other means - would probably be broken down into a few discrete components. Furthermore, it would utilize some existing tools where possible. All of this makes it easier for a student to focus on and complete some subset of the work even if the whole thing is quite daunting.

Here are some of the ideas for the architecture of the solution and the different components to be used or developed: