Introduction

Note: this page is fully dedicated for the GSoC 2020 Big Data Infrastructure By Gentoo project.

Many Java packages, for example, hadoop and Spark, depend heavily on the Maven build system. To make them a part of the ecosystem of portage, and to make Gentoo an attractive choice for Java users, it is an excellent choice to integrate Maven into portage. But since Maven will fetch the dependencies from Maven Central for any arbitrary artifact during the build phase, which violates the policy of portage, it appears pretty hard to integrate Maven with portage. An alternative way to compile and install Maven artifacts with portage is to translate the build procedure of Maven to command lines and make the command a part of the ebuild file.

Java-ebuilder is an initial effort to implement the mentioned alternative method. In a nutshell, it will read pom.xml, parse some key attributes, translate them into bash variables, and generate an ebuild. After that, portage will take over the build process and install the compiled jar files. Besides, java-ebuilder also provides add-on scripts that will generate ebuilds for any dependencies that are not packaged yet.

Unfortunately, java-ebuilder could not grab all the required information from pom.xml. The ebuild generated by java-ebuilder was not completely portage-compliant either. As for the add-on script, also known as movl, it ran slowly and could not resume from failure, which makes the solution unattractive.

This project, namely GSoC 2020 Big Data Infrastructure By Gentoo project, aims at

  • improving the functionality of java-ebuilder;
  • rewriting movl using GNU Make, which helps run tasks in parallel and resume the job from failure;
  • parsing the metadata of Maven packages and presenting an up-to-date Gentoo Overlay containing full dependencies of Java software, such as Spark.

Objectives

For your further reference: Repositories Produced By This Project

This project desires some improvements on java-ebuilder (my fork), including:

  • improving the procedure of generating the reverse dependencies of a certain Maven artifact (e.g. org.apache.spark:spark-core_2.12:3.0.0-preview2);
  • improving the style of ebuild files generated by java-ebuilder;
  • mapping gentoo packages into Maven artifacts, so that java-ebuilder could make use of the already existing packages of a Gentoo System.

It also requires enhancements for java-pkg-simple.eclass(my fork), including:

  • detect the need for tools.jar;
  • enabling src_test() function for Java Unit Testing;
  • explicitly supporting multiple directories containing sources via Bash Array;
  • supporting packaging Java Resources with portage;
  • supporting testing framework: JUnit3;
  • supporting testing framework: JUnit4;
  • supporting testing framework: JUnit5;
  • supporting testing framework: TestNG;
  • generating classpath for USE-Conditional dependencies.

Besides, portage also needs some more features (java-pkg-simple-plugins.eclass).

  • compiling Scala sources;
  • compiling Kotlin sources;
  • deal with shaded jars and uber jars.

Eventually, an overlay will behave as a deliverable of this project. The overlay should have those contents satisfy these requirements:

  • ebuild files for Spark and its dependency graph;
  • a demo that runs Monte Carlo Integration with Spark;
  • building all the packages from source.

System Design

java-ebuilder

According to the ebuild file of java-ebuilder, java-ebuilder contains four parts of files.

  1. the {jar,source code,Java doc} file of java-ebuilder;
  2. scripts, i.e. movl and java-ebuilder, that are installed to /usr/bin;
  3. the resources and extra scripts needed by movl, located in /usr/lib/java-ebuilder/;
  4. the skeleton of Maven overlay, which is located in /var/lib/java-ebuilder/.

The first part is the core of java-ebuilder, which contains all the things that we need to run java-ebuilder.

The second part is the interface of java-ebuilder. java-ebuilder is a helper that sets environment variables and executes java-ebuilder.jar. movl is a script that starts from a root Maven artifact, utilizes java-ebuilder to resolve the dependency graph of the artifact, and generates an overlay for installing the artifact with portage.

The third part contains the backend of movl.

The fourth part is the directory where movl will generate the overlay.

Here is a graph that outlines how java-ebuilder interact with Maven and portage.

java-ebuilder overview

The detailed description is attached below.

cache

The First part of java-ebuilder is the cache of java-ebuilder. The cache file maps portage packages and Maven artifacts using their unique identifier. For portage, it is category:package name:version:slot. For Maven, it is groupId:artifactId:version. We expect every ebuild file that has equivalent artifacts should have a variable called MAVEN_ID or MAVEN_PROVIDES that contains the identifiers of artifacts. Java-ebuilder will parse the identifier of the portage package and read the variables mentioned above to generate the cache. The pipeline of creating a cache file is shown as the graph below.

java-ebuilder creates cache

The cache file is used if the pom.xml indicates that the artifact which we are parsing has dependencies. We will look up the cache and translate the dependencies to the way that portage understands.

ebuild writer

The ebuild writer part of java-ebuilder is to translate elements from pom.xml to package-version.ebuild. The key elements and the equivalent variables are shown below.

XML element Ebuild Variable
/project/{artifact,group}Id MAVEN_ID
/project/build/sourceDirectory JAVA_SRC_DIR
/project/build/resources JAVA_RESOURCE_DIRS
/project/build/testSourceDirectory JAVA_TEST_SRC_DIR
/project/build/testResources JAVA_TEST_RESOURCES_DIRS
/project/dependencies/* DEPENDRDEPEND
/project/dependencies/* JAVA_GENTOO_CLASSPATH JAVA_CLASSPATH_EXTRA JAVA_TEST_GENTOO_CLASSPATH
/project/dependencies/* JAVA_TESTING_FRAMEWORKS
/project/description DESCRIPTION
/project/licenses/license/name LICENSE
/project/plugins/plugin/configuration/maven-jar-plugin/configuration/manifest/mainClass JAVA_MAIN_CLASS
/project/project.build.sourceEncoding JAVA_ENCODING
/project/url HOMEPAGE

movl

movl is a wrapper of java-ebuilder.

Assume that we are going to make an overlay for an arbitrary Maven artifact Foo. movl will execute java-ebuilder to output the dependencies of Foo. For every dependency that does not have an ebuild in portage, movl will run java-ebuilder for it and output the dependencies of the dependency. By recursively grabbing the dependencies, movl will finally walk through the dependency graph of Foo and generate a DAG to describe it. With the DAG, we will be able to create an overlay that helps us install Foo.

The old version of movl was written in Bash, making the pipeline of movl not straight forward. Besides, it ran all tasks serially, which took a long time to finish its work. If a user wanted to update any single ebuild of the overlay, he would need to run movl again and wait for a long time for movl to refresh the overlay.

During this summer, I rewrote movl with GNU Make, which overcomes the flaws above. Since it is driven by Makefile, it is easy to explain how it works. Here are a chart that explains different targets of movl and a graph that describes the dependencies of the targets.

Target Description
all / build alias for stage2 and post-stage2
stage2 generate ebuilds of the overlay in parallel
post-stage2 generate Manifest files for stage2 ebuilds
clean-stage2 remove *.ebuild and Manifest in the overlay
stage1 alias for /path/to/stage2.mk
/path/to/stage2.mk resolve the dep graph of MAVEN_ARTS, and generate a makefile (stage2.mk) that defines the DAGand the commands to generate the final ebuilds
/path/to/pre-stage1-cache java-ebuilder cache containing pkgs of the system
/path/to/post-stage1-cache pre-stage1-cache andjava-ebuilder cache containing pkgs generated in stage1

DAG of movl

java-pkg-simple.eclass

Generally speaking, with reference to the bash variables defined in ebuild, java-pkg-simple.eclass chooses the proper action to compile Java classes from the source, to generate Javadoc, to add resources, to install and to test the jar.

Variable Action
JAVA_GENTOO_CLASSPATH add items of the variable to ${CLASSPATH},and record them as dependencies of the package
JAVA_CLASSPATH_EXTRA add items of the variable to ${CLASSPATH},but not record them as dependencies
JAVA_SRC_DIR get and compile \*.java file from the directories defined in the variable
JAVA_RESOURCE_DIRS recognize things in the directories as Java Resources and package them later
JAVA_ENCODING character encoding used by source files
JAVA_TESTING_FRAMEWORKS launch testing defined in the variable junit: testing with dev-java/junit:0 junit-4: testing with dev-java/junit:4 testng: testing with dev-java/testng:0 pkgdiff: make sure compiled jar and binary jar are compatible
JAVA_TEST_GENTOO_CLASSPATH while testing the package,add items of the variable to ${CLASSPATH},and not record them as dependencies
JAVA_TEST_RESOURCE_DIRS while testing the package,recognize things in the directories as Java Resources
JAVA_TEST_EXCLUDES while testing the package,exclude classes defined in the variable from testing
JAVA_MAIN_CLASS set proper value in MANIFEST.MF to indicate the Main class
JAVA_LAUNCHER_FILENAME the name of the script that will install to /usr/bin

Quick Start

java-ebuilder

core

To let java-ebuilder generate cache file, a user need to execute

java-ebuilder --refresh-cache --portage-tree /var/db/repos/gentoo\
	[--portage-tree /path/to/your/overlay]\
	[--cache-file /path/to/cache]

To make java-ebuilder generate ebuild file for a Maven artifact (e.g. commons-io:commons-io:2.6), a user may want

mkdir -p /tmp/workdir && cd /tmp/workdir
SRC_URI="https://archive.apache.org/dist/commons/io/source/commons-io-2.6-src.tar.gz"
wget ${SRC_URI}
tar xvf commons-io-2.6-src.tar.gz
java-ebuilder --generate-ebuild --workdir . -u ${SRC_URI}\
		-k "~amd64" -k "~arm64" -k "~ppc64 ~x86"\
		--pom commons-io-2.6-src/pom.xml --slot 1\
		--ebuild commons-io-2.6.ebuild\
		[--cache-file /path/to/your/cache]

After executing the commands, a commons-io-2.6.ebuild will appear in the current directory.

movl

To use movl to generate an overlay, one can follow the instruction below.

  1. define MAVEN_ARTS in /etc/java-ebuilder.conf
echo MAVEN_ARTS=\"io.netty:netty-transport-udt:4.1.42.Final\" >> /etc/java-ebuilder.conf
  1. run movl build and wait
  2. movl may encounter issues, and it will print an error message like
[!] java-ebuilder Returns 1
[!] The problematic artifact is com.barchart.udt:barchart-udt-bundle:2.3.0,
[!] please write it (or its parent) a functional ebuild,
[!] make it a part of your overlay (the overlay does not need to be /var/lib/java-ebuilder/maven),
[!] and run `movl build` afterwards
[!]
[!] P.S. DO NOT forget to assign a MAVEN_ID to the ebuild
[!] P.P.S. To make `movl build` deal with the dependency of com.barchart.udt:barchart-udt-bundle:2.3.0,
[!] you need to add MAVEN_IDs of the dependencies to
[!] your MAVEN_ARTS variable in //etc/java-ebuilder.conf

You will need to write an ebuild for the problematic package like this and place it in an overlay that is registered in /etc/portage/repos.conf. 4. run movl build and repeat 3. until it finished its work. 5. emerge the Maven artifact

emerge -1av netty-transport-udt

Spark-overlay

Please follow the README of spark-overlay.

Works Left To Be Done

It turns out that I underestimated the works that this project needs to accomplish. While fighting with Maven artifacts and Gentoo ebuilds, I met some issues that seem fairly interesting. Here are those ideas that still need further discussion and development.

JUnit-5 Testing Platform

Currently, there are no JUnit-5 packages in portage. Since many packages (e.g. commons-lang3) now uses JUnit-5 as their unit test framework, it is important to make it a part of portage ecosystem and integrate it into the Java build system of portage.

Classpath for USE-Conditional Dependencies

The concept of USE-Conditional dependencies is an exquisite and essential part of portage. It makes enable and disable optional features of a package much easier. But USE-Conditional dependencies is barely functional and not easy to use for Java packages in portage. The widely used java-pkg_gen-cp() function cannot deal with USE-Conditional dependencies, and it is hard to write the *CLASSPATH variable in this situation. Proper methods for this circumstance should be an interesting topic.

Scala and Kotlin Sources

Projects that are written in Scala and Kotlin are becoming an important part of the ecosystem of Java, but there is no integration with portage for them. The issue also blocked me from compiling some artifacts from the spark-overlay. I wrote a preliminary eclass, but it is still immature and needs further development.

Uber Jars and Shaded Jars

Let’s say we are going to compile a Maven artifact Foo.jar. Foo.jar depends on Bar.jar, and the Java package of Bar.jar is org.example.

In this context, if Foo.jar contains Java classes from Bar.jar, Foo.jar is an uber jar. Furthermore, if we use the package string com.foo instead of org.example while packaging Java classes from Bar.jar, Foo.jar becomes a shaded jar.

Currently, it seems that there appears no discussion about how to deal with those situations with java-pkg-simple.eclass I believe it is a really valuable work that I discovered.

Build Maven with java-ebuilder

As a final deliverable, I provided an Overlay that resolves all the dependencies of spark-core. But in my opinion, to resolve dependencies for Maven itself and make us able to install Maven from source code is an attractive choice.

Reference

Repositories Produced By This Project

  1. a fork of java-ebuilder, https://github.com/6-6-6/java-ebuilder
  2. an overlay for testing java-pkg-simple.eclass, https://github.com/6-6-6/test-java-pkg-simple
  3. the deliverable of the project, A.K.A spark-overlay, https://github.com/6-6-6/spark-overlay
  1. [gentoo-dev] [PATCH 0/2] eclass/java-{utils-2,pkg-simple}.eclass: features and enhancements, https://archives.gentoo.org/gentoo-dev/message/da4435309a1585fbc07fce705558ad06
  2. make dev-util/pkgdiff compatible with Gentoo Prefix, https://bugs.gentoo.org/723124
  3. bump the version of dev-java/netty-tcnative, https://bugs.gentoo.org/733630

GSoC Weekly Reports

  1. Reports for Week 1, https://archives.gentoo.org/gentoo-soc/message/360caaf690c1b8f45cc7f0767a8b6b3f
  2. Reports for Week 2, https://archives.gentoo.org/gentoo-soc/message/d663e2813f52d237c7a117b200f4d32c
  3. Reports for Week 3, https://archives.gentoo.org/gentoo-soc/message/9398ec0bd71b9a1c5191bf6c0cc358fa
  4. Reports for Week 4, https://archives.gentoo.org/gentoo-soc/message/0550776ef2e1ece9d1c2905df06e839f
  5. Reports for Week 5, https://archives.gentoo.org/gentoo-soc/message/557babc402ffeea84cf6d08c758e0837
  6. Reports for Week 6, https://archives.gentoo.org/gentoo-soc/message/3978f7900b0673c85ce3c64b47f7d44a
  7. Reports for Week 7, https://archives.gentoo.org/gentoo-soc/message/0cef85b2045f82b319e2fada043523f2
  8. Reports for Week 8, https://archives.gentoo.org/gentoo-soc/message/0cef85b2045f82b319e2fada043523f2
  9. Reports for Week 9 and 10, https://archives.gentoo.org/gentoo-soc/message/7008c78bcc2650420987567158856c6c
  10. Reports for Week 11 and 12, https://archives.gentoo.org/gentoo-soc/message/709519229888158095036c97095bddd1