Big Data Infrastructure by Gentoo: Gentoo GSoC 2020 Final Report
Introduction
Note: this page is fully dedicated for the GSoC 2020 Big Data Infrastructure By Gentoo project.
Many Java packages, for example, hadoop and Spark, depend heavily on the Maven build system. To make them a part of the ecosystem of portage, and to make Gentoo an attractive choice for Java users, it is an excellent choice to integrate Maven into portage. But since Maven will fetch the dependencies from Maven Central for any arbitrary artifact during the build phase, which violates the policy of portage, it appears pretty hard to integrate Maven with portage. An alternative way to compile and install Maven artifacts with portage is to translate the build procedure of Maven to command lines and make the command a part of the ebuild file.
Java-ebuilder is an initial effort to implement the mentioned alternative method. In a nutshell, it will read pom.xml, parse some key attributes, translate them into bash variables, and generate an ebuild. After that, portage will take over the build process and install the compiled jar files. Besides, java-ebuilder also provides add-on scripts that will generate ebuilds for any dependencies that are not packaged yet.
Unfortunately, java-ebuilder could not grab all the required information from pom.xml. The ebuild generated by java-ebuilder was not completely portage-compliant either. As for the add-on script, also known as movl, it ran slowly and could not resume from failure, which makes the solution unattractive.
This project, namely GSoC 2020 Big Data Infrastructure By Gentoo project, aims at
- improving the functionality of java-ebuilder;
- rewriting
movl
using GNU Make, which helps run tasks in parallel and resume the job from failure; - parsing the metadata of Maven packages and presenting an up-to-date Gentoo Overlay containing full dependencies of Java software, such as Spark.
Objectives
For your further reference: Repositories Produced By This Project
This project desires some improvements on java-ebuilder (my fork), including:
- improving the procedure of generating the reverse dependencies of a certain Maven artifact (e.g. org.apache.spark:spark-core_2.12:3.0.0-preview2);
- improving the style of ebuild files generated by java-ebuilder;
- mapping gentoo packages into Maven artifacts, so that java-ebuilder could make use of the already existing packages of a Gentoo System.
It also requires enhancements for java-pkg-simple.eclass
(my fork), including:
- detect the need for tools.jar;
- enabling
src_test()
function for Java Unit Testing; - explicitly supporting multiple directories containing sources via Bash Array;
- supporting packaging Java Resources with portage;
- supporting testing framework: JUnit3;
- supporting testing framework: JUnit4;
- supporting testing framework: JUnit5;
- supporting testing framework: TestNG;
- generating classpath for USE-Conditional dependencies.
Besides, portage also needs some more features (java-pkg-simple-plugins.eclass).
- compiling Scala sources;
- compiling Kotlin sources;
- deal with shaded jars and uber jars.
Eventually, an overlay will behave as a deliverable of this project. The overlay should have those contents satisfy these requirements:
- ebuild files for Spark and its dependency graph;
- a demo that runs Monte Carlo Integration with Spark;
- building all the packages from source.
System Design
java-ebuilder
According to the ebuild file of java-ebuilder, java-ebuilder contains four parts of files.
- the {jar,source code,Java doc} file of java-ebuilder;
- scripts, i.e.
movl
andjava-ebuilder
, that are installed to/usr/bin
; - the resources and extra scripts needed by
movl
, located in/usr/lib/java-ebuilder/
; - the skeleton of Maven overlay, which is located in
/var/lib/java-ebuilder/
.
The first part is the core of java-ebuilder, which contains all the things that we need to run java-ebuilder.
The second part is the interface of java-ebuilder.
java-ebuilder
is a helper that sets environment variables and executes java-ebuilder.jar
.
movl
is a script that starts from a root Maven artifact, utilizes java-ebuilder to resolve the dependency graph of the artifact, and generates an overlay for installing the artifact with portage.
The third part contains the backend of movl
.
The fourth part is the directory where movl
will generate the overlay.
Here is a graph that outlines how java-ebuilder interact with Maven and portage.
The detailed description is attached below.
cache
The First part of java-ebuilder is the cache of java-ebuilder.
The cache file maps portage packages and Maven artifacts using their unique identifier.
For portage, it is category:package name:version:slot
.
For Maven, it is groupId:artifactId:version
.
We expect every ebuild file that has equivalent artifacts should have a variable called MAVEN_ID
or MAVEN_PROVIDES
that contains the identifiers of artifacts.
Java-ebuilder will parse the identifier of the portage package and read the variables mentioned above to generate the cache.
The pipeline of creating a cache file is shown as the graph below.
The cache file is used if the pom.xml
indicates that the artifact which we are parsing has dependencies.
We will look up the cache and translate the dependencies to the way that portage understands.
ebuild writer
The ebuild writer part of java-ebuilder is to translate elements from pom.xml
to package-version.ebuild
.
The key elements and the equivalent variables are shown below.
XML element | Ebuild Variable |
---|---|
/project/{artifact,group}Id | MAVEN_ID |
/project/build/sourceDirectory | JAVA_SRC_DIR |
/project/build/resources | JAVA_RESOURCE_DIRS |
/project/build/testSourceDirectory | JAVA_TEST_SRC_DIR |
/project/build/testResources | JAVA_TEST_RESOURCES_DIRS |
/project/dependencies/* | DEPENDRDEPEND |
/project/dependencies/* | JAVA_GENTOO_CLASSPATH JAVA_CLASSPATH_EXTRA JAVA_TEST_GENTOO_CLASSPATH |
/project/dependencies/* | JAVA_TESTING_FRAMEWORKS |
/project/description | DESCRIPTION |
/project/licenses/license/name | LICENSE |
/project/plugins/plugin/configuration/maven-jar-plugin/configuration/manifest/mainClass | JAVA_MAIN_CLASS |
/project/project.build.sourceEncoding | JAVA_ENCODING |
/project/url | HOMEPAGE |
movl
movl
is a wrapper of java-ebuilder.
Assume that we are going to make an overlay for an arbitrary Maven artifact Foo
.
movl
will execute java-ebuilder
to output the dependencies of Foo
.
For every dependency that does not have an ebuild in portage, movl
will run java-ebuilder
for it and output the dependencies of the dependency.
By recursively grabbing the dependencies, movl
will finally walk through the dependency graph of Foo
and generate a DAG to describe it.
With the DAG, we will be able to create an overlay that helps us install Foo
.
The old version of movl
was written in Bash, making the pipeline of movl
not straight forward.
Besides, it ran all tasks serially, which took a long time to finish its work.
If a user wanted to update any single ebuild of the overlay, he would need to run movl
again and wait for a long time for movl
to refresh the overlay.
During this summer, I rewrote movl
with GNU Make, which overcomes the flaws above.
Since it is driven by Makefile
, it is easy to explain how it works.
Here are a chart that explains different targets of movl
and a graph that describes the dependencies of the targets.
Target | Description |
---|---|
all / build | alias for stage2 and post-stage2 |
stage2 | generate ebuilds of the overlay in parallel |
post-stage2 | generate Manifest files for stage2 ebuilds |
clean-stage2 | remove *.ebuild and Manifest in the overlay |
stage1 | alias for /path/to/stage2.mk |
/path/to/stage2.mk | resolve the dep graph of MAVEN_ARTS, and generate a makefile (stage2.mk) that defines the DAGand the commands to generate the final ebuilds |
/path/to/pre-stage1-cache | java-ebuilder cache containing pkgs of the system |
/path/to/post-stage1-cache | pre-stage1-cache andjava-ebuilder cache containing pkgs generated in stage1 |
java-pkg-simple.eclass
Generally speaking, with reference to the bash variables defined in ebuild, java-pkg-simple.eclass
chooses the proper action
to compile Java classes from the source, to generate Javadoc, to add resources, to install and to test the jar.
Variable | Action |
---|---|
JAVA_GENTOO_CLASSPATH | add items of the variable to ${CLASSPATH},and record them as dependencies of the package |
JAVA_CLASSPATH_EXTRA | add items of the variable to ${CLASSPATH},but not record them as dependencies |
JAVA_SRC_DIR | get and compile \*.java file from the directories defined in the variable |
JAVA_RESOURCE_DIRS | recognize things in the directories as Java Resources and package them later |
JAVA_ENCODING | character encoding used by source files |
JAVA_TESTING_FRAMEWORKS | launch testing defined in the variable junit : testing with dev-java/junit:0 junit-4 : testing with dev-java/junit:4 testng : testing with dev-java/testng:0 pkgdiff : make sure compiled jar and binary jar are compatible |
JAVA_TEST_GENTOO_CLASSPATH | while testing the package,add items of the variable to ${CLASSPATH},and not record them as dependencies |
JAVA_TEST_RESOURCE_DIRS | while testing the package,recognize things in the directories as Java Resources |
JAVA_TEST_EXCLUDES | while testing the package,exclude classes defined in the variable from testing |
JAVA_MAIN_CLASS | set proper value in MANIFEST.MF to indicate the Main class |
JAVA_LAUNCHER_FILENAME | the name of the script that will install to /usr/bin |
Quick Start
java-ebuilder
core
To let java-ebuilder
generate cache file, a user need to execute
java-ebuilder --refresh-cache --portage-tree /var/db/repos/gentoo\
[--portage-tree /path/to/your/overlay]\
[--cache-file /path/to/cache]
To make java-ebuilder
generate ebuild file for a Maven artifact (e.g. commons-io:commons-io:2.6), a user may want
mkdir -p /tmp/workdir && cd /tmp/workdir
SRC_URI="https://archive.apache.org/dist/commons/io/source/commons-io-2.6-src.tar.gz"
wget ${SRC_URI}
tar xvf commons-io-2.6-src.tar.gz
java-ebuilder --generate-ebuild --workdir . -u ${SRC_URI}\
-k "~amd64" -k "~arm64" -k "~ppc64 ~x86"\
--pom commons-io-2.6-src/pom.xml --slot 1\
--ebuild commons-io-2.6.ebuild\
[--cache-file /path/to/your/cache]
After executing the commands, a commons-io-2.6.ebuild
will appear in the current directory.
movl
To use movl
to generate an overlay, one can follow the instruction below.
- define MAVEN_ARTS in
/etc/java-ebuilder.conf
echo MAVEN_ARTS=\"io.netty:netty-transport-udt:4.1.42.Final\" >> /etc/java-ebuilder.conf
- run
movl build
and wait movl
may encounter issues, and it will print an error message like
[!] java-ebuilder Returns 1
[!] The problematic artifact is com.barchart.udt:barchart-udt-bundle:2.3.0,
[!] please write it (or its parent) a functional ebuild,
[!] make it a part of your overlay (the overlay does not need to be /var/lib/java-ebuilder/maven),
[!] and run `movl build` afterwards
[!]
[!] P.S. DO NOT forget to assign a MAVEN_ID to the ebuild
[!] P.P.S. To make `movl build` deal with the dependency of com.barchart.udt:barchart-udt-bundle:2.3.0,
[!] you need to add MAVEN_IDs of the dependencies to
[!] your MAVEN_ARTS variable in //etc/java-ebuilder.conf
You will need to write an ebuild for the problematic package like this
and place it in an overlay that is registered in /etc/portage/repos.conf
.
4. run movl build
and repeat 3. until it finished its work.
5. emerge the Maven artifact
emerge -1av netty-transport-udt
Spark-overlay
Please follow the README of spark-overlay.
Works Left To Be Done
It turns out that I underestimated the works that this project needs to accomplish. While fighting with Maven artifacts and Gentoo ebuilds, I met some issues that seem fairly interesting. Here are those ideas that still need further discussion and development.
JUnit-5 Testing Platform
Currently, there are no JUnit-5 packages in portage. Since many packages (e.g. commons-lang3) now uses JUnit-5 as their unit test framework, it is important to make it a part of portage ecosystem and integrate it into the Java build system of portage.
Classpath for USE-Conditional Dependencies
The concept of USE-Conditional dependencies is an exquisite and essential part of portage.
It makes enable and disable optional features of a package much easier.
But USE-Conditional dependencies is barely functional and not easy to use for Java packages in portage.
The widely used java-pkg_gen-cp()
function cannot deal with USE-Conditional dependencies, and it is hard to write the *CLASSPATH
variable in this situation.
Proper methods for this circumstance should be an interesting topic.
Scala and Kotlin Sources
Projects that are written in Scala and Kotlin are becoming an important part of the ecosystem of Java, but there is no integration with portage for them. The issue also blocked me from compiling some artifacts from the spark-overlay. I wrote a preliminary eclass, but it is still immature and needs further development.
Uber Jars and Shaded Jars
Let’s say we are going to compile a Maven artifact Foo.jar
.
Foo.jar
depends on Bar.jar
, and the Java package of Bar.jar
is org.example
.
In this context, if Foo.jar
contains Java classes from Bar.jar
, Foo.jar
is an uber jar.
Furthermore, if we use the package string com.foo
instead of org.example
while packaging Java classes from Bar.jar
,
Foo.jar
becomes a shaded jar.
Currently, it seems that there appears no discussion about how to deal with those situations with java-pkg-simple.eclass
I believe it is a really valuable work that I discovered.
Build Maven with java-ebuilder
As a final deliverable, I provided an Overlay that resolves all the dependencies of spark-core. But in my opinion, to resolve dependencies for Maven itself and make us able to install Maven from source code is an attractive choice.
Reference
Repositories Produced By This Project
- a fork of java-ebuilder, https://github.com/6-6-6/java-ebuilder
- an overlay for testing
java-pkg-simple.eclass
, https://github.com/6-6-6/test-java-pkg-simple - the deliverable of the project, A.K.A spark-overlay, https://github.com/6-6-6/spark-overlay
Related Bugs and Discussions
- [gentoo-dev] [PATCH 0/2] eclass/java-{utils-2,pkg-simple}.eclass: features and enhancements, https://archives.gentoo.org/gentoo-dev/message/da4435309a1585fbc07fce705558ad06
- make dev-util/pkgdiff compatible with Gentoo Prefix, https://bugs.gentoo.org/723124
- bump the version of dev-java/netty-tcnative, https://bugs.gentoo.org/733630
GSoC Weekly Reports
- Reports for Week 1, https://archives.gentoo.org/gentoo-soc/message/360caaf690c1b8f45cc7f0767a8b6b3f
- Reports for Week 2, https://archives.gentoo.org/gentoo-soc/message/d663e2813f52d237c7a117b200f4d32c
- Reports for Week 3, https://archives.gentoo.org/gentoo-soc/message/9398ec0bd71b9a1c5191bf6c0cc358fa
- Reports for Week 4, https://archives.gentoo.org/gentoo-soc/message/0550776ef2e1ece9d1c2905df06e839f
- Reports for Week 5, https://archives.gentoo.org/gentoo-soc/message/557babc402ffeea84cf6d08c758e0837
- Reports for Week 6, https://archives.gentoo.org/gentoo-soc/message/3978f7900b0673c85ce3c64b47f7d44a
- Reports for Week 7, https://archives.gentoo.org/gentoo-soc/message/0cef85b2045f82b319e2fada043523f2
- Reports for Week 8, https://archives.gentoo.org/gentoo-soc/message/0cef85b2045f82b319e2fada043523f2
- Reports for Week 9 and 10, https://archives.gentoo.org/gentoo-soc/message/7008c78bcc2650420987567158856c6c
- Reports for Week 11 and 12, https://archives.gentoo.org/gentoo-soc/message/709519229888158095036c97095bddd1