Classpath issues when using Spark's Hive integration

written by Lars Francke on 2018-03-22

We were investigating a weird Spark exception recently. This happened on Apache Spark jobs that were running fine until now. The only difference we saw was an upgrade from IBM BigReplicate 4.1.1 to 4.1.2 (based on WANdisco Fusion 2.11 I believe).

This is the stack trace we saw:

Caused by: java.lang.IllegalArgumentException: interface com.wandisco.fs.common.ApiCompatibility is not visible from class loader
            at java.lang.reflect.Proxy$ProxyClassFactory.apply(
            at java.lang.reflect.Proxy$ProxyClassFactory.apply(
            at java.lang.reflect.WeakCache$Factory.get(
            at java.lang.reflect.WeakCache.get(
            at java.lang.reflect.Proxy.getProxyClass0(
            at java.lang.reflect.Proxy.newProxyInstance(
            at com.wandisco.fs.common.CompatibilityAdaptor.createAdaptor(
            at com.wandisco.fs.client.FusionCommon.<init>(
            at com.wandisco.fs.client.ReplicatedFC.<init>(
            at com.wandisco.fs.client.ReplicatedFC.<init>(
            at com.wandisco.fs.client.ReplicatedFC.get(
            at com.wandisco.fs.client.FusionCommonFactory.newInstance(
            at com.wandisco.fs.client.FusionCommonFactory.get(
            ... 48 more

It took us a while to investigate and we learned a couple of things in the course. Namely:

  • There is an option -verbose:class that'll show you all the classes that the JVM loads and from which file it takes them
  • Both OpenJDK and IBM's JDK support this option but the output is slightly different (so helpful...)
  • There is a command line debugger in Java called jdb which is fabulous when you have no way to connect with an IDE for remote debugging (firewall etc.)
  • There are multiple different transports to communicate between the debugger and a running virtual machine.

    Don't do what we did and blindly follow the first link Google gives you as that was (in our case) the documentation for jdb on Windows.

    Windows has a transport called dt_shmem (Shared Memory) which is not available on Linux. On Linux you want to use dt_socket instead.

The issues

Back to the problem at hand: We developed a minimal example for our problem. Starting the Spark Shell worked just fine. As soon as we ran a command that "triggered" the Hive integration we'd get the above stack trace:

spark.sql("SHOW DATABASES").show

First we looked at the OpenJDK source code as to what would trigger the error message is not visible from class loader. This is triggered in the Proxy class and happens when it queries a given Classloader for a Class and compares it with a reference Class instance. As soon as those two are different (they will point to the same logical class but not the same instance, i.e. equals() returns false) the previous exeception is thrown. There is also this SO question which hits on the same issue and helped a little in understanding this.

We (finally) found the issue by attaching jdb to a running Spark and breaking at the Proxy creation.

There we saw that the two class instances were indeed different in that they are coming from two different Classloaders. One is from the default SystemClassLoader, the other from an IsolatedClientLoader from the Spark project.

Once we found this we were quick to find the solution to our problem. Here are the relevant JIRAs:

  • SPARK-6906: "Improve Hive integration support"
  • SPARK-6907: "Create an isolated classloader for the Hive Client."
  • SPARK-7491: "Handle drivers for Metastore JDBC"

The last one held the key. It introduces a parameter called spark.sql.hive.metastore.sharedPrefixes (Spark documentation in the section Interacting with Different Versions of Hive Metastore) that allows specifying certain prefixes of classes that are shared between the main system classloader and the isolated classloader for Hive. We simply had to add com.wandisco here and everything worked again:

spark-shell --conf spark.sql.hive.metastore.sharedPrefixes=com.wandisco.

As of Spark 2.3 there are a few hardcoded default prefixes that are applied in addition to the ones you specify:

  • org.apache.hadoop. but not org.apache.hadoop.hive.
  • slf4j
  • log4j
  • org.apache.spark.
  • scala.
  • but not
  • java.lang.

While this is a very obscure error that probably won't affect many people, if you are stuck with it and find this blog post I hope it saves you a few hours of debugging!

If you enjoy working on new technologies, traveling, consulting clients, writing software or documentation please reach out to us. We're always looking for new colleagues!