Skip to content

AWS Glue - Arango Oasis Spark Connection unclassified error on read_collection load "An error occurred while calling o120.load. org/apache/spark/sql/arangodb/commons/ArangoDBConf$" #64

@am0eba-byte

Description

@am0eba-byte

Setup:

ArangoGraph Oasis 3.11 (oneshard model, 3 x 4GB)
AWS Glue 4.0 - Spark 3.3, Scala 2, Python 3
ArangoDB Spark Connector [version 1.7.0](https://mvnrepository.com/artifact/com.arangodb/arangodb-spark-datasource-3.3_2.13-1.7.0.jar)

Description:

Trying to set up an ETL pipeline to read a collection from our Arango Oasis instance, and we keep running into the same error. We're performing the ETL job in AWS Glue. We're connecting to the database through a NAT Gateway. We based our Glue job script off of the python demo arangodb-spark-datasource/demo/python-demo/demo.py, and we're providing the DB credentials via SecretsManager. Here's what our python code looks like:

def read_collection(spark: SparkSession, collection_name: str, base_opts: dict[str, str], schema: StructType) -> pyspark.sql.DataFrame:
    arangodb_datasource_options = combine_dicts([base_opts, {"table": collection_name}])

    return spark.read \
        .format("com.arangodb.spark") \
        .options(**arangodb_datasource_options) \
        .schema(schema) \
        .load() #fails here

We believe the error is occuring on .load(), and we're wondering if anyone else has run into the same error when trying to setup a connection within AWS Glue, or if anyone has any tips for us to try:

ExceptionErrorMessage failureReason: An error occurred while calling o121.load. org/apache/spark/sql/arangodb/commons/ArangoDBConf$
24/09/30 16:27:46 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: 169.254.76.0 Proxy Port: 8888
24/09/30 16:27:46 INFO ProcessLauncher: Enhance failure reason and emit cloudwatch error metrics.
24/09/30 16:27:46 INFO ProcessLauncher: postprocessing
24/09/30 16:27:46 WARN OOMExceptionHandler: Failed to extract executor id from error message.
24/09/30 16:27:46 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
  File "/tmp/ArangoDB_CompGraph_ETL.py", line 99, in <module>
    collection = read_collection(spark, "competency_transitive", arango_options, edges_schema)
  File "/tmp/ArangoDB_CompGraph_ETL.py", line 97, in read_collection
    .load()
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 184, in load
    return self._df(self._jreader.load())
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/arangodb/commons/ArangoDBConf$
	at com.arangodb.spark.DefaultSource.extractOptions(DefaultSource.scala:16)
	at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:38)
	at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:31)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83)
	at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:132)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
	at scala.Option.flatMap(Option.scala:271)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.arangodb.commons.ArangoDBConf$
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 21 more

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions