-
Notifications
You must be signed in to change notification settings - Fork 11
Open
Description
Setup:
ArangoGraph Oasis 3.11 (oneshard model, 3 x 4GB)
AWS Glue 4.0 - Spark 3.3, Scala 2, Python 3
ArangoDB Spark Connector [version 1.7.0](https://mvnrepository.com/artifact/com.arangodb/arangodb-spark-datasource-3.3_2.13-1.7.0.jar)
Description:
Trying to set up an ETL pipeline to read a collection from our Arango Oasis instance, and we keep running into the same error. We're performing the ETL job in AWS Glue. We're connecting to the database through a NAT Gateway. We based our Glue job script off of the python demo arangodb-spark-datasource/demo/python-demo/demo.py
, and we're providing the DB credentials via SecretsManager. Here's what our python code looks like:
def read_collection(spark: SparkSession, collection_name: str, base_opts: dict[str, str], schema: StructType) -> pyspark.sql.DataFrame:
arangodb_datasource_options = combine_dicts([base_opts, {"table": collection_name}])
return spark.read \
.format("com.arangodb.spark") \
.options(**arangodb_datasource_options) \
.schema(schema) \
.load() #fails here
We believe the error is occuring on .load()
, and we're wondering if anyone else has run into the same error when trying to setup a connection within AWS Glue, or if anyone has any tips for us to try:
ExceptionErrorMessage failureReason: An error occurred while calling o121.load. org/apache/spark/sql/arangodb/commons/ArangoDBConf$
24/09/30 16:27:46 INFO AmazonHttpClient: Configuring Proxy. Proxy Host: 169.254.76.0 Proxy Port: 8888
24/09/30 16:27:46 INFO ProcessLauncher: Enhance failure reason and emit cloudwatch error metrics.
24/09/30 16:27:46 INFO ProcessLauncher: postprocessing
24/09/30 16:27:46 WARN OOMExceptionHandler: Failed to extract executor id from error message.
24/09/30 16:27:46 ERROR ProcessLauncher: Error from Python:Traceback (most recent call last):
File "/tmp/ArangoDB_CompGraph_ETL.py", line 99, in <module>
collection = read_collection(spark, "competency_transitive", arango_options, edges_schema)
File "/tmp/ArangoDB_CompGraph_ETL.py", line 97, in read_collection
.load()
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 184, in load
return self._df(self._jreader.load())
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
return_value = get_return_value(
File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
return f(*a, **kw)
File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o121.load.
: java.lang.NoClassDefFoundError: org/apache/spark/sql/arangodb/commons/ArangoDBConf$
at com.arangodb.spark.DefaultSource.extractOptions(DefaultSource.scala:16)
at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:38)
at com.arangodb.spark.DefaultSource.getTable(DefaultSource.scala:31)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.getTableFromProvider(DataSourceV2Utils.scala:83)
at org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils$.loadV2Source(DataSourceV2Utils.scala:132)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:209)
at scala.Option.flatMap(Option.scala:271)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:207)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.arangodb.commons.ArangoDBConf$
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 21 more
Metadata
Metadata
Assignees
Labels
No labels