Author: Zizhun Guo
作者:
写于:
import findspark
findspark.init('/home/XXX/spark-3.1.2-bin-hadoop3.2')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com")\
.config("spark.driver.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
.config("spark.executor.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
.getOrCreate()
spark.executor.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
spark.driver.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
Once the configuration has done, there is no need to configure again unless the extra jars file are deleted.
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
df = spark.read.csv('s3a://bucket-name/object-name')
df.show()
Unfortunately this does not works in my case and keep beeping me.
ubuntu@xxx:~/spark-3.1.2-bin-hadoop3.2$ bin/spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.0
Error: Missing application resource.
Help:
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
...
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
...
Copyright @ 2021 Zizhun Guo. All Rights Reserved.