亚马逊云-S3与Pyspark数据载入管线 Working with AWS S3 and Pyspark with s3a filesystem

Author: Zizhun Guo

作者:

写于:




Manually Config

Step 1: Get Packages
Step 2: Configuration
import findspark
findspark.init('/home/XXX/spark-3.1.2-bin-hadoop3.2')
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkByExamples.com")\
    .config("spark.driver.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
    .config("spark.executor.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
    .getOrCreate()
spark.executor.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
spark.driver.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar

Once the configuration has done, there is no need to configure again unless the extra jars file are deleted.

Step 3: Connect to S3 with AcessKey and SecretKey using s3a://
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")
Step 4: Read remote CSV file
df = spark.read.csv('s3a://bucket-name/object-name')
df.show()

Using $SPARK_HOME/bin/spark-submit –package []

Unfortunately this does not works in my case and keep beeping me.

ubuntu@xxx:~/spark-3.1.2-bin-hadoop3.2$ bin/spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.0
Error: Missing application resource.

Help:

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
...
--packages      Comma-separated list of maven coordinates of jars to include
                on the driver and executor classpaths. Will search the local
                maven repo, then maven central and any additional remote
                repositories given by --repositories. The format for the
                coordinates should be groupId:artifactId:version.
...

Resources

Tutorial


Copyright @ 2021 Zizhun Guo. All Rights Reserved.

Back to Top