亚马逊云-S3与Pyspark数据载入管线 Working with AWS S3 and Pyspark with s3a filesystem

Author: Zizhun Guo

作者:

写于： 05 Sep 2021

Manually Config

Step 1: Get Packages

Step 2: Configuration

Option 1:
- Load jars in script in python:

import findspark
findspark.init('/home/XXX/spark-3.1.2-bin-hadoop3.2')
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SparkByExamples.com")\
    .config("spark.driver.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
    .config("spark.executor.extraClassPath", "./spark-3.1.2-bin-hadoop3.2/jars/hadoop-aws-3.2.0.jar:./spark-3.1.2-bin-hadoop3.2/jars/aws-java-sdk-bundle-1.11.375.jar")\
    .getOrCreate()

Option 2:
- Set up local configuration file: spark-default.conf in $SPARK_HOME/conf/

spark.executor.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
spark.driver.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar

Once the configuration has done, there is no need to configure again unless the extra jars file are deleted.

Step 3: Connect to S3 with AcessKey and SecretKey using s3a://

sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "XXX")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.amazonaws.com")

Step 4: Read remote CSV file

df = spark.read.csv('s3a://bucket-name/object-name')
df.show()

Using $SPARK_HOME/bin/spark-submit –package []

Unfortunately this does not works in my case and keep beeping me.

ubuntu@xxx:~/spark-3.1.2-bin-hadoop3.2$ bin/spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.0

Error: Missing application resource.

Help:

Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
...
--packages      Comma-separated list of maven coordinates of jars to include
                on the driver and executor classpaths. Will search the local
                maven repo, then maven central and any additional remote
                repositories given by --repositories. The format for the
                coordinates should be groupId:artifactId:version.
...