亚马逊云-创建私人云服务器工作站 Build Up a Personal Cloud Workstation using AWS Technologies

Author: Zizhun Guo

作者:

写于： 24 Aug 2021

Goal

Set up the personal workstation using the cloud services provider such as Amazon Web Service (AWS) to support personal data science works for study and research
Choose a proper combination of the infrastructure giving it a reasonable low budget while enables the workflow in a good efficiency
Configure the working pipeline enabling the immediate hand on

EC2 Instance Selection

Rather than running the instance 24/7, On-Demand suits fine for quick data analysis job in which the machine remains stopped while not in usage, which saves the pennies. There exists another cheaper strategy called Spot Instance which bids the on-demand price every time launching the instance while suffers shut down if the market price gets higher than the bidding. The snapshot image can be used to restore the system and data.

Instance name	On-Demand hourly rate	vCPU	Memory	Storage	Network performance
c5.2xlarge	$0.34	8	16 GiB	EBS Only	Up to 10 Gigabit

Amazon EC2 Instance Types: https://aws.amazon.com/ec2/instance-types/?trkCampaign=acq_paid_search_brand&sc_channel=PS&sc_campaign=acquisition_US&sc_publisher=Google&sc_category=Cloud%20Computing&sc_country=US&sc_geo=NAMER&sc_outcome=acq&sc_detail=%2Bamazon%20%2Bec2%20%2Bcloud%20%2Bservice&sc_content={ad%20group}&sc_matchtype=b&sc_segment=488982705501&sc_medium=ACQ-P

PS-GO

Brand

Desktop

Cloud%20Computing

EC2

Sitelink&s_kwcid=AL!4422!3!488982705501!b!!g!!%2Bamazon%20%2Bec2%20%2Bcloud%20%2Bservice&ef_id=CjwKCAjwp_GJBhBmEiwALWBQkx2jsTDU3u8MY-eAPvcBAnOk4O0urYLKG3thxc1H3qi_AaGptRJMbBoC_4kQAvD_BwE:G:s&s_kwcid=AL!4422!3!488982705501!b!!g!!%2Bamazon%20%2Bec2%20%2Bcloud%20%2Bservice

Storage Selection Option: EBS

Link: Amazon EBS Volumes Pricing

By default gp2 works fine for personal use, it allows users to attach/detach/replace/delete the volume freely onto the instance on demand See: Amazon Elastic Block Store (Amazon EBS).

The price for gp2 is 0.10$/GB per month, therefore maintaining a large-sized volume sounds not a good deal since AWS charges once the EBS is created, even when the instance the block attached is off. One way to lower the cost is to delete the block once the instance is off, but create the snapshot of it and store it on S3, for it only cost 0.023$/GB per month. Amazon EBS snapshots

Storage Selection Option: S3

Amazon S3 pricing

Standard S3 provides stable storage for data which only costs $0.023 per GB for storage (First 50 TB / Month), $0.01 per 1,000 requests, free for Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket. What a deal!

Remote Development on ipynb files using VScode + Jupyter Notebook

For VScode, install Remote-SSH extension and Jupyter Extension.

For both using VScode + Jupyter and Jupyter Notebook, the client end needs verifying .pem key for the authorization in order to connect the cloud Jupyter server.

Create the local key for the authorization:

ubuntu@xxx:~$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout xxx.pem -out xxx.pem

Insert these lines to jupyter configuration:

ubuntu@xxx:~$ vi jupyter_notebook_config.py

c = get_config()
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888

Some Troubles might encountered during the configuration:

Trouble with Jupyter certifications: Permission Denied

Change the ownship of the certs to current user through:

sudo chown -R user:user ~/.local/share/jupyter

SSLError: [SSL: EE_KEY_TOO_SMALL] ee key too small (_ssl.c:4022) on Ubuntu when starting jupyter notebook

Generate a larger key through:

sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout mycert.pem -out mycert.pem

Jupyter on EC2: SSL Error

Enforce https:// rather than default chrome http://

How to install xfs and create xfs file system on Debian/Ubuntu Linux

sudo apt install xfsrogs

Resources

远程访问服务器Jupyter Notebook的两种方法

Configuration

Activate spark-defaults.conf (this file would be loaded while spark server launches)

Add these lines into the file:

spark.executor.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
spark.driver.extraClassPath /jars/aws-java-sdk-bundle-1.11.375.jar:/lib/hadoop-aws-3.2.0.jar
spark.driver.memory 8g

Some suggest using driver memory for 5g which provides the maximum throughput, but tests show 8g works best for my instance. Since I used only the local model - both the driver program and executor program exist in JVM instance, therefore I cannot set up the number of executors or number of cores for each executor. The current driver program with the executor uses 8 cores for all threads.

I may try running my programs on the clusters of instances, but for big data processing, a single machine seems workable. I am migrating my old deep learning project onto the instance and hope it might work with the TensorFlow program. However, this plan may be delayed due to my job hunting process still ongoing.

Configuration resources:

How to set Apache Spark Executor memory?

Spark executors and shuffle in local mode h

Spark standalone configuration having multiple executors

Number of Executors in Spark Local Mode

How to change number of executors in local mode?

Not able to setup spark.driver.cores

spark.driver.cores setting in spark standalone cluster mode

Configuring memory and CPU options

Apache Spark Memory Management

Spark的driver理解和executor理解

一文详解Spark基本架构原理