Some notes on debugging and monitoring issues with spark on EMR

Sample pyspark job call

spark-submit \
--deploy-mode cluster \
--master yarn \
--num-executors 4 \
--driver-memory 4G \
--executor-memory 16G \
my_pyspark_script.py

This returns an application ID among other things.

get the application final status and more details.

See the final state from here: yarn application -status application_1546029821006_0029

[hadoop@ip-xxx-xx-xx-xxx ~]$ yarn application -status application_1546029821006_0029
18/12/30 05:05:29 INFO client.RMProxy: Connecting to ResourceManager at ip-xxx
Application Report :
	Application-Id : application_1546029821006_0029
	Application-Name : my_pyspark_script.py
	Application-Type : SPARK
	User : hadoop
	Queue : default
	Application Priority : 0
	Start-Time : 1546145137699
	Finish-Time : 1546145173658
	Progress : 100%
	State : FINISHED
	Final-State : FAILED <<<<<<<<<<<
	Tracking-URL : ip-xxx.ec2.internal:18080/history/application_1546029821006_0029/2
	RPC Port : 0
	Aggregate Resource Allocation : 2050010 MB-seconds, 136 vcore-seconds
	Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
	Log Aggregation Status : SUCCEEDED
	Diagnostics : User application exited with status 1
	Unmanaged Application : false
	Application Node Label Expression : <Not set>
	AM container Node Label Expression : <DEFAULT_PARTITION>

In the UI, individual tasks might show as completed under jobs section, especially if this error is happening on the driver’s side.

http://ec2-xxx.compute-1.amazonaws.com:18080/history/application_1546029821006_0029/2/executors/

To see the error, checkout the stdout and stderror for the driver.

Sample output from one of the failed jobs.

Log Type: stdout
Log Upload Time: Sun Dec 30 04:46:15 +0000 2018
Log Length: 472

Writing rows from xyz to s3://xyz...
Traceback (most recent call last):
  File "my_pyspark_script.py", line 35, in <module>
    write_rows('my_df')
  File "my_pyspark_script.py", line 23, in write_rows
    spark.sql("select * from {} limit 100".format(view_name)).write.mode("overwrite").path(output_path)
AttributeError: 'DataFrameWriter' object has no attribute 'path'

you can also get the same by getting the log from yarn.

yarn logs -applicationId application_1546029821006_0029