Debugging spark applications on EMR
29 Dec 2018
Some notes on debugging and monitoring issues with spark on EMR
Sample pyspark job call
spark-submit \
--deploy-mode cluster \
--master yarn \
--num-executors 4 \
--driver-memory 4G \
--executor-memory 16G \
my_pyspark_script.py
This returns an application ID among other things.
get the application final status and more details.
See the final state from here: yarn application -status application_1546029821006_0029
[hadoop@ip-xxx-xx-xx-xxx ~]$ yarn application -status application_1546029821006_0029
18/12/30 05:05:29 INFO client.RMProxy: Connecting to ResourceManager at ip-xxx
Application Report :
Application-Id : application_1546029821006_0029
Application-Name : my_pyspark_script.py
Application-Type : SPARK
User : hadoop
Queue : default
Application Priority : 0
Start-Time : 1546145137699
Finish-Time : 1546145173658
Progress : 100%
State : FINISHED
Final-State : FAILED <<<<<<<<<<<
Tracking-URL : ip-xxx.ec2.internal:18080/history/application_1546029821006_0029/2
RPC Port : 0
Aggregate Resource Allocation : 2050010 MB-seconds, 136 vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
Log Aggregation Status : SUCCEEDED
Diagnostics : User application exited with status 1
Unmanaged Application : false
Application Node Label Expression : <Not set>
AM container Node Label Expression : <DEFAULT_PARTITION>
In the UI, individual tasks might show as completed under jobs section, especially if this error is happening on the driver’s side.
http://ec2-xxx.compute-1.amazonaws.com:18080/history/application_1546029821006_0029/2/executors/
To see the error, checkout the stdout and stderror for the driver.
Sample output from one of the failed jobs.
Log Type: stdout
Log Upload Time: Sun Dec 30 04:46:15 +0000 2018
Log Length: 472
Writing rows from xyz to s3://xyz...
Traceback (most recent call last):
File "my_pyspark_script.py", line 35, in <module>
write_rows('my_df')
File "my_pyspark_script.py", line 23, in write_rows
spark.sql("select * from {} limit 100".format(view_name)).write.mode("overwrite").path(output_path)
AttributeError: 'DataFrameWriter' object has no attribute 'path'
you can also get the same by getting the log from yarn.
yarn logs -applicationId application_1546029821006_0029