I remember a lot of discussions at one of my earlier companies about HDFS and EMR+S3. We were piloting a smaller project on Hadoop and quickly came to the realization that EMR was probably the safest way to go, but it was very clear that the initial founders and companies were mostly using in house clusters. There was a lot of back and forth with AWS’s EMR team who helped us a great deal get everything set up.
Contrast this with the Spark and it is clear they are more readily integrated with cloud providers (in my experience, AWS) from the very beginning. Their community edition runs on AWS, the MLib course on Coursera makes good use of spot instance availability and a fair number of examples show S3 as the storage subsystem, which is increasingly becoming the storage of choice for companies that are going the cloud route.
Most batch processing workloads are time-window-based and being able to reduce cluster sizes in a matter of minutes in an automated fashion is a fantastic way to keep your costs low, but also making pipelines as efficient and fast as they can be.
This comes as no surprise, as the team at AMPLAB have been thinking about cloud computing and big data processing from the very first days , but it is very useful to see the examples have the S3:// urn, instead of googling “hdfs ls equivalent in EMRFS”:)