Apache Spark
Recently I’ve had the opportunity to dig into Apache Spark, thanks to some training from Brian Bloechle from Cloudera. What is spark? Fast, flexible, and developer friendly, Apache Spark is the leading platform for large scale SQL, batch processing, stream processing, and machine learning. Java, Scala, Python and R are first class citizens when its comes to consuming the various Spark API’s. I’ll cover PySpark in more detail. Spark is an agnostic processing engine, that can target a number of cluster managers including Spark Standalone, Hadoop’s YARN, Apache Mesos and Kubernetes. In the context of Spark, some useful surrounding ecosystem to be aware of: ...