Processing Data: Introducing Apache Spark

Skillsoft issued completion badges are earned based on viewing the percentage required or receiving a passing score when assessment is required. Apache Spark is a powerful distributed data processing engine that can handle petabytes of data by chunking that data and dividing across a cluster of resources. In this course, explore Spark’s structured streaming engine, including components like PySpark shell. Begin by downloading and installing Apache Spark. Then create a Spark cluster and run a job from the PySpark shell. Monitor an application and job runs from the Spark web user interface. Then, set up a streaming environment, reading and manipulating the contents of files that are added to a folder in real-time. Finally, run apps on both Spark standalone and local modes.