This notebook requires the Apache Toree kernel to be installed in Jupyter and Spark to be installed on the machine. Install instructions are at the Apache Toree site. Be sure to read both the quick start and the installation guide.
Download the file SparkFirstExample.zip and unpack it to access the notebook that you can run.
You can verify the version of Spark by executing:
sc.version
The following will return the number of workers that you are running. local[*]
means that you are using one worker per core on your machine.
sc.getConf.getOption("spark.master")
First an example creating an RDD.
sc.parallelize( 1 to 100)
Count is an action so it brings the result back to the master
sc.parallelize(1 to 100).count
Now for some examples using DataFrames. First
import org.apache.spark.sql._
val spark = SparkSession.builder().appName("Sample").getOrCreate()
The file listed below is included in the .zip file. Change the path below
val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"
val flightData2015 = spark.read.json(jsonFlightFile)
flightData2015.take(2)
flightData2015.explain()
flightData2015.explain(true)
val sortedFlightData2015 = flightData2015.sort("count")
sortedFlightData2015.show
sortedFlightData2015.show(5)
sortedFlightData2015.explain(true)
val grouped = sortedFlightData2015.groupBy("DEST_COUNTRY_NAME")
grouped.mean("count").show