8.2. Spark
docker run -e GEN_CERT=yes -d -p 8888:8888 \
-v /tmp/docmnt :/home/jovyan/work/docmnt \
jupyter/all -spark-notebook start-notebook.sh \
--NotebookApp.password=' sha1 :.... '
When developing the
k
-means example, provided as notebook 12, we used a
container with a host drive
/vol1/dockerdata
with 10 GB of memory and cores
0, 1, 2, and 3. We created thi s container as follows.
docker run -e GEN_CERT=yes -d -p 8888:8888 --cpuset -cpus 0-3 -m 10G\
-v /tmp/docmnt :/home/jovyan/work/docmnt \
jupyter/all -spark-notebook start-notebook.sh \
--NotebookApp.password=' sha1 :.... '
8.2.4 SQL in Spark
Python and Spark can also execu te SQL commands [
64
]. We illustrate this
capability in notebook 13 with a simple example. We have a comma-separated
value (CSV) file, hvac.csv, with a hea der and three data lin es, as follows .
Date , Time , desired temp , actual temp , buildingID
3/23/2016 , 11:45, 67, 54, headquarters
3/23/2016 , 11:51, 67, 77, lab1
3/23/2016 , 11:20, 67, 33, coldroom
We load thi s file into Spark and create an RDD by applying the
textFile
operator to the Spark context object. We convert the text file RDD into an RDD
of tuples by stripping off the header and mapping the rest to typed tuples. We
create an SQL context object and schema type, and then an SQL DataFrame [
45
]
(see section 10.1 on page 192), hvacDF.
from pyspark.sql.types import *
hvacText = sc.textFile ("/pathto/file/hvac.csv")
hvac = hvacText .map( lambda s: s.split(",")) \
. filter( lambda s: s[0] != "Date") \
. map ( lambda s:(str (s[0]), str (s[1]),
int(s[2]), int(s[3]), str (s[4]) ))
sqlCtx = SQLContext(sc)
hvacSchema = StructType([StructField("date", StringType (), False),
StructField("time", StringType(), False),
StructField("targettemp", IntegerType (), False),
StructField("actualtemp", IntegerType (), False),
StructField("buildingID", StringType (), False)])
hvacDF = sqlCtx .createDataFrame(hvac , hvacSchema )
142