This is a very trivial demo of spark. If you download and install spark from spark-2.0.2-bin-hadoop2.7 then cd to that directory and do the following
$ export PYSPARK_DRIVER_PYTHON=Jupyter $ export PYSPARK_DRIVER_PYTHON_OPTS=notebook $bin/pysparkJupyter will come up in that directory with the spark context "sc" already loaded. The usual way to get tthe spark context is
sc = pyspark.SparkContext('local[*]')which tell spark to use all available cores on the local machine. For other clusters you may need something else.
import numpy as np
import pyspark
import time
n = 10000000
ar = np.arange(n)
ar
grab the spark context convert ar into a Spark RDD with 16 partitions.
sc
This is where we create our first RDD called 'dat' from our array. In this simple case the RDD will have 2 partitions.
numpartitions = 2
dat = sc.parallelize(ar, numpartitions)
For fun let's compute $\sum_{i=1}^{n}(1/i^2)$ that Euler in 1735 proved: $$\lim_{n->\infty}\sum_{i=1}^{n}\frac{1}{i^2} = \frac{\pi^2}{6}$$
now we will apply a simple map operation to compute $\frac{1}{i^2}$ for each i. Note that we do not actually execute this operation until we resolve the data back to the shell. That will come with the reduce step below.
sqrs = dat.map(lambda i: 1.0/(i+1)**2)
t0 = time.time()
x = sqrs.reduce(lambda a,b: a+b)
t1 = time.time()
print("x = %f"%x)
print("time=%f"%(t1-t0))
print("pi**2/6=%f"%((np.pi**2)/6))
We rean this on 8 cores and the performance for different partition numbers was
part = 16 time=6.939963
part = 64 time = 6.214874
part = 128 time=6.213802
x = 0.0
t0 = time.time()
for i in range(1,n):
x += 1.0/(i**2)
t1 = time.time()
print("x = %f"%x)
print("time=%f"%(t1-t0))