## Spark Demo 1 - A simple Euler sum computation¶

This is a very trivial demo of spark. If you download and install spark from spark-2.0.2-bin-hadoop2.7 then cd to that directory and do the following

$export PYSPARK_DRIVER_PYTHON=Jupyter$ export PYSPARK_DRIVER_PYTHON_OPTS=notebook
$bin/pyspark  Jupyter will come up in that directory with the spark context "sc" already loaded. The usual way to get tthe spark context is sc = pyspark.SparkContext('local[*]')  which tell spark to use all available cores on the local machine. For other clusters you may need something else. In : import numpy as np import pyspark import time  In : n = 10000000 ar = np.arange(n)  In : ar  Out: array([ 0, 1, 2, ..., 9999997, 9999998, 9999999]) grab the spark context convert ar into a Spark RDD with 16 partitions. In : sc  Out: <pyspark.context.SparkContext at 0x100578710> This is where we create our first RDD called 'dat' from our array. In this simple case the RDD will have 2 partitions. In : numpartitions = 2 dat = sc.parallelize(ar, numpartitions)  For fun let's compute$\sum_{i=1}^{n}(1/i^2)$that Euler in 1735 proved: $$\lim_{n->\infty}\sum_{i=1}^{n}\frac{1}{i^2} = \frac{\pi^2}{6}$$ now we will apply a simple map operation to compute$\frac{1}{i^2}\$ for each i. Note that we do not actually execute this operation until we resolve the data back to the shell. That will come with the reduce step below.

In :
sqrs = dat.map(lambda i: 1.0/(i+1)**2)

In :
t0 = time.time()
x = sqrs.reduce(lambda a,b: a+b)
t1 = time.time()

In :
print("x = %f"%x)
print("time=%f"%(t1-t0))

x = 1.644934
time=4.631260

In :
print("pi**2/6=%f"%((np.pi**2)/6))

pi**2/6=1.644934


We rean this on 8 cores and the performance for different partition numbers was

part = 16 time=6.939963

part = 64 time = 6.214874

part = 128 time=6.213802

In :
x = 0.0
t0 = time.time()
for i in range(1,n):
x += 1.0/(i**2)
t1 = time.time()
print("x = %f"%x)
print("time=%f"%(t1-t0))

x = 1.644934
time=38.883733

In [ ]: