Spark Demo 1 - A simple Euler sum computation¶

This is a very trivial demo of spark. If you download and install spark from spark-2.0.2-bin-hadoop2.7 then cd to that directory and do the following

$ export PYSPARK_DRIVER_PYTHON=Jupyter
$ export PYSPARK_DRIVER_PYTHON_OPTS=notebook
$bin/pyspark

Jupyter will come up in that directory with the spark context "sc" already loaded. The usual way to get tthe spark context is

sc = pyspark.SparkContext('local[*]')

which tell spark to use all available cores on the local machine. For other clusters you may need something else.

import numpy as np
import pyspark
import time

n = 10000000
ar = np.arange(n)

ar

array([      0,       1,       2, ..., 9999997, 9999998, 9999999])

grab the spark context convert ar into a Spark RDD with 16 partitions.

sc

<pyspark.context.SparkContext at 0x100578710>

This is where we create our first RDD called 'dat' from our array. In this simple case the RDD will have 2 partitions.

numpartitions = 2
dat = sc.parallelize(ar, numpartitions)

For fun let's compute $\sum_{i=1}^{n}(1/i^2)$ that Euler in 1735 proved: $$\lim_{n->\infty}\sum_{i=1}^{n}\frac{1}{i^2} = \frac{\pi^2}{6}$$

now we will apply a simple map operation to compute $\frac{1}{i^2}$ for each i. Note that we do not actually execute this operation until we resolve the data back to the shell. That will come with the reduce step below.

sqrs = dat.map(lambda i: 1.0/(i+1)**2)

t0 = time.time()
x = sqrs.reduce(lambda a,b: a+b)
t1 = time.time()

print("x = %f"%x)
print("time=%f"%(t1-t0))

x = 1.644934
time=4.631260

print("pi**2/6=%f"%((np.pi**2)/6))

pi**2/6=1.644934

We rean this on 8 cores and the performance for different partition numbers was

part = 16 time=6.939963

part = 64 time = 6.214874

part = 128 time=6.213802

x = 0.0
t0 = time.time()
for i in range(1,n):
    x += 1.0/(i**2)
t1 = time.time()
print("x = %f"%x)
print("time=%f"%(t1-t0))

x = 1.644934
time=38.883733