Spark ML using logistic Regression

The problem here is to predict the grade (Passing or Failing) of City of Chicago restaurant inspections based on the notes made by the inspector.

This uses logistic regression. Suppose you have a set of feature vectors $x_i \in R^n$ for $i$ in in $[0,m]$. Associated with each feature vector we have a binary result $y_i$. We are interested in the probability $P(y =1 | x)$ which we write as the function $p(x)$. However because $p(x)$ is between 0 and 1 it is not expressable as a linear function of x so we can't use regular linear regression, so we look at the odds expression $p(x) / (1-p(x))$ and make the guess that its log is linear. In other words

$$ ln( \frac{p(x)}{1-p(x)}) = b_0 + b \cdot x$$

where the offset $b_0$ and the vector $b = [b_1, b_2, ... b_n]$ define a hyperplane for linear regression. solving this for $p(x)$ we get

$$p(x) = \frac {1}{1+e^{-(b_0 + b \cdot x)}} $$

And we say $y=1$ if $p(x)>0$ otherwise it is zero. Unfortunately finding the best $b_0$ and $b$ is not as easy as straight linear regression, but simple Newton like iterations will converge to good solutions.

We note that the logistic function $\sigma (t)$ is defined as follows:

$$\sigma (t)= \frac {e^t}{e^{t}+1} =\frac {1}{1+e^{-t}}$$

It is used frequently in machine learning to map a real number into a probabilty range $[0,1]$ .

In [1]:
sc
Out[1]:
<pyspark.context.SparkContext at 0x10012f710>
In [2]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.sql import Row
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
In [55]:
def csvParse(s):
    import csv
    from StringIO import StringIO
    sio = StringIO(s)
    value = csv.reader(sio).next()
    sio.close()
    return value

The version in this notebook uses a slightly different input file from the one in the Azure HDInsight demo.

This notebook will run on spark on your laptop.

In [225]:
#inspections = spark.sparkContext.textFile('wasb:///HdiSamples/HdiSamples/FoodInspectionData/Food_Inspections1.csv')\
#                .map(csvParse)

inspections = spark.sparkContext.textFile('/users/dennisgannon/OneDrive/Docs7/Food_Inspections1.csv')\
                .map(csvParse)
In [226]:
inspections.count()
Out[226]:
103994
In [296]:
inspections.take(10)
Out[296]:
[['1978294',
  'KFC',
  'KENTUCKY FRIED CHICKEN',
  'Pass',
  '32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED THE FLOUR BIN BROKEN/CRACKED AT CHICKEN FRIED PREP AREA, INSTRUCTED TO REPLACE. | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: OBSERVED THE EXTERIOR OF THE FRYER, OVEN NOT CLEAN, INSTRUCTED TO CLEAN. ALSO CLEAN AND SANITIZE REAR CHICKEN PREP TABLE. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED THE CEILING IN THE CHICKEN COOLER, NOT CLEAN, DUSTY. INSTRUCTED TO CLEAN. | 36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES OF LIGHT PROVIDED, FIXTURES SHIELDED - Comments: OBSERVED BROKEN LIGHT SHIELD IN LADIES RESTROOM, INSTRUCTED TO REPLACE. | 21. * CERTIFIED FOOD MANAGER ON SITE WHEN POTENTIALLY HAZARDOUS FOODS ARE  PREPARED AND SERVED - Comments: VIOLATION CORRECTION'],
 ['1978279', 'M & K FOOD MARKET', 'SOLO FOODS', 'Out of Business', ''],
 ['1978275',
  'SHARKS FISH & CHICKEN',
  'SHARKS FISH & CHICKEN',
  'Pass',
  '34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED THE BASEBOARDS INSIDE OF THE WALK IN COOLER DETACHED FROM THE WALL AND FLOOR. INSTRUCTED TO REPAIR AND SEAL BACK TO THE WALL AND FLOOR. | 37. TOILET ROOM DOORS SELF CLOSING: DRESSING ROOMS WITH LOCKERS PROVIDED: COMPLETE SEPARATION FROM LIVING/SLEEPING QUARTERS - Comments: OBSERVED THE RESTROOM DOOR NOT SELF CLOSING. PROVIDE SELF CLOSURE DEVICE. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: OBSERVED THE 3 COMPARTMENT SINK RIGHT FAUCET BASE LEAKING. INSTRUCTED TO REPAIR AND MAINTAIN ALL PLUMBING. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED THE FRP WALL IN THE REAR AROUND THE OFFICE WITH GAPS AND RAW WOOD ABOVE THE FRONT COUNTER. INSTRUCTED TO REPAIR WALLS.'],
 ['1978268',
  'CARNITAS Y SUPERMERCADO EL GIGANTE INC',
  'CARNITAS Y SUPERMERCADO EL GIGANTE INC',
  'Pass',
  '33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: OBSERVED THE INTERIOR OF THE MICROWAVE NOT CLEAN, WALK-IN COOLER, INSTRUCTED TO CLEAN. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED WATER DAMAGED CEILING TILES, INSTRUCTED TO REPLACE. ALSO CLEAN CEILING IN WALK-IN COOLER. | 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED - Comments: OBSERVED SOME CLUTTER IN REAR, INSTRUCTED TO ORGANIZE AND REMOVE ITEMS NOT BEING USED.'],
 ['1978261', 'WINGS OVER CANAL', 'WINGSTOP', 'Pass', ''],
 ['1978248',
  'REACH HIGHER LEARNING CENTER 2',
  'REACH HIGHER LEARNING CENTER 2',
  'Fail',
  '2. FACILITIES TO MAINTAIN PROPER TEMPERATURE - Comments: A COMMERCIAL SIZE REFRIGERATOR NOW PROVIDED ON PREMISES FOR FOOD STORAGE. | 16. FOOD PROTECTED DURING STORAGE, PREPARATION, DISPLAY, SERVICE AND TRANSPORTATION - Comments: CORRECTED.  PREP AREA HAS BEEN ENCLOSED WITH A BARRIER AS PREVIOUSLY INSTRUCTED. | 19. OUTSIDE GARBAGE WASTE GREASE AND STORAGE AREA; CLEAN, RODENT PROOF, ALL CONTAINERS COVERED - Comments: CORRECTED.  HAS A SIGNED CONTRACT WITH WASTE MANAGEMENT. | 21. * CERTIFIED FOOD MANAGER ON SITE WHEN POTENTIALLY HAZARDOUS FOODS ARE  PREPARED AND SERVED - Comments: CORRECTED.  CITY OF CHICAGO CERTIFICATE PROVIDED. | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: CORRECTED. HOLE IN FLOOR SEALED. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: CORRECTED.  HOLE AROUND PIPE SEALED. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: NO BACKFLOW PREVENTOR DEVICE FOUND ON THE UTILITY SINK IN THE BASEMENT; MUST INSTALL A BACKFLOW PREVENTOR DEVICE ON UTILITY SINK. | 24. DISH WASHING FACILITIES: PROPERLY DESIGNED, CONSTRUCTED, MAINTAINED, INSTALLED, LOCATED AND OPERATED - Comments: NO THREE COMP SINK ON PREMISES FOR WASH RINSE AND SANITIZING OF UTENSILS OR DISHES AS REQUIRED. MUST INSTALL WITH HOT AND COLD RUNNING WATER AND A GREASE TRAP CONNECTED TO THE SINK. SERIOUS VIOLATION 7-38-030. | 12. HAND WASHING FACILITIES: WITH SOAP AND SANITARY HAND DRYING DEVICES, CONVENIENT AND ACCESSIBLE TO FOOD PREP AREA - Comments: NO EXPOSED HANDWASHING SINK WITH HOT AND COLD RUNNING WATER SOAP AND HANDTOWELS FOR WASHING HANDS AS REQUIRED; MUST INSTALL.'],
 ['1978242',
  'SYS HAPPY LAND CHILD DAYCARE CENTER, INC.',
  'SYS HAPPY LAND CHILD DAYCARE CENTER, INC.',
  'Out of Business',
  ''],
 ['1978214',
  "NANDO'S PERI PERI",
  "NANDO'S PERI PERI",
  'Pass',
  '40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments: MUST PROVIDE REFRIGERATION THERMOMETERS WHERE MISSING IN REFRIGERATION UNITS. | 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED - Comments: PROVIDE WASTE RECEPTACLES IN WASHROOM AD WHERE NECESSARY.'],
 ['1977206',
  'THE GUNDIS KURDISH KITCHEN',
  'THE GUNDIS KURDISH KITCHEN',
  'Fail',
  '9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY PRESSURE - Comments:  | 11. ADEQUATE NUMBER, CONVENIENT, ACCESSIBLE, DESIGNED, AND MAINTAINED - Comments:  | 12. HAND WASHING FACILITIES: WITH SOAP AND SANITARY HAND DRYING DEVICES, CONVENIENT AND ACCESSIBLE TO FOOD PREP AREA - Comments:  | 18. NO EVIDENCE OF RODENT OR INSECT OUTER OPENINGS PROTECTED/RODENT PROOFED, A WRITTEN LOG SHALL BE MAINTAINED AVAILABLE TO THE INSPECTORS - Comments:  | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments:  | 36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES OF LIGHT PROVIDED, FIXTURES SHIELDED - Comments:  | 40. REFRIGERATION AND METAL STEM THERMOMETERS PROVIDED AND CONSPICUOUS - Comments:  | 41. PREMISES MAINTAINED FREE OF LITTER, UNNECESSARY ARTICLES, CLEANING  EQUIPMENT PROPERLY STORED - Comments:  | 70. NO SMOKING REGULATIONS - Comments: '],
 ['1977202', 'CHIU QUON INC', 'CHIU QUON', 'Out of Business', '']]
In [231]:
schema = StructType([StructField("id", IntegerType(), False), 
                     StructField("name", StringType(), False), 
                     StructField("results", StringType(), False), 
                     StructField("violations", StringType(), True)])

df = spark.createDataFrame(inspections.map(lambda l: (int(l[0]), l[2], l[3], l[4])) , schema)
df.registerTempTable('CountResults')
In [275]:
df.show(5)
+-------+--------------------+---------------+--------------------+
|     id|                name|        results|          violations|
+-------+--------------------+---------------+--------------------+
|1978294|KENTUCKY FRIED CH...|           Pass|32. FOOD AND NON-...|
|1978279|          SOLO FOODS|Out of Business|                    |
|1978275|SHARKS FISH & CHI...|           Pass|34. FLOORS: CONST...|
|1978268|CARNITAS Y SUPERM...|           Pass|33. FOOD AND NON-...|
|1978261|            WINGSTOP|           Pass|                    |
+-------+--------------------+---------------+--------------------+
only showing top 5 rows

In [280]:
print("passing = %d"%df[df.results == 'Pass'].count())
print("failing = %d"%df[df.results == 'Fail'].count())
passing = 61204
failing = 20225
In [281]:
df.count()
Out[281]:
103994
In [ ]:
 
In [ ]:
 
In [233]:
df.select('results').distinct().show()
+--------------------+
|             results|
+--------------------+
|           Not Ready|
|                Fail|
|            No Entry|
|Business Not Located|
|  Pass w/ Conditions|
|     Out of Business|
|                Pass|
+--------------------+

In [234]:
#%%sql -o count_results_df
count_results_df = spark.sql("SELECT results, COUNT(results) AS cnt FROM \
       CountResults GROUP BY results").toPandas()
In [235]:
count_results_df
Out[235]:
results cnt
0 Not Ready 429
1 Fail 20225
2 No Entry 2678
3 Business Not Located 44
4 Pass w/ Conditions 9725
5 Out of Business 9689
6 Pass 61204
In [236]:
%matplotlib inline
import matplotlib.pyplot as plt


labels = count_results_df['results']
sizes = count_results_df['cnt']
colors = ['turquoise', 'seagreen', 'mediumslateblue', 'palegreen', 'coral']
plt.pie(sizes, labels=labels, autopct='%1.1f%%', colors=colors)
plt.axis('equal')
Out[236]:
(-1.0101138270539849, 1.0, -1.0114322486525658, 1.0040925495551682)
In [237]:
def labelForResults(s):
    if s == 'Fail':
        return 0.0
    elif s == 'Pass w/ Conditions' or s == 'Pass':
        return 1.0
    else:
        return -1.0
label = UserDefinedFunction(labelForResults, DoubleType())
labeledData = df.select(label(df.results).alias('label'), df.violations).where('label >= 0')
In [238]:
labeledData.take(1)
Out[238]:
[Row(label=1.0, violations=u'32. FOOD AND NON-FOOD CONTACT SURFACES PROPERLY DESIGNED, CONSTRUCTED AND MAINTAINED - Comments: OBSERVED THE FLOUR BIN BROKEN/CRACKED AT CHICKEN FRIED PREP AREA, INSTRUCTED TO REPLACE. | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: OBSERVED THE EXTERIOR OF THE FRYER, OVEN NOT CLEAN, INSTRUCTED TO CLEAN. ALSO CLEAN AND SANITIZE REAR CHICKEN PREP TABLE. | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED THE CEILING IN THE CHICKEN COOLER, NOT CLEAN, DUSTY. INSTRUCTED TO CLEAN. | 36. LIGHTING: REQUIRED MINIMUM FOOT-CANDLES OF LIGHT PROVIDED, FIXTURES SHIELDED - Comments: OBSERVED BROKEN LIGHT SHIELD IN LADIES RESTROOM, INSTRUCTED TO REPLACE. | 21. * CERTIFIED FOOD MANAGER ON SITE WHEN POTENTIALLY HAZARDOUS FOODS ARE  PREPARED AND SERVED - Comments: VIOLATION CORRECTION')]
In [ ]:
 
In [284]:
tokenizer = Tokenizer(inputCol="violations", outputCol="words")
#hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

model = pipeline.fit(labeledData)
model
Out[284]:
PipelineModel_4b838c0e8fdc2048a099
In [285]:
hashingTF.getNumFeatures()
Out[285]:
262144
In [ ]:
 
In [295]:
testData = spark.sparkContext.textFile('/users/dennisgannon/OneDrive/Docs7/Food_Inspections2.csv')\
             .map(csvParse) \
             .map(lambda l: (int(l[0]), l[2], l[3], l[4]))
testDf = spark.createDataFrame(testData, schema).where("results = 'Fail' OR results = 'Pass' OR results = 'Pass w/ Conditions'")
predictionsDf = model.transform(testDf)
predictionsDf.registerTempTable('Predictions')
predictionsDf.columns
Out[295]:
['id',
 'name',
 'results',
 'violations',
 'words',
 'features',
 'rawPrediction',
 'probability',
 'prediction']
In [292]:
predictionsDf.take(1)
Out[292]:
[Row(id=1978295, name=u'GREEN LINE FOOD MART, INC', results=u'Fail', violations=u'9. WATER SOURCE: SAFE, HOT & COLD UNDER CITY PRESSURE - Comments: OBSERVED NO RUNNING WATER AT ALL SINKS IN THE FACILITY. INSTRUCTED TO PROVIDE HOT AND COLD RUNNING WATER TO ALL SINKS (WASHBOWL, EXPOSED HAND SINK). CRITICAL VIOLATION 7-38-030 | 11. ADEQUATE NUMBER, CONVENIENT, ACCESSIBLE, DESIGNED, AND MAINTAINED - Comments: OBSERVED THE EMPLOYEE TOILET UNABLE TO FLUSH. INSTRUCTED TO REPAIR AND MAINTAIN THE TOILET. CRITICAL VIOLATION 7-38-030 | 27. TOILET ROOMS ENCLOSED CLEAN, PROVIDED WITH HAND CLEANSER, SANITARY HAND DRYING DEVICES AND PROPER WASTE RECEPTACLES - Comments: VIOLATION CORRECTED | 18. NO EVIDENCE OF RODENT OR INSECT OUTER OPENINGS PROTECTED/RODENT PROOFED, A WRITTEN LOG SHALL BE MAINTAINED AVAILABLE TO THE INSPECTORS - Comments: VIOLATION CORRECTED | 21. * CERTIFIED FOOD MANAGER ON SITE WHEN POTENTIALLY HAZARDOUS FOODS ARE  PREPARED AND SERVED - Comments: VIOLATION CORRECTED | 34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: VIOLATION CORRECTED', words=[u'9.', u'water', u'source:', u'safe,', u'hot', u'&', u'cold', u'under', u'city', u'pressure', u'-', u'comments:', u'observed', u'no', u'running', u'water', u'at', u'all', u'sinks', u'in', u'the', u'facility.', u'instructed', u'to', u'provide', u'hot', u'and', u'cold', u'running', u'water', u'to', u'all', u'sinks', u'(washbowl,', u'exposed', u'hand', u'sink).', u'critical', u'violation', u'7-38-030', u'|', u'11.', u'adequate', u'number,', u'convenient,', u'accessible,', u'designed,', u'and', u'maintained', u'-', u'comments:', u'observed', u'the', u'employee', u'toilet', u'unable', u'to', u'flush.', u'instructed', u'to', u'repair', u'and', u'maintain', u'the', u'toilet.', u'critical', u'violation', u'7-38-030', u'|', u'27.', u'toilet', u'rooms', u'enclosed', u'clean,', u'provided', u'with', u'hand', u'cleanser,', u'sanitary', u'hand', u'drying', u'devices', u'and', u'proper', u'waste', u'receptacles', u'-', u'comments:', u'violation', u'corrected', u'|', u'18.', u'no', u'evidence', u'of', u'rodent', u'or', u'insect', u'outer', u'openings', u'protected/rodent', u'proofed,', u'a', u'written', u'log', u'shall', u'be', u'maintained', u'available', u'to', u'the', u'inspectors', u'-', u'comments:', u'violation', u'corrected', u'|', u'21.', u'*', u'certified', u'food', u'manager', u'on', u'site', u'when', u'potentially', u'hazardous', u'foods', u'are', u'', u'prepared', u'and', u'served', u'-', u'comments:', u'violation', u'corrected', u'|', u'34.', u'floors:', u'constructed', u'per', u'code,', u'cleaned,', u'good', u'repair,', u'coving', u'installed,', u'dust-less', u'cleaning', u'methods', u'used', u'-', u'comments:', u'violation', u'corrected'], features=SparseVector(262144, {4653: 1.0, 5220: 1.0, 8500: 1.0, 8630: 1.0, 9639: 1.0, 11251: 6.0, 14114: 1.0, 14311: 6.0, 18805: 1.0, 20326: 1.0, 22500: 1.0, 24016: 2.0, 24980: 1.0, 26379: 1.0, 29403: 1.0, 30816: 1.0, 33151: 1.0, 35329: 1.0, 42337: 1.0, 42343: 2.0, 45433: 2.0, 45531: 6.0, 50223: 1.0, 58524: 1.0, 59760: 1.0, 63091: 2.0, 73700: 1.0, 79763: 1.0, 80107: 2.0, 80458: 1.0, 81631: 1.0, 84053: 1.0, 84144: 1.0, 84504: 1.0, 86643: 2.0, 87885: 1.0, 90859: 1.0, 91677: 5.0, 92175: 1.0, 94851: 1.0, 97215: 1.0, 100258: 1.0, 103382: 5.0, 103409: 1.0, 103838: 4.0, 104448: 1.0, 111192: 1.0, 113159: 1.0, 113432: 1.0, 113458: 1.0, 114381: 2.0, 118121: 1.0, 121133: 1.0, 123581: 1.0, 126466: 1.0, 130136: 1.0, 135513: 1.0, 135560: 2.0, 142631: 1.0, 144122: 1.0, 145542: 2.0, 151196: 1.0, 154554: 1.0, 154701: 1.0, 156250: 2.0, 161061: 1.0, 163140: 1.0, 163353: 1.0, 163504: 1.0, 167122: 1.0, 167152: 1.0, 168139: 1.0, 168495: 1.0, 169654: 1.0, 176964: 1.0, 178003: 1.0, 179666: 1.0, 183858: 1.0, 185659: 1.0, 187408: 1.0, 187621: 3.0, 189193: 1.0, 193224: 1.0, 195766: 1.0, 203500: 1.0, 205044: 5.0, 207872: 1.0, 217836: 2.0, 218380: 1.0, 222453: 1.0, 223946: 1.0, 227410: 1.0, 227642: 2.0, 229543: 1.0, 232367: 3.0, 238163: 1.0, 244185: 1.0, 244298: 1.0, 247107: 1.0, 248091: 1.0, 248824: 1.0, 249180: 1.0, 252378: 1.0, 254003: 1.0, 256918: 1.0, 259400: 1.0, 259928: 4.0}), rawPrediction=DenseVector([3.1427, -3.1427]), probability=DenseVector([0.9586, 0.0414]), prediction=0.0)]
In [293]:
numSuccesses = predictionsDf.where("""(prediction = 0 AND results = 'Fail') OR 
                                      (prediction = 1 AND (results = 'Pass' OR 
                                                           results = 'Pass w/ Conditions'))""").count()
numInspections = predictionsDf.count()

print "There were", numInspections, "inspections and there were", numSuccesses, "successful predictions"
print "This is a", str((float(numSuccesses) / float(numInspections)) * 100) + "%", "success rate"
There were 30694 inspections and there were 27774 successful predictions
This is a 90.4867400795% success rate
In [243]:
from pyspark.sql.types import *
from IPython.core.magic import register_line_cell_magic
In [244]:
# Configuration parameters
max_show_lines = 50         # Limit on the number of lines to show with %sql_show and %sql_display
detailed_explain = True    
In [245]:
@register_line_cell_magic
def sql(line, cell=None):
    "Return a Spark DataFrame for lazy evaluation of the sql. Use: %sql or %%sql"
    val = cell if cell is not None else line 
    print val
    return sqlContext.sql(val)

@register_line_cell_magic
def sql_show(line, cell=None):
    "Execute sql and show the first max_show_lines lines. Use: %sql_show or %%sql_show"
    val = cell if cell is not None else line 
    return sqlContext.sql(val).show(max_show_lines) 

@register_line_cell_magic
def sql_display(line, cell=None):
    """Execute sql and convert results to Pandas DataFrame for pretty display or further processing.
    Use: %sql_display or %%sql_display"""
    val = cell if cell is not None else line 
    return sqlContext.sql(val).limit(max_show_lines).toPandas() 
In [246]:
#%%sql -q -o true_positive
#SELECT count(*) AS cnt FROM Predictions WHERE prediction = 0 AND results = 'Fail'
true_negative = spark.sql("SELECT count(*) AS cnt FROM Predictions WHERE \
        (prediction = 0 AND results = 'Fail')").toPandas()
In [247]:
#%%sql -q -o false_positive
false_negative = spark.sql("SELECT count(*) AS cnt FROM Predictions \
WHERE prediction = 0 AND (results = 'Pass' OR results = 'Pass w/ Conditions')").toPandas()
In [248]:
#%%sql -q -o true_negative
false_positive = spark.sql("SELECT count(*) AS cnt FROM Predictions WHERE \
       prediction = 1 AND results = 'Fail' ").toPandas()
In [249]:
#%%sql -q -o false_negative
true_positive = spark.sql("SELECT count(*) AS cnt FROM Predictions WHERE \
      prediction = 1 AND (results = 'Pass' OR results = 'Pass w/ Conditions')").toPandas()
In [250]:
false_negative['cnt']
Out[250]:
0    677
Name: cnt, dtype: int64
In [251]:
%matplotlib inline
import matplotlib.pyplot as plt

labels = ['True positive', 'False positive', 'True negative', 'False negative']
sizes = [true_positive['cnt'], false_positive['cnt'], false_negative['cnt'], true_negative['cnt']]
colors = ['turquoise', 'seagreen', 'mediumslateblue', 'palegreen', 'coral']
plt.pie(sizes, labels=labels, autopct='%10.1f%%', colors=colors)
plt.axis('equal')
Out[251]:
(-1.0045326948165894, 1.0, -1.0032166242599487, 1.0048178434371948)

Precision and recall are then defined as:

$$Precision=\frac {tp}{tp+fp}$$$$ Recall = \frac {tp}{tp+fn} $$

Precision is the probability that a (randomly selected) positive prediction is correct.

Recall is the probability that a (randomly selected) resturant with a passing grade is predicted to be passing.

In [252]:
print('so precision = %f'% \
      (float(true_positive['cnt'])/(float(true_positive['cnt'])+float(false_positive['cnt']))))
so precision = 0.911911
In [253]:
print('and recall = %f'% \
      (float(true_positive['cnt'])/(float(true_positive['cnt'])+float(false_negative['cnt']))))
and recall = 0.971670

If we do this another way, we can ask how accurate are we in finding the failing resturants? This is a bit harder because there are far fewer of them. In this case we are interested in true-negatives, so

Precision is the probability that a (randomly selected) negative prediction is correct.

Recall is the probability that a (randomly selected) resturant with a failing grade is predicted to be failing.

$$Precision=\frac {tn}{tn+fn}$$$$ Recall = \frac {tn}{tn+fp} $$
In [254]:
print('so the precision of failure prediction = %f'% \
      (float(true_negative['cnt'])/(float(true_negative['cnt'])+float(false_negative['cnt']))))
so the precision of failure prediction = 0.870579
In [255]:
print('and recall is = %f'% \
      (float(true_negative['cnt'])/(float(true_negative['cnt'])+float(false_positive['cnt']))))
and recall is = 0.670001
In [ ]: