sciml_data_arxiv.p is the set of original arxiv rss items. a subset of these were use for training. the url is https://1drv.ms/u/s!AkRG9Zk_IOUagrFxuO3yxvluvwQ1uA This is a OneDrive link so you will need to go there on your browser and go to the "download" tab at the top and download the file. perhaps in "/tmp"
sciml_data_arxiv_new_9_28_15.p "newly" arrived arxiv items (not in the above) it is at https://1drv.ms/u/s!AkRG9Zk_IOUagrFvwrNNB8Jgc1ARcA you can also use this one. it does not contain anything that was used for training.
import pickle
import json
import boto3
The following function just loads the pickled data file and extracts the titles, sitenames (arxiv designations) and abstracts
def load_docs(path, name):
filename = path+name+".p"
fileobj = open(filename, "rb")
z = fileobj.read()
lst = pickle.loads(str(z))
titles = []
sitenames = []
abstracts = []
for i in range(0, len(lst)):
titles.extend([lst[i][0]])
sitenames.extend([lst[i][1]])
abstracts.extend([lst[i][2]])
print "done loading "+filename
return abstracts, sitenames, titles
this call loads three arrays: abstracts: the paper abstract, sites: the arxiv designation, and the title
abstracts, sites, titles = load_docs("/tmp/", "sciml_data_arxiv")
let's take a peek at one of them. how about number 1301
print abstracts[1301]
print sites[1301]
print titles[1301]
sqs = boto3.resource('sqs',
region_name='us-west-2',
)
The queue "bookque" was previously created from the aws portal
queuename = 'bookque'
queue = sqs.get_queue_by_name(QueueName=queuename)
print(queue.url)
print(queue.attributes.get('DelaySeconds'))
import time
this loop grabs 100 papers (1350 through 1449) and pushes them to the queue. These 100 were chosen as a random sample. nothing special about them.
t0 = time.time()
for i in range(1350,1450):
abstract = abstracts[i]
source = sites[i]
title = titles[i]
#print abstract
print title
queue.send_message(MessageBody='boto3', MessageAttributes ={
'Title':{ 'StringValue': title,
'DataType': 'String'},
'Source':{ 'StringValue': source,
'DataType': 'String'},
'Abstract':{ 'StringValue': abstract,
'DataType': 'String'}
})
t1 = time.time()
print 'elapsed time ='+str(t1-t0)
To see the output that appeared in the dynodb table look at BookTable-dynammodb-output.csv
Note: it took 3 seconds to fill the queue and if we look at the csv file and sort by timestame we see it took 26 seconds to do all the processing using 8 predictors.