This set of functions can be used to load the ARXIV data file and send it to the AWS queue

sciml_data_arxiv.p is the set of original arxiv rss items. a subset of these were use for training. the url is https://1drv.ms/u/s!AkRG9Zk_IOUagrFxuO3yxvluvwQ1uA This is a OneDrive link so you will need to go there on your browser and go to the "download" tab at the top and download the file. perhaps in "/tmp"

sciml_data_arxiv_new_9_28_15.p "newly" arrived arxiv items (not in the above) it is at https://1drv.ms/u/s!AkRG9Zk_IOUagrFvwrNNB8Jgc1ARcA you can also use this one. it does not contain anything that was used for training.

In [1]:
import pickle
import json
import boto3

The following function just loads the pickled data file and extracts the titles, sitenames (arxiv designations) and abstracts

In [2]:
def load_docs(path, name):
    filename = path+name+".p"
    fileobj = open(filename, "rb")
    z = fileobj.read()
    lst = pickle.loads(str(z))
    titles = []
    sitenames = []
    abstracts = []
    for i in range(0, len(lst)):
        titles.extend([lst[i][0]])
        sitenames.extend([lst[i][1]])
        abstracts.extend([lst[i][2]])
        
    print "done loading "+filename
    return abstracts, sitenames, titles

this call loads three arrays: abstracts: the paper abstract, sites: the arxiv designation, and the title

In [3]:
abstracts, sites, titles = load_docs("/tmp/", "sciml_data_arxiv")
done loading /tmp/sciml_data_arxiv.p

let's take a peek at one of them. how about number 1301

In [4]:
print abstracts[1301]
print sites[1301]
print titles[1301]
Cancer cells utilize large amounts of ATP to sustain growth, relying primarily on non-oxidative, fermentative pathways for its production. In many types of cancers this leads, even in the presence of oxygen, to the secretion of carbon equivalents (usually in the form of lactate) in the cell's surroundings, a feature known as the Warburg effect. While the molecular basis of this phenomenon are still to be elucidated, it is clear that the spilling of energy resources contributes to creating a peculiar microenvironment for tumors, possibly characterized by a degree of toxicity. This suggests that mechanisms for recycling the fermentation products (e.g. a lactate shuttle) may be active, effectively inducing a mutually beneficial metabolic coupling between aberrant and non-aberrant cells. Here we analyze this scenario through a large-scale in silico metabolic model of interacting human cells. By going beyond the cell-autonomous description, we show that elementary physico-chemical constraints indeed favor the establishment of such a coupling under very broad conditions. The characterization we obtained by tuning the aberrant cell's demand for ATP, amino-acids and fatty acids and/or the imbalance in nutrient partitioning provides quantitative support to the idea that synergistic multi-cell effects play a central role in cancer sustainment.
q-bio.MN
Quantitative constraint-based computational model of tumor-to-stroma   coupling via lactate shuttle [q-bio.MN]
In [5]:
sqs = boto3.resource('sqs',
    region_name='us-west-2', 
 )

now let's grab the queue

The queue "bookque" was previously created from the aws portal

In [6]:
queuename = 'bookque'
queue = sqs.get_queue_by_name(QueueName=queuename)
print(queue.url)
print(queue.attributes.get('DelaySeconds'))
https://us-west-2.queue.amazonaws.com/066301190734/bookque
5
In [7]:
import time

send data to the queue

this loop grabs 100 papers (1350 through 1449) and pushes them to the queue. These 100 were chosen as a random sample. nothing special about them.

In [14]:
t0 = time.time()
for i in range(1350,1450):
    abstract = abstracts[i]
    source = sites[i]
    title = titles[i]
    #print abstract
    print title
    queue.send_message(MessageBody='boto3', MessageAttributes ={
            'Title':{ 'StringValue': title,
                      'DataType': 'String'}, 
            'Source':{ 'StringValue': source,
                      'DataType': 'String'}, 
            'Abstract':{ 'StringValue': abstract,
                      'DataType': 'String'} 
        })
t1 = time.time()
Coping with Space Neophobia in Drosophila melanogaster: The Asymmetric   Dynamics of Crossing a Doorway to the Untrodden [q-bio.NC]
New Scaling Relation for Information Transfer in Biological Networks [q-bio.MN]
The Composition Of A Disrupted Extrasolar Planetesimal At SDSS   J0845+2257 (Ton 345) [astro-ph.EP]
Tip cell overtaking occurs as a side effect of sprouting in   computational models of angiogenesis [q-bio.CB]
Modeling the ballistic-to-diffusive transition in nematode motility   reveals low-dimensional behavioral variation across species [q-bio.NC]
Superconducting dark energy [gr-qc]
Equipartitions and a Distribution for Numbers: A Statistical Model for   Benford's Law [physics.data-an]
Uncertainty analysis and composite hypothesis under the likelihood   paradigm [q-bio.QM]
The long-tail distribution function of mutations in bacteria [q-bio.PE]
Simultaneous regulation of cell size and chromosome replication in   bacteria [q-bio.CB]
A Jump Distance Distribution-based Bayesian model selection procedure   reliably extracts molecular motion features from single molecule tracking   data [q-bio.QM]
A new measure of multisensory integration in a single neuron based on   dependent probability summation [q-bio.NC]
Directional out-coupling of light from a plasmonic nanowire-nanoparticle   junction [physics.optics]
Evaluating the importance of environmental factors on the spatial   distribution of livestock settlements in the Monte desert with a Monte Carlo   based model: Settlement Dynamics in Drylands (SeDD) [q-bio.PE]
Molecular geometry of alkaloids present in seeds of mexican prickly   poppy [q-bio.OT]
Computational principles of biological memory [q-bio.NC]
Critical population and error threshold on the sharp peak landscape for   the Wright-Fisher model [math.PR]
Computational neuroanatomy: mapping cell-type densities in the mouse   brain, simulations from the Allen Brain Atlas [q-bio.NC]
A model of sensory neural responses in the presence of unknown   modulatory inputs [q-bio.NC]
Possible Mechanisms for Neural Reconfigurability and their Implications [cs.NE]
Asymptotic Green's function for the stochastic reproduction of competing   variants via Fisher's angular transformation [q-bio.PE]
Design of Pressure Actuated Cellular Structures [q-bio.QM]
A model of sensory neural responses in the presence of unknown   modulatory inputs [q-bio.NC]
Sequential Monte Carlo with Adaptive Weights for Approximate Bayesian   Computation [stat.CO]
The shifted Wald distribution represents a non-censored Wiener diffusion   model of choice response times: Evidence from simulations and a Go/No-go task [q-bio.NC]
What is Learning? A primary discussion about information and   Representation [cs.AI]
Tissue fibrosis: a principal evidence for the central role of Misrepairs   in aging [q-bio.TO]
Species Trees from Gene Trees Despite a High Rate of Lateral Genetic   Transfer: A Tight Bound [math.PR]
BioNetGen 2.2: Advances in Rule-Based Modeling [q-bio.QM]
What can ecosystems learn? Expanding evolutionary ecology with learning   theory [q-bio.PE]
Were there two forms of Stegosaurus? [q-bio.PE]
Detecting somatic mutations in genomic sequences by means of   Kolmogorov-Arnold analysis [q-bio.GN]
Efficient maximum likelihood parameterization of continuous-time Markov   processes [physics.data-an]
The Automatic Neuroscientist: automated experimental design with   real-time fMRI [q-bio.NC]
Elastic network models for RNA: a comparative assessment with molecular   dynamics and SHAPE experiments [q-bio.BM]
A guide through a family of phylogenetic dissimilarity measures among   sites [q-bio.PE]
Superchords: the atoms of thought [q-bio.NC]
On the Sensitivity of Protein Data Bank Normal Mode Analysis: An   Application to GH10 Xylanases [q-bio.BM]
pyBioSig: optimizing group discrimination using genetic algorithms for   biosignature discovery [q-bio.QM]
The Zigzag Hochschild Complex [math.DG]
A model of discrete Kolmogorov-type competitive interaction in a   two-species ecosystem [math.DS]
Magnetic fields facilitate DNA-mediated charge transport [physics.bio-ph]
Fronts and fluctuations at a critical surface [nlin.PS]
Fluctuating fitness shapes the clone size distribution of immune   repertoires [q-bio.PE]
Three-layer Regulation Leads to Monoallelic Olfactory Receptor   Expression [q-bio.MN]
Tests of cosmic ray radiography for power industry applications [physics.ins-det]
Alignment of protein-coding sequences with frameshift extension   penalties [cs.DS]
Sustainability of Transient Kinetic Regimes and Origins of Death [q-bio.MN]
An Introduction to Multilevel Monte Carlo for Option Valuation [math.NA]
Determining physical properties of the cell cortex [physics.bio-ph]
Bayesian co-estimation of selfing rate and locus-specific mutation rates   for a partially selfing population [q-bio.PE]
Neural mechanism to simulate a scale-invariant future timeline [q-bio.NC]
Effective diffusion rates and cross-correlation analysis of "acid   growth" data [q-bio.CB]
Linear models of activation cascades: analytical solutions and   coarse-graining of delayed signal transduction [q-bio.MN]
Error Threshold of Fully Random Eigen Model [q-bio.PE]
Known Structure, Unknown Function: An Inquiry-based Undergraduate   Biochemistry Laboratory Course [physics.ed-ph]
The free energy cost of accurate biochemical oscillations [physics.bio-ph]
Predicting the extinction of Ebola spreading in Liberia due to   mitigation strategies [q-bio.PE]
NGC4370: a case study for testing our ability to infer dust distribution   and mass in nearby galaxies [astro-ph.GA]
Evaluation of the Number of Different Genomes on Medium and   Identification of Known Genomes Using Composition Spectra Approach [q-bio.GN]
Demography-based adaptive network model reproduces the spatial   organization of human linguistic groups [q-bio.PE]
Coupling all-atom molecular dynamics simulations of ions in water with   Brownian dynamics [physics.comp-ph]
Exact simulation of the Wright-Fisher diffusion [stat.ME]
Computational Performance and Statistical Accuracy of *BEAST and   Comparisons with Other Methods [q-bio.PE]
DNA cyclization and looping in the wormlike limit: normal modes and the   validity of the harmonic approximation [q-bio.BM]
Hybrid approaches for multiple-species stochastic reaction-diffusion   models [physics.comp-ph]
The effect of disorder in the contact probability of elongated   conformations of biopolymers [q-bio.BM]
Chaotic Neuronal Oscillations in Spontaneous Cortical-Subcortical   Networks [q-bio.NC]
Head-related Impulse Response Cues for Spatial Auditory Brain-computer   Interface [q-bio.NC]
Cell bystander effect induced by radiofrequency electromagnetic fields   and magnetic nanoparticles [q-bio.SC]
Coupling-induced oscillations in two intrinsically quiescent populations [q-bio.PE]
What makes a neural code convex? [q-bio.NC]
Cluster Mergers and the Origin of the ARCADE-2 Excess [astro-ph.HE]
An enhanced merger fraction within the galaxy population of the SSA22   protocluster at z ~ 3.1 [astro-ph.GA]
A System Structure for Adaptive Mobile Applications [cs.SE]
A rise in the ionizing photons in star-forming galaxies over the past 5   billion years [astro-ph.GA]
Self-similar ultra-relativistic jetted blast wave [astro-ph.HE]
Simulator of Galaxy Millimeter/Submillimeter Emission (SIGAME): The   [CII]-SFR Relationship of Massive z=2 Main Sequence Galaxies [astro-ph.GA]
A New Model for Mixing By Double-Diffusive Convection (Semi-Convection).   III. Thermal and Compositional Transport Through Non-Layered ODDC [astro-ph.SR]
On the [CII]-SFR relation in high redshift galaxies [astro-ph.GA]
Weak lensing by galaxy troughs in DES Science Verification data [astro-ph.CO]
Improved determination of sterile neutrino dark matter spectrum [hep-ph]
The GHOSTS survey. II. The diversity of Halo Color and Metallicity   Profiles of Massive Disk Galaxies [astro-ph.GA]
Gaia FGK benchmark stars: abundances of alpha and iron-peak elements [astro-ph.SR]
Early Science with the Large Millimeter Telescope: Observations of dust   continuum and CO emission lines of cluster-lensed submillimetre galaxies at   z=2.0-4.7 [astro-ph.GA]
Nonsingular Cosmology from an Unstable Higgs Field [hep-th]
Near-Infrared Polarimetric Adaptive Optics Observations of NGC 1068: A   torus created by a hydromagnetic outflow wind [astro-ph.GA]
Electron and Ion Acceleration in Relativistic Shocks with Applications   to GRB Afterglows [astro-ph.HE]
A Deep XMM-Newton Study of the Hot Gaseous Halo Around NGC 1961 [astro-ph.GA]
The Diversity of Transients from Magnetar Birth [astro-ph.HE]
Wilson Loop Invariants from $W_N$ Conformal Blocks [hep-th]
Disk-stability constraints on the number of arms in spiral galaxies [astro-ph.GA]
Relativistic Shocks: Particle Acceleration and Magnetization [astro-ph.HE]
Building a Better Understanding of the High Redshift BOSS Galaxies as   Tools for Cosmology [astro-ph.CO]
Constraints on $\mu$-distortion fluctuations and primordial   non-Gaussianity from Planck data [astro-ph.CO]
An independent test of the photometric selection of white dwarf   candidates using LAMOST DR3 [astro-ph.SR]
What is the optimal way to measure the galaxy power spectrum? [astro-ph.CO]
Construction of the CHIPS-M prototype and simulations of a 10 kiloton   module [physics.ins-det]
The stellar wind velocity field of HD 77581 [astro-ph.HE]
Five groups of red giants with distinct chemical composition in the   globular cluster NGC 2808 [astro-ph.SR]
In [15]:
print 'elapsed time ='+str(t1-t0)
elapsed time =3.36903595924

what was in the dynamoDB table after the process ended?

To see the output that appeared in the dynodb table look at BookTable-dynammodb-output.csv

Note: it took 3 seconds to fill the queue and if we look at the csv file and sort by timestame we see it took 26 seconds to do all the processing using 8 predictors.

In [ ]: