Wolfram Data Repository
Immediate Computable Access to Curated Contributed Data
Nucleotide sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data
"LatestData" | a Dataset containing the most recently collected data |
"CollectionHistogram" | a DateHistogram of when the sequences were collected |
"ReleaseHistogram" | a DateHistogram of when the sequences were released to the public |
"AffectedLocations" | a world map showing where these sequences were collected |
"SubmissionAuthors" | a Dataset containing the accessions for each author list |
"AlignmentDifferences" | a Dataset containing alignment differences with the reference sequence |
"ReferenceBioSequence" | a BioSequence representing the reference SARS-CoV-2 genome |
Get a Dataset containing rows for the most recent sequences:
In[1]:= | ![]() |
Out[1]= | ![]() |
Get a Dataset containing rows for all sequences (this can take considerable time to download and expand):
In[2]:= | ![]() |
Out[2]= | ![]() |
Return the latest date a sequence was released:
In[3]:= | ![]() |
Out[3]= | ![]() |
Count the different lengths of sequences provided, which corresponds well to the part of the virus that was sequenced:
In[4]:= | ![]() |
Out[4]= | ![]() |
Most of these SARS-CoV-2 samples are collected from humans, but not all:
In[5]:= | ![]() |
Out[5]= | ![]() |
Some of these genetic sequences correspond to named variations of interest as designated by the World Health Organization (WHO):
In[6]:= | ![]() |
Out[6]= | ![]() |
Get a date histogram of collection dates:
In[7]:= | ![]() |
Out[7]= | ![]() |
See a date histogram of release dates:
In[8]:= | ![]() |
Out[8]= | ![]() |
Show the locations where the sequences were gathered:
In[9]:= | ![]() |
Out[9]= | ![]() |
Obtain the available alignment differences with the reference sequence:
In[10]:= | ![]() |
Out[10]= | ![]() |
Show the authors with the accessions of the sequences they submitted:
In[11]:= | ![]() |
Out[11]= | ![]() |
Obtain the reference sequence as a biomolecular sequence:
In[12]:= | ![]() |
Out[12]= | ![]() |
A phylogenetic tree comparison of the most-common complete genomes by location shows clusters that are broadly distributed. Dropping the trailing sequences of adenine terms avoids arbitrary differences from varying poly(A) RNA tail lengths, which may be sequencing artifacts and shouldn’t affect viral adaptivity:
In[13]:= | ![]() |
Out[15]= | ![]() |
In[16]:= | ![]() |
Out[16]= | ![]() |
A similar visualization can be created for samples where more detailed geographic information is supplied. In this visualization of most-common sequences reported for US states, we see the emergence of clusters containing interesting regional blocks as shown in the map below:
In[17]:= | ![]() |
Out[17]= | ![]() |
In[18]:= | ![]() |
Out[18]= | ![]() |
When visualizing the similarity of the most common sequence by month of sequence collection, there are recurring overlaps (most significantly between December 2019 and February 2020), illustrating that the virus has not only seen evolution, but significant continuity. Since then, greater spread has led to further divergence:
In[19]:= | ![]() |
Out[20]= | ![]() |
Using the provided alignment differences, we can see where along the viral genome changes have been detected over time. We see that while mutations are relatively uniformly distributed, there are certainly changes more commonly measured than others:
In[21]:= | ![]() |
Out[22]= | ![]() |
It is also possible to treat these genetic differences as lists of features:
In[23]:= | ![]() |
Out[24]= | ![]() |
By doing so, it is possible to perform a fairly wide variety of analysis. Here, we determine all of the genetic differences that always occur together in the sampled sequences, taking advantage of the fact that when differences always occur together they must occur in the same number of sequences:
In[25]:= | ![]() |
Out[26]= | ![]() |
Wolfram Research, "Genetic Sequences for the SARS-CoV-2 Coronavirus" from the Wolfram Data Repository (2021) https://doi.org/10.24097/wolfram.03304.data
Public Domain