Function Resource

ToLongForm

Converts an association or a dataset into a long form dataset

ResourceFunction["ToLongForm"][dataset]

converts the argument dataset into a long form dataset with colums “AutomaticKey”, “Variable”, and “Value”.

ResourceFunction["ToLongForm"][dataset, idcols, valcols]

converts the argument dataset into a long form dataset with colums idcols, and columns “Variable” and “Value” derived from valcols.

ResourceFunction["ToLongForm"][dataset,opts]

converts the argument dataset into a long form dataset using the option specifactions opts.

Details and Options

The so called "long form" and "wide form" conversions of tabular data are fundamental "data wrangling" operations.
In long form conversion the columns of the original dataset are seen in three groups: identifier columns, variable columns, and ignored columns.
The conversion into long form converts the set of specified variable columns into two columns: one with variable names, the other variable values.
The conversion of a dataset into a long form allows column names (of variables) to be treated as data.
If no identifier columns are given ResourceFunction["ToLongForm"] uses the order indexes of the rows as identifiers.
ToLongForm takes the following options:
"IdentifierColumns"Automaticidentifier columns to make the long form dataset with
"VariableColumns"Automaticvariable columms that are spread into a variable column and a value column
"AutomaticKeysTo""AutomaticKey"the column name if automatic keys is used
"VariablesTo""Variable"the column name of the column that has the variable names as values
"ValuesTo""Value"the column name of the column that has the variable values as values

Examples

Basic Examples

Here is a simple dataset:

In[1]:=
dataset = Dataset[{
    <|"a" -> "x", "b" -> 5|>,
    <|"a" -> "y", "b" -> 6|>,
    <|"a" -> "x", "b" -> 10|>,
    <|"a" -> "y", "b" -> 100|>,
    <|"a" -> "z", "b" -> Missing[]|>}];
k = 1;
dataset = dataset[All, Prepend[#, "ID" -> k++] &]
Out[3]=

Here is a long form conversion with specified identifier column and variable columns:

In[4]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "ID", {"a", "b"}]
Out[4]=

Here is another long form derived with automatically determined identifier column and variable columns:

In[5]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset]
Out[5]=

Here is a conversion to long form with identifier columns "a" and "b":

In[6]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, {"a", "b"}]
Out[6]=

Scope

The first argument can be a matrix:

In[7]:=
mat = RandomReal[{100, 200}, {3, 6}];
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][mat]
Out[8]=

The first argument can be a list of associations:

In[9]:=
alist = AssociationThread[Range[Length[#]], #] & /@ mat;
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][alist]
Out[10]=

The second and third arguments can be Automatic:

In[11]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, Automatic, {"a", "b"}]
Out[11]=
In[12]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "a", Automatic]
Out[12]=

The options "IdentifierColumns" and "VariableColumns" can be used instead of second argument and third argument respectively:

In[13]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "IdentifierColumns" -> "ID", "VariableColumns" -> {"a", "b"}]
Out[13]=

The second and third arguments can be column indexes:

In[14]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, 2, {1, 3}]
Out[14]=

Automatic keys are derived if the dataset argument has row keys. Here is dataset with row keys:

In[15]:=
SeedRandom[12];
dataset2 = Dataset[AssociationThread[RandomWord[Length[#]], Normal[#]] &@
   dataset]
Out[12]=

Here is automatic conversion to long form:

In[16]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset2]
Out[16]=

The row keys are ignored if the identifier column is specified:

In[17]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset2, "ID"]
Out[17]=

Options

AutomaticKeysTo

The option "AutomaticKeysTo" can be used to specify the name of the column that corresponds to the automatically determined identifier:

In[18]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "AutomaticKeysTo" -> "SpecialID"]
Out[18]=

IdentifierColumns

It might be more convenient to specify the identifier columns with an option setting:

In[19]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "IdentifierColumns" -> "ID"]
Out[19]=

VariableColumns

It might be more convenient to specify the variable columns with an option setting:

In[20]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "VariableColumns" -> "ID"]
Out[20]=

VariablesTo

The option "VariablesTo" specifies the name of long form's column that has as values the names of the variable columns:

In[21]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "VariablesTo" -> "VAR"]
Out[21]=

ValuesTo

The option "ValuesTo" specifies the name of long form's column that has as values the values in the variable columns:

In[22]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "ValuesTo" -> "VAL"]
Out[22]=

Applications

Column names as data

The main advantage of the long form conversion is that variable column names become data. (I.e. values in a certain column.)

This advantage can be demonstrated by making an association of time series objects for the rows of multiple time series data given in wide form. (I.e. a table in which most or all of the column names correspond to time axis values.)

Here is a randomly generated table (dataset) with multiple time series data:

In[23]:=
dsTSWide = Dataset[
Association[
  2018 -> Association[
    "Apr" -> 83.49, "Aug" -> 0, "Dec" -> 84.59, "Feb" -> 0, "Jan" -> 0, "Jul" -> 0, "Jun" -> 93.32000000000001, "Mar" -> 215.34, "May" -> 89.65, "Nov" -> 168.51999999999998`, "Oct" -> 0, "Sep" -> 116.71000000000001`], 2019 -> Association[
    "Apr" -> 23.61, "Aug" -> 0, "Dec" -> 133.88, "Feb" -> 12.74, "Jan" -> 152.55, "Jul" -> 0, "Jun" -> 93.34, "Mar" -> 59.69, "May" -> 126.05000000000001`, "Nov" -> 94.46000000000001, "Oct" -> 69.95, "Sep" -> 37.27], 2020 -> Association[
    "Apr" -> 94.48, "Aug" -> 38.480000000000004`, "Dec" -> 77.44, "Feb" -> 12.64, "Jan" -> 33.86, "Jul" -> 141.04000000000002`, "Jun" -> 50.36, "Mar" -> 0, "May" -> 0, "Nov" -> 12.65, "Oct" -> 0, "Sep" -> 26.13]]]
Out[23]=

Each row of that random time series dataset corresponds to a yearly sequence of monthly values.

The interpretation of the dataset reveals heterogeneous semantics of its values: (1) each row corresponds to a year which is specified with row’s key, and (2) the columns have short month column names.

Convert the dataset into long form using the row keys as identifiers:

Add the column “ObservationTime” derived from the columns “Year” and “Month”:

In[24]:=
dsLong = dsLong[All, Join[#, <|"ObservationTime" -> AbsoluteTime[ToString[#Year] <> "-" <> #Month]|>] &]
Out[24]=

Split the long form dataset by year and make time series with the columns "ObservationTime" and "Value":

In[25]:=
timeSeriesCollection = GroupBy[Normal@dsLong, #Year &, TimeSeries[Values /@ #[[All, {"ObservationTime", "Value"}]]] &]
Out[25]=

Plot the obtained time series:

In[26]:=
DateListPlot /@ timeSeriesCollection
Out[26]=

Combinations of the heterogenous data

Using long forms makes easier the programmatic manipulation of heterogenous data.

Get datasets with different number of rows and columns that correspond to items and variables of different kinds:

In[27]:=
datasets = Association@
   Map[# -> ResourceFunction[
        "ExampleDataSpecToDataset"][{"Statistics", #}] &, \
{"AnimalWeights", "EmployeeAttitude", "OrangeTreeGrowth"}];
Magnify[#, 0.6] & /@ datasets
Out[12]=

Convert all datasets into long form datasets:

In[28]:=
datasets = ResourceFunction[
   "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"] /@ datasets;
Magnify[#, 0.6] & /@ datasets
Out[25]=

For each long form dataset change the automatic key column to include dataset's name:

In[29]:=
datasets = KeyValueMap[
   Function[{k, v}, v[All, Prepend[#, "AutomaticKey" -> k <> "-" <> ToString[#AutomaticKey]] &]], datasets];
Magnify[#, 0.6] & /@ datasets
Out[6]=

Join all long form datasets into one dataset and show a sample:

In[30]:=
SeedRandom[321];
Sort@RandomSample[Join @@ datasets, 8]
Out[31]=

Properties and Relations

Sparse matrices (sparse arrays) have representation very similar to that of long form. Here is a random sparse matrix:

In[32]:=
SeedRandom[12];
smat = SparseArray[
  RandomReal[{0, 1}, {4, 5}] /. (x_?NumberQ /; x < 0.5) -> 0]
Out[33]=

Compare the array rules of that sparse matrix with its long form representation.

Here are the array rules:

In[34]:=
ArrayRules[smat]
Out[34]=

Here is the long form representation:

In[35]:=
ResourceFunction[
  "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][Normal@smat][Select[#Value > 0 &]]
Out[35]=

Note that since using Normal on the sparse matrix makes a dense matrix with zeroes, those zeroes are filtered out from the displayed long form.


Cross tabulation can be seen under certain conditions as conversion to wide form. Therefore, certain long form conversions can be "reverted" into the original forms using cross tabulation.

Here is an example using the resource function CrossTabulate over a subset of the Lake Mead elevation levels data:

In[36]:=
dsLakeData = ResourceFunction["ExampleDataSpecToDataset"][{"Statistics", "LakeMeadLevels"}][[1 ;; 4, 1 ;; 4]];
dsLong = ResourceFunction[
   "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dsLakeData, "Year", "VariablesTo" -> "Month", "ValuesTo" -> "Elevation"];
dsCrossTbl = ResourceFunction["CrossTabulate"][dsLong];
In[37]:=
AssociationThread[{"Orignal", "Long form", "Cross tabulation"}, Magnify[#, 0.8] & /@ {dsLakeData, dsLong, dsCrossTbl}]
Out[37]=

Possible Issues

If the first argument is a dataset without column names the second argument and third argument are expected to consist of column indexes.

Here is a correct specification:

In[38]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset[Values], 2, {1, 3}]
Out[38]=

Here is a couple of incorrect specifications:

In[39]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset[Values], 2, {1, 5}]
Out[39]=
In[40]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset[Values], "a"]
Out[40]=

If the identifier columns produce multiple corresponding rows for a given combination of identifier values then only the last row is put in the long form result.

For this dataset:

In[41]:=
dataset
Out[41]=

the row <|"ID"1, "a" x, "b" 5|> is "lost":

In[42]:=
ResourceFunction[
 "https://www.wolframcloud.com/obj/antononcube/DeployedResources/\
Function/ToLongForm"][dataset, "a"]
Out[42]=