Lab 10: Working with Data
Lab 10: Working with Data
NetID: <Please fill in>
Note: Please do not rush to evaluate all code by hitting shift+enter. If you do not understand the logical progression of how we are attempting to solve the task, chances are you will run into more frustrating errors.
A Data Science Workflow
A Data Science Workflow
A data science project needs a flexible, modular, iterative and multiparadigm workflow.
Setting up Questions
Setting up Questions
The first stage of the workflow is where you frame questions. To get some useful conclusions from the data, you need to start out with the right questions.
What Can You Learn from the Data?
What Can You Learn from the Data?
That is a pretty broad question. It makes sense to break it down into a few specific questions that can guide your analysis.
Topic-specific questions like:
◼
How many…?
◼
Who…?
◼
Where…?
◼
What happened, together with…?
Generic questions like:
◼
Who is this analysis for?
◼
What is the action that will be driven by the insight from this analysis?
◼
How will they access the results? At what frequency?
The questions can be fuzzy as you start out, and they can change later. In fact, more interesting questions may surface as you sift through the data. However, it is important to set up questions at the beginning with the audience in mind. Otherwise, with the sheer variety of things that you can try with the data, you might end up wasting a lot of time trying unnecessary things.
Data Wrangling
Data Wrangling
◼
Process of importing raw data and converting it into a suitable format for downstream analysis
◼
Sometimes requires “hacking skills” to organize and clean messy data into an informative, manageable dataset
◼
Goal is to create code for semi-automated tools that would make the process easier the next time the workflow is used
Exploratory Data Analysis
Exploratory Data Analysis
Exploratory data analysis (EDA) can help:
◼
Gain an intuitive understanding of the underlying nature of the dataset
◼
Identify relationships between variables
◼
Formulate good questions for the actual analysis (as the explorations proceed, those questions can change)
◼
Evaluate the quality of the data (Data QA)
Tools used in EDA can be categorized as:
◼
Graphical or non-graphical
◼
Univariate (exploring one feature/variable at a time) or multivariate (exploring combined behavior of more than one variable)
Analyzing Data: Machine Learning can Help
Analyzing Data: Machine Learning can Help
Questions that can be answered by supervised machine learning:
◼
Classification: Is this A or B? (Is this A or B or C or D…?)
◼
Regression: How many or how much?
Questions that can be answered by unsupervised machine learning:
◼
Clustering: How is the data organized? Does the data have some inherent structure? Do the samples sort themselves out into different groups and subgroups?
◼
Anomaly detection: Is this unusual? Are there outliers in the data?
◼
Sequence Prediction; Time Series Forecasting: What comes next?
Communicating Results
Communicating Results
◼
Visualizations and infographics
◼
Computational essays or reports
◼
Web deployed apps or microsites
Part 1: EDA or Exploratory Data Analysis
Part 1: EDA or Exploratory Data Analysis
In this lab, we will have a look at a couple of simple examples of working with data.
Background
Background
What is an Abalone?
What is an Abalone?
We’ll perform exploratory data analysis of the Abalone data set in the UCI machine learning repository.
Out[]=
The Abalone dataset contains information about physical measurements of abalone (a type of marine snail).
What does the dataset contain?
What does the dataset contain?
The dataset includes the following attributes:
◼
Sex: Categorical variable with three categories: M (male), F (female), and I (infant).
◼
Length: Continuous variable representing the longest shell measurement (in mm).
◼
Diameter: Continuous variable representing the measurement perpendicular to length (in mm).
◼
Height: Continuous variable representing the height of the shell (in mm).
◼
Whole weight: Continuous variable representing the whole abalone weight (in grams).
◼
Shucked weight: Continuous variable representing the weight of the meat (in grams).
◼
Viscera weight: Continuous variable representing the gut weight (after bleeding) (in grams).
◼
Shell weight: Continuous variable representing the weight of the shell (after being dried) (in grams).
◼
Rings: Integer variable representing the number of rings (which can be used to estimate the age of the abalone).
How can you go about doing Exploratory Data Analysis (EDA)?
How can you go about doing Exploratory Data Analysis (EDA)?
The EDA process typically involves the following steps:
1
.Load the dataset.
2
.Check for missing values and figure out a way to deal with them consistently.
3
.Examine the summary statistics of the dataset.
4
.Visualize the distribution of each variable.
5
.Analyze the relationships between variables using scatter plots, correlation matrices, etc.
Code for Exploratory Data Analysis
Code for Exploratory Data Analysis
Load the Data
Load the Data
Load the dataset into a Wolfram Language Dataset object:
Problem 1: How many rows and columns are there?
Problem 1: How many rows and columns are there?
List some basic information about the dataset, such as the number of rows and columns, column names, and data types (String, Integer, or Real)
This information can can easily computed using code.
Problem 2: Calculate descriptive statistics
Problem 2: Calculate descriptive statistics
Calculate summary statistics (Mean, Median and StandardDeviation) for each numerical column.
The following code shows how to calculate the Mean for the “Length” column:
The following code shows how to calculate the Median for the “Length” column:
The following code shows how to calculate the Standard Deviation for the “Length” column:
What is the Mean, Median and StandardDeviation for the “Height” of all the samples?
The functions Max and Min can be used in the same way as Mean to find the maximum or minimum value of a column.
Use them to find the maximum and minimum number of “Rings”.
Use them to find the maximum and minimum number of “Rings”.
Problem 3: Visual exploration using scatter plots
Problem 3: Visual exploration using scatter plots
The following code shows how you can create a scatter plot of the Diameter vs. WholeWeight of the samples:
Create a scatterplot of the features “Length” vs. “Height”. Explain the relationship you see.
Problem 4: Visual exploration using histograms
Problem 4: Visual exploration using histograms
The following code selects the male samples and plots a histogram of their number of “Rings”:
Modify the code above to create a histogram of the number of rings of the female samples. What is the most frequently occurring value for the number of rings and how many samples do you see for this value?
Problem 5: Non-graphical exploration - find the correlation of the features.
Problem 5: Non-graphical exploration - find the correlation of the features.
The following code shows the correlation between the numerical features:
Which two features are the most correlated and which ones are the least correlated?
You can use the following code to find the maximum and minimum correlation values.
Part 2: EDA of Content on a WebPage of Your Choice
Part 2: EDA of Content on a WebPage of Your Choice
In this section we will perform EDA on a webpage of your choice to see how we can quickly get quantitative and visual information about this page.
Explore a webpage of your choice
Explore a webpage of your choice
Set the URL for the page you want to explore:
Import the text from the webpage:
Problem 6: What is the page talking about?
Problem 6: What is the page talking about?
Create a word cloud from the page text:
What are the most frequently used words on this page?
Problem 7: Analyze the text
Problem 7: Analyze the text
The following code finds all the sentences on the page:
Number of sentences found on the page:
Create a histogram of the length of the sentences:
Sort the sentences by the number of words in them and show the top 5 longest sentences on the page:
Show the 5 shortest sentences on the page:
How many sentences did you find?
What is the shortest and the longest sentence on the page?
What is the shortest and the longest sentence on the page?
Problem 8: What sort of pictures are found on the page?
Problem 8: What sort of pictures are found on the page?
Import the images from the page:
Create a visual clustering of the images:
Do you find any unusual clusters (of images that are very different from other images on the page)?
If yes, why do you think the cluster is separate from the others?
If yes, why do you think the cluster is separate from the others?
Extra Credit: Get WikipediaData on a topic related to your webpage for comparison
Extra Credit: Get WikipediaData on a topic related to your webpage for comparison
The following function gets the plain text from the Wikipedia page on a particular topic. Provide the topic of your article in the space between the quotes:
Problem 9: Create a word cloud
Problem 9: Create a word cloud
Create word cloud to see what is of most interest on this page.
What are the words most frequently used on this page? Compare the results with your previous word cloud. Do they seem similar or does Wikipedia seem to talk about something else?
Problem 10: Find a specific type of entity used often on this page.
Problem 10: Find a specific type of entity used often on this page.
You can look for parts of speech like noun, adjective, verb etc.
You can look for quantities like money, age, measurements etc.
You can look for countries.
The following code looks for countries in the Wikipedia article about “abalone”:
Count how many times each entity is mentioned and sort the numbers in decreasing order:
Which entity is mentioned most in your Wikipedia article?
Submitting your work
1
.Ensure you have filled in your NetID at the top of the notebook
2
.Save the notebook as a PDF file (Alternately, "Print to PDF" but please ensure the PDF looks ok and is not garbled)
3
.Upload to Gradescope
4
.Just to be sure that your submission was received, maybe email your TA (sattwik2@illinois.edu) that you have submitted.