Document Language
Recognition Lab
Document Language
Recognition Lab
Recognition Lab
Allow Mathematica to Initialize
Allow Mathematica to Initialize
Introduction
Introduction
There are many ways to recognize the language of a document. The goal of this lab is to introduce you to a particularly computationally efficient one using “digraphs,” or combinations of two adjacent letters that appear in the text. The details are in the following handout: . From each document, you will compute a “digraph vector." As you proceed, you should keep focus on two central questions:
1. What is the typical angle between digraph vectors for documents both written in the same language? What about for documents written in different, but related languages? And documents written in unrelated languages? Add your results to our class document: .
2. How could you leverage what you learned from your answer to the first question to design an algorithm that recognizes the language of an unclassified document? Imagine that you are an engineer at a tech company and your job is to make this algorithm work fast enough to deal with millions of documents per day.
1. What is the typical angle between digraph vectors for documents both written in the same language? What about for documents written in different, but related languages? And documents written in unrelated languages? Add your results to our class document: .
2. How could you leverage what you learned from your answer to the first question to design an algorithm that recognizes the language of an unclassified document? Imagine that you are an engineer at a tech company and your job is to make this algorithm work fast enough to deal with millions of documents per day.
Finding Digraph Vectors
Finding Digraph Vectors
We examine Wikipedia articles written in different languages. The following command will produce the digraph vector for the article concerning dogs written in French. All punctuation and numbers are ignored, diacritics removed, and letters are all lower case. The first entry corresponds to the count of the digraph “aa”, the second to “ab”, all the way until the 676th, which counts “zz.”
In[]:=
french1=digraphVector["dog","French"]
In[]:=
french2=digraphVector["cat","French"]
Which Languages Can You Use?
Which Languages Can You Use?
You can find the complete list of Wikipedias in . You should look for ones that use the Latin script for the purposes of this lab. Try some “risky” languages too, but don’t be surprised if you encounter unexpected weirdness.
Computing Angles Between Digraph Vectors
Computing Angles Between Digraph Vectors
After you have found the digraph vectors for two documents, say vector1 and vector2, you can compute the angle between these two vectors. We will use this angle to measure "distance" between the two documents. Mathematica makes the computation of this angle relatively easy. The dot product of two vectors is computed by the command:
In[]:=
french1.french2
To compute the desired angle, you will also need to compute the magnitudes of both vectors. By way of a hint, this can be done using the dot product! What are you actually computing when you compute vector1.vector1?
Note: To make your results easier to digest, make sure that you express the angle in terms of degrees.
Hint: When computing the angle, write a general formula with vectors v and w, and simply redefine what v and w are to avoid a lot of typing.
Note: To make your results easier to digest, make sure that you express the angle in terms of degrees.
Hint: When computing the angle, write a general formula with vectors v and w, and simply redefine what v and w are to avoid a lot of typing.
Some Useful Functions
Some Useful Functions
Below are some functions you may find useful for this lab. Pay particular attention to the syntax, things like capitalization, square brackets vs. curved parentheses matter.
In[]:=
(Cos[17]+2)/13
In[]:=
N[(Cos[17]+2)/13]
In[]:=
Pi/2
In[]:=
Sqrt[(12+17)/2]
In[]:=
ArcCos[0]
In[]:=
N[Cos[french1.french2]]
The Lab
The Lab
Focus on the following questions in your investigation:
1. What is the typical angle between two documents both written in the same language? Pick a language, pick five pairs of documents written in it, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different language.
2. What is the typical angle between two documents each written in one of two related languages (like French and Spanish, or German and Dutch)? Pick a pair of languages, pick five pairs of documents in different langauges, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different pair of related languages.
3. Finally, what is the typical angle between two documents each written in one of two unrelated languages (like French and Vietnamese, or German and Hungarian)? Pick a pair of languages, pick five pairs of documents in different langauges, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different pair of unrelated languages.
Now step back and try to figure out how a company such as Google might go about classifying the language of a new document found on the web. How do digraphs help?
1. What is the typical angle between two documents both written in the same language? Pick a language, pick five pairs of documents written in it, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different language.
2. What is the typical angle between two documents each written in one of two related languages (like French and Spanish, or German and Dutch)? Pick a pair of languages, pick five pairs of documents in different langauges, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different pair of related languages.
3. Finally, what is the typical angle between two documents each written in one of two unrelated languages (like French and Vietnamese, or German and Hungarian)? Pick a pair of languages, pick five pairs of documents in different langauges, compute angles, and average your results. Report to the class Google Doc. Now repeat for a different pair of unrelated languages.
Now step back and try to figure out how a company such as Google might go about classifying the language of a new document found on the web. How do digraphs help?