Evolution Starts Here: 433-352 Data on the web Project 1 report

Abstract
This is the report for 433-352 Project 1. This report will summarize what I’ve done in this project, and discuss the n-gram ranking method I used in language identification.

1. Introduction
There are quite a few techniques and theories for language identification. In this report, I will discuss the method I’ve used in Project 1, i.e. n-gram ranking, and will discuss my results in details. NLTK (Natural Language Toolkit) is used for tokenization. Finally, my findings in relation to ‘out-of-place’ value will be discussed in details.

2. Task
Our task in this project is to write one or more classifiers that can identify over a group of test documents as to what languages they are written in. In order to do that, we must let our classifier train over a set of training documents which are a large set of documents that are written in 74 different languages and we are told what language/languages they are written in. After training, the classifier is then to classify a set of test documents to see how well it performs. There’s a Perl script provided to test the performance of our classifiers.

3. Methodology
I implemented two classifiers based on n-gram ranking and Naïve Bayes Multinomial. However, since my Naïve Bayes classifier doesn’t work properly, I will only concentrate on n-gram ranking method in this report.

In order to focus more on the classification methods, I chose Python to program the classifiers because there’s NLTK (Natural Language Toolkit) for Python that can help me easily do tokenization, and Python is easier to handle text than other languages I know, such as Java and C.

I chose to do word unigram ranking, and the training model basically works like this:
1. It reads off words in each language as tokens from all the training documents. In each language, the frequencies of all words are calculated and stored in the dictionary D.
2. Ranking is then given to each word within each language according to their frequency. Higher frequency word gets higher ranking. Ranking starts from 1, and there is no upper bound for it.

This is the training model. After training, the classifier should have a dictionary containing 74 languages; under each language, there are words that appear in training documents and there is a ranking attached to each word.

When it comes to classifying test documents, the classifier does the following to identify languages:
1. It reads all the words from a test document T, and use the same method described above to assign a ranking to each word in the test document.
2. It then loops through all the words in the test document, and sums up the absolute value of the ranking differences between each word in each language in D and each word in T, we denote this sum-up as Vd.
Note that if there is a word from the test document that doesn’t appear in the training data set, I set the corresponding ranking from the training data set as a constant, and I call it out-of-place value. This value affects the final result greatly in a certain range, and I’ll discuss it later in the report.
3. It finds the smallest difference value, and the corresponding language is the language that the test document is most likely written in.
Considering there are dual-language documents, the classifier actually gets the smallest two differences, and computes the difference between them, we call this difference a1. If the difference is smaller than a certain threshold, the classifier considers the test document is written in those two languages instead of one, otherwise the classifier just gets the language that corresponds to the smallest difference value.

4. Results
Before we start talking about the results, there’s one thing that we need to clarify first – the baseline. Considering the large proportion of English in the development documents, we establish the baseline by counting percentage of the development documents that are in English, and this gives us the baseline: 0.406.

The first time my classifier was working properly, I set my classifier only to treat each development document as mono-lingual document, and the out-of-place value was randomly set to 5,000, here is the result I got from classifying the 1,000 development documents:

Precision recall f-score
0.948 0.636 0.761

This result was significantly higher than the baseline. I took this result as a benchmark, and increase the performance of my classifier based on this.

I then added code to the classifier that enabled the classifier to identify files that are in two languages. The out-of-place value remained the same as 5,000, down below is my result:

Precision recall f-score
0. 875 0. 758 0. 813

Since more documents (containing two languages) could be identified now, my overall score had a 6.4% increase.

I then tinkered with the out-of-place value and the parameter a1 which determines whether there’s a second language in the document, I increased my result slightly. Here is the result where out-of-place value is set to 1,800.00, and a1 is set to 0.2:

Precision recall f-score
0. 868 0. 787 0. 825

This is the best f-score over the development documents I got after trying out different parameter values.

After I tried out different out-of-place value, I found it affects the overall identification rate quite a lot. This will be discussion in more detail in the next section.

5. Observation/ Discussion
5.1. Out of place value

When I tinkered with my classifier, I found that changing the out-of-place value can affect the result greatly.

In order to see how the out-of-place value affects my result, I calculated the highest word ranking among all languages in D, and this value is 688. Theoretically, the out-of-place value should be this number, at least this is what I was told in 352 tutorial. However, when I set this value actually to 689, I get a much worse result:

precision recall f-score
0.564 0.573 0.569

However, if I try large values, like 10,000 or 100,000, the results don’t differ quite obviously between each other, and f-score all remain between the 79% - 80% range. Here is the result when out-of-place value is set to 100,000:

precision recall f-score
0.873 0.735 0.798

I carefully thought about it and found out the following:

Out-of-place value affects results greatly when it’s set small (small means anything near the word rankings in D), and will stabilize the result when it’s set large. Let’s look at an example to see what exactly this means:

Let X = {x1, x2 x3…} be the number of words that appear in test document T, but doesn’t appear in dictionary D. Rankings of words in T is denoted as {rank1, rank2…} We also have Class1 and Class 2 in D. Now the difference Vd for different classes in D can be written as:

Vd1 = x1 * | rank1 – out-of-place-value|
Vd2 = x2 * | rank2 – out-of-place-value|

Let’s assume T is actually Class 1. Accordingly, we know that x1 must be much smaller than x2 because there must be more words in T that don’t appear in Class 2 given that T belongs to Class 1. When out-of-place-value is set small, the rank value can greatly affect Vd value, which means that whether T1 belongs to Class 1 or Class 2 greatly depends on the ranks of words that don’t appear in each class. This can give us false identification result. And this is why I got a much worse result when I set the out-of-place value to 688.

When out-of-place value is set very large, the relative ranks of words (that don’t appear in D) will contribute little to the Vd value, which means T will always be classified as Class1 because of the much smaller x1 value will always make Vd1 much smaller than Vd2 because | rank – out-of-place-value| is a quite large number.

Thanks to the random 5,000 value that I set at the start, which gives me quite a good result. In order to get the best result possible, I manually used binary search given two values 5,000 and 688. I then found the best result is obtained when out-of-place value is set around 1,800. When I further used binary search from 1,800 to find a better value, the resultant f-score only increases or decreases very slightly, so I decide that 1,800 gives me best f-score overall.

I could have written an automated test that helps me find the best out-of-place value, but due to lack of time – I have other assignments to see to, I didn’t really came to a very exact best out-of-place value. But I think my analysis was reasonable and the value 1,800 does give me a reasonably good result. However, I didn’t find out why 1,800 is the near best value.

5.2. Classifier speed up

When I just finished my classifier, it took my classifier more than 20 minutes to train over 8,000 training. I spent several hours on optimizing the speed. I found that moving all calculation-intense operations outside of nested loops helps a great deal.

Down below is the stuff I found most useful when speeding up the classifier:

1. When looping through dictionary keys, try to loop through all the dictionary keys instead of looping through all the dictionary items.
2. If some calculations can be put outside of main loops, do so and put all calculated values in a look-up table, it would make the classifier several times faster when the loops are very deep.

Now it only takes about 1 minute to train over the 8,000 documents and about 1 minute to classify 1000 documents.

6. Conclusion
Overall, I got my own word unigram ranking language classifier up and running with an f-score significantly higher than the baseline. I also found out how out-of-place value affects the performance of the classifier. Finally I optimized the classifier so that it runs much faster.

References

Timothy Baldwin, 433-352 Lecture Notes, Department of Computer Science and Software Engineering, University of Melbourne

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Information Retrieval, Cambridge University Press, Cambridge, England

Evolution Starts Here

Blog Archive

About Me

Thanks for coming and sharing your experience

Monday, 28 September 2009

433-352 Data on the web Project 1 report

No comments: