Monday, 28 September 2009

More on Data on the Web Project 1

After we finished project 1 and the report, we were asked to review other people's reports, and that's part of the assessment. One guys said something funny in his review on my report, it's something like: This paper is written in a professional way and the results the author achieved are quite good; however, from the paper, it seems that the author got good results without spending much effort on it; everything just seems so easy to get done. After reading this, I was like wtf? Isn't report supposed to be easy to read and clearly illustrate my work in the project? Who the hell says I have to show how hard I tried before I reached this far. I did show in great details how I tweaked the parameters to get best performance possible. I think he's just jealous of my result, lol! He just doesn't know how much time I've put on it and how many sleepless nights I had to get the project done.

Anyway, back to the main thing I want to talk about here. I'm going to write down something here that I want to do further with the project. These ideas have been keeping flashing up in my mind since the submission of project one, so I think I have to write it down to rest my mind in peace.

1. For identifying multi-lingual documents, I'll try another pre-processing method done by some fellows - slicing the document into parts of equal size and evaluate each part individually. This should be able to boost my f-score significantly, because my way of determining multilingual documents is too vague and naive (though it works pretty well). With this slicing method, I can slice the document into as many parts as I want. Theretically, the more parts a document is sliced, the more accurate the result will be. But there's trade-offs between run-time efficiency and accuracy. I'll look into this later when I've got time.

2. Another thing I want to get done is my Naive Bayes multinomial method. I wrote the code, but it just behaves weird, I think I must have had hidden typos or gotten some basic thing wrong. Well, I am happy that the time I spent on the NB classifier paid off to some degree, because the running time of my NB classifier went from 50 seconds per document to one minute for all 1,000 test documents. So with this achievement I got for my NB classifier, I may as well dig more into it and make it fully functional. The best part of NB is that it doesn't need a large set of training documents to get a relatively high identification accuracy in contrast to the n-gram ranking method I used in the project. So it'd be interesting to be able to see it up and running and the performance against n-gram ranking.

3. I also want to get another classifier up using artificial neural networks. We are going through neural networks in our AI subject at the moment, and neural networks look pretty viable in language identification according to my understanding of it. It's got a solid model, but I think the training time might be longer than the naive method I used because of the complexity of the neural networks models - weight values need to be adjusted until they match the training patterns, and it looks quite calculation intensive.

I reckon that's pretty much about it. I'll probably do it on summer holiday due to lack of time during this semester. I'm quite interested in data mining and artificial intelligence, and hopefully I can find a good mentor for my honours year. Yeah, hope I can get entry to the honours year.

No comments: