Evolution Starts Here

Wednesday, 30 September 2009

Some Linux command tricks

I'll put together some of the Linux commands in this post for future reference, and this post will be updated whenever there's something new coming to me.

How to view the content of a .tar.gz file?
# tar tf file.tar.gz

How to extract .tar.gz file?
# tar zvxf file.tar.gz

How to extract .tar.bz2 file?
# tar jvxf file.tar.bz2

How to recursively copy a folder with subfolders?
# cp -R foldername
Here I can use -L to replace soft links by their actually content and -P to never follow soft links sources. Note that adding -L may greatly increase the destination folder size.
-a is the same as -dpR, where -d means -P --preserve=link, and -p means --preserver=mode.
-s makes symbolic links instead of copying, same as ln -s.

How to make ls show human readable file size?
# ls -sh
or
# ls -lh
Note that h shows size in human readable form but doesn't work alone, there has gotta be some argument that asks ls to show file size.

How to make ls sort files by size and show their sizes one file per line?
# ls -1Ssh
Note that -1 shows on file per line, -S sorts file by size, -s shows file size and -h shows file size in human readable form.

Monday, 28 September 2009

More on Data on the Web Project 1

After we finished project 1 and the report, we were asked to review other people's reports, and that's part of the assessment. One guys said something funny in his review on my report, it's something like: This paper is written in a professional way and the results the author achieved are quite good; however, from the paper, it seems that the author got good results without spending much effort on it; everything just seems so easy to get done. After reading this, I was like wtf? Isn't report supposed to be easy to read and clearly illustrate my work in the project? Who the hell says I have to show how hard I tried before I reached this far. I did show in great details how I tweaked the parameters to get best performance possible. I think he's just jealous of my result, lol! He just doesn't know how much time I've put on it and how many sleepless nights I had to get the project done.

Anyway, back to the main thing I want to talk about here. I'm going to write down something here that I want to do further with the project. These ideas have been keeping flashing up in my mind since the submission of project one, so I think I have to write it down to rest my mind in peace.

1. For identifying multi-lingual documents, I'll try another pre-processing method done by some fellows - slicing the document into parts of equal size and evaluate each part individually. This should be able to boost my f-score significantly, because my way of determining multilingual documents is too vague and naive (though it works pretty well). With this slicing method, I can slice the document into as many parts as I want. Theretically, the more parts a document is sliced, the more accurate the result will be. But there's trade-offs between run-time efficiency and accuracy. I'll look into this later when I've got time.

2. Another thing I want to get done is my Naive Bayes multinomial method. I wrote the code, but it just behaves weird, I think I must have had hidden typos or gotten some basic thing wrong. Well, I am happy that the time I spent on the NB classifier paid off to some degree, because the running time of my NB classifier went from 50 seconds per document to one minute for all 1,000 test documents. So with this achievement I got for my NB classifier, I may as well dig more into it and make it fully functional. The best part of NB is that it doesn't need a large set of training documents to get a relatively high identification accuracy in contrast to the n-gram ranking method I used in the project. So it'd be interesting to be able to see it up and running and the performance against n-gram ranking.

3. I also want to get another classifier up using artificial neural networks. We are going through neural networks in our AI subject at the moment, and neural networks look pretty viable in language identification according to my understanding of it. It's got a solid model, but I think the training time might be longer than the naive method I used because of the complexity of the neural networks models - weight values need to be adjusted until they match the training patterns, and it looks quite calculation intensive.

I reckon that's pretty much about it. I'll probably do it on summer holiday due to lack of time during this semester. I'm quite interested in data mining and artificial intelligence, and hopefully I can find a good mentor for my honours year. Yeah, hope I can get entry to the honours year.

433-352 Data on the web Project 1 report

Abstract
This is the report for 433-352 Project 1. This report will summarize what I’ve done in this project, and discuss the n-gram ranking method I used in language identification.

1. Introduction
There are quite a few techniques and theories for language identification. In this report, I will discuss the method I’ve used in Project 1, i.e. n-gram ranking, and will discuss my results in details. NLTK (Natural Language Toolkit) is used for tokenization. Finally, my findings in relation to ‘out-of-place’ value will be discussed in details.

2. Task
Our task in this project is to write one or more classifiers that can identify over a group of test documents as to what languages they are written in. In order to do that, we must let our classifier train over a set of training documents which are a large set of documents that are written in 74 different languages and we are told what language/languages they are written in. After training, the classifier is then to classify a set of test documents to see how well it performs. There’s a Perl script provided to test the performance of our classifiers.

3. Methodology
I implemented two classifiers based on n-gram ranking and Naïve Bayes Multinomial. However, since my Naïve Bayes classifier doesn’t work properly, I will only concentrate on n-gram ranking method in this report.

In order to focus more on the classification methods, I chose Python to program the classifiers because there’s NLTK (Natural Language Toolkit) for Python that can help me easily do tokenization, and Python is easier to handle text than other languages I know, such as Java and C.

I chose to do word unigram ranking, and the training model basically works like this:
1. It reads off words in each language as tokens from all the training documents. In each language, the frequencies of all words are calculated and stored in the dictionary D.
2. Ranking is then given to each word within each language according to their frequency. Higher frequency word gets higher ranking. Ranking starts from 1, and there is no upper bound for it.

This is the training model. After training, the classifier should have a dictionary containing 74 languages; under each language, there are words that appear in training documents and there is a ranking attached to each word.

When it comes to classifying test documents, the classifier does the following to identify languages:
1. It reads all the words from a test document T, and use the same method described above to assign a ranking to each word in the test document.
2. It then loops through all the words in the test document, and sums up the absolute value of the ranking differences between each word in each language in D and each word in T, we denote this sum-up as Vd.
Note that if there is a word from the test document that doesn’t appear in the training data set, I set the corresponding ranking from the training data set as a constant, and I call it out-of-place value. This value affects the final result greatly in a certain range, and I’ll discuss it later in the report.
3. It finds the smallest difference value, and the corresponding language is the language that the test document is most likely written in.
Considering there are dual-language documents, the classifier actually gets the smallest two differences, and computes the difference between them, we call this difference a1. If the difference is smaller than a certain threshold, the classifier considers the test document is written in those two languages instead of one, otherwise the classifier just gets the language that corresponds to the smallest difference value.

4. Results
Before we start talking about the results, there’s one thing that we need to clarify first – the baseline. Considering the large proportion of English in the development documents, we establish the baseline by counting percentage of the development documents that are in English, and this gives us the baseline: 0.406.

The first time my classifier was working properly, I set my classifier only to treat each development document as mono-lingual document, and the out-of-place value was randomly set to 5,000, here is the result I got from classifying the 1,000 development documents:

Precision recall f-score
0.948 0.636 0.761

This result was significantly higher than the baseline. I took this result as a benchmark, and increase the performance of my classifier based on this.

I then added code to the classifier that enabled the classifier to identify files that are in two languages. The out-of-place value remained the same as 5,000, down below is my result:

Precision recall f-score
0. 875 0. 758 0. 813

Since more documents (containing two languages) could be identified now, my overall score had a 6.4% increase.

I then tinkered with the out-of-place value and the parameter a1 which determines whether there’s a second language in the document, I increased my result slightly. Here is the result where out-of-place value is set to 1,800.00, and a1 is set to 0.2:

Precision recall f-score
0. 868 0. 787 0. 825

This is the best f-score over the development documents I got after trying out different parameter values.

After I tried out different out-of-place value, I found it affects the overall identification rate quite a lot. This will be discussion in more detail in the next section.

5. Observation/ Discussion
5.1. Out of place value

When I tinkered with my classifier, I found that changing the out-of-place value can affect the result greatly.

In order to see how the out-of-place value affects my result, I calculated the highest word ranking among all languages in D, and this value is 688. Theoretically, the out-of-place value should be this number, at least this is what I was told in 352 tutorial. However, when I set this value actually to 689, I get a much worse result:

precision recall f-score
0.564 0.573 0.569

However, if I try large values, like 10,000 or 100,000, the results don’t differ quite obviously between each other, and f-score all remain between the 79% - 80% range. Here is the result when out-of-place value is set to 100,000:

precision recall f-score
0.873 0.735 0.798

I carefully thought about it and found out the following:

Out-of-place value affects results greatly when it’s set small (small means anything near the word rankings in D), and will stabilize the result when it’s set large. Let’s look at an example to see what exactly this means:

Let X = {x1, x2 x3…} be the number of words that appear in test document T, but doesn’t appear in dictionary D. Rankings of words in T is denoted as {rank1, rank2…} We also have Class1 and Class 2 in D. Now the difference Vd for different classes in D can be written as:

Vd1 = x1 * | rank1 – out-of-place-value|
Vd2 = x2 * | rank2 – out-of-place-value|

Let’s assume T is actually Class 1. Accordingly, we know that x1 must be much smaller than x2 because there must be more words in T that don’t appear in Class 2 given that T belongs to Class 1. When out-of-place-value is set small, the rank value can greatly affect Vd value, which means that whether T1 belongs to Class 1 or Class 2 greatly depends on the ranks of words that don’t appear in each class. This can give us false identification result. And this is why I got a much worse result when I set the out-of-place value to 688.

When out-of-place value is set very large, the relative ranks of words (that don’t appear in D) will contribute little to the Vd value, which means T will always be classified as Class1 because of the much smaller x1 value will always make Vd1 much smaller than Vd2 because | rank – out-of-place-value| is a quite large number.

Thanks to the random 5,000 value that I set at the start, which gives me quite a good result. In order to get the best result possible, I manually used binary search given two values 5,000 and 688. I then found the best result is obtained when out-of-place value is set around 1,800. When I further used binary search from 1,800 to find a better value, the resultant f-score only increases or decreases very slightly, so I decide that 1,800 gives me best f-score overall.

I could have written an automated test that helps me find the best out-of-place value, but due to lack of time – I have other assignments to see to, I didn’t really came to a very exact best out-of-place value. But I think my analysis was reasonable and the value 1,800 does give me a reasonably good result. However, I didn’t find out why 1,800 is the near best value.

5.2. Classifier speed up

When I just finished my classifier, it took my classifier more than 20 minutes to train over 8,000 training. I spent several hours on optimizing the speed. I found that moving all calculation-intense operations outside of nested loops helps a great deal.

Down below is the stuff I found most useful when speeding up the classifier:

1. When looping through dictionary keys, try to loop through all the dictionary keys instead of looping through all the dictionary items.
2. If some calculations can be put outside of main loops, do so and put all calculated values in a look-up table, it would make the classifier several times faster when the loops are very deep.

Now it only takes about 1 minute to train over the 8,000 documents and about 1 minute to classify 1000 documents.

6. Conclusion
Overall, I got my own word unigram ranking language classifier up and running with an f-score significantly higher than the baseline. I also found out how out-of-place value affects the performance of the classifier. Finally I optimized the classifier so that it runs much faster.

References

Timothy Baldwin, 433-352 Lecture Notes, Department of Computer Science and Software Engineering, University of Melbourne

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Information Retrieval, Cambridge University Press, Cambridge, England

Wednesday, 9 April 2008

manipulation of multi-dimensional array

Source:http://www.utdallas.edu/~rxc064000/dynamic_multi_dim.htm

23.1: Multidimensional Arrays and Functions

The most straightforward way of passing a multidimensional array to a function is to declare it in exactly the same way in the function as it was declared in the caller. If we were to call

func(a2);

then we might declare

func(int a[5][7])
{
...
}

and it's clear that the array type which the caller passes is the same as the type which the function func accepts.

If we remember what we learned about simple arrays and functions, however, two questions arise. First, in our earlier function definitions, we were able to leave out the (single) array dimension, with the understanding that since the array was really defined in the caller, we didn't have to say (or know) how big it is. The situation is the same for multidimensional arrays, although it may not seem so at first. The hypothetical function func above accepts a parameter a, where a is an array of 5 things, where each of the 5 things is itself an array. By the same argument that applies in the single-dimension case, the function does not have to know how big the array a is, overall. However, it certainly does need to know what a is an array of. It is not enough to know that a is an array of "other arrays"; the function must know that a is an array of arrays of 5 ints. The upshot is that although it does not need to know how many "rows" the array has, it does need to know the number of columns. That is, if we want to leave out any dimensions, we can only leave out the first one:

func(int a[][7])
{
...
}

The second dimension is still required. (For a three- or more dimensional array, all but the first dimension are required; again, only the first dimension may be omitted.)

The second question we might ask concerns the equivalence between pointers and arrays. We know that when we pass an array to a function, what really gets passed is a pointer to the array's first element. We know that when we declare a function that seems to accept an array as a parameter, the compiler quietly compiles the function as if that parameter were a pointer, since a pointer is what it will actually receive. What about multidimensional arrays? What kind of pointer is passed down to the function?

The answer is, a pointer to the array's first element. And, since the first element of a multidimensional array is another array, what gets passed to the function is a pointer to an array. If you want to declare the function func in a way that explicitly shows the type which it receives, the declaration would be

func(int (*a)[7])
{
...
}

The declaration int (*a)[7] says that a is a pointer to an array of 7 ints. Since declarations like this are hard to write and hard to understand, and since pointers to arrays are generally confusing, I recommend that when you write functions which accept multidimensional arrays, you declare the parameters using array notation, not pointer notation.

What if you don't know what the dimensions of the array will be? What if you want to be able to call a function with arrays of different sizes and shapes? Can you say something like

func(int x, int y, int a[x][y])
{
...
}

where the array dimensions are specified by other parameters to the function? Unfortunately, in C, you cannot. (You can do so in FORTRAN, and you can do so in the extended language implemented by gcc, and you will be able to do so in the new version of the C Standard ("C9X") to be completed in 1999, but you cannot do so in standard, portable C, today.)

Finally, we might explicitly note that if we pass a multidimensional array to a function:

int a2[5][7];
func(a2);

we can not declare that function as accepting a pointer-to-pointer:

func(int **a) /* WRONG */
{
...
}

As we said above, the function ends up receiving a pointer to an array, not a pointer to a pointer.

23.2: Dynamically Allocating Multidimensional Arrays

We've seen that it's straightforward to call malloc to allocate a block of memory which can simulate an array, but with a size which we get to pick at run-time. Can we do the same sort of thing to simulate multidimensional arrays? We can, but we'll end up using pointers to pointers.

If we don't know how many columns the array will have, we'll clearly allocate memory for each row (as many columns wide as we like) by calling malloc, and each row will therefore be represented by a pointer. How will we keep track of those pointers? There are, after all, many of them, one for each row. So we want to simulate an array of pointers, but we don't know how many rows there will be, either, so we'll have to simulate that array (of pointers) with another pointer, and this will be a pointer to a pointer.

This is best illustrated with an example:

#include

int **array;
array = malloc(nrows * sizeof(int *));
if(array == NULL)
{
fprintf(stderr, "out of memory\n");
exit or return
}
for(i = 0; i < nrows; i++)
{
array[i] = malloc(ncolumns * sizeof(int));
if(array[i] == NULL)
{
fprintf(stderr, "out of memory\n");
exit or return
}
}

array is a pointer-to-pointer-to-int: at the first level, it points to a block of pointers, one for each row. That first-level pointer is the first one we allocate; it has nrows elements, with each element big enough to hold a pointer-to-int, or int *. If we successfully allocate it, we then fill in the pointers (all nrows of them) with a pointer (also obtained from malloc) to ncolumns number of ints, the storage for that row of the array. If this isn't quite making sense, a picture should make everything clear:

Once we've done this, we can (just as for the one-dimensional case) use array-like syntax to access our simulated multidimensional array. If we write

array[i][j]

we're asking for the i'th pointer pointed to by array, and then for the j'th int pointed to by that inner pointer. (This is a pretty nice result: although some completely different machinery, involving two levels of pointer dereferencing, is going on behind the scenes, the simulated, dynamically-allocated two-dimensional "array" can still be accessed just as if it were an array of arrays, i.e. with the same pair of bracketed subscripts.)

If a program uses simulated, dynamically allocated multidimensional arrays, it becomes possible to write "heterogeneous" functions which don't have to know (at compile time) how big the "arrays" are. In other words, one function can operate on "arrays" of various sizes and shapes. The function will look something like

func2(int **array, int nrows, int ncolumns)
{
}

This function does accept a pointer-to-pointer-to-int, on the assumption that we'll only be calling it with simulated, dynamically allocated multidimensional arrays. (We must not call this function on arrays like the "true" multidimensional array a2 of the previous sections). The function also accepts the dimensions of the arrays as parameters, so that it will know how many "rows" and "columns" there are, so that it can iterate over them correctly. Here is a function which zeros out a pointer-to-pointer, two-dimensional "array":

void zeroit(int **array, int nrows, int ncolumns)
{
int i, j;
for(i = 0; i < nrows; i++)
{
for(j = 0; j < ncolumns; j++)
array[i][j] = 0;
}
}

Finally, when it comes time to free one of these dynamically allocated multidimensional "arrays," we must remember to free each of the chunks of memory that we've allocated. (Just freeing the top-level pointer, array, wouldn't cut it; if we did, all the second-level pointers would be lost but not freed, and would waste memory.) Here's what the code might look like:

for(i = 0; i < nrows; i++)
free(array[i]);
free(array);

Tuesday, 1 April 2008

Parsing arguments for your shell script

Parsing arguments for your shell script
By Carl Albing, JP Vossen, and Cameron Newham on July 17, 2007 (4:00:00 PM)
Source: http://www.linux.com/feature/118031

Suppose you want to have some options on your bash shell script, some flags that you can use to alter its behavior. You could do the parsing directly, using ${#} to tell you how many arguments have been supplied, and testing ${1:0:1} to test the first character of the first argument to see if it is a minus sign. You would need some if/then or case logic to identify which option it is and whether it takes an argument. What if the user doesn't supply a required argument? What if the user calls your script with two options combined (e.g., -ab)? Will you also parse for that? The need to parse options for a shell script is a common situation. Lots of scripts have options. Isn't there a more standard way to do this?

This article is excerpted from the newly published book bash Cookbook.

The solution -- use bash's built-in getopts command to help parse options. Here is an example, based largely on the example in the manpage for getopts

#!/usr/bin/env bash
# cookbook filename: getopts_example
#
# using getopts
#
aflag=
bflag=
while getopts 'ab:' OPTION
do
case $OPTION in
a) aflag=1
;;
b) bflag=1
bval="$OPTARG"
;;
?) printf "Usage: %s: [-a] [-b value] args\n" $(basename $0) >&2
exit 2
;;
esac
done
shift $(($OPTIND - 1))

if [ "$aflag" ]
then
printf "Option -a specified\n"
fi
if [ "$bflag" ]
then
printf 'Option -b "%s" specified\n' "$bval"
fi
printf "Remaining arguments are: %s\n" "$*"

There are two kinds of options supported here. The first and simpler kind is an option that stands alone. It typically represents a flag to modify a command's behavior. An example of this sort of option is the -l option on the ls command. The second kind of option requires an argument. An example of this is the mysql command's -u option, which requires that a username be supplied, as in mysql -u sysadmin. Let's look at how getopts supports the parsing of both kinds.

The use of getopts has two arguments.

getopts 'ab:' OPTION

The first is a list of option letters. The second is the name of a shell variable. In our example, we are defining -a and -b as the only two valid options, so the first argument in getopts has just those two letters -- and a colon. What does the colon signify? It indicates that -b needs an argument, just like -u username or -f filename might be used. The colon needs to be adjacent to any option letter taking an argument. For example, if only -a took an argument we would need to write 'a:b' instead.

The getopts built-in will set the variable named in the second argument to the value that it finds when it parses the shell script's argument list ($1, $2, etc.). If it finds an argument with a leading minus sign, it will treat that as an option argument and put the letter into the given variable ($OPTION in our example). Then it returns true (i.e., 0) so that the while loop will process the option then continue to parse options by repeated calls to getopts until it runs out of arguments (or encounters a double minus -- to allow users to put an explicit end to the options). Then getopts returns false (i.e., non-zero) and the while loop ends.

Inside the loop, when the parsing has found an option letter for processing, we use a case statement on the variable $OPTION to set flags or otherwise take action when the option is encountered. For options that take arguments, that argument is placed in the shell variable $OPTARG (a fixed name not related to our use of $OPTION as our variable). We need to save that value by assigning it to another variable because as the parsing continues to loop, the variable $OPTARG will be reset on each call to getopts.

The third case of our case statement is a question mark, a shell pattern that matches any single character. When getopts finds an option that is not in the set of expected options ('ab:' in our example) then it will return a literal question mark in the variable ($OPTION in our example). So we could have made our case statement read \?) or '?') for an exact match, but the ? as a pattern match of any single character provides a convenient default for our case statement. It will match a literal question mark as well as matching any other single character.

In the usage message that we print, we have made two changes from the example script in the manpage. First, we use $(basename $0) to give the name of the script without all the extra pathnames that may have been part of how it was invoked. Secondly, we redirect this message to standard error (>&2) because that is really where such messages belong. All of the error messages from getopts that occur when an unknown option or missing argument is encountered are always written to standard error. We add our usage message to that chorus.

When the while loop terminates, we see the next line to be executed is:

shift $(($OPTIND - 1))

which is a shift statement used to move the positional parameters of the shell script from $1, $2, etc. down a given number of positions (tossing the lower ones). The variable $OPTIND is an index into the arguments that getopts uses to keep track of where it is when it parses. Once we are done parsing, we can toss all the options that we've processed by doing this shift statement. For example, if we had this command line:

myscript -a -b alt plow harvest reap

then after parsing for options, $OPTIND would be set to 4. By doing a shift of three ($OPTIND-1) we would get rid of the options and then a quick echo$* would give this:

plow harvest reap

So, the remaining (non-option) arguments are ready for use in your script (in a for loop perhaps). In our example script, the last line is a printf showing all the remaining arguments.

Monday, 1 October 2007

lib for the image processing stuff

This is the library file for that image processing project.

-- PPMlib.hs
-- Library for 433-152 Project 2007
-- Bernard Pope

module PPMlib where

import Text.ParserCombinators.Parsec

type Pixel = (Int, Int, Int)

data PPM
= PPM Int Int Int [Pixel]
deriving Show

getPixels :: PPM -> [Pixel]
getPixels (PPM _ _ _ ps) = ps

getWidth :: PPM -> Int
getWidth (PPM w _ _ _) = w

getHeight :: PPM -> Int
getHeight (PPM _ h _ _) = h

getMaxVal :: PPM -> Int
getMaxVal (PPM _ _ m _) = m

{- transform interface
example: transform verify greyScale "in.ppm" "out.ppm"
-}

transform :: (PPM -> Maybe String) -> (PPM -> PPM) -> String -> String -> IO ()
transform verify trans inFile outFile = do
result <- parser inFile
case result of
Left e -> print e
Right ppm -> do
case verify ppm of
Nothing -> do
let newPPM = trans ppm
case verify newPPM of
Nothing -> writeFile outFile $ pretty newPPM
Just err -> do
putStrLn "Error in transformed image"
putStrLn err
Just err -> do
putStrLn "Error in input image"
putStrLn err

{- pretty printing PPM files -}

pretty :: PPM -> String
pretty (PPM width height maxVal pixels)
= "P3 " ++ unwords [show width, show height] ++ "\n" ++
unlines (show maxVal : prettyPixels pixels)
where
prettyPixels :: [Pixel] -> [String]
prettyPixels = map prettyPixel
prettyPixel :: Pixel -> String
prettyPixel (r,g,b) = unwords $ map show [r,g,b]

{- parsing PPM files -}

parser :: String -> IO (Either ParseError PPM)
parser = parseFromFile parsePPM

parsePPM :: Parser PPM
parsePPM = do
string "P3"
junk
width <- parseInt
junk
height <- parseInt
junk
maxVal <- parseInt
junk
pixels <- many parseTriplet
return $ PPM width height maxVal pixels

comment :: Parser ()
comment = do
char '#'
manyTill anyChar newline
spaces

junk :: Parser ()
junk = do
spaces
optional comment

parseTriplet :: Parser (Int, Int, Int)
parseTriplet = do
red <- parseInt
spaces
green <- parseInt
spaces
blue <- parseInt
spaces
return (red, green, blue)

parseInt :: Parser Int
parseInt = do
ds <- many1 digit
return $ read ds

my Proj.hs

This is our project for 433-152: Algorithmic Problem Solving. Had finished the 8 functions a few days before, but found that the pictures my program generated were different from the example files provided using the Unix tool 'diff'. Then I spent about 8 hours fixing and debugging. This was really painful! I struggled for two hours just with the order of base case for 'enlarge'! I should have put it before the main body of the function but didn't realize that! WTF!! Okay, I'll cut my crap, it's the project main file Proj.hs. I'll put the library file on here using another blog entry. Take a look at it if you're interested in Haskell or image processing!

-- Proj.hs
-- Your file for 433-152 Project 2007
-- Stub code by Bernard Pope and Anthony Wirth

module Proj where

import PPMlib
import Data.Map hiding (map)

type Coord = (Int, Int)

noVerify :: PPM -> Maybe String
noVerify _ = Nothing

verify :: PPM -> Maybe String
verify (PPM w h m pixel) =
-- if there's at least one wrong pixel, return error msg otherwise return Nothing
if (and (map (verify' w h m) pixel) == False) then Just "Invalid PPM file" else Nothing
-- see whether the attributes of the picture are valid
where
verify' w h m (r, g, b)
| g > m || r > m || b > m = False
| g < 0 || r < 0 || b < 0 = False
| w < 0 || h < 0 || m < 0 = False
| otherwise = True

-- simple one!
greyScale :: PPM -> PPM
greyScale (PPM w h m pixel) =
PPM w h m (map change pixel)
where
change :: (Int, Int, Int) -> (Int, Int, Int)
change (r, g, b) = (grey, grey, grey)
where
-- grey = floor (0.3 * (fromIntegral (r::Int)) + 0.59 * (fromIntegral (g::Int)) + 0.11 * (fromIntegral (b::Int)))
grey = div (r + b + g) 3

-- mothod is like the one above.
negative :: PPM -> PPM
negative (PPM w h m pixel) =
PPM w h m (map (invert m) pixel)
where
invert :: Int -> (Int, Int, Int) -> (Int, Int, Int)
invert m (a, b, c) = (m - a, m - b, m - c)

-- Stupid way which I used at first!
-- inverter m ([]) = []
-- inverter m ((a, b, c) : rest) = (m - a, m - b, m - c) : (inverter m rest)

-- This seems perfectly right, but it never stops when run.
-- replicate and concat
-- replicate n : get a list of n duplicated items
-- concat : convert a list of tuples into a list with all the original iterms
-- (left, right) = splitAt halfway pixel. this is a really useful way, learnt from the lecture. A good way to split things up.
enlarge :: Int -> PPM -> PPM
enlarge n (PPM w h m pixel) =
PPM (w * n) (h * n) m (vertical w n (horizontal n pixel))
-- horizontally stretching
horizontal :: Int -> [(Int, Int, Int)] -> [(Int, Int, Int)]
horizontal n pixel = concat $ map (replicate n) pixel
-- vertically stretching
-- duplicate each row and append the handled rest to it
vertical :: Int -> Int -> [(Int, Int, Int)] -> [(Int, Int, Int)]
vertical w n [] = []
vertical w n pix = (concat $ replicate n row) ++ (vertical w n rest)
where
(row, rest) = splitAt (n * w) pix

--crap this is what I did at first, obviously it's wrong
--concat $ map (replicate n) pixel
--concat (map (replicate n) pixel)

-- I have an idea, but don't know how to write code for it.
-- I take the average value of the corresponding pixels and let it be the new pixel.
-- What I'm not sure is, what if w or h is not divisible by n? How to handle the remainder?
reduce :: Int -> PPM -> PPM
reduce _ _ = undefined

-- reconstruct rows in the reverse order, pretty simple.
reflectRow :: PPM -> PPM
reflectRow (PPM w h m pixel) =
PPM w h m (reflect w pixel)
reflect :: Int -> [a] -> [a]
-- base case
reflect w [] = []
-- append rows from the bottom to the top
reflect w pixel = (reflect w right) ++ left
where
(left, right) = splitAt w pixel

-- reverse all the pixels in each row and then append all the rows together!
reflectCol :: PPM -> PPM
reflectCol (PPM w h m pixel) =
PPM w h m (revcol w pixel)
revcol :: Int -> [a] -> [a]
revcol w [] = []
-- reverse and append
revcol w pixel = (reverse left) ++ (revcol w right)
where
(left, right) = splitAt w pixel

-- Really don't know how to do it
rotate :: Float -> Coord -> PPM -> PPM
rotate _ _ _ = undefined

-- a pretty easy one, used the function min
threshold :: Int -> PPM -> PPM
threshold t (PPM w h m pixel) =
PPM w h m (map change pixel)
where
change :: (Int, Int, Int) -> (Int, Int, Int)
change (a, b, c) = ((min t a), (min t b), (min t c))

-- First take out extra columns and then rows
crop :: Coord -> Coord -> PPM -> PPM
crop (a, b) (c, d) (PPM w h m pixel) =
PPM (d - b + 1) (c - a + 1) m (croprow h a c (d - b + 1) (cropcol w b d pixel))
-- eliminate columns that we don't need
cropcol w b d [] = []
cropcol w b d pix = image ++ (cropcol w b d rest)
where
-- get each row
(row, rest) = splitAt w pix
-- for each row, take out unwanted part to the left of the selected area
(junk1, rowrest) = splitAt b pix
-- for each row, take out the part to the right of the wanted image part
(image, junk2) = splitAt (d - b + 1) rowrest
--take out extra rows
croprow h a c w' [] = []
croprow h a c w' pix = image
where
-- extra rows above the selected image
(upperjunk, rest) = splitAt (a * w') pix
-- rows delow the wanted part of image
(image, lowerjunk) = splitAt (w' * (c - a + 1)) rest

translate :: Coord -> PPM -> PPM
translate (a, b) (PPM w h m pixel) =
PPM w h m (transrow w h a (transcol w b pixel))
-- move the left edge of the picture to the correct column.
transcol :: Int -> Int -> [(Int, Int, Int)] -> [(Int, Int, Int)]
transcol w b [] = []
transcol w b pix = case b >= 0 of
True -> black b ++ left ++ transcol w b rest
False -> rightn ++ black (-b) ++ transcol w b rest
where
(row, rest) = splitAt w pix
(left, right) = splitAt (w - b) row
(leftn, rightn) = splitAt (-b) row
-- move the top edge to the correct row.
transrow :: Int -> Int -> Int -> [(Int, Int, Int)] -> [(Int, Int, Int)]
transrow w h a [] = []
transrow w h a pix = case a >= 0 of
True -> black (a * w) ++ image
False -> uimage ++ black ((-a) * w)
where
(image, rest) = splitAt ((h - a) * w) pix
(urest, uimage) = splitAt ((-a) * w) pix
-- make black pixels
black :: Int -> [(Int, Int, Int)]
black 0 = []
black n = (0, 0, 0) : black (n - 1)

-- don't have enough time to do it.
normalize :: PPM -> PPM
normalize _ = undefined