The Power of Shell
— Using shell programming for corpus analysisSheng Li
University of Birmingham
January, 2012
Gist
1. My story
I still remember the first supervision meeting with my supervisor.He asked me which programming language I use.
Sorry, but nothing. I answered.
Then he said, do not worry, what you need to know is Shell.
Later, I worried about the coding skill, and asked him if I need to learn any programming skill, and he answered that Shell is very helpful for corpus analysis, if you learn sed or awk, that will be better.
At the beginning of last academic year, we received three lectures about using shell language for corpus analysis from the Research Methods of CL, which covered the most fundamental ( but very helpful ) part of shell.
I kept doubting it until I began to use shell by myself!
Now, I do not doubt it any more because of the magical power of shell.
2. Why do we use shell?
Shell is a very simple but robust language, and it has many ‘dialects’, for example, bash, zsh, csh. The most common one is bash, which means bounce shell. You can find bash on Mac, or most Linux distributions.Efficiency
What we need is not only the speed, but also accuracy.- Speed
- It is much faster than any other GUI tool, such as UAM corpus tool or AntConc.
- E.g.: comparison with AntConc.
- It is much faster than any other GUI tool, such as UAM corpus tool or AntConc.
- Accuracy
- If you use other GUI tools, the accuracy depends on the designer’s understanding.
- The example of wc command ant word count function in other GUI tools
- If you use other GUI tools, the accuracy depends on the designer’s understanding.
- Robustness
- Shell can handle various materials, ranging from some individual lines to ten million-word files (maybe even larger data, but I have only tried the ten-million-word data so far).
- E.g.: I used shell to deal with my one-million tweet corpus, and its performance is quite optimistic.
- Shell can handle various materials, ranging from some individual lines to ten million-word files (maybe even larger data, but I have only tried the ten-million-word data so far).
- Simple
- Usually, the shell programming is only one line or several lines, much simpler than other languages, and much efficient than GUI analysis.
3. Converting
Question 3.1: How can we convert a PDF to other readable format?We often need to deal with PDF files, which is a big problem!
Please download a thesis from [eThsis @ Bham] (http://etheses.bham.ac.uk/464/), and put it on your Desktop directory.
Solution:
1. $ pdftotext FILENAME
# the default output is a txt file.
2. $ pdf2html FILENAME
# the default output includes an indexed html file and a normal html file.
Answer 3.1 $ pdftotext Desktop/Cheung09PhD.pdf- pdftotext is a default tool,
- However, pdf2html, must be installed otherwise.
$ sudo apt-get pdf2html
On Mac OS X, you can use MacPorts (It but be installed first, so do the following two approaches.) $ port pdf2html
Or, Homebrew: $ brew pdf2html
Or, Gentoo Prefix (highly recommended): $ emerge pdf2html
- The principle of these tools is that they convert the file format, instead of doing OCR recognition.
- They are super fast!!!
Sometimes, if you input a non-English encoded files, such as a big5 file (a traditional Chinese encoding, Terminal might not recognise it correctly. Hence, you need convert its encoding to a universal encoding, e.g.: UTF-8.
Solution:
iconv -f ENCODING -t ENCODING INPUTFILE
If you would like to know how many encodings iconv can convert, you can use:iconv -l
4. Read a large file
Question 4.1: How can we deal with a huge plain text file or a tabular file?Often, we will have some 10M+ txt files as corpus data, or some 10M+ csv/tsv files or even larger. Using a text editor (if you are a vim or emacs user, that will be fine) to open and read them will be extremely slow ( Forget about MS office, orz! ).
Using some simple shell commands will be fairly helpful.
Solutions: By this command u can read from the beginning of a file.
$ head FILENAME
$ head -NOofLINE FILENAME
Question 4.2: How can we read a large file from the ending?Solutions:
$ tail FILENAME
$ tail -NUMofLINE FILENAME
Question 4.3: How can we read a large file freely?Solutions:
$ less FILENAME
$ more FILENAME
5. Word count and other simple statistics
Question 5.1: How can we count the word, line, or character of one large file?MS Word has a would count function, and so do other softwares. We can also perform this by a simple command.
Solution:
$ wc FILENAME
$ wc -option FILENAME
wc just means the Word Count, but it has several options, by default:1. -c for character
2. -l for line In shell, line means \n or \012
3. -b for byte
Line means a new line (hit a RETURN); in regex, it is \n.5.2. A nicer solution
In addition, if you are familiar with awk, then using a simple awk script can be much nicer. (See The AWK Programming Language P.14)
$ awk '{ nc = nc + length ($0) + 1
nw = nw + NF
}
END { print NR, "lines,", nw, "words", nc, "characters" }' FILENAME
NF: number of fieldNR: number of line
5.3 a more advanced solution
We can also use unigram to calculate the file size. The idea of unigram is to tokenise the original file to a word list: each line contains only one word, and remove all punctuation marks. Then, you can just use wc to calculate the line number, which is the word count of the original file. I will no go to details about this, but provide the script here.
tr ' ' '\012' | # convert all space to NEW LINE
wc -l # to count the line number
the complete script is:tr ' ' '\012' | wc -l
6. Looking at specific patterns
Question 6.1: If we want to do some manual analysis by looking at some specific patterns, what can we do?In corpus analysis, manual analysis is a must. Sometimes, it is difficult to do this by a GUI tool.
Solution:
$ grep -OPTION PATTERN INPUTFILE
$ grep -OPTION REGEX INPUTFILE
# You can use regular expression to improve the accuracy and robustness.
Generally, it outputs the whole line containing the pattern you searched; however, the -o option (only matched pattern) only inputs the exact pattern you searched. This would enrich the frequency count function of grep.Question 6.2 How can we look at a pattern regardless the case?
Solution:
$ grep -i PATTERN INPUTFILE
The option -i means ignore the case, so the terminal considers the lowercase and uppercase as a same pattern.Question 6.3 Sometime, the grep is not powerful enough, what can we do?
One example is if we want to look at some word variants, the grep would not help. E.g.: if we want to know “I am”, “I’m”, “Im” at once, what can we do?
Solution:
$ grep -e PATTERN INPUTFILE
or$ egrep -OPTION PATTERN INPUTFILE
Using the extended regular expression, strongly recommended! Personally, I prefer egrep than grep.$ egrep -i "\bi( am|m|'m)\b" INPUTFILE #\b means word boundary. Because the regular expression is very greedy or ambitious, they will match any possible pattern in the file. If the file contains a word like "William" or "Miami", they will also be included in the result. Thus, we must use some
Notice: the grep or egrep output is according to the line occurrence. If you simply combine grep or egrep with wc to count the pattern occurrence, that will be problematic if one line contains more than one matched pattern.Question 6.4 What can we do to deal with one line containing one more matched pattern?
Solution:
$ grep -o PATTERN FILENAME
This would only output the matched pattern, in other words, if one line contains one more pattern, it will output all matched patterns. Then, combining the wc, the result should be accurate.Very important notice: for mac user, the grep version is too old, so there is a serious conflict between -i and -o option. If you combine them, you will get a wrong result. Please update your grep through the way above immediately!
Question 6.5 Is there a convenient way to deal with counting in grep?
Sometimes, using pipeline is tedious, because you may forget it.
Solution:
$ grep -c PATTERN FILENAME
With -c option, you can count the output easily.Notice: You can always combine different options together, but make sure they do not conflict. Please read the man page carefully.
E.g.:
$ egrep -ioc "\bi( am|m|'m)\b" INPUTFILE # This script will count all matched patterns of "I am", "I'm", "Im" in the file.
5. $ grep -v PATTERN FILENAME
(to look at the unmatched patterns)You can combine the different options above.
Always be aware of the ambitiousness or greed of regular expression!!!
NB: Please update grep to the newest version, it has a serious bug with -I and -o option.
7. Regular expression
This is used for fuzzy matching.If you are familiar with CQP Syntax or Simple Query Syntax used on BNCweb, they are quite similar to the regular expression.
I will not go in details about this, because this will take ages to discuss. You may refer to some cheatsheets.
8. Some tricks
There are many useful keyboard shortcuts in Terminal.tab: find the relevant file or command
ctrl+c: abort a command
ctrl+a: go to the beginning of the current line
ctrl+e: go to the end of the current line
ctrl+u: erase the whole line
ctrl+l: clean the screen
q: exit the current command
$ man COMMAND to look at the manual.
9. Final points
Any programming language is just like a foreign language (precisely, they are just artificial languages), if you can master any foreign language, then you can master any programming language.Only you need to do is to keep practising.
Keep it simple, stupid! KISS philosophy
10. Extended reading
- Use $ man command to refer the manual in shell.
- egrep for linguists by Nikolaj Lindberg, STTS Södermalms talteknologiservice. (Highly recommended!)
- grep for linguists by Stuart Robinson
- Unix™ for Poets by Kenneth Ward Church, AT&T Bell Laboratories. (The ultimate manual which I am still learning it.)
- Ngrams by Kenneth Ward Church, AT&T Bell Laboratories.
- The Awk Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. (An old staff, but rather handy and comprehensive!)
- Why you should learn just a little Awk – A Tutorial by Example by Greg Grothaus, Google.
- Unix Shell Text Processing Tutorial (grep, cat, awk, sort, uniq) by Xah Lee
- Sculpting text with regex, grep, sed, awk, emacs and vim by Matt Might, University of Utah
0 comments:
Post a Comment