Tuesday, 21 February 2012

Get Sed Savvy tr

Untitled.html
Thank Eric Wendelins very much, because he agreed me to translate and republish his posts.
Yes, you may translate any of my articles and post them. Attribution would be appreciated, but not required.

Thanks a lot!

Cheers,
Eric

Get sed savvy – part 1

Original post

今天我將繼續介紹命令行工具,主題是sed。(Stream EDitor)是目前介紹過最複雜的工具,它自成一體。把他們放在一篇裡面會太擁擠,所以我會分開介紹。

sed的精華是檢索和替換,所以我們將從這裡開始,然後延伸到其他。

教程

如果你在Win平台上的話,請安裝Crywin或者類似的工具。sed也使用正則式,所以你可能需要一個正則式的手冊。

“sed一行行地有序讀取數據,調取命令,然後一行行地輸出”

sed 's/#FF0000/#0000FF/g' main.css

我們可以這樣理解:在main.css裡全局[/g]搜索[s/]紅色[#FF0000/],然後用藍色替代[#0000FF]。注意兩點:

  1. sed並沒有修改文件,只是在屏幕上輸出結果;

  2. 如果我們不使用"g",那麼sed只會替換第一個匹配結果。

所以我們可以這樣來修改:

sed -i -r 's/#(FF0000|F00)b/#0F0/g' main.css

可以從前面那個find教程中找到這個例子:在css文件中用綠色替換紅色。-r選項是提供額外的正則式輔助。Sheila在find那個帖子的回覆中指出,sed的-i在Solaris系統上無效,所以她建議用類似

perl -e s///g -i

來替代。

假設我們要更改所有的顏色設置,可能最佳方式就是使用類似這樣的sed腳本:

# sedscript - one command per line
s/#00CC00/#9900CC/g
s/#990099/#000000/g
s/#0000FF/#00FF00/g
...

# use sedscript with -f

sed -i -f sedscript *.css

sedscropt自然就是我們剛創建的那個腳本。注意,我們不需要在腳本中引用這句。這樣,我們就可以替換css文件中的所有顏色配置。

其他例子

# Trim whitespace from beginning and end of line 刪除開頭的空白
# You *might* have to type a tab instead of t here depending on your version of sed 不通的系統可能使用不同的sed,所以你可能使用tab來代替智力的\t.
sed -r 's/^[ t]*//;s/[ t]*$//g'

# Delete all occurances of foo 刪除文本中所有的foo
sed 's/foo//g'

結語

現在你大概能夠理解怎樣用單行的sed命令來修改很多文件了吧?好好使用它們將大大提高你的效率。

這兒有幾個不錯的sed教程(當然也包括本頁啦!):

USEFUL ONE-LINE SCRIPTS FOR SED – Eric Pement

Sed – UNIX Stream Editor – Cheat Sheet – Peteris Krumins

我敢說,我90%的sed的腳本都是用來檢索替換的,所以你已經很快入門了。當然,如我之前所言,sed有太多的變化,所以我會慢慢介紹刪除,增加行號,輸出特定行,以及其他一些技巧。最後,希望你們可以在評論中分享你們最喜歡的sed命令。

Get sed savvy – part 2

Original post

現在你已經有所了解Stream EDitor了,讓我們來拓展關於替換和行輸出的知識吧。

假設,我們希望讓他人理解我們在一個簡單的css或者javascript文件中使用了哪些function,比如。我們想要檢查最近的所有修改,然後提取所有的註釋,並將他們輸出到另一文件(比如wiki)。如果能夠自動完成這些,將大大提高我們的效率和團隊合作

教程

如果你在Win平台下的話,請安裝Cygwin。

# 提取單行註釋- 當然用grep更好,不過,我們也可以用sed
sed -n '///p' blah.js > /tmp/comments.out

# 提取多行註釋
sed -n '//*/,/*//p' blah.js >> /tmp/comments.out

這些sed語句比較複雜,我們可以這樣來理解: -n告訴sed先不要輸出結果,直到你告訴它到底輸出什麼為止。那個逗號讓sed匹配前後兩個語句之間的所有內容。在這個例子中,所有在/*和*/兩個語句間的內容都會通過p(打印)輸出。

當然,我們可以把這樣兩個命令結合到一塊,成為一個殺手級的應用。

# sed script file
////p
//*/,/*//p

# 使用sed輸出所有註釋
sed -n -f sedscr blah.js > /tmp/comments.out

這樣一來,我們就有了一個不錯的javascript註釋文件,我們可以放到wiki上,也可以用來比較版本的不同。注意,sed輸出的是整行,如果你僅僅需要的是某個註釋的結尾,通常會連開頭一併得到。雖然不是一個完美的方案,但至少快捷方便。

其他例子

#輸出所有超過八十個字母的行
sed -n '/^.{81}/p' myfile

#刪除空白行
sed '/^$/d' myfile

#優化速度的替代
sed '/Yahoo/ s//Not Microhoo/g' myfile

結語

如此一來,你應該已經比較熟悉sed了,和find,grep一起使用,將幫助你更好地適應命令行工具。

我建議你們試著使用這些命令,並且在一些不必要的情況下使用這些,以期能夠熟悉他們。今後你們就可以藉助這些工具大大提高效率了。

Get sed savvy – part 3

Original post

現在我們將學習sed的刪除(d), 讀取(r)和寫(w)命令,這樣你們的sed工具包就更加強大了。將要介紹的內容將幫助你們解決99%適用sed的任務。

我們將會繼續介紹awk之類的工具,如果你還沒準備好,請安裝Cygwin並回顧前面兩個部份。

教程

重複使用某些代碼的最好方式就是使用模板。借助sed,你可以大大提高利用模板的效率。

假設我們有一個經常使用的html模板,看起來像這樣:

?
01
02
03
04
05
06
07
08
09
10
11
<html>
    <head>
        <title>template.html</title>
    </head>
    <body>
        <div id="nav">Navigation here</div>
        <div id="content">
%%CONTENT%%
        </div>
    </body>
</html>

我們想要用一個htmlf文件替換其中的“%%CONTENT%%“。這個語句很簡單:‘//r’:

sed '/^%%CONTENT%%/r fragment.htmlf' template.html

這個腳本會在"%%CONTENT%%“添加fragment.htmlf的內容,所以我們可以利用刪除選項來修改:

sed -e '/^%%CONTENT%%/r fragment.htmlf' -e '/^%%CONTENT%%/d' template.html > whole.html

這個看起來似乎無用,但它最大的威力在於簡潔。每次當我試圖生成一些wiki或者html文本的時候,這一腳本威力無窮。

再來看另一個寫選項(w)。假設我們試圖根據最後一列的值來分割一個csv文件。我們可以使用grep,awk(即將介紹),但是用sed將大大提高效率:

#sedscript file
/,[0-9]+$/w numbers.csv
/,[A-Za-z]+/w letters.csv
/,[^A-Za-z0-9]+/w symbols.csv

sed -r -f sedscript original.csv

如此一來,numbers.csv將包含所有最後一列為數字的行,letters.csv和sumbols.csv也是如此。這個腳本或許可以幫助你將一個極大的地址簿分割為幾個文件。這僅僅是一個簡單的例子,你或許可以想到一些更加複雜的方式。

其他例子

?
1
2
3
4
5
6
#輸出<html>標籤外的所有內容

sed '/<html>/,/</html>/d' myfile.html

#轉換DOS的rn為Unix的n
sed 's/$//' myfile        #Windows
sed 's/.$//' myfile       #Linux/UNIX

結語

你們已經學習了部分sed命令,它們或許可以幫助你更好地編輯文本或者搜索文本。你們可以生成一些可執行腳本,這樣直接調用。一個很酷的腳本可以從某些​​文件中查找所有的註釋並把他們輸出到wiki上。這樣一來,團隊合作就更加緊密了,對吧?

當然,你可以把這些添加到書籤,但是只有嘗試過之後,你們才能牢牢掌握他們。請分享你們的經驗和命令行列表,這樣最棒!

Sunday, 19 February 2012

Grep之美 tr

Untitled.html
Thank Eric Wendelins very much, because he agreed me to translate and republish his posts.
Yes, you may translate any of my articles and post them. Attribution would be appreciated, but not required.

Thanks a lot!

Cheers,
Eric
The original post is here: http://eriwen.com/tools/grep-is-a-beautiful-tool/

Global Regular Expression Print (Grep)大概是每個命令行用戶的必備工具吧?如同find命令一般,將它與其他命令結合起來使用,將大大提高你的效率。
以下這個簡要的教程將幫助你認識到Grep的簡潔和厲害。如果你在Windows平台上,請下載Crywin;如果你剛開始使用正則式,這兒有一個不錯的正則式教程。

教程

假設我們希望在JavaScript文件中搜索重複的標籤。讓我們來看看怎麼利用基本的grep來實現。這一技巧可以幫助你搜索無數的重複項,如下:
  1. 重複的HTML標籤
  2. 檢查所使用的CSS標籤
  3. 重複的java標籤
  4. 以及其他

在某個目錄下搜索所有JS文件的function

grep "function" *.js
以上命令將輸出當前目錄下所有JS文件中每一行包含“function”的代碼。當然,如果輸出結果包含行數或者文件名稱就更棒了。

輸出所有開頭是function的代碼,並且包含行數和文件名

grep -EHn "^s*(function w+|w+ = function)" *.js # -E是regex,H是文件名,n是行數
有時Grep命令可以忽略所有的註釋,隱含標籤,或者其他一些語句,這取決於你怎樣寫JS文件。

輸出一個含有{function-name}標籤的列表並排序:

grep -Eho "^s*function w+" *.js | sort
-o只嚴格輸出匹配結果,-E使用extended regex,而-h將幫助你忽略文件名。我接着使用管道鏈接sort命令,所以它的輸出結果是一個排序後的<function-name>標籤列表。如果你沒有很多標籤或者文件,那麼你可以通過這個列表來觀察有哪些重複的標籤。讓我們來進一步看看那些比較大的列表。

只輸出重複的標籤

grep -hEo "^s*function w+" *.js | sort | uniq -d
這就對了!這行命令只輸出重複的標籤。當然我知道,我們可以用awk或者其他工具來實現,但是我不想在這解釋awk的具體細節;)。其實我原本在這寫了那條ask的命令,後來把它刪了,如果你們有興趣的話,可以留言告訴我。

其他例子

統計所有js文件中的function

grep -c "function" *.js

輸出不含有function的行

grep -v "function" *.js

列出是所有含有【pidgin的進程(非Win平台)

ps -ef | grep pidgin

結束語

grep是最經常使用的命令行工具之一。理解它的基本功能將會更好地提升你的效率。當然還有更多的grep命令,所以你將在實踐中感受它的優雅。

Monday, 13 February 2012

有效的Bash快捷方式 tr

Untitled.html
Thank Eric Wendelins very much, because he agreed me to translate and republish his posts.
Yes, you may translate any of my articles and post them. Attribution would be appreciated, but not required.

Thanks a lot!

Cheers,
Eric
原文由Eric Wendelins所著:Effective bash shorthand
此文的中文翻譯markdown文本可在這裡下載: https://gist.github.com/1806679
請讓我來向你們介紹下怎樣最大化你的bash效率。 bash有各種各樣的小技巧和快捷鍵,而我所打算告訴你們的是那些時時刻刻幫助我的點點滴滴。
在這我將舉例解釋的是關於history命令的特點, 括號展開和文件展開,以及其他一些小竅門。

掌控你的命令歷史

那些忘記歷史命令的傢伙肯定費勁周折地在重複他們。History恐怕是任何shell中最能提升效率的一個特性吧?
你可以使用history命令檢查你的命令歷史,它默認輸出最近的500行命令,當然你也可以調整輸出:
  1. 輸出最近的10行
     history 10
    
  2. 在歷史中搜索cmd命令
     history | grep cmd
    
每一條結果都有編號,所以你可以使用!去執行他們。
假設我需要從某一個目錄複製一個文件,然後跳轉到那個目錄下,。比較快捷的方式如下:
cp myfile.txt my/directory/path
cd !$  # cd my/directory/path
或者當我忘記使用sudo模式:
vi /etc/fstab  # oops!
sudo !!  # sudo vi /etc/fstab
抑或我想執行最近一條開頭是mount的的命令,但我不想再次輸入:
假設之前的命令是:
mount 192.168.0.100:/my/path/to/music /media/music
我只要:
!mount
  1. 重複最近的mount命令
注意,我經常(但不是每次)使用Ctrl-R,它可以搜索你輸入的歷史。更棒的是,你可以在執行前,預覽那條命令。
其他一些例子:
eric@sawyer:~$ echo foo -a bar baz
foo -a bar baz

eric@sawyer:~$ echo !:3-4 #輸出上條命令的後面兩個結果
bar baz

eric@sawyer:~$ !-2 #執行倒數第二條命令
foo -a bar baz

eric@sawyer:~$ ^ba^ya #用"ya"替代第一條命令中的"ba"
foo -a yar baz

eric@sawyer:~$ !^:p #MUCH cooler than "echo ..." ;)
foo

eric@sawyer:~$ !?bar #L輸出最後一條含有"bar"的命令
foo -a bar baz 

eric@sawyer:~$ !:gs/ba/ya #用“ya”替換所有“ba”

快速指南

!!  #展開最後一條命令

!-3 #展開最近第三條命令

!^  #展開最後一條命令中的第一個argument 

!:2 #展開最後一條命令中的第二個argument

!$  #展開最後一條命令中的最後一個argument

!*  #展開最後一一條命令中所有的argument,但不展開命令本身

!42 #展開歷史中第42條命令

!foo    #展開歷史中最近一條以“foo”開頭的命令

!?baz    #展開歷史中最近一條含有“baz”的命令

^foo^bar #用“bar”替換最近一條命令中的第一個“foo”

!:gs/foo/bar    #將最近一條命令中的“foo”全部替換為“bar”

<any_above>:p  #在屏幕上顯示而不執行命令
Download as PDF

有關history有用的.bashrc配置

請直接複製粘貼到 ~/.bashrc
~/.bashrc 

#不在歷史中保存重複的命令
export HISTCONTROL=ignoredups 

#在命令中保存更多的命令
shopt -s histappend
export HISTFILE=~/long_history
export HISTFILESIZE=50000

#沒有什麼理由保存太多吧?
export HISTSIZE=9999

#忽略所有重複的命令及其他無關緊要的命令
export HISTIGNORE="&:[ ]*:exit"

.inputrc的配置文件

如果你還是特別喜歡上下方向鍵(默認的history導航鍵),你可以將下面的配置粘貼到~/.inputrc去。這些可以幫助你輸入某個命令之後用上下方向鍵搜索歷史。儘管我會用其他方式,但這個看起不也很cool嗎?
?
~/.inputrc
1
2
3
4
5
6
7
8
"\eOA": history-search-backward
"\e[A": history-search-backward
"\eOB": history-search-forward
"\e[B": history-search-forward
"\eOC": forward-char
"\e[C": forward-char
"\eOD": backward-char
"\e[D": backward-char

延伸閱讀

Peteris Krumins寫過一篇極其棒的文章The Definitive Guide to Bash Command Line History,涉及了上面很多方面,或許可以幫你加深對對於history的理解。

括號拓展

一個快捷列表如果沒有brace expansions,就是不完整的。原理上,他們能用{}幫助你某個命令中重複的不同值。請看下面這個例子:
?
1
2
3
# Quickly make a backup
cp file.txt{,.bak}
# Equivalent to 'cp file.txt file.txt.back'
這樣一來,當然可以避免重複部份路徑。
假設我要創建一個文件夾結構模版,我可以這麼來:
?
1
mkdir -p {src,test}/com/eriwen/{data,view}
這樣可以幫助我拓展所有src/com/eriwen/data, src/com/eriwen/view, test/com/eriwen/data的目錄。如此一行命令大大節省了我的時間。

更棒的文件名拓展

我想你們會經常使用*作為通配符去替代一些字符,然而bash中遠不止這些。不過值得注意的是,有時find|grep這樣的組合可能更加實用。請參見Find is a beautiful tool
除了使用*,你還可以使用?來替代任意單個字符。或者可以使用[]來替代任意數量的字符。比如:
?
1
2
3
4
5
6
ls
# prints "myfile netbeans.conf netbeans-6.5rc2 netbeans-6.5 netbeans-6.7 src"
ls netbeans-6.?
# matches "netbeans-6.5 netbeans-6.7"
ls netbeans-6.[1-5]*
# matches "netbeans-6.5rc2 netbeans-6.5"
.bashrc entries for better filename expansion
?
1
2
3
4
5
6
# Include dot (.) files in the results of expansion
shopt -s dotglob
# Case-insensitive matching for filename expansion
shopt -s nocaseglob
# Enable extended pattern matching
shopt -s extglob
cd shorthand
還有幾個小竅門可以跳轉到常用目錄。例如:
?
1
2
3
4
5
# Lame way to go home
cd ~

    # The cool way
    cd
你也可以用cd -回到上一目錄:
?
1
2
3
4
5
6
7
pwd  # prints /home/eriwen/src
cd /my/webserver/directory

# Do something...

cd -
# Now I'm back in /home/eriwen/src
如果你想進一步掌握這些,可以試着用pushd和popd。

結束語

有效利用history,brace expansion和其他一些快捷方式可以大大節省你的時間。然而,沒有什麼比不輸入更節省時間的了。讓他們自動執行總歸是最好的,如果不能的話,就盡量使用聰明的方式。
我想我並沒有談及所有tilde expansions,shell參數擴展或者bash的鍵位命令。也許你們想知道那些,但我以為他們沒有我涉及到的這些有用。
你還有什麼快捷鍵嗎?請在評論中分享!

Friday, 10 February 2012

A very stupid wget crawler

As I mentioned in the last post, my gf wanted to analyse some online texts, and she spent ten hours manually extracting 100 samples from the Internet (basically it was Ctrl+c and Ctrl+v), so I am going to help her.

In Shell, there are two very powerful download tools: wget and curl, and personally I prefer wget. Thus I tried using wget to crawl that website, but unfortunately, I retrieved only an index file of that site. I modified my wget crawler, but still got nothing. One friend helped me to check that site, and found it possibly is a dynamic javascript page, so it would be very difficult to retrieve by a simple wget crawler.

Then, I had two options: map the site and download; or write an address file for wget. For me, the first option was too challenging, and hopefully, the site address is very formative: it can be regarded as an arithmetic series, although the size is very big (about 400,000). Hence, I chose the second option and used Shell to create an address book for wget.

The format of the address is like this: 

http://blabla.com/1 
http://blabla.com/2
...

So, I shall create two individual lists fist: one contains all "http://blabla.com/", and another one is a very big arithmetic series. Then, I can combine together. It sounds very simple, but it is a bit tough.

In Shell, jot can be used to create an infinite repeated string, so it is simple:

$ jot -b http://blabla.com/ 400000 # -b is for word

Thus, a 400000-occurrence prefix list is created.

Also, seq can be used to create an infinite number string:

$ seq 1 400000 # the default increment is 1

then, a 1 to 400000 number list is created.

Fairly simple, right?

The third step is to combine two big lists together.

In Shell, there are several commands can combine files together, such as join, paste, but I failed to use them (maybe I am too stupid, but do please let me know if it is possible to combine two lists from two individual files by these two commands.

So I referred to the Internet, and found a very helpful post here: Bash: Redirecting Input from Multiple FilesIt provides three scripts, and I tried the second one: 

#!/bin/bash

i=0
while read line
do
    f1[$i]="$line"
    let i++
done <$1

i=0
while read line
do
    f2[$i]="$line"
    let i++
done <$2

i=0
while [[ "${f1[$i]}" ]]
do
    echo ${f1[$i]} ${f2[$i]}
    let i++
done

(You can save is as a sh file, and use sh to execute it in Terminal)

It perfectly worked, but a space existed, which was very annoying: 

http://blabla.com/ 2 

By default, these are two columns, so I need to remove the space between. Again, I used another Shell tool to deal with: sed.

Sed is really powerful, but it is not very easy to handle, I feel. To delete the space in between, I wrote:

$ sed -e 's/ //' # I did not use g at the end, but it worked. Anyone could help me?

However, I was very stupid, because when I created the prefix list, I wrote http:blabla.com. Oh, I forgot two slashes! I used sed again:

$ sed 's_:_://_g'

OK, a 12M address file is finished.

Then using the -i option in wget, I could easily crawl the site. To avoid being blocked, I also added --random-wait option to my wget crawler, although it became very slow.

Now, this stupid wget crawler is still working, hmm.


Do please let me know if there is an easier way to do any of above. Thank you!

Notice: the script from Linux Journal is fairly slow: it took about 30 min to manipulate two lists. Hope I can figure out how to use join or paste soon.









Tuesday, 7 February 2012

Shall we group small sample first?

When we have some big data, we often treat them very carefully; however, if we have some small data, we might forget the importance of careful manipulation.


One friend asked to help out with some textual data. Mainly, she has a number of plain text files, sizing from 100 to 300 words, and wants to know the word choice of these files. Specifically, she is interested in what the frequency of the word type of an individual file is (sounds a bit unusual though), but not merely the word frequency.

To my knowledge, this case is very similar to my tweets corpus. It has about one million tweets, and the total size is about fourteen million words, which is to say, the average length of each tweet is about fourteen words. In my case, I roughly grouped the tweets into two categories: one is general tweets, and another is conversational tweets (this criteria is just a very general guidance, but what I really care about is to look at the data in a reasonable way. In addition, this follows Sinclair's external criteria).

I ran all analyses on my desktop. Although Mac is very powerful, I tried using AntConc to read 20k+ files (the original data were stored in 20k+ individual txt files), and it took about 10 min to generate a keyword list (it's like a unigram list, and I tried to looked at the details of concordance, but each concordance will take 3-4 min to generate!). Then I switched to Shell, and wrote some very simple commands to look at the data. It was more efficient, but still a bit slow. Later on, I divided the data into two groups as described above, approximately, the general subcorpus has about 550k tweets, and the dialogue subcorpus has about 450k tweets. This categorisation not only improves the analysis speed, but also brings me some new ideas. For example, I can compare the differences of two groups of data.

OK, let me explain the most important reason of the data categorisation. Basically, I would regard this as a grouping method. As you can see, the size of individual tweet data is extremely small, which means that the comparison between each individual tweets is meaningless. Why? The small size certainly brings a new problem: the data is very sparse if you look at the data not as a whole part. This means that the comparison is either meaningless, or impossible to make.

Also, in linguistics, we often talk about Zipf's law, which indicates that "about half of them occur once only, a quarter twice only, and so on" (Sinclair, 2004). However, for small-size data, this method may not apply (Suppose another similar case, why do we need t-score for small size data? If we do not care about the data size, we can use z-score for any size of data).

Back to the case, although this is an extreme case, but it is convincible. If we want to apply ZIpf's Law to each individual tweets, is it possible or acceptable? No, definitely not. Or, if we compare the differences of each individual tweets, is it possible or acceptable? No, absolutely not. Thus, we need to look the data in a different way -- grouping them according to some external rules. Only in this way, we can look at the data in a reasonable way.

To answer my friend's question, I would suggest a similar approach: grouping the data based on the metadata of the original data. Though her data is much longer than my individual tweets, it is still not a very good idea to look at them individually. Then, we can normalise the different groups of data and see the similarities or differences.



Sinclair, J. (2004). Corpus and Text — Basic Principles. in Developing Linguistic Corpora: a Guide to Good Practice. http://www.ahds.ac.uk/guides/linguistic-corpora/chapter1.htm

Thursday, 2 February 2012

Rethinking of Sentiment Analysis



    Given the boldness of their claims, I believe they ought to either publish their methods and their code, or withdraw these claims. (Kuleshov, 2010)