When we have some big data, we often treat them very carefully; however, if we have some small data, we might forget the importance of careful manipulation.
One friend asked to help out with some textual data. Mainly, she has a number of plain text files, sizing from 100 to 300 words, and wants to know the word choice of these files. Specifically, she is interested in what the frequency of the word type of an individual file is (sounds a bit unusual though), but not merely the word frequency.
To my knowledge, this case is very similar to my tweets corpus. It has about one million tweets, and the total size is about fourteen million words, which is to say, the average length of each tweet is about fourteen words. In my case, I roughly grouped the tweets into two categories: one is general tweets, and another is conversational tweets (this criteria is just a very general guidance, but what I really care about is to look at the data in a reasonable way. In addition, this follows Sinclair's external criteria).
I ran all analyses on my desktop. Although Mac is very powerful, I tried using AntConc to read 20k+ files (the original data were stored in 20k+ individual txt files), and it took about 10 min to generate a keyword list (it's like a unigram list, and I tried to looked at the details of concordance, but each concordance will take 3-4 min to generate!). Then I switched to Shell, and wrote some very simple commands to look at the data. It was more efficient, but still a bit slow. Later on, I divided the data into two groups as described above, approximately, the general subcorpus has about 550k tweets, and the dialogue subcorpus has about 450k tweets. This categorisation not only improves the analysis speed, but also brings me some new ideas. For example, I can compare the differences of two groups of data.
OK, let me explain the most important reason of the data categorisation. Basically, I would regard this as a grouping method. As you can see, the size of individual tweet data is extremely small, which means that the comparison between each individual tweets is meaningless. Why? The small size certainly brings a new problem: the data is very sparse if you look at the data not as a whole part. This means that the comparison is either meaningless, or impossible to make.
Also, in linguistics, we often talk about Zipf's law, which indicates that "about half of them occur once only, a quarter twice only, and so on" (Sinclair, 2004). However, for small-size data, this method may not apply (Suppose another similar case, why do we need t-score for small size data? If we do not care about the data size, we can use z-score for any size of data).
Back to the case, although this is an extreme case, but it is convincible. If we want to apply ZIpf's Law to each individual tweets, is it possible or acceptable? No, definitely not. Or, if we compare the differences of each individual tweets, is it possible or acceptable? No, absolutely not. Thus, we need to look the data in a different way -- grouping them according to some external rules. Only in this way, we can look at the data in a reasonable way.
To answer my friend's question, I would suggest a similar approach: grouping the data based on the metadata of the original data. Though her data is much longer than my individual tweets, it is still not a very good idea to look at them individually. Then, we can normalise the different groups of data and see the similarities or differences.
Sinclair, J. (2004). Corpus and Text — Basic Principles. in Developing Linguistic Corpora: a Guide to Good Practice. http://www.ahds.ac.uk/guides/linguistic-corpora/chapter1.htm
That's a very interesting case. However, I am confused a bit whether you are talking about a top-down approach (compiling a big corpus then dividing them into sub-corpora) or a bottom-up (having many small texts and, instead of looking at them individually, combining them to make a big corpus). I think that latter seems to be what corpus compilation is about. The former seems to be a way of investigating variation within a corpus. In addition, I am not sure what your friend wants to do with her corpus. Can you explain her research topic?
ReplyDeleteHi, Bas, thx.
DeleteFor her research question, as I mentioned, they would like to investigate the online reviews or comments of micro-financing projects. Actually they have thousands of projects, but each projects got one review from the donator and one from the receiver, and these reviews are very concise, sizing from one hundred to three hundred words. What they wanna know is if there is any specific keywords or patterns relevant to micro-financing, then they could use this result to improve the evaluation of these reviews.
Considering the data size, you can see the individual file is very small, so it is not very convenient to compare them individually. Thus I suggest to group them according to some external criteria such as location or age.
I also asked my boss, and he agrees to my opinion. However, I feel really frustrated to convince my friend as I mentioned in PGtip page. sigh!