
Collocations in Corpora and Statistical Methods
Explore the concept of collocations in corpora and statistical methods, considering distributions of words, lexical variation, and measures of collocational strength. Learn about the empiricist view of meaning and characteristics of collocations according to Firth. Discover the importance of frequency and regularity in language development through examples and analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Corpora and Statistical Methods Albert Gatt
In this lecture We have considered distributions of words and lexical variation in corpora. Today we consider collocations: definition and characteristics measures of collocational strength experiments on corpora hypothesis testing Corpora and Statistical Methods
Part 1 Collocations: Definition and characteristics
A motivating example Consider phrases such as: strong tea strong support powerful drug ? powerful tea ? powerful support ? strong drug Traditional semantic theories have difficulty accounting for these patterns. strong and powerful seem near-synonyms do we claim they have different senses? what is the crucial difference? Corpora and Statistical Methods
The empiricist view of meaning Firth s view (1957): You shall know a word by the company it keeps This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953). In the Firthian tradition, attention is paid to patterns that crop up with regularity in language. Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc. Statistical work on collocations tends to follow this tradition. Corpora and Statistical Methods
Defining collocations Collocations are statements of the habitual or customary places of [a] word. (Firth 1957) Characteristics/Expectations: regular/frequently attested; occur within a narrow window (span of few words); not fully compositional; non-substitutable; non-modifiable display category restrictions Corpora and Statistical Methods
Frequency and regularity We know that language is regular (non-random) and rule- based. this aspect is emphasised by rationalist approaches to grammar We also need to acknowledge that frequency of usage is an important factor in language development. why do big and large collocate differently with different nouns? Corpora and Statistical Methods
Regularity/frequency f(strong tea) > f(powerful tea) f(credit card) > f(credit bankruptcy) f(white wine) > f(yellow wine) (even though white wine is actually yellowish) Corpora and Statistical Methods
Narrow window (textual proximity) Usually, we specify an n-gram window within which to analyse collocations: bigram: credit card, credit crunch trigram: credit card fraud, credit card expiry The idea is to look at co-occurrence of words within a specific n-gram window We can also count n-grams with intervening words: federal (.*) subsidy matches: federal subsidy, federal farm subsidy, federal manufacturing subsidy Corpora and Statistical Methods
Textual proximity (continued) Usually collocates of a word occur close to that word. may still occur across a span Examples: bigram: white wine, powerful tea >bigram: knock on the door;knock on X s door Corpora and Statistical Methods
Non-compositionality white wine not really white , meaning not fully predictable from component words + syntax signal interpretation a term used in Intelligent Signal Processing: connotations go beyond compositional meaning Similarly: regression coefficient good practice guidelines Extreme cases: idioms such as kick the bucket meaning is completely frozen Corpora and Statistical Methods
Non-substitutability If a phrase is a collocation, we can t substitute a word in the phrase for a near-synonym, and still have the same overall meaning. E.g.: white wine vs. yellow wine powerful tea vs. strong tea Corpora and Statistical Methods
Non-modifiability Often, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms. Example: kick the bucket vs. ?kick the large bucket NB: this is a matter of degree! non-idiomatic collocations are more flexible Corpora and Statistical Methods
Category restrictions Frequency alone doesn t indicate collocational strength: by the is a very frequent phrase in English not a collocation Collocations tend to be formed from content words: A+N: powerful tea N+N: regression coefficient, mass demonstration N+PREP+N: degrees of freedom Corpora and Statistical Methods
Collocations in a broad sense In many statistical NLP applications, the term collocation is quite broadly understood: any phrase which is frequent/regular enough proper names (New York) compound nouns (elevator operator) set phrases (part of speech) idioms (kick the bucket) Corpora and Statistical Methods
Why are collocations interesting? Several applications need to know about collocations: terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices) document classification: specialist phrases are good indicators of the topic of a text named entity recognition: names such as New York tend to occur together frequently; phrases like new toydon t Corpora and Statistical Methods
Example application: Parsing She spotted the man with a pair of binoculars 1. [VP spotted [NP the man [PP with a pair of binoculars]]] 2. [VP spotted [NP the man] [PP with a pair of binoculars]] Parser might prefer (2) if spot/binoculars are frequent co- occurrences in a window of a certain width. Corpora and Statistical Methods
Example application: Generation NLG systems often need to map a semantic representation to a lexical/syntactic one. Shouldn t use the wrong adjective-noun combinations: clean face vs. ?immaculate face Lapata et al. (1999): experiment asking people to rate different adjective-noun combinations frequency of the combination a strong predictor of people s preferences argue that NLG systems need to be able to make contextually- informed decisions in lexical choice Corpora and Statistical Methods
Frequency-based approach Motivation: if two (or three, or ) words occur together a lot within some window, they re a collocation Problems: frequent collocations under this definition include with the, onto a, etc. not very interesting Corpora and Statistical Methods
Improving the frequency-based approach Justeson & Katz (1995): part of speech filter only look at word combinations of the right category: N + N: regression coefficient N + PRP + N: jack in (the) box dramatically improves the results content-word combinations more likely to be phrases Corpora and Statistical Methods
Case study: strong vs. powerful See: Manning & Schutze `99, Sec 5.2 Motivation: try to distinguish the meanings of two quasi-synonyms data from New York Times corpus Basic strategy: find all bigrams <w1, w2> where w1 = strong or powerful apply POS filter to remove strong on [crime], powerful in [industry] etc. Corpora and Statistical Methods
Case study (cont/d) Sample results from Manning & Schutze `99: f(strong support) = 50 f(strong supporter) = 10 f(powerful force) = 13 f(powerful computers) = 10 Teaser: would you also expect powerful supporter? what s the difference between strong supporter and powerful supporter? Corpora and Statistical Methods
Limitations of frequency-based search Only work for fixed phrases But collocations can be looser , allowing interpolation of other words. knock on [the,X s,a] door pull [a] punch Simple frequency won t do for these: different interpolated words dilute the frequency. Corpora and Statistical Methods
Using mean and variance General idea: include bigrams even at a distance: w1 X w2 pull a punch Strategy: find co-occurrences of the two words in windows of varying length compute mean offset between w1 and w2 compute variance of offset between w1 and w2 if offsets are randomly distributed, then we have high variance and conclude that <w1,w2> is not a collocation Corpora and Statistical Methods
Example outcomes (M&S `99) position of strong wrt opposition mean = -1.15, standard dev = 0.67 i.e. most occurrences are strong [ ] opposition position of strong wrt for mean = -1.12, standard dev = 2.15 i.e. for occurs anywhere around strong, SD is higher than mean. can get strong support for, for the strong support, etc. Corpora and Statistical Methods
More limitations of frequency If we use simple frequency or mean & variance, we have a good way of ranking likely collocations. But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance? We need to think in terms of hypothesis-testing. Given <w1,w2>, we want to compare: The hypothesis that they are non-independent. The hypothesis that they are independent. Corpora and Statistical Methods
Preliminaries: Hypothesis testing and the binomial distribution
Permutations Suppose we have the 5 words {the, dog, ate, a, bone} How many permutations (possible orderings) are there of these words? the dog ate a bone dog the ate a bone E.g. there are 5! = 120 ways of permuting 5 words. = ! 1 ... 1 n n n
Binomial coefficient Slight variation: How many different choices of three words are there out of these 5? This is known as an n choose k problem, in our case: 5 choose 3 n = ! n k ( ! k )! n k For our problem, this gives us 10 ways of choosing three items out of 5
Bernoulli trials A Bernoulli (or binomial) trial is like a coin flip. Features: 1. There are two possible outcomes (not necessarily with the same likelihood), e.g. success/failure or 1/0. 2. If the situation is repeated, then the likelihoods of the two outcomes are stable.
Sampling with/out replacement Suppose we re interested in the probability of pulling out a function word from a corpus of 100 words. we pull out words one by one without putting them back Is this a Bernoulli trial? we have a notion of success/failure: w is either a function word ( success ) or not ( failure ) but our chances aren t the same across trials: they diminish since we sample without replacement
Cutting corners If the sample (e.g. the corpus) is large enough, then we can assume a Bernoulli situation even if we sample without replacement. Suppose our corpus has 52 million words Success = pulling out a function word Suppose there are 13 million function words First trial: p(success) = .25 Second trial: p(success) = 12,999,999/51,999,999 = .249 On very large samples, the chances remain relatively stable even without replacement.
Binomial probabilities - I Let represent the probability of success on a Bernoulli trial (e.g. our simple word game on a large corpus). Then, p(failure) = 1 - Problem: What are the chances of achieving success 3 times out of 5 trials? Assumption: each trial is independent of every other. (Is this assumption reasonable?)
Binomial probabilities - II How many ways are there of getting success three times out of 5? Several: SSSFF, SFSFS, SFSSF, To estimate the number of possible ways of getting k outcomes from n possibilities, we use the binomial coefficient: 5 ! 5 120 = = = 10 3 5 ( ! 3 3 )! 6 2
Binomial probabilities - III 5 choose 3 gives 10. Given independence, each of these sequences is equally likely. What s the probability of a sequence? it s an AND problem (multiplication rule) P(SSSFF) = (1- )(1 ) = 3(1- )2 P(SFSFS) = (1- ) (1- ) = 3(1- )2 (they all come out the same)
Binomial probabilities - IV The binomial distribution states that: given n Bernoulli trials, with probability of success on each trial, the probability of getting exactly k successes is: probability of each success n ( ) k n k = ; , 1 ( ) b k n k Number of different ways of getting k successes probability of k successes out of n
Expected value and variance Expected value: Expected value of X over n trials = n [ ] E X Variance of X over n trials = n ( ) 1 ( ) Var X where is our probability of success
The logic of hypothesis testing The typical scenario in hypothesis testing compares two hypotheses: 1. The research hypothesis 2. A null hypothesis The idea is to set up our experiment (study, etc) in such a way that: If we show the null hypothesis to be false then we can affirm our research hypothesis with a certain degree of confidence
H0 for collocation studies There is no real association between w1 and w2, i.e. occurrence of <w1,w2> is no more likely than chance. More formally: H0: P(w1 & w2) = P(w1)P(w2) i.e. P(w1) and P(w2) are independent
Some more on hypothesis testing Our research hypothesis (H1): <w1,w2> are strong collocates P(w1 & w2) > P(w1)P(w2) A null hypothesis H0 P(w1 & w2) = P(w1)P(w2) How do we know whether our results are sufficient to affirm H1? I.e. how big is our risk of wrongly falsifying H0?
The notion of significance We generally fix a level of confidence in advance. In many disciplines, we re happy with being 95% confident that the result we obtain is correct. So we have a 5% chance of error. Therefore, we state our results at p = 0.05 The probability of wrongly rejecting H0 is 5% (0.05)
Tests for significance Many of the tests we use involve: 1. having a prior notion of what the mean/variance of a population is, according to H0 2. computing the mean/variance on our sample of the population 3. checking whether the sample mean/variance is different from the sample predicted by H0, at 95% confidence.
The t-test: strategy obtain mean (x) and variance (s2) for a sample H0: sample is drawn from a population with mean and variance 2 estimate the t value: this compares the sample mean/variance to the expected (population) mean/variance under H0 check if any difference found is significant enough to reject H0
Computing t calculate difference between sample mean and expected population mean scale the difference by the variance x = t 2 s N Assumption: population is normally distributed. If t is big enough, we reject H0. The magnitude of t given our sample size N is simply looked up in a table. Tables tell us what the level of significance is (p-value, or likelihood of making a Type 1 error, wrongly rejecting H0).
Example: new companies We think of our corpus as a series of bigrams, and each sample we take is an indicator variable (Bernoulli trial): value = 1 if a bigram is new companies value = 0 otherwise Compute P(new) and P(companies) using standard MLE. H0: P(new companies) = P(new)P(companies)
Example continued We have computed the likelihood of our bigram of interest under H0. Since this is a Bernoulli Trial, this is also our expected mean. We then compute the actual sample probability of <w1,w2> (new companies). Compute t and check significance
Uses of the t-test Often used to rank candidate collocations, rather than compute significance. Stop word lists must be used, else all bigrams will be significant. e.g. M&S report 824 out of 831 bigrams that pass the significance test. Reason: language is just not random regularities mean that if the corpus is large enough, all bigrams will occur together regularly and often enough to be significant. Kilgarriff (2005): Any null hypothesis will be rejected on a large enough corpus.
Extending the t-test to compare samples Variation on the original problem: what co-occurrence relations are best to distinguish between two words, w1 and w1 that are near-synonyms? e.g. strong vs. powerful Strategy: find all bigrams <w1,w2> and <w1, w2 > e.g. strong tea, strong support check, for each w1, if it occurs significantly more often with w2, versus w2 . NB. This is a two-sample t-test