
Clustering & Decomposition in Document Analysis
Explore clustering methods like K-Means algorithm for decomposing documents into authorial components. Understand the importance of choosing the right feature sets and seed selection in achieving accurate clustering results.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Lecture 7: Clustering and Decomposition of a Document to Authorial Components Moshe Koppel and Navot Akiva
Clustering The idea is to define natural clusters Position in space depends on feature set Need to have some notion of what you want in order to choose right feature types e.g., cluster by topic: content words e.g., cluster by author: style features With many features, clusters aren t so elegant
How Many Clusters? Some methods decide in advance Some methods try to optimize k based on data e.g., in previous example, would want 3 clusters Caution: some optimization criteria biased towards one giant cluster or many tiny ones
Flat Clustering: K-Means Algorithm Select K random docs {s1, s2, sK} as seeds. Until clustering converges or other stopping criterion: For each doc di: Assign dito the cluster cjsuch that dist(xi, sj) is minimal. For each cluster cj: sj= (cj) (Update the seeds to the centroid of each cluster)
K-Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reassign clusters x x Compute centroids x x x x Reassign clusters Converged!
Termination conditions Several possibilities, e.g., A fixed number of iterations. Doc partition unchanged. Centroid positions don t change.
Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters don t change. K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. EM is known to converge. Number of iterations could be large.
Convergence of K-Means Define goodness measure of cluster k as sum of squared distances from cluster centroid: Gk= i(di ck)2 G = kGk (sum over all diin cluster k) Reassignment monotonically decreases G since each vector is assigned to the closest centroid.
Dependence on Seed Choice Results depend on random seed selection. Some tricks for avoiding bad seeds: Choose seed far from any already chosen Try out multiple seed sets If you start with B and E as centroids you converge to {A,B,C} and {D,E,F} If you start with D and F you converge to {A,B,D,E} and {C,F}
A Problem The methods we ve discussed so far don t always give very satisfying results. Have a look at this example .
Spectral Methods Because of examples like that, spectral methods have been developed. Here s the idea: Imagine weighted edges between points, where weight reflects similarity. Zero out all edges except k nearest neighbors with high weights. Now make the cheapest cuts that result in separate components.
Spectral Clustering Properly formulated, the min Ncut problem is NP-hard. But there are some nice linear algebra tricks we can use to get reasonable approximate solutions: 1. Pre-processing Construct a matrix representation of the dataset. 2. Decomposition Compute eigenvalues and eigenvectors of the matrix. Map each point to a lower-dimensional representation based on one or more eigenvectors. 3. Grouping Assign points to two or more clusters, based on the new representation.
New Problem: Separating Document Components Often documents consist of multiple authorial components. Our object is to tease apart the components of a composite document.
Basic Idea Divide the document into natural chunks (e.g., chapters, paragraphs) Vectorize chunks using some feature set Cluster the vectors
Classic Example: The Bible Great historic and cultural interest Much prior research on components Has been manually tagged in every conceivable way
Test Case Let s throw the chapters of Ezekiel and Jeremiah into a hat and see if we can separate them out. Each is presumably the work of a single (primary) author. There s no reason to think it s the same author. They are thematically quite similar, though far from identical.
Clustering Jeremiah+Ezekiel Chunks = chapters (no sequence info) Features = bag of words Cluster method = Ncut # clusters = 2
Results using all words Jeremiah Ezekiel Cluster 1 29 28 Cluster 2 23 20 Not too exciting. We must be picking up thematic or genre-related differences that cross books. Let s try using only function words.
Results using function words Jeremiah Ezekiel Cluster 1 34 28 Cluster 2 18 20 Not any better. Let s try a new approach.
A Better Idea Exploit the fact that different authors use different synonyms for the same idea (e.g., makel/mateh). It would be really convenient if it turned out that Jeremiah and Ezekiel made consistently different choices for various synsets. Note that words aren t synonyms, rather word senses are synonyms. (For example mateh=staff is a synonym of makel, but mateh=tribe is not.)
Automatically Finding Synonyms There are various clever methods for identifying synsets, but most are not exact enough for our purpose. Conveniently, for the Bible, we have many useful tools, including careful translations and manual sense tagging. We identify as synonyms word senses that are translated into the same English word (e.g., makel=staff and mateh=staff). (We even get sense disambiguation for free.) Due to polysemy (in English), this method overshoots. We manually delete mistakes. (This is the only manual intervention we will ever do.)
Synonym Method The usual similarity measures (e.g., cosine, inverse Euclidean distance) aren t quite right. If two docs use the same synonym, they are similar. If two docs use opposite synonyms, they are different. If one of the docs uses one of the synonyms, but the other doesn t, cosine would regard them as different. But are they? For measuring similarity, we only consider synsets represented in both docs.
Results using synonyms Jeremiah Ezekiel Cluster 1 46 6 Cluster 2 7 41 Now we re getting somewhere. But we re not done yet
Core of a Cluster Some chapters are in a cluster because they really belong there; some just have to be somewhere. Let s consider only chapters near one centroid and far from the other. These are the cores of the respective clusters. 36 2 12 3 36 7 4
Cluster cores Jeremiah Ezekiel Cluster 1 36 2 Cluster 2 0 36
Cluster cores Jeremiah Ezekiel Cluster 1 36 2 Cluster 2 0 36 Ezekiel 1, 10
Expanding the Core Now that we have a core, we can use supervised methods (e.g., SVM) to learn a boundary. In fact, we can use function words as our features. Using synonyms will just get us back where we started. And besides, FW are generally very reliable for supervised authorship attribution.
SVM expansion of core Jeremiah Ezekiel Cluster 1 52 1 0 47 Cluster 2 Two training examples are misclassed by SVM. Incredibly, these are Ezekiel 1 and 10, which were part of Jeremiah core, but are on Ezekiel side of optimal SVM boundary. The only exception is Ezekiel 42, a non-core chapter which lies in the SVM margin.
Decomposing Unsegmented Text Until now, we ve assumed that our units (chapters) were all pure Ezekiel or pure Jeremiah. That s a major unrealistic assumption. Let s create a munged Jer-iel book by alternately taking a random number of verses from each book until we run out of verses. Can we tease that apart?
Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Chunks Might Be Mixed xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxx
Decomposing Unsegmented Text Start with some arbitrary chunking of the munged text. (Some of these chunks will be mixed.) Now proceed exactly as above, until you have cluster cores. With a bit of luck, the chunks in the core will be the pure chunks.
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Decomposing Unsegmented Text As above, use SVM to learn a classifier, using the core chunks as training examples. But at the last step, classify individual verses instead of chapters. Some smoothing techniques are necessary to: Deal with the sparseness of features in individual verses Take advantage of serial correlation
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
A Munged Document xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Multiple Books The same method works if we create a munged book out of k>2 constituent books. BUT, we need to be given the k. So we don t yet know how many components there are, only how to best split into a given number of components.
Open Question Can we extend this method so it can be used also for texts that don t have canonical translations like the Bible? Can we use this method to identify plagiarized sections of a text?