605
There are 3 questions to answer below:
We learned about text clustering methods for documents by representing each document as a vector of non-stop-words and comparing the similarity of documents using the Tanimoto Cosine Distance metric.
1. Write pseudocode that takes as input a corpus (set) of document and creates vectors for each document where the vectors do not contain stop-words and are weighted by the term frequency multiplied by the log of inverse document frequency as described in the course module.
DocumentVectorSet documentVectorSet =
CreateDocumentVectors(documentSet);
2. Write pseudocode that takes two document vectors and measures their similarity.
Similarity similarity =
DocumentSimilarity(documentVectorA, documentVectorB);
After performing K-means clusters, let us suppose that we examine the clusters by sight and assign names to them. For example, one cluster may represent documents about sports, another may represent documents about politics, and yet another may represent documents about animals. Let us assume that we assign each cluster a name such as sports, politics, and animals.
Sometimes, words are used in multiple contexts. For example, the word duck is ambiguous. Sometimes it means a waterfowl and would fall into the animal category. Sometimes it is used in politics such as a lame duck congress and would fall into the politics category. Sometime it is used in sports such as the name of a National Hockey League team the Anaheim Ducks and would fall into the sports category. Knowing which context the word is used makes the clustering much better. To understand why, suppose that we had two documents, one with the words duck and water, and the other with the words duck and ice. Without understanding the context of the word duck, our similarity metric may actually find that these documents are similar. However, understanding that when duck appears with water, the word duck probably refers to an animal, whereas when duck appears with ice, the word duck probably refers to sports. With this knowledge, our similarity metric would find these documents not very similar at all.
Suppose we had a li
ary of words that are used in multiple contexts such as:
String[] multiContextWords= {“duck”, “crane”, “book”, …};
Suppose also that we have a multi-dimensional a
ay that shows the multi-context words and common words that are used with them:
String[][] wordContext = {
{“duck (animal)”, “zoo”, “feathers”, “water”, …},
{“duck (sports)”, “hockey”, “Anaheim", “ice”, …},
{“duck (politics)”, “congress”, “lame”, …},
{“crane (animal)”, “bird”, “water”, …},
{“crane (construction)”, “building”, “equipment”, …},
…};
3.
Modify the CreateDocumentVectors() pseudocode from above to take advantage of the multiContextWords[] and wordContext[][] a
ays to create better document vectors so that the subsequent call to DocumentSimilarity() will better distinguish contexts.
1