Probabilistic CKY Practice Test Feedback
Discussion and feedback on practice test questions related to probabilistic CKY algorithm. The session includes analysis of sample answers and solutions for questions on LM tuning, hyperparameter settings, and improving language models for medical dialogue systems.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Welcome! The Week 7 live session will begin shortly. Plan for today: Discussion/feedback on practice test questions probabilistic CKY
Practice test feedback/reflection This test was a real midterm in 2020. Samples answers and discussion are based on real responses. Final exam can have similar questions, but also likely to have more questions that ask you to synthesize across the whole course. Qs 1-2, 3(a), 4: see solutions/marking scheme provided. Q 5(b): stretch question, so will discuss only very briefly. Main discussion today: Q3(b), Q5(a), maybe Q6
Q3(b): LM tuning ... Developing an LM to use in a dialogue system for medical domain. ... Train the LM on Wikipedia; tune hyperparameters on dev set. Perplexities are given for two different hyperparameter settings (A and B) on two different development sets (Wikipedia and medical):
Sample answers: No credit? Partial credit? Full credit? Which hyperparameter setting should they use for the LM in the dialogue system, and why? A. Setting B is a better choice for the language model. In medical domain, setting B has a better performance. In addition, an obvious overfitting occurs while using setting A. B. For perplexity, lower perplexity indicates that the language model is more likely to generated desirable outputs. The system is specialized for medical domain. Therefore, setting B is better.
What about these? What about these? Which hyperparameter setting should they use for the LM in the dialogue system, and why? C. Setting B is the better choice for the language model because it has a lower perplexity on the medical dev set D. Setting B is better because it has smaller perplexities in the medical set. Although Setting A has smaller perplexities on Wikipedia, it is unuseful because the final target is a dialogue system for a specialized medical domain. So we need to choose the model based on the best performance on the medical set.
And finally, this one? Setting B might be a better choice for their language model. The reasons are: 1. This model will be used in the medical domain, so the perplexity computed with the medical domain is relatively more important than the Wikipedia domain. When we compare these two settings, we can see Setting B has a smaller perplexity in the medical domain. 2. The model is built based on the Wikipedia training data, and the hyperparameters are set based on a subset of Wikipedia data. And from the result, we can see the model with Setting A is very likely more overfitting to the Wikipedia data but not the medical data, while the goal of the model is to be used in the medical domain. Thus, Setting B might lead to a better result for the medical domain. 3. The hyperparameters are tuned on the Dev set, while the tested results are also based on the Dev set. And from the result, we can see Setting A has exactly much lower perplexity in the same Dev set but higher in medical domain data, which means the model with Setting A is probably overfitting to the Dev set than the model with Setting B.
Some incorrect answers After the modification in w7, tag assigned to w5 will not change since w7 occurred after w6, which is one word after w5. In HMM, there are 2 assumptions : (1) the probability of a particular tag only depends on the previous tag and (2) the probability of an output observation "word-i" only depends on the tag that produced the observation "tag-i" and not on any other tag or any other observations. No, it's impossible. With a trained HMM, we could use the Viterbi algorithm, so the probability of w in a sequence only depends on the former and current words, which is, P(t5|w5) depends on P(t4|w4) as well as transition and emission probabilities on w5,t5, and further related to the former ones, so the change of w7 won't impact.
The key point Changing a single word can change any or all tagsin the sequence, because the Viterbi algorithm finds the best sequence of tags for the whole sentence. When w7 changes, its tag could change. This in turn affects the probabilities of surrounding tags (through the transition probabilities), which in turn depend on other tags in the sequence.
Common misconception Many answers mentioned the generative model, where we move L-to-R through the sequence, and w7 is independent of w5 given the tags. But consider this example: pack the dogs w6 w7 w8 pack of dogs w6 w7 w8 Suppose the can only be DT, and of can only be P . Clearly, the best path might now use a different tag for t6 (beforet7), and in turn for t5 and earlier.
Q6(b): an even harder HMM question Example of uninformative answer: This quantity represents the time step t that maximize the product of forward and backward probabilities of tag PRO.
On uninformative answers Answers like this don t demonstrate any understanding; just repeating parts of the question. Unclear if students thought their answers were sufficient, or were just hoping. You will not lose marks for guessing, but it may take time away from answering other questions.
Q6: how to assess your answer Look over the marking rubric and examples of better and worse answers in the feedback document. Try to identify how each answer matches the rubric description. Common distinguishing features are: clarity; good use of specific examples; tying discussion to this specific situation. In this case, the question explicitly asked for examples; but even if not, providing specific examples can often improve an answer. Be realistic when assessing your own answer. Nearly everyone should get at least 1 mark, and it s not hard to get 2. Achieving 4+ is very difficult: only ~10% of students in 2020 did this well (though another 15% got 3.5).