CS 520 Test 3

Intelligent Systems
CS 520

Test 3, 30 November 2011

Give an example of a training set for which decision trees would not be an appropriate learning technique. Justify your answer.
A decision tree fails when it can't use the input-feature values to predict the target feature values, which occurs when the input features don't distinguish among the target features.
I would like to classify e-mail as spam or not spam, and would like to use linear regression. Describe the feature set you'd define to solve this problem.

One possibility is to define a set of n spam words, and then set up an n+1 dimensional space. The first n dimensions, one per spam word, are binary valued and indicate whether or not the associated spam word appears in the message. The last dimension is also binary valued and indicates whether or not the message is spam.
Alternatively, and even simpler, set up a 2-dimensional space. The x axis indicates the total number of spam words appearing in a message, and the y axis is binary valued and determines whether or not the message is spam.
Given a Markov decision process P₁, let P₂ be a Markov decision process derived from P₁ by making the transition probability function from P₁ deterministic. How would you expect the utilities derived by direct utility estimation for P₁ and P₂ to differ? Justify your answer. Assume the Markov decision processes are over the simple 3×4 world given in the lectures.
There would be no difference. The problem with direct utility estimation is that it assumes the actions are deterministic, and fails to account for any bleed-over into a state resulting from actions other than the intended action.
Suppose the k-means algorithm is run for an increasing sequence of values for k, and that it is run for a number of times for each k to find the assignment with a global minimum error. Is it possible that a number of values of k exist for which the error plateaus and then has a large improvement (e.g., when the error for k = 3, 4, and 5 are about the same, but the error for k = 6 is much lower)? If so, give an example; if not, explain why.
It is not possible. Suppose the sum-of-squares error dropped significantly for k, but remained more or less the same for k - 1 and k - 2.
Given the training set { 1, 2, …, n } for some fixed n, define a point estimator for the set and arrange the maximum error, absolute error and the sum-of-squares error from largest to smallest.
Define the point estimator to be 42. The maximum error is
When computing probabilities for soft clustering, why is it necessary to compute Pr(C)?
Pr(C) represents the classification prediction; it is also needed to compute Bayes' rule.
A Q-learning agent operates in the world shown to the right. The start state is marked “s”; the remaining states are terminal states. The agent can move up, down, left, or right with a 50% probability of success; that is, half the time the agent moves as directed, and half the time the agent doesn't move. The reward for reaching a terminal state is given in the state; the reward for staying at the start state is 0.
Describe the entries in the Q array discovered by the agent.

There are five states and four actions, so the Q array is a 5×4 table. Four of the states are terminal, so any action in those states are ignored, and there are no more policy moves from them. Assuming there is no reward for staying in the terminal state, the expected utility of those 4×4 states is 0.
The last state is the start state. There is a 0.5 probability a down action will be successful; if it is, the utility is 1 and if it isn't the utility is 0. The expected utility is then 0.5(1) + 0.5(0) = 0.5. If the action fails, there's a 0.5 probability the next one will be successful. Assuming independence between moves (a dodgy assumption, but a simplifying one), the expected utility is the sum of the expected utilities of the individual moves, leading to a 0.5 expected utility overall. A similar argument holds for the other moves from the start state, except the reward in those cases is -1, and the expected utility is -0.5.
Explain what happens to learning when alpha is greater than 1.
If alpha is greater than 1, then there is a tendency to overcompensate for errors, which makes it difficult to converge to the mean, and be driven into instability when reacting to changes.

This page last modified on 2011 December 5.

Intelligent SystemsCS 520

Test 3, 30 November 2011

Intelligent Systems
CS 520