Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Final Project: Reinforcement Learning CS XXXXXXXXXX), Fall 2022, Introduction to Data Science Due Date: Dec. 14, 11:59 PM (EST) WARNING: This project might be hard for some of you: please start as...

1 answer below »
Final Project: Reinforcement Learning
CS XXXXXXXXXX), Fall 2022, Introduction to Data Science
Due Date: Dec. 14, 11:59 PM (EST)
WARNING: This project might be hard for some of you: please start as soon as possible!
Remarks. You are expected to write a short essay, which covers in detail your approaches and answers to
the below questions. It is highly recommended that you first state your approaches and ideas at a high level
and then show how your ideas apply to the two concrete examples as shown here. Your score of this project
will be evaluated against both your answers to specific questions and the overall writing skills.
Consider such an interesting game as follows. There is a special die with N sides, where the ith side has
the number i for each 1 ≤ i ≤ N . Let [N ] .= {1, 2, 3, . . . , N}, the set of integers ranging from 1 to N . Let
p ∈ [0, 1]N be a vector of length N such that the ith entry of p, denoted by pi, represents the probability
that we will end with the ith side (thus, we will see the number i) if rolling the die once. For example, N = 4
and p = (0, 1/2, 1/4, 1/4), which means that if we roll the die once, we will see the number 1, 2, 3, and 4,
with probability 0, 1/2, 1/4 and 1/4, respectively. There is another binary vector q ∈ {0, 1}N , where the ith
entry of q, denoted by qi, indicates if the ith side is BAD (qi = 1) or not (qi = 0).
Game Rules. At the beginning, you have $0 at hand. Suppose at some time, you have x < K dollars at
hand, where K is a parameter known in advance. You have two choices to make, either “accept” the challenge
or “quit”. (Case 1) If your choice is “quit”, then game is over and you walk away with x dollars. (Case 2) If
your choice is “accept”, then you will roll the die once and see a random number X ∈ [N ] with a probability
specified by p. Here are two subcases. (1) If qX = 1, i.e., the Xth side is BAD, then you lose all cu
ent
money at hand; (2) If qX = 0, i.e., the Xth side is not BAD, then you will get a reward of f(X) where f is
a function of X. In this case, you will have x + f(X) dollars. Here is a tricky part: if x + f(X) ≥ K (bea
in mind that K is a parameter known in advance), then game is over, and you take x + f(X) dollars and
go away; otherwise, you will continue the game with x+ f(X) dollars at hand. Attention: If you accept the
challenge, roll the die, and get X such that qX = 1, you lose all the money at hand but Game is NOT over:
you can still continue to play the game with $0 at hand. Game is over only when either you choose to quit
or you have at least K dollars at hand. Note that the following key components uniquely define the game:
(N , p, q, f , K).
(Question 1) Consider a simple case where N = 6, p = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). In other words, we
have a “normal” die with six sides, and each side will appear with the same chance if we roll once. Let
q = (1, 0, 1, 0, 1, 0), f(X) = max(X2, 23), and K = 150. You are asked to do the following.
(a) Formulate the above game as a reinforcement learning system. Please specify the key components in the
game (S,A,P, R), where S is the state space, A is the action space, P is the transition probability matrix,
R is the reward function. For simplicity, you can assume the discounted factor γ = 1. Please specify clearly
the terminal state space (ST ) and the non-terminal state space (SN ).
(b) Compute the optimal value function V ∗ and the optimal policy π∗. You can try either the value iteration
method or the dynamic programming method. Please make sure to state explicitly the values of V ∗(s) and
π∗(s) for all s ∈ SN , where SN refers to the non-terminal state space. Based on your results, state explicitly
1
the maximum expected total rewards you will get in this game when starting with $0. (If you use the value
iteration method, please try different tolerance parameters � to make sure your algorithm converges properly.)
(c) Please try the approach of linear programming (LP) to compute the optimal value function V ∗ and
the optimal policy π∗. You should explicitly specify the following elements in the LP: variables, objective
function, and constraints. Again, please state explicitly the values of V ∗(s) and π∗(s) for all s ∈ SN . Based
on your results, state explicitly the maximum expected total rewards you will get in this game when starting
with $0.
(Question 2) Consider a special case where N = 5, p = (1/2, 1/4, 1/8, 1/16, 1/16), q = (0, 1, 0, 1, 0),
f(X) = min(5, 2X), and K = 150. Answer the same questions (a), (b), and (c), as shown in Question 1.
2
Answered 21 days After Nov 04, 2022

Solution

Banasree answered on Nov 26 2022
54 Votes
1.a)
S = State space
A = Action space
P = transition probability matrix
R = Reward function
= Behavior policy
ꭋ = discounted facto
With respect to the given policy
Terminal state space (ST) =
Loop for the each episode:
Initialize and store S0 ≠ terminal
Select and store A0~b(.|S0)
T←∞
Loop for t = 0,1….:
If tAction At
Observe and store the next reward as Rt+1, and the...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here