Final Project: Reinforcement LearningCS XXXXXXXXXX), Fall 2022, Introduction to Data ScienceDue...

Question

Final Project: Reinforcement LearningCS XXXXXXXXXX), Fall 2022, Introduction to Data ScienceDue Date: Dec. 14, 11:59 PM (EST)WARNING: This project might be hard for some of you: please start as soon as possible!Remarks. You are expected to write a short essay, which covers in detail your approaches and answers tothe below questions. It is highly recommended that you first state your approaches and ideas at a high leveland then show how your ideas apply to the two concrete examples as shown here. Your score of this projectwill be evaluated against both your answers to specific questions and the overall writing skills.Consider such an interesting game as follows. There is a special die with N sides, where the ith side hasthe number i for each 1 ≤ i ≤ N . Let [N ] .= {1, 2, 3, . . . , N}, the set of integers ranging from 1 to N . Letp ∈ [0, 1]N be a vector of length N such that the ith entry of p, denoted by pi, represents the probabilitythat we will end with the ith side (thus, we will see the number i) if rolling the die once. For example, N = 4and p = (0, 1/2, 1/4, 1/4), which means that if we roll the die once, we will see the number 1, 2, 3, and 4,with probability 0, 1/2, 1/4 and 1/4, respectively. There is another binary vector q ∈ {0, 1}N , where the ithentry of q, denoted by qi, indicates if the ith side is BAD (qi = 1) or not (qi = 0).Game Rules. At the beginning, you have $0 at hand. Suppose at some time, you have x hand, where K is a parameter known in advance. You have two choices to make, either “accept” the challengeor “quit”. (Case 1) If your choice is “quit”, then game is over and you walk away with x dollars. (Case 2) Ifyour choice is “accept”, then you will roll the die once and see a random number X ∈ [N ] with a probabilityspecified by p. Here are two subcases. (1) If qX = 1, i.e., the Xth side is BAD, then you lose all cuentmoney at hand; (2) If qX = 0, i.e., the Xth side is not BAD, then you will get a reward of f(X) where f isa function of X. In this case, you will have x + f(X) dollars. Here is a tricky part: if x + f(X) ≥ K (beain mind that K is a parameter known in advance), then game is over, and you take x + f(X) dollars andgo away; otherwise, you will continue the game with x+ f(X) dollars at hand. Attention: If you accept thechallenge, roll the die, and get X such that qX = 1, you lose all the money at hand but Game is NOT over:you can still continue to play the game with $0 at hand. Game is over only when either you choose to quitor you have at least K dollars at hand. Note that the following key components uniquely define the game:(N , p, q, f , K).(Question 1) Consider a simple case where N = 6, p = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). In other words, wehave a “normal” die with six sides, and each side will appear with the same chance if we roll once. Letq = (1, 0, 1, 0, 1, 0), f(X) = max(X2, 23), and K = 150. You are asked to do the following.(a) Formulate the above game as a reinforcement learning system. Please specify the key components in thegame (S,A,P, R), where S is the state space, A is the action space, P is the transition probability matrix,R is the reward function. For simplicity, you can assume the discounted factor γ = 1. Please specify clearlythe terminal state space (ST ) and the non-terminal state space (SN ).(b) Compute the optimal value function V ∗ and the optimal policy π∗. You can try either the value iterationmethod or the dynamic programming method. Please make sure to state explicitly the values of V ∗(s) andπ∗(s) for all s ∈ SN , where SN refers to the non-terminal state space. Based on your results, state explicitly1the maximum expected total rewards you will get in this game when starting with $0. (If you use the valueiteration method, please try different tolerance parameters � to make sure your algorithm converges properly.)(c) Please try the approach of linear programming (LP) to compute the optimal value function V ∗ andthe optimal policy π∗. You should explicitly specify the following elements in the LP: variables, objectivefunction, and constraints. Again, please state explicitly the values of V ∗(s) and π∗(s) for all s ∈ SN . Basedon your results, state explicitly the maximum expected total rewards you will get in this game when startingwith $0.(Question 2) Consider a special case where N = 5, p = (1/2, 1/4, 1/8, 1/16, 1/16), q = (0, 1, 0, 1, 0),f(X) = min(5, 2X), and K = 150. Answer the same questions (a), (b), and (c), as shown in Question 1.2

Banasree · Accepted Answer

1.a)
S = State space
A = Action space
P = transition probability matrix
R = Reward function
b = Behavior policy
ꭋ = discounted factor
With respect to the given policy
Terminal state space (ST) = 
Loop for the each episode:
Initialize and store S0 ≠ terminal
Select and store A0~b(.

Final Project: Reinforcement Learning CS XXXXXXXXXX), Fall 2022, Introduction to Data Science Due Date: Dec. 14, 11:59 PM (EST) WARNING: This project might be hard for some of you: please start as...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment