📄 policy_iteration.m
字号:
function [p, V, Q, iter] = policy_iteration(T, R, discount_factor, use_val_iter, oldp)% POLICY_ITERATION% [new_policy, V, Q, niters] = policy_iteration(T, R, discount_factor, use_val_iter, old_policy)%% If use_val_iter is not specified, we use value determination instead of value iteration.% If the old_policy is not specified, we use an arbitrary initial policy.%% T(s,a,s') = prob(s' | s, a)% R(s,a)S = size(T,1);A = size(T,2);p = zeros(S,1);Q = zeros(S, A);oldQ = Q;if nargin < 4 use_val_iter = 0;endif nargin < 5 oldp = ones(S,1); % arbitrary initial policyend V = max(R, [], 2); % initial value fniter = 1;done = 0;while ~done iter = iter + 1; if use_val_iter V = value_iteration(T, R, discount_factor, V); else V = value_determination(oldp, T, R, discount_factor); end Q = Q_from_V(V, T, R, discount_factor); [V, p] = max(Q, [], 2); if isequal(p, oldp) | approxeq(Q, oldQ, 1e-3) % if we just compare p and oldp, it might oscillate due to ties % However, it may converge faster than Q done = 1; end oldp = p; oldQ = Q;end
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -