Handling Out-of-distribution Scenarios in Offline Reinforcement Learning

MPhil Thesis Defence


Title: "Handling Out-of-distribution Scenarios in Offline Reinforcement 
Learning"

By

Mr. Hon Hing CHAK


Abstract

In standard reinforcement learning (RL), environment interactions are 
usually available for model training to facilitate continuous exploration 
and performance improvement. But, in many reinforcement learning 
applications, models are only trained with pre-existing datasets without 
online interactions with the environments. This problem is called offline 
reinforcement learning (Offline RL). Recent studies in Offline RL mixes 
traditional RL techniques with some forms of regularization, which usually 
aims at matching the RL policy with the dataset-generating policy. This is 
to address extrapolation errors when we evaluate the quality of 
out-of-distribution state-action pairs, also known as the distributional 
shift issue. However, most regularization techniques assume that the 
environment states encountered by the RL agent during deployment will stay 
close to the dataset distribution such that the agent could identify 
suitable actions to minimize distributional shift. In many real-world 
applications with highly stochastic environments, this might not be true. 
In the case that an unfamiliar environment state, i.e. out-of-distribution 
(OOD) states is encountered, the agent might pick actions that are 
unregularized during training. These unregularized actions could lead to 
further distributional shifts in latter interactions, forming a vicious 
cycle.

In this thesis, we propose an offline RL model by combining the regular 
actor-critic architecture with a Wasserstein-1 divergence critic inspired 
by Wasserstein Generative Adversarial Network with gradient penalty 
(WGAN-GP) to address the issue of OOD states. We build a gradient 
penalized critic network to capture the divergence from the state-action 
distribution of the dataset using the Wasserstein-1 distance metric, while 
extending the distance metric to the full state space during training. 
When encountering unfamiliar environment states during deployment, the 
model would still be able to output actions that are close to the marginal 
action distribution of the dataset. In our experiments, we tested our 
model in a real-world application setting: automatic cremation control 
system. We show that our model produces actions similar to human actions 
in OOD states, but previous studies are unable to recognize meaningful 
actions in those states. In addition, our model enjoys stable performances 
and outperforms the human baseline.


Date:  			Thursday, 11 August 2022

Time:			2:00pm - 4:00pm

Zoom Meeting:
https://hkust.zoom.us/j/99482645453?pwd=R2VLRVZmemF2L01DRjR2cXQ1MktmQT09

Committee Members:	Prof. Raymond Wong (Supervisor)
 			Prof. Dit-Yan Yeung (Chairperson)
 			Prof. Shing-Chi Cheung


**** ALL are Welcome ****