Online Limited Memory Neural-Linear Bandits with Likelihood Matching

Author: Ofir Nabati

This post summarizes the work Online Limited Memory Neural-Linear Bandits with Likelihood Matching, accepted to ICML 2021. The code is available here.

Joint work with Tom Zahavy and Shie Mannor.

figure
Scheme of our proposed method NeuralLinear-LiM2.

We propose a new neural linear bandit algorithm; it uses a deep neural network as a function approximator while exploration is based on linear contextual bandit, using the network’s last layer activations as features. Our main contribution is a mechanism called likelihood matching for dealing with the drift that occurs to these features through training, under finite memory constraints. The basic idea of likelihood matching is to compute new priors of the reward using the statistics of the old representation whenever a change occurs. We call our algorithm NeuralLinear-LiM2 or LiM2 in short.

Continue Reading Online Limited Memory Neural-Linear Bandits with Likelihood Matching

Detecting Rewards Deterioration in Episodic Reinforcement Learning

Author: Ido Greenberg

This post summarizes the work Detecting Rewards Deterioration in Episodic Reinforcement Learning, accepted to ICML 2021. The code is available here.

Introduction

A major challenge in real-world RL is the need to trust the agent, and in particular to know whenever its performance begins to deteriorate. Unlike the framework of many works on robust RL, in real-world problems we often cannot rely on further exploration for adjustment, nor on a known model of the environment. Rather, we must detect the performance degradation as soon as possible, with as little assumptions on the environment as possible. Once detected, corresponding safety mechanisms can be activated (e.g. changing to manual control).

We address this problem in an episodic setup where the rewards within every episode are NOT assumed to be independent, identically-distributed, or based on a Markov process. We suggest a method that exploits a reference dataset of recorded episodes assumed to be “valid”, and detects degradation of rewards compared to this reference dataset.

We show that our test is optimal under certain assumptions; is better than the current common practice even under weaker assumptions; and is empirically better than several alternative mean-change tests on standard control environments – in certain cases by orders of magnitude.

In addition, we suggest a Bootstrap mechanism for False-Alarm Rate control (BFAR), that is applicable to episodic (i.e. non-i.i.d) data.

Our detection method is entirely external to the agent, and in particular does not require model-based learning. Furthermore, it can be applied to detect changes or drifts in any episodic signal.

Continue Reading Detecting Rewards Deterioration in Episodic Reinforcement Learning

Reproducing a Reproducible Robotics Benchmark

Authors: Omer Cohen & Raveh Ben Simon

Supervised by: Orr Krupnik

Visit our published materials for this project!
Github

Introduction

Standardized evaluation measures have aided in the progress of machine learning approaches in disciplines such as computer vision and machine translation, and even simulated robotics. However, real-world robotic learning has suffered from a lack of benchmark setups. To tackle this issue, a group of researchers from UC Berkeley developed a unique, cheap, easy to set up, robotic arm environment called REPLAB. The environment is presented as a benchmark for robotic reaching and object manipulation. The details of this environment are laid out in the article:

REPLAB: A Reproducible Low-Cost Arm Benchmark Platform for Robotic Learning Brian Yang , Jesse Zhang , Vitchyr Pong , Sergey Levine , and Dinesh Jayaraman.

Further technical details about how to set up your own REPLAB cell, along with links to the paper and original results, are given on the project website.

We set out to construct our own REPLAB cell in an attempt to gauge its reproducibility.

Continue Reading Reproducing a Reproducible Robotics Benchmark

Deep Reinforcement Learning Works – Now What?

Author: Chen Tessler
chen {dot} tessler {at} campus {dot} technion {dot} ac {dot} il

Two years ago, Alex Irpan wrote a post about why “Deep Reinforcement Learning Doesn’t Work Yet”. Since then, we have made huge algorithmic advances, tackling most of the problems raised by Alex. We have methods that are sample efficient [1, 21] and can learn in an off-policy batch setting [22, 23]. When lacking a reward function, we now have methods that can learn from preferences [24, 25] and even methods that are better fit to escape bad local extrema, when the return is non-convex [14, 26]. Moreover, we are now capable of training robust agents [27, 28] which can generalize to new and previously unseen domains!

Continue Reading Deep Reinforcement Learning Works – Now What?

Causal Inference

Why is Causal Inference Important for Reinforcement Learning?

Authors: Guy Tennenholtz, Shie Mannor and Uri Shalit
guytenn {at} gmail {dot} com

There has been growing interest relating Causal Inference (CI) to Reinforcement Learning (RL). While there are some great achievements in solving high dimensional RL problems, research on the intersection of RL and CI is still in its diapers. What makes these problems hard, and how do they relate to RL? In this blog we’ll give our view.

Continue Reading Why is Causal Inference Important for Reinforcement Learning?

Why does reinforcement learning not work (for you)?

Author: Shie Mannor
shie {at} ee {dot} technion {dot} ac {dot} il

 

So you run a reinforcement learning (RL) algorithm and it performs poorly. What then? Basically, you can try some other algorithm out of the box: PPO/AxC/*QN/Rainbow/etc… [1, 2, 3, 4] and hope for the best. This approach rarely works. But, why? Why don’t we have a “ResNet” for RL? By that I mean, why don’t we have a network architecture that gets you to 90% of the desired performance with 10% of the effort?

Continue Reading Why does reinforcement learning not work (for you)?

Due to the limitation to Gaussian policies, the approach is incapable of converging to the global optimum, while policy gradient approaches over the set of entire policies are ensured (in the tabular case) to converge to a global optima

Distributional Policy Optimization: An Alternative Approach for Continuous Control

Chen Tessler*, Guy Tennenholtz* and Shie Mannor
Published at NeurIPS 2019
Paper, Code

What?

We propose a new optimization framework, named Distributional Policy Optimization (DPO), which optimizes a distributional loss (as opposed to the standard policy gradient).

As opposed to Policy Gradient methods, DPO is not limited to parametric distribution functions (such as Gaussian and Delta distributions) and can thus cope with non-convex returns

Author: Chen Tessler @tesslerc

Continue Reading Distributional Policy Optimization: An Alternative Approach for Continuous Control