Report for: Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Title	Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
Published in	PLoS Computational Biology, April 2013
DOI	10.1371/journal.pcbi.1003024
Pubmed ID	23592970
Authors	Nicolas Frémaux, Henning Sprekeler, Wulfram Gerstner
Abstract	Animals repeat rewarded behaviors, but the physiological basis of reward-based learning has only been partially elucidated. On one hand, experimental evidence shows that the neuromodulator dopamine carries information about rewards and affects synaptic plasticity. On the other hand, the theory of reinforcement learning provides a framework for reward-based learning. Recent models of reward-modulated spike-timing-dependent plasticity have made first steps towards bridging the gap between the two approaches, but faced two problems. First, reinforcement learning is typically formulated in a discrete framework, ill-adapted to the description of natural situations. Second, biologically plausible models of reward-modulated spike-timing-dependent plasticity require precise calculation of the reward prediction error, yet it remains to be shown how this can be computed by neurons. Here we propose a solution to these problems by extending the continuous temporal difference (TD) learning of Doya (2000) to the case of spiking neurons in an actor-critic network operating in continuous time, and with continuous state and action representations. In our model, the critic learns to predict expected future rewards in real time. Its activity, together with actual rewards, conditions the delivery of a neuromodulatory TD signal to itself and to the actor, which is responsible for action choice. In simulations, we show that such an architecture can solve a Morris water-maze-like navigation task, in a number of trials consistent with reported animal performance. We also use our model to solve the acrobot and the cartpole problems, two complex motor control tasks. Our model provides a plausible way of computing reward prediction error in the brain. Moreover, the analytically derived learning rule is consistent with experimental evidence for dopamine-modulated spike-timing-dependent plasticity.

View on publisher site Alert me about new mentions

X Demographics

The data shown below were collected from the profiles of 13 X users who shared this research output. Click here to find out more about how the information was compiled.

Geographical breakdown

Country	Count	As %
United States	2	15%
France	1	8%
Germany	1	8%
Mexico	1	8%
Canada	1	8%
Norway	1	8%
Unknown	6	46%

Demographic breakdown

Type	Count	As %
Members of the public	11	85%
Scientists	2	15%

Mendeley readers

The data shown below were compiled from readership statistics for 324 Mendeley readers of this research output. Click here to see the associated Mendeley record.

Geographical breakdown

Country	Count	As %
Germany	6	2%
United Kingdom	6	2%
Switzerland	5	2%
France	5	2%
United States	5	2%
Italy	1	<1%
Austria	1	<1%
Sweden	1	<1%
Turkey	1	<1%
Other	4	1%
Unknown	289	89%

Demographic breakdown

Readers by professional status	Count	As %
Student > Ph. D. Student	102	31%
Researcher	62	19%
Student > Master	53	16%
Student > Bachelor	28	9%
Professor	11	3%
Other	30	9%
Unknown	38	12%

Readers by discipline	Count	As %
Computer Science	78	24%
Engineering	48	15%
Neuroscience	46	14%
Agricultural and Biological Sciences	45	14%
Physics and Astronomy	17	5%
Other	43	13%
Unknown	47	15%

PLOS

Article Metrics

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Mentioned by

Readers on

X Demographics

Geographical breakdown

Demographic breakdown

Mendeley readers

Geographical breakdown

Demographic breakdown