Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail
Read::
- Spike-Based Reinforcement Learning in Continuous State and Action Space: When Policy Gradient Methods Fail E. Vasilaki, N. FrΓ©maux, R. Urbanczik, W. Senn, W. Gerstner 2009 π« NA reading citation Print:: β Zotero Link:: NA PDF:: NA Files:: Snapshot; Vasilaki et al_2009_Spike-BasThree-Factor Learning Ruleed Reinforcement Learning in Continuous State and Action Space.pdf Reading Note:: E. Vasilaki, N. FrΓ©maux, R. Urbanczik, W. Senn, W. Gerstner (2009) Web Rip::
TABLE without id
file.link as "Related Files",
title as "Title",
type as "type"
FROM "" AND -"ZZ. planning"
WHERE citekey = "vasilakiSpikeBasedReinforcementLearning2009a"
SORT file.cday DESC
Abstract
Changes of synaptic connections between neurons are thought to be the physiological basis of learning. These changes can be gated by neuromodulators that encode the presence of reward. We study a family of reward-modulated synaptic learning rules for spiking neurons on a learning task in continuous space inspired by the Morris Water maze. The synaptic update rule modifies the release probability of synaptic transmission and depends on the timing of presynaptic spike arrival, postsynaptic action potentials, as well as the membrane potential of the postsynaptic neuron. The family of learning rules includes an optimal rule derived from policy gradient methods as well as reward modulated Hebbian learning. The synaptic update rule is implemented in a population of spiking neurons using a network architecture that combines feedforward input with lateral connections. Actions are represented by a population of hypothetical action cells with strong mexican-hat connectivity and are read out at theta frequency. We show that in this architecture, a standard policy gradient rule fails to solve the Morris watermaze task, whereas a variant with a Hebbian bias can learn the task within 20 trials, consistent with experiments. This result does not depend on implementation details such as the size of the neuronal populations. Our theoretical approach shows how learning new behaviors can be linked to reward-modulated plasticity at the level of single synapses and makes predictions about the voltage and spike-timing dependence of synaptic plasticity and the influence of neuromodulators such as dopamine. It is an important step towards connecting formal theories of reinforcement learning with neuronal and synaptic properties.
Quick Reference
Time Course
The time course of the EPSP represents the effect of presynaptic activity at the location of the synapse.
Excitatory - Inhibitory postsynaptic potential (EPSP) - (IPSP)
the membrane potential of neuron
a positive function that increases with the membrane potential ,
see also Eq. (24).
roughly equates to βfiring rateββ
The membrane potential enters in the function that determines the instantaneous firing rate .
is the spike train of the postsynaptic neuron
the Dirac function, Postsynaptic spikes are treated as events and described by the function .
a parameter with units of time
The parameter is a specific feature of our model which allows to turn the model from a strict policy gradient method see methods) to a naive Hebbian model , see below the discussion of the postsynaptic factor).
Thus we are able to link and compare these conceptually different rules via the modification of .
We note that for small firing rates , Eq. (9) approximates the optimal policy gradient rule of , while for larger firing rates, it enhances the Hebbian component of the rule.
For , the term in the square brackets goes to so that for learning is driven by the Hebbian correlation term . In the main body of the simulation results, we pick a fixed value of which implies that we use a policy gradient method with a Hebbian bias.
postsynaptic factor
Encapsulated by the square brackets in Eq. (8) and visualized as a function of membrane potential in Figure 1.
For the case of the postsynaptic factor depends only on spike timing, but not on the membrane potential of the postsynaptic neuron.
Top Comments
Topics
Eligibility Traces
Spike Response Model neuron
eligibility trace time constant
Appears to be
Tasks
β
Extracted Annotations and Comments
Introduction
The resulting synaptic update rules can be formulated as a differential equation in continuous time that has the form of a three-factor rule
The term , called eligibility trace, picks up the correlations between pre- and postsynaptic activity just as in a Hebbian learning rule and convolves these with a low-pass filter .
However, the final weight change is implemented only in the presence of a reward signal which is delivered at the time when the animal hits the target. The choices of considered in this paper are: and , where is the reward signal averaged over many trials.
In contrast to earlier work of Xie and Seung [32] but similar to [33-35] our approach takes into account spiking neurons with refractoriness and includes examples such as the standard integrate-and-fire model. Under certain conditions on the refractoriness [34], our learning rule can be identified with a standard STDP model, but modulated by a third factor [33-36]. In contrast to most earlier work , our learning rule is applied to a network of neurons that combines feed-forward input with lateral interactions.
Three-factor learning rule for spiking neurons
We consider a Spike Response Model neuron with index that receives input from other neurons . The -th input spike from neuron arrives at time at a synapses onto neuron and causes there an excitatory (or inhibitory) postsynaptic potential (EPSP or IPSP) of time course and amplitude .
The EPSPs and IPSPs of all incoming spikes are added to the membrane potential of neuron . Spikes are generated stochastically with an instantaneous rate (or stochastic intensity)
where is a positive function that increases with the membrane potential , see also Eq. (24).
Immediately after a spike of neuron at time , the neuron enters into a state of relative refractoriness, which is implemented by a hyperpolarizing spike afterpotential .
Thus the total membrane potential of the Spike Response Model neuron is [20]: Equation 6:$$ u_i(t)=u_{\text {rest }}+\sum_{j=1}^N w_{i j} \sum_{t_j^{\prime} \in x_j} \varepsilon\left(t-t_j^f\right)+\sum_{t_i \in y_{i, t}} \eta\left(t-t_i^f\right)
where $u_{\text {rest }}$ is the resting potential, $x_j$ is the set of presynaptic spikes, $y_{i, t}=\left\{t_i^1, t_i^2, \ldots, t_i^F<t\right\}$ is the set of postsynaptic spikes up to time $t$. Using this neuron model, we can calculate the probability that neuron $i$ generates a specific spike train with firing times $t_i^1, t_i^2, t_i^3, \ldots$ during a trial of duration $T$ [34], see Methods, Eq. (25). Some of the spikes of neurons $i$ occur just before a reward is delivered, others not. The aim of learning is to change the synaptic weights $w_{i j}$ so that the probability of receiving a reward $R$ increases. We consider learning rules of the form **Equation 7:**$$ \frac{d w_{i j}}{d t}(t)=\alpha(R-b) \delta\left(t-t_{h i t}\right) e_{i j}(t)where is the learning rate (controlling the amplitude of weight updates), the moment when the animal hits the target or the wall, is the positive reward for finding the target, the (negative) reward for bumping into a wall and a reward baseline, for instance an estimate of the positive reward based on past experience.
The eligibility trace evolves according Equation 8:
where is the spike train of the postsynaptic neuron, the Dirac function, the eligibility trace time constant, a parameter with units of time, and the derivative of the function .
Because of the parameter , the learning equations (9) and (8) define a family of learning rules, rather than one single instance of a rule.
The parameter is a specific feature of our model which allows to turn the model from a strict policy gradient method see methods) to a naive Hebbian model , see below the discussion of the postsynaptic factor).
Thus we are able to link and compare these conceptually different rules via the modification of .
We note that for small firing rates , Eq. (9) approximates the optimal policy gradient rule of , while for larger firing rates, it enhances the Hebbian component of the rule.
For , the term in the square brackets goes to so that for learning is driven by the Hebbian correlation term . In the main body of the simulation results, we pick a fixed value of which implies that we use a policy gradient method with a Hebbian bias.
The estimate of the positive reward is calculated as a running mean updated at the end of the trial according the following equation: , with being the number of the trial, being the reward at the end of the -th trial ( 1 or 0 ) and the width of the averaging window.
We will now show that Eqs. (7) and (8) can be interpreted as a three-factor learning rule for spiking neurons, within the general framework outlined in the introduction.
Presynaptic factor.
Presynaptic spike arrival causes an EPSP.
The time course of the EPSP represents the effect of presynaptic activity at the location of the synapse.
We emphasize that the term presynaptic factor does not imply that this factor is implemented presynaptically - rather it refers to a term caused by the activity of the presynaptic neuron .
Postsynaptic factor.
Postsynaptic activity is represented by both the timing of postsynaptic action potentials and the postsynaptic membrane potential .
The membrane potential enters in the function that determines the instantaneous firing rate .
Postsynaptic spikes are treated as events and described by the function .
The postsynaptic factor, denoted by , is encapsulated by the square brackets in Eq. (8) and visualized as a function of membrane potential in Figure 1.
For the case of the postsynaptic factor depends only on spike timing, but not on the membrane potential of the postsynaptic neuron.
The presynaptic and postsynaptic factors both enter into the eligibility trace of Eq. (8) which is a quantity that must be stored locally at the synapses from neuron to neuron .
The eligibility trace of the synapse from to is updated by a finite positive amount whenever a postsynaptic action potential occurs within the time span of an EPSP at this synapse.
Hence the eligibility trace picks up (potentially causal) correlations between presynaptic spike arrival and postsynaptic spike firing. If an EPSP occurs without a postsynaptic spike, the eligibility trace decays smoothly at a rate proportional to .
In particular, if the membrane potential is high, but no postsynaptic spike is triggered, the eligibility trace decreases strongly. However, in the limit such a depression of the synapse does not occur.
Thus, for the eligibility trace is naive Hebbian in the sense that it is increased if postsynpatic spikes occur shortly after (and potentially triggered by) presynaptic spike arrival. If a synapse is not active (that is, in the absence of an EPSP at the synapse), the eligibility always decays with a slow time constant in the range of seconds. Whatever the choice of , the eligibility trace uses only local quantities that are available at the site of the synapse and stores locally the correlations between pre- and postsynaptic activity averaged over several seconds. In the