Deep Recurrent Encoder: A scalable end-to-end network to model brain signals

Understanding how the brain responds to sensory inputs is challenging: brain recordings are partial, noisy, and high dimensional; they vary across sessions and subjects and they capture highly nonlinear dynamics. These challenges have led the community to develop a variety of preprocessing and analytical (almost exclusively linear) methods, each designed to tackle one of these issues. Instead, we propose to address these challenges through a specific end-to-end deep learning architecture, trained to predict the brain responses of multiple subjects at once. We successfully test this approach on a large cohort of magnetoencephalography (MEG) recordings acquired during a one-hour reading task. Our Deep Recurrent Encoding (DRE) architecture reliably predicts MEG responses to words with a three-fold improvement over classic linear methods. To overcome the notorious issue of interpretability of deep learning, we describe a simple variable importance analysis. When applied to DRE, this method recovers the expected evoked responses to word length and word frequency. The quantitative improvement of the present deep learning approach paves the way to better understand the nonlinear dynamics of brain activity from large datasets.

Understanding how the brain responds to sensory inputs from non-invasive brain recordings like magnetoencephalography (MEG) can be particularly challenging: (i) the highdimensional dynamics of mass neuronal activity are notoriously difficult to model, (ii) signals can greatly vary across subjects and trials, and (iii) the relationship between these brain responses and the stimulus features is non-trivial.These challenges have led the community to develop a variety of preprocessing and analytical (almost exclusively linear) methods, each designed to tackle one of these issues.Instead, we propose to address these challenges through a specific end-to-end deep learning architecture, trained to predict the MEG responses of multiple subjects at once.We successfully test this approach on a large cohort of MEG recordings acquired during a one-hour reading task.Our Deep Recurrent Encoder (DRE) reliably predicts MEG responses to words with a three-fold improvement over classic linear methods.We further describe a simple variable Abbreviations: MEG, magnetoencephalography; EEG, electroencephalography. * Equally contributing authors.importance analysis to investigate the MEG representations learned by our model and recover the expected evoked responses to word length and word frequency.Lastly, we show that, contrary to linear encoders, our model captures modulations of the brain response in relation to baseline fluctuations in the alpha frequency band.The quantitative improvement of the present deep learning approach paves the way to a better characterization of the complex dynamics of brain activity from large MEG datasets.

K E Y W O R D S
Magnetoencephalography (MEG), Encoding, Forecasting, Reading, Deep Learning

| INTRODUCTION
A major goal of cognitive neuroscience consists of identifying how the brain responds to distinct experimental conditions.While descriptive statistics and statistical tests are classically used to analyze neural data [1], this approach is not suited to predict how the brain should react to new conditions.The resulting models of the brain can thus be particularly challenging to compare.By contrast, predictive encoding models [2,3] can be directly trained to predict brain responses to various experimental conditions, and compared on their ability to accurately predict novel conditions.For example, encoding models allow the estimation of integration constants in the brain [4,5], the hierarchical organization of visual [6] and speech processing [7,8].Beyond MEG, predictive models have enabled automatic segmentation [9] and dynamical system identification [10,11].In functional Magnetic Resonance Imaging, predictive encoding models are starting to emulate complex neural processing [12] and are a step towards discovering new phenomena [13,14].Yet, this general objective of developing encoding models faces three major challenges when working with non-invasive and time-resolved signals collected by magneto-and electro-encephalography (M/EEG).
Challenge 1: rich response dynamics M/EEG signals are known and promoted for their excellent temporal resolution.While this ability to measure cognitive processes at a millisecond time-scale offers unique opportunities for fine chronometry of neural responses in humans, it also makes such signals notoriously difficult to analyze.For example, brain responses to audio streams overlap in time making their identification difficult.To address this issue in the context of encoding models, it is standard to employ a Temporal Receptive Field (TRF) model [15,16,17,18,19,20,21,22,23,24] 1 .TRF models are commonly designed to predict neural responses to exogenous stimulation by fitting a linear regression model with a fixed timelag window of past sensory stimuli.By doing so, the predictions derived from TRFs are only influenced by stimuli descriptors, enabling them to modulate their response based on previous brain activity.Consequently, unless the basal activity from previous time points is introduced as an exogenous feature, TRF cannot learn to capture neuronal adaptation responses [27], nor can it learn to vary an evoked response as a function of the pre-stimulus alpha power [28,29].
1 also referred to as Finite Impulse Response (FIR) analysis in fMRI [25], and Distributed Lag modeling in statistics [26] Challenge 2: inter-trial and inter-subject variability Neuronal recordings in general, and M/EEG in particular, can be extremely variable across trials and subjects [30,3,31,29,32].To reduce the nuisance factors behind these variations such as eye blinks, head movements, cardiac, and face muscle activity which corrupt MEG recordings, it is common to make use of multiple sessions and subjects within a study.For example, several methods based on spatial filtering [33,34,35,36,37] or "hyper-alignment" use linear models such as canonical correlation analysis (CCA), partial least square regression (PLS), multi-view ICA and back-toback regressions (B2B) [38,39,40,41,42,43] to isolate the brain responses shared across trials and/or individuals.
However, these denoising techniques can also remove relevant signals.For example, VanRullen [29] have repeatedly shown that evoked responses to sensory input can be modulated by pre-stimulus alpha activity in a predictable way.
Averaging trials, or filtering out this variability during preprocessing would therefore prevent the identification of such phenomenon.

Challenge 3: identifying the relationship between brain responses and stimulus features
A large part of cognitive neuroscience aims to identify how the brain responds to stimulus features.For example, are V1 neurons tuned to respond to luminance, contrast, oriented lines, or faces?When and where does this elicit a response?To tackle this issue, it is common to present many stimuli to the subject, and fit a general linear model (GLM) to predict brain responses given a set of hypothetical features [2].This approach can be limited, as GLMs only reveal brain responses to features predetermined by the analyst [44,25,45,46] and understanding interactions between features often requires explicitly modeling these interactions (e.g. as cross-terms to form a quadratic polynomial) and feeding them to a linear regression [47].
Here, we propose to simultaneously address these three core challenges with a unique end-to-end "Deep Recurrent Encoding" (DRE) neural network trained to robustly predict brain responses from both (i) past MEG activity and (ii) current experimental conditions.We test DRE on 99 subjects recorded with MEG during a one-hour long reading task, and show that our model (1) better predicts MEG responses than standard models, (2) efficiently captures inter-trial and inter-subject variability, and (3) identifies feature-specific responses as well as interactions between basal activity and these evoked responses.

| MATERIALS AND METHOD
This section presents, with consistent and self-contained mathematical notations, a methodological progression from linear to nonlinear encoding models of neural dynamics as observed with MEG.It also discusses the statistical and computational benefits of recurrent models, as well as the novel methodological ideas proposed with the DRE model.

| Problem formalization
In the case of MEG, the measured magnetic fields x reflect a tiny subset of brain dynamics h -specifically a partial and macroscopic summation of the synaptic input to cortical pyramidal cells.Given the physics of electromagnetic fields propagation, it is standard to assume that these neuronal magnetic fields have a linear, stationary, and instantaneous relationship with the magnetic fields measured via MEG sensors [48].We refer to this "readout operator" as C, a matrix which is subject-specific because it depends on the location of pyramidal neurons in the cortex and thus on the anatomy of each subject.Furthermore, the brain dynamics governed by a function f evolve according to their past and to external stimuli u [49].In sum, we can formulate the problem as follows:

| Operational Objective
Here, we aim to parameterize f with θ, and subsequently learn θ and C to obtain a statistical (as opposed to biologically constrained as in [50]) generative model of observable brain activity that accurately predicts MEG activity x ∈ d x given an initial state and a series of past stimuli.

Notations
We denote by u t ∈ d u the stimulus with du encoded features at time t, x t ∈ d x the MEG recording with dx sensors, xt its estimation, and by h t ∈ d h the underlying brain activity.Because the true underlying brain dynamics are never known, h will always refer to a model estimate.To facilitate the parametrization of f , it is common in the modeling of dynamical systems to explicitly create a "memory buffer" by concatenating successive lags.We adopt the bold notation h t−1∶t−τ h ∶= h t−1 , ..., h t−τ h ∈ d h τ h for flattened concatenation of τ h ∈ time-lagged vectors.With these notations, the dynamical models considered in this paper are described as: where • f ∶ d h τ h +d u (τ u +1) → d h governs brain dynamics given the preceding brain states and external stimuli • C ∈ d h ×d x is a linear, stationary, instantaneous, and subject-specific observability operator that makes a subset of the underlying brain dynamics observable to the MEG sensors.

Temporal Receptive Field (TRF)
Temporal receptive fields (TRF) [15] are arguably the most common model for predicting neural time series in response to exogeneous stimulation.The TRF equation is that of control-driven linear dynamics: where B ∈ d h ×d u .(τu +1) is the convolution kernel that maps the stimuli to the brain response and θ = {B}.By definition, the TRF kernel encodes the input-output properties of the system, namely, its characteristic time scale, its memory, and thus its ability to sustain an input over time.A computational drawback is that the TRF kernel size scales linearly with the duration of the neural response to the stimulus.For example, a dampened oscillation evoked by the stimulus could last one hundred time samples (τu = 99) and would require B ∈ d h ×100d u to reach 100 steps in the past, even though oscillatory dynamics can be compactly written as a second-order differential equation expressing h t in terms of only two of its own past states (h t−1 , h t−2 )2 .Emulating this, we will introduce a recurrent component to the TRF model to tackle the issue of dimensionality.

Recurrent Temporal Receptive Field (RTRF)
A Recurrent Temporal Receptive Field (RTRF) is a linear auto-regressive model.The RTRF with exogenous input can model time-series from its own past (e.g., past brain activity) and from exogenous stimuli.Unrolling the recurrence reveals that current brain activity can be expressed in terms of past activity.This corresponds to recurrent dynamics with control: where the matrix A ∈ d h ×(d h .τh ) encodes the recurrent dynamics of the system and θ = {A, B}.
The dependency of h t on h t−1 in (3) means we need to unroll the expression of h t−1 in order to compute h t .
However, it has been shown that linear models perform poorly in this case, as terms of the form A t (A to the power t) will appear, with either exponentially exploding or vanishing eigenvalues.This rules out optimization with first order methods due to the poor conditioning of the problem [51], or using a closed-form solution.To circumvent unrolling the expression of h t−1 , we need to obtain it from what is measured at time t − 1.This however assumes the existence of an inverse relationship from x t−1 to h t−1 , which we assume here to be linear by using the pseudo inverse of C: As a result, h t and x t are identifiable to one another, and (3) can be solved in closed form as a regular linear system [52].Initializing the RTRF dynamics with the pre-stimulus data can be written as: where τ h is chosen to match the pre-stimulus duration τ .
Though the recurrent component of the RTRF is able to reduce the receptive field τu of TRF, it is nevertheless constrained to maintain a 'sufficiently big' receptive field τ h to initialize over τ h steps.The following model, DRE, will avoid this issue, and will also not require that h t and x t are identifiable via linear inversion.

Deep Recurrent Encoder (DRE)
DRE is an architecture based on the Long-Short-Term-Memory (LSTM) computational block [53].It is useful to think of the LSTM as a "black-box nonlinear dynamical model", which composes the RTRF building block with nonlinearities and a memory module which reduces the need for receptive fields, so that τ h = 1 and τu = 0.It is employed here to capture nonlinear dynamics evoked by a stimulus.A single LSTM layer can be formulated as [53]: where the tanh nonlinearity is applied element-wise, ⊙ is the Hadamard (element-wise) product, and 3 are data-dependent vectors with values between 0 and 1 modeled as forget (or drop) input and output gates, respectively.The memory module m t ∈ d m thus interpolates between a "past term" m t−1 ∈ d m and a "prediction term" mt ∈ d m , taking h t as input.The "prediction term" (See (5) last equation) resembles that of the previous RTRF model except that it is here composed with a tanh nonlinearity which conveniently normalizes the signals.
Again, the dependency of h t on h t−1 in (3) meant that we needed to unroll the expression of h t−1 to compute h t .
While this is numerically unstable for the RTRF, the LSTM is designed such that h t and its gradient are stable even for large values of t.As a result, h t and x t do not need to be identifiable to one another.In other words, contrary to RTRF, the LSTM allows h t to represent a hidden state containing potentially more information than its corresponding observation x t .
We now motivate three modifications made to the standard LSTM.
First, we help it recognize when (not) to sustain a signal, by augmenting the control u t with a mask embedding p t ∈ {0, 1} indicating whether the provided MEG signal generates the current brain response (i.e. 1 before word onset and 0 thereafter).Second, we automatically learn to align subjects with a dedicated subject embedding layer.Indeed, a shortcoming of standard brain encoding analyses is that they are commonly performed on each subject separately.
However, this implies that one cannot exploit potential similarities across subjects.Here, we adapt the LSTM in the spirit of Défossez et al. [54] so that a single model is able to leverage information across multiple subjects.We do this by augmenting the control u t with a "subject embedding" s ∈ d s , that is learned for each subject.Note that this amounts to learning a matrix in d s ×n s that is applied to the one-hot-encoding of the subject number.In order words, each subject has a vectorized representation that is one column of the embedding matrix.Setting ds < ns allows us to use the same LSTM block to model subject-wise variability, and to train across subjects simultaneously while leveraging similarities across subjects.
Third, for comparability purposes, RTRF and LSTM should access the same pre-stimulus MEG information x τ ∶1 .
Incorporating the initial MEG, before word onset, is done by augmenting the control with p t ⊙ x t .The extended control reads: ũt = u t , s, p t , p t ⊙ x t , and the LSTM with augmented control ũt finally reads: In practice, to maximize expressivity, two modified LSTM blocks are stacked on top of one another (Figure 1).
Having introduced a nonlinear dynamical system for the brain response h t , we can also extend the model (1) by challenging the linear instantaneous mixing from the brain response h t to the measurements x t .Introducing two new nonlinear functions d and e, respectively parametrized by θ 2 and θ 3 , a more general model formally reads: where τx allows us to capture a small temporal window of data around x t , and τu is taken to be much larger than τx.
terms, the DRE model generalizes the linear instantaneous measurement of the previous models with a "convolutional autoencoder" [55].The e (encoder) function is formed by convolutions and the d (decoder) function uses transposed convolutions, where both functions are two layers deep (Figure 1) 3 .
In practice, we use a kernel size K = 4 for the convolutions.This impacts the receptive field of the network and the parameter τx.Equation ( 7) implies that the number of time samples in h and x are the same.However, a strong benefit of the convolutional auto-encoder is to perform a reduction of the number of time steps by using a stride S larger than 1.By using a stride of 2, one reduces the temporal dimension by 2. Indeed it boils down to taking every other time sample from the output of the convolved time series.Given that the LSTM module is by nature sequential, this reduces the number of time steps it has to consider when learning, which accelerates both training and evaluation.
Further, there is evidence that LSTMs can only pass information over a limited number of time steps [57].In practice, we use d h output channels for the convolutional encoder.In summary, our DRE model generalizes TRF and RTRF models by using nonlinearities both in the dynamics of the brain response h t and in its measurement x t (1).It is done respectively with LSTM cells and a convolutional autoencoder.Importantly, the DRE is equipped with a subject embedding allowing us to learn a joint model for the group of subjects.

| Optimization losses
The dynamics for the above three models (TRF, RTRF, DRE) are given by different expressions of f θ as well as the mappings between x and h via C for TRF and RTRF, or c and e for DRE.
At test time, the models aim to accurately forecast MEG data from initial steps combined with subsequent stimuli.
Consequently, one should train the models in the same setup.This boils down to minimizing a "multi-step-ahead" 2 prediction error: This "multi-step-ahead" minimization requires unrolling the recurrent expression of h t over the preceding time steps, which the LSTM-based DRE model is able to do [58].The DRE model takes as input the observed MEG at the beginning of the sequence, and must predict the future MEG measurements using the (augmented) stimulus ũt .
Note that the mapping to and from the latent space, e θ 3 and d θ 2 , are learned jointly with the dynamical operator f θ 1 .
Furthermore, the DRE has reduced temporal receptive fields, thus the computational load is lightened and allows for a low or high-dimensional latent space.
Given that the RTRF model is linear and can suffer from numerical instabilities (see above), it is trained with the "one-step-ahead" version of the predictive loss with squared 2 regularization: TRF models are also trained with this "one-step-ahead" loss.As mentioned above, the linear models (TRF) require a larger receptive field than the nonlinear DRE.Large receptive fields induce a computational burden, because each time lag comes with a spatial dimension of size d h or du.To tackle this issue, C is chosen to reduce this spatial dimension.In practice, we choose to learn C separately from the dynamics to simplify the training procedure of the linear models.Given a participant, we fit a Principal Component Analysis (PCA) with 40 components on their averaged (evoked) MEG data: the resulting PCA map is taken to be the matrix C ∈ d x ×40 .The resulting latent space will thus explain most of the variance of the original recording.Indeed, when training the TRF model on all 270 MEG sensors with no hidden state (6.4 ± 0.22%, MEAN and SEM across subjects) or using a 40-component PCA (6.43 ± 0.17%), we obtained similar performances.The pseudo-inverse C † required to compute the previous latent state h t−1 is also obtained from the PCA model.Note that dimensionality reduction via linear demixing is a standard preprocessing step in MEG analysis [34,59,60].

Evaluation Method
Following seminal works (e.g. by Kay et al. [61], Güçlü and Gerven [62]), models are evaluated using the Pearson Correlation R (between -1 and 1) between the model prediction x and the true MEG signals x for each channel and each time sample (after the initial state) independently 4 .When comparing the overall performance of the models, we average over all time steps after the stimulus onset, and over all MEG sensors for each subject independently.

Feature Importance
To investigate what a model actually learns, we use Permutation Feature Importance [63] which measures the drop in prediction performance when the j th input feature u j is shuffled: By tracking ∆R over time and across MEG channels, we can locate in time and space the contribution of a particular feature (e.g.word length) to the brain response.

Experiment
The model weights are optimized with the training and validation sets, while the penalization λ for the linear models (TRF and RTRF) is optimized with a grid search over five values distributed logarithmically between 10 −3 and 10 3 .
Training of the DRE is performed with ADAM [64] using a learning rate of 10 −4 and PyTorch's default parameters [65] for the running averages of the gradient and its square.The training is stopped when the error on the validation set increases.In practice, the DRE and the DRE-PCA were trained over approximately 20 and 80 epochs, respectively.

Statistics
Each subject score is obtained using the model prediction on held-out trials, using the learned subject embeddings as the "alignment function", similar to Chen et al. [66], Zhang et al. [67], Haxby et al. [68], Bazeille et al. [42], Richard et al. [43].To test the reliability of our effects (e.g.prediction performance, feature importance, model comparison), we assess confidence intervals and p-values across subjects using a non-parametric Wilcoxon rank test across subjects.
When applicable, we correct these estimates for multiple comparisons using a false discovery rate (FDR) across time samples and channels.Note that subjects can be treated as independent observations to derive meaningful p-values since the statistics are based on held-out data independent from the training set.

Noise Ceiling
Noise ceilings are typically estimated using batches of repeated conditions [69,70,62], to evaluate the maximal amount of explainable variance.This involves multiple presentations of the same stimulus characterized by a given feature set.In our case, however, sentences are presented only once to each subject.Further, we cannot control one of the variables input to our encoding models: namely, the baseline brain activity.

Experimental design
We analyze 99 subjects from the Mother Of Unification Studies (MOUS) dataset [71] who performed a one-hour reading task while being recorded with a 273-channel CTF MEG scanner.The task consisted of reading approximately 2,700 words flashed on a screen in rapid series.Words were presented sequentially in groups of 10 to 15, with a mean inter-stimulus interval of 0.78 seconds (min: 300ms, max: 1,400ms).Sequences were either random word lists or actual sentences (50% each).For this study, both conditions were used.However, this study does not investigate the differences obtained across these two conditions.Out of the original 102 subjects, 3 were discarded from the study because we could not reliably parse their stimulus channels.

Stimulus preprocessing
We focus on four well-known features associated with reading, namely word length (i.e., the number of letters in a word), word frequency in natural language (as derived by the wordfreq Python package [72], and measured on a logarithmic scale), and a binary indicator for the first and last words of the sequence.At a given time t, each stimulus u t ∈ 4 is therefore encoded with four values, fed to the models as a square function that is non-zero for the duration of the stimulus.Each feature is standardized to have zero mean and unit variance.Word length is expected to elicit an early (from t=100 ms) visual response in posterior MEG sensors, whereas word frequency is expected to elicit a late (from t=400 ms) left-lateralized response in anterior sensors.In the present task, word length and word frequency are correlated R=-0.48.

MEG Preprocessing
As we are primarily interested in evoked responses [73], we band-pass filtered between 1 and 30 Hz and downsampled the data to 120 Hz using the MNE software [74] with default settings: i.e. a FIR filter with a Hamming window, a lower transition bandwidth of 1 Hz with -6 dB attenuation at 0.50 Hz and a 7.50 Hz upper transition bandwidth with an attenuation of -6 dB at 33.75 Hz.
To limit the interference of large artefacts on model training, we use Scikit-learn's RobustScaler with default settings [75] to normalize each sensor using the 25 th and 75 th quantiles.Following this step, most of the MEG signals will have a scale around 1. Since we observed a few large scale outliers, we chose to reject any segment of 3 seconds that contains amplitudes higher than 16 in absolute value (fewer than 0.8% of the time samples).
These continuous data are then segmented between 0.5 s before and 2 s after word onset, yielding a threedimensional MEG tensor per subject: words, sensors, and time samples relative to word onset.For each subject, we form a training, validation, and test set using respectively 70%, 10%, and 20% of these segments, ensuring that two segments from different sets do not originate from the same word sequence to avoid information leakage.This corresponds to 191K, 27K, and 53K segments used for the train, validation, and test sets, respectively.Each segment has a spatial dimension of 273 sensors and a temporal dimension of 300 time points (2.5 s sampled at 120 Hz).
For clarity, some figures use global field power (GFP) to summarize effects over time.GFP refers to the standard deviation across MEG channels of an average evoked response.

Model hyper parameters
We compare the three models introduced in Section 2.3 over ns = 99 subjects.For the TRF, we use a lag on the control of τu = 250 time steps (about 2 s).This corresponds to the duration of the signal after the stimulus onset.
For the RTRF, we use τu = τ h = 40 time steps.This lag is close to the minimum inter-word duration of 300 ms, and corresponds to the part of the initial MEG (i.e.333 ms out of 500 ms, at 120 Hz) that is passed to the model to predict the 2.5 s MEG sequence during the evaluation.
For the DRE model, we use a subject embedding of dimension ds = 16, a latent state of dimension d h = 512, a kernel size K = 4, and a stride S = 2.The subject embeddings are initialized as Gaussian vectors with zero mean and unit variance, while the weights of the convolutional layers and the LSTM are initialized using the default "Kaiming" initialization [76].Like its linear counterpart, the DRE is given the first 333 ms of the MEG signal to predict the complete 2.5 s of a training sample.

Ablation study
To investigate the importance of the different components of the DRE model, we implement an ablation study by fitting the model with all but one of its components.To this end, we compare the DRE to i) the DRE without using the 333 ms of pre-stimulus initial MEG data (DRE NO-INIT), ii) the DRE trained in the 40-dimensional PCA space used for the linear models (DRE PCA), iii) the DRE devoid of a subject embedding (DRE NO-SUBJECT), and to iv) the DRE devoid of the convolutional auto-encoder (DRE NO-CONV).

| Code availability
The code developed for the present study is available at https://github.com/facebookresearch/deepmeg-recurrentencoder.We first evaluate the DRE's ability to predict brain responses to written words presented in rapid serial visual presentation and measured with MEG, and compare these brain predictions to those of linear encoding models (TRF, RTRF).Then, we show with ablation experiments which elements of the DRE help address the challenges of rich dynamics, inter-subject, and inter-trial variability.Finally, we show how feature importance helps address the third challenge introduced above, namely: identifying the relationship between brain responses and stimulus features.
Modeling rich MEG dynamics: model comparison.
DRE outperforms the baseline TRF and RTRF models with up to a three-fold improvement (Figure 2).To provide a fair comparison between the models, we also compare TRF to a NO-INIT DRE, i.e. to a DRE architecture that ignores the pre-stimulus MEG activity.The results show that DRE NO-INIT consistently outperforms TRF (mean correlation score R = 0.077 on average across all subjects, time samples, and all channels; standard error of the mean across subjects: ± 0.002 for DRE NO-INIT vs. R = 0.064 ± 0.002 for TRF).This difference is strongly significant (p < 10 −17 ) under a Wilcoxon test across subjects.Similarly, DRE (R=0.20 ± 0.003) significantly (p < 10 −17 ) outperforms RTRF (R=0.10 ± 0.003), when both of these models are given as input the pre-stimulus MEG activity.To verify that this gain is not trivially accounted for by the limited dimensionality of RTRF (trained with 40-dimensional Principal Components because of computational limitations), we trained DRE with the same PCA-reduced data as RTRF.The results confirm that DRE obtains a higher performance (R=0.16 ± 0.003, p < 10 −16 ) than RTRF.Overall, these results suggest that DRE better models the rich M/EEG dynamics than linear models.

Subject embeddings efficiently capture inter-individual variability
To evaluate the importance of the subject-embedding layer in capturing inter-individual variability, we trained the DRE without a subject embedding layer (DRE NO-SUB).The comparison between DRE and DRE NO-SUB reveals a clear difference (∆R = 0.038, p < 10 −17 ).This result shows that the subject embedding layer efficiently re-aligns subjects' brain responses to model the dynamics specific to each -or shared across -subject(s).

Recurrence efficiently captures inter-trial variability.
Brain responses to sensory input are known to vary with ongoing brain activity [29,77].Recurrent models (RTRF, DRE) are thus well suited to capture such phenomenon: initialized with 333 ms of pre-stimulus MEG, they can use basal brain activity to modulate the post-stimulus MEG predictions.Our results confirm this prediction: TRF is outperformed by RTRF (0.10 ± 0.003, p < 10 −16 ) with an average performance increment of ∆R = 0.03, and DRE (0.20 ± 0.003) outperforms DRE NO-INIT (0.077 ± 0.002%, p < 10 −17 ) with an average performance increment of ∆R = 0.12.

DRE's recurrence specifically captures alpha-dependent evoked responses.
To further explore how DRE learns inter-trial variability, we investigate a well-known interaction between evoked responses and pre-stimulus activity.Specifically, brain responses to sensory input are known to be modulated by pre-stimulus oscillatory activity in the "alpha" frequency range (8 -13 Hz) [29].To test whether this phenomenon can be detected in the present dataset, we compared the average evoked responses to words for "high pre-stimulus alpha" versus "low alpha" trials, using a median split for each subject separately.The results (Figure 3) show an effect of up to 50 × 10 −12 T in fronto-temporal channels, peaking around 400 ms after word onset.Critically, while the singletrial predictions of DRE capture this phenomenon, neither TRF nor RTRF learn to modulate their evoked responses depending on the alpha power (Figure 3B.Bottom).This interaction between pre-stimulus alpha activity and evoked responses varies with the content of words, and more specifically, with their frequency in natural language: a factorial split between "high alpha" versus "low alpha" F I G U R E 3 A: Brain response obtained for all three models (TRF, RTRF, DRE) and the actual data (Truth), averaged over the test set for all 99 subjects.Qualitatively, all models learn the mean MEG response, although with a variable precision.B: Difference of average brain response between trials with high vs low pre-stimulus alpha power (8)(9)(10)(11)(12)(13).This analysis shows that DRE modulates the predicted brain response as a function of pre-stimulus alpha power.C: Global Field Power (GFP) of the brain response, as a function of pre-stimulus alpha power and word frequency.Qualitatively, all models capture the stimulus-modulation of the brain response, but only the DRE captures the alpha-modulation, which varies with the latency.D: Global Field Power (GFP) of the brain response averaged before the peak (0-100ms), during the ramp up of the peak (100-500ms) and after the peak of the evoked response (500-800ms).The number of stars illustrates the level of significance (**: 10 −2 , ***: 10 −3 , or ****: 10 −4 ).The marginal pre-stimulus alpha effect (light vs. dark) is followed by the marginal word frequency effect (black) and then an interaction (grey) between pre-stimulus alpha and word frequency (red vs. blue).Only DRE captures the same effects as in the actual data.
trials and "high word frequency" versus "low word frequency" trials resulted in both main and interaction effects (Figure 3C-D).Specifically, the comparison between these 2x2 conditions reveals three main phases.First, a main effect of alpha can be observed before the evoked response (light vs. dark lines in Figure 3C, p < 10 −3 ).Second, the main effect of word frequency starts to become significant from ≈ 200 ms (blue vs. red lines, p < 10 −4 ).Finally, the main effect of alpha starts to fade away after ≈ 500 ms (p > 10 −2 for high-frequency words), but its interaction with the stimulus continues to be significant (p < 10 −2 ).Critically, DRE learns these interactions between pre-stimulus alpha power and stimulus responses, while the linear models do not.Interpreting nonlinear and/or high-dimensional models is notoriously challenging [78].This issue poses strong limitations on the application of deep learning to neural recordings, where interpretability remains a major goal [3,79].
While DRE faces the same types of issues as any deep neural network, we show below that a simple feature importance analysis of the predictions of this model (as opposed to its parameters) yields results that are consistent with those obtained by linear models, and with those described in neuroscientific literature (cf.Section 2.5).
Feature importance quantifies the loss of prediction performance ∆R when a unique feature is shuffled across words as compared to a non-shuffled prediction.Here, we focus our analysis on word length and word frequency, as these two features have been repeatedly associated with early sensory responses in the visual cortex and late lexical responses in the temporal cortex, respectively [80,81].As expected, the feature importance for word length in Figure 4 peaked around 150 ms in posterior MEG channels, whereas the feature importance of word frequency peaked around 400 ms in fronto-temporal MEG channels, for both the TRF and the DRE models.Furthermore, we recover a second phenomenon known in the literature: the lateralization of lexical processing in the brain.Indeed, Figure 4 shows, for the word frequency, an effect similar in shape across hemispheres, but significantly higher in amplitude for the left hemisphere (e.g.p < 10 −10 in the frontal region, p < 10 −12 in the temporal region, for the DRE).
These results suggest that, in spite of being high dimensional and nonlinear, DRE can be interpreted similarly to linear models in the present context.

| DISCUSSION
The present study demonstrates that DRE outperforms several of its linear counterparts to predict MEG time series.
In particular, it addresses the three challenges introduced above.First, the complex, nonlinear and non-stationary dynamics can be efficiently modeled by deep convolutional LSTM layers.Second, inter-trial and inter-individual variability can be addressed with recurrence (i.e.MEG-INIT) and subject embeddings, respectively.Finally, the relationship between stimulus features and brain responses can be interpreted in light of a permutation-based feature importance analyses.
Overall, the present study shows that the gain in prediction performance obtained by deep learning algorithms may not necessarily come at the price of interpretability.Indeed, we show here that DRE can be probed a posteriori to reveal how evoked responses relate to each stimulus feature and/or to pre-stimulus brain activity.This feature importance supplements ongoing efforts to open black-box models of brain activity.For example, Güçlü and Gerven [62] used a recurrent neural network to predict fMRI recordings, and quantified the impact of stimulus features by correlating them with the model's hidden state.Similarly, Keshishian et al. [5], analyzed the activations of a deep convolutional network with TRF to show how they captured "dynamical receptive fields".In both of these cases, however, these post-hoc analyses are based on (i) linear assumptions and (ii) the inner activations of the model.By contrast, the permutation feature importance used here focuses on probabilistic dependencies between input features and the models' predictions, which generalize linear dependencies measured by correlation [63].This approach can thus be applied to any black-box predictive model.

Deep neural networks
have not yet emerged as the go-to method for neuroimaging [82,83].Nevertheless, several studies have successfully used deep learning architectures to model brain signals [84,85,86,87,45].In particular, deep nets trained on natural images [88,89,90], sounds [5], or text [91] are being increasingly used as an encoding model for predicting neural responses to sensory stimulation [6,92,7,8].Conversely, deep nets have also been trained to decode sensory-motor signals from neural activity, successfully reconstructing text [93] from Electrocorticography (ECoG) recordings, or images [94] from functional magnetic resonance imaging (fMRI).Despite these successes, we would like to argue that what possibly limits a wider impact of deep learning in human cognitive neuroscience is a combination of factors including: (i) low signal-to-noise ratio, (ii) small datasets, and (iii) a lack of high temporal resolution, where nonlinear dynamics may be the most prominent.The present experimental results make a step in this direction, and could thus open an avenue towards leveraging the many existing shorter naturalistic stimulus datasets collected on many subjects.This could be an alternative to making new long recordings of many hours of data from a handful of subjects [95].
While the DRE's architecture may be efficient at handling the dynamical structure of brain data, the dynamics assessed in this study are driven by specific linguistic features (i.e.word-length and word frequency).By contrast, recent ECoG and MEG studies have used more complex word features, represented as activations of a deep network pretrained on visual or language tasks [96,97,98], and then predict the brain response in a way that is unaware of the dynamics (using a linear classifier for each time sample independently).Given the successes independently observed with these two approaches, a natural extension of this work would be to combine the two and learn to map complex stimuli to brain responses using (1) rich representations for the stimuli (such as the activations of a pretrained deep network), followed by (2) a rich dynamical model such as the DRE.It is however worth pointing out that this approach would naturally lead to more high-dimensional parameter spaces, which would require larger datasets to limit potential overfitting.
The present work is based on a deterministic and predictive framework using deep learning.Other complementary approaches such as Hidden Markov Models (HMMs) and Gaussian Processes (GPs) have also been proposed to model brain data in a probabilistic framework.Such approaches have been exploited to explore the spatio-temporal statistics of fMRI or MEG data [99,100], but also in an encoding context [101,102].In particular, [99] combine GPs and dynamical system modeling to account for MEG responses to tactile input and shows that it captures meaningful modulations of oscillatory activity.This approach may offer a promising avenue to further clarify the interaction between baseline alpha oscillations and visual responses captured by DRE (see Figure 3).Similarly, [102] show that Gaussian modeling efficiently learns to predict fMRI responses to visual stimuli, and, importantly, can be inverted to achieve zero-shot decoding of individual characters.By contrast, our encoding approach would necessitate additional fine-tuning to transfer DRE to a novel decoding task.Overall, a key advantage of probabilistic models like GP is the ability to quantify uncertainty in the predictions, which in the present forecasting scenario would likely increase when looking at late latencies.While the proposed approach does not offer this possibility, the present study benefits from the highly-optimized ecosystem of deep learning, which allows us to efficiently deal with the large size of raw MEG data (273 MEG channels sampled at 1,200 Hz and recorded for 60 min in 99 subjects).
It is worth noting that because the losses used to train the models in this paper are MSEs evaluated in the time domain, the TRF, RTRF, and DRE are solely trained and evaluated on their ability to predict the amplitude of the neural signal at each time sample.Consequently, the objective solely focuses on "evoked" activity, i.e. neural signals that are strictly phase-locked to stimulus onset or to past brain signals [73].A quick time-frequency decomposition of the models suggest that none of them capture "induced" activity, e.g.changes in the amplitude of neural oscillations with a non-deterministic phase.A fundamentally distinct loss would be necessary to capture such non phase-locked neural dynamics.
As for many other scientific disciplines, deep neural networks will undoubtedly complement -if not shiftthe myriad of analytical pipelines developed over the years toward standardized end-to-end modeling.While such methodological development may improve our ability to predict how the brain responds to various exogenous stimuli, the present attempt already highlights the many challenges that this approach entails.Nevertheless, the present results hopefully clearly demonstrate that deep networks are a very relevant technology to capture complex neural dynamics collected non-invasively by MEG and certainly EEG.Indeed, some variance in the signal is explained by noise only, whose amplitude is on a scale 10 times larger that of the evoked response.However, the ordering of model performance and the conclusions are left unchanged.

Note on convolution and computational efficiency
The introduction of convolutional layers is here mainly motivated by computational efficiency: convolutional layers reduced training time on an NVIDIA V100 GPU from 2.6 h for a DRE devoid of convolutional layers down to 1.4 h for our DRE.The performance between these two models is relatively similar, although with a slight benefit in favour of DRE:(NO-CONV: 0.194 ± 0.003%; DRE: 0.197 ± 0.003%, p < 10 −5 ).
U R E 2 A: Model Comparison.Predictive performance of different models over 99 subjects (dots) (chance=0).Boxplots indicate median and interquartile range, while gray lines follow each subject.The gray brackets indicate the p-value obtained with a pair-wise Wilcoxon comparison across subjects.Initialization and nonlinearity increase predictive performance.B: Ablation Study of our model (DRE).Pearson R mean obtained by removing a component from (Convolutional Auto-encoder or Subject-embedding) or adding an element (PCA embedding composed with the Convolutional Auto-encoder) to the DRE architecture and retraining the model, and the resulting p-values obtained by comparing the resulting scores to those of the normal DRE.Convolutions marginally impact performance of the DRE (they are essentially used for speed), but training over all MEG sensors (vs.PCA space) and using subject embeddings designed for realignment are responsible for an increase of approximately 4%.
x alpha high wfreq.high x alpha low wfreq.low x alpha high wfreq.low x alpha low

Feature 4
importance helps interpreting the links between brain responses and stimulus features 0.0 0.1 0.2 0.3 0.4 0.5 Permutation importance (∆R) of word length (left column) and word frequency (right column), as a function of spatial location (color-coded by channel position, see top-left rainbow topography) and time relative to word onset for the two main encoding models (rows).The amplitudes of each line represent the mean across words and subjects for each channel.An FDR-corrected Wilcoxon test across subjects assesses the significance for each channel and time samples independently.Non-significant effects (p-value higher than 5%) are displayed in black.Overall, this feature importance analysis confirms that early visual and late frontal responses are modulated by word length and word frequency, respectively, as expected.

Figure 5 F
Figure 5 reports an alternative scoring of the main models using the R 2 score instead of the Pearson correlation R.It is interpretable as explained variance is a quantity upper bounded by 1, as opposed to MSE which depends on the input scaling of the data.