Estimating smooth and sparse neural receptive fields with a flexible spline basis

Spatio-temporal receptive field (STRF) models are frequently used to approximate the computation implemented by a sensory neuron. Typically, such STRFs are assumed to be smooth and sparse. Current state-of-the-art approaches for estimating STRFs based on empirical Bayes are often not computationally efficient in high-dimensional settings, as encountered in sensory neuroscience. Here we pursued an alternative approach and encode prior knowledge for estimation of STRFs by choosing a set of basis functions with the desired properties: natural cubic splines. Our method is computationally efficient and can be easily applied to a wide range of existing models. We compared the performance of spline-based methods to non-spline ones on simulated and experimental data, showing that spline-based methods consistently outperform the non-spline versions.


| INTRODUCTION
Spatio-temporal receptive fields (STRFs) are frequently used in neuroscience to approximate the computation implemented by a sensory neuron. Such models consist typically of one or more linear filters summing up sensory inputs across time and space, followed by a static nonlinearity and a probabilistic output process [1]. For this, different distributions can be used depending on the type of recorded neuronal responses.
The simplest way to estimate a STRF for a given data set is arguably to compute the spike-triggered average (STA), the average over all stimuli preceding a spike [2]. However, the results of the spike-triggered analysis are often noisy due to limited data and prone to be distorted when the neuron is stimulated with correlated noise or natural stimuli [3]. The maximum likelihood estimate (MLE), also known as whitened-STA (wSTA) [4], does not suffer from such shortcomings, but it requires even more data to converge as the whitening procedure tends to amplify noise at frequencies with low power, making it a less preferred option in many experimental settings. A modified wSTA that removes low power frequencies can result in smoother STRF, while the threshold for such a frequency cutoff is normally chosen by cross-validation [4].
It has been suggested that these shortcomings can be overcome by computing the maximum a posteriori (MAP) estimates within the framework of Bayesian inference [5]. Under this framework, prior knowledge about STRFs such as sparsity, smoothness [6], and locality [7] can be encoded into a prior covariance matrix, whose hyperparameters are learnt automatically from the data via evidence optimization (also known as empirical Bayes, or Type II maximum likelihood in the literature [8]). These algorithms thus provide STRF estimates with the desired properties even with little or noisy data. Although elegant in theory, the evidence optimization step of this framework is computationally costly: the cubic complexity in the dimensionality of the parameter space renders it difficult to scale to high-dimensional settings as often encountered in sensory neuroscience [5]. Also, current popular implementations of MAP estimates work mostly under the Linear-Gaussian encoding model, which assumes additive Gaussian noise to the neural response, and are restricted to simple one filter models.
An alternative approach to encoding prior knowledge for estimating STRFs is to parameterize the STRF with a set of basis functions with the desired properties. In this way, the model fitting becomes much more efficient as the number of parameters to be optimized is significantly reduced, and some degree of smoothness is automatically enforced.
Previous studies used a raised-cosine basis ( Figure 1A) for 1D retinal ganglion cell STRFs in macaque monkey [9] and Gaussian spectral or Morlet wavelet basis for 2D auditory STRFs in ferrets [10]. Those basis functions are chosen to capture certain STRF properties in their corresponding studies, thus they are relatively inflexible to generalize to other settings. Besides, they are typically controlled by multiple hyperparameters, which creates a burdensome and time-consuming model selection process.
Here, we propose to use an alternative basis for receptive field estimation: natural cubic splines ( Figure 1B).
These are known as the smoothest possible interpolant and can be easily extended to high-dimensional settings with minimal assumptions and few hyperparameters [11]. This spline basis can be incorporated into a wide range of existing receptive field models (Figure 2), such as the Linear-Gaussian (LG) and the Linear-Nonlinear-Poisson (LNP) model [12,9] for single STRF estimation as well as Linear-Nonlinear-Linear-Nonlinear (LNLN) cascade models [13,14] and spike-triggered clustering methods [14,15] for subunit estimation. When combined with proper regularization, smooth and sparse STRFs can be retrieved with very little data and in a computationally efficient manner, while their predictive performance reaches state-of-the-art level. Furthermore, we introduce diagnostic tools for assessing the quality of an estimated STRF: (1) a significance test based on the analytical solution of the STRF covariance matrix and (2) a permutation test based on model prediction. We provide all developed methods as part of the RFEst Python toolbox. Weighted 2D Cubic Spline F I G U R E 1 Illustration of the raised cosine and natural cubic spline basis. A: A typical raised-cosine bases as in [9] controlled by several parameters, such as number of basis functions, the degree of nonlinearity and the placement of endpoints. The choice of these parameters determines the final goodness of fit. B: An equally spaced 1D natural cubic spline, controlled by only one parameter: the number of basis functions (aka the degree of freedom). C: Fitted 1D filters using different bases and different sets of parameters. D: An example of 2D natural cubic splines. Left: Two selected bases of 2D natural cubic splines. Middle: Weighted bases after fitted to a simulated 2D STRF. Right: An example of a fitted 2D STRF overlaid on the ground truth.

| RESULTS
We developed the RFEst Python toolbox for spline-based STRF estimation in various models, including the single-filter LG and LNP models as well as the multi-filter LNLN cascade model ( Figure 2). We measured the performance of the spline-based methods and compared it with their non-spline versions or other previous methods, using simulated data as well as neural recordings from the retina and the visual cortex. To quantify the goodness of fits for the simulated benchmark data, we measured the mean squared error (MSE) between the ground truth STRF and the estimated STRFs for various methods and studied their dependence on the amount of available data. For the experimental data, we computed the correlation between the predicted and measured responses on a held-out test set as a performance measure. F I G U R E 2 Overview of the different spatio-temporal receptive field (STRF) models used in this study. A: Linear-Gaussian (LG) encoding model. y = X w + , ∼ N (0, σ 2 ), where X is the stimulus design matrix, y the response, w the STRF. B: Linear-Nonlinear-Poisson (LNP) model. y = f (X w ), y ∼ P oi sson, where f (x ) is the filter nonlinearity. C: Linear-Nonlinear-Linear-Nonlinear-Poisson (LNLN) model. y = g ( f (X w )), y ∼ P oi sson, where g is the output nonlinearity. Optionally, an autoregressive term convolving previous response with a response-history filter can be added before the nonlinearity to improve the model predictive performance.

| Estimating receptive fields from simulated LG, LNP and LNLN models
First, we simulated the responses of a sensory neuron y with additive Gaussian noise using a STRF with 30 time frames and 40 pixels in space in response to white and pink noise stimuli for various stimulus lengths (see Methods 4.1 for details). First, we fit LG models ( Figure 2A) to these responses using various STRF estimation techniques.
We measured the MSEs between the ground truth STRF and the different STRF estimates with respect to the ratio of the number of samples n to the number of parameters d = 30 × 40, averaged over 10 different random seeds ( Figure 3A). As expected, the STA and wSTA approached the ground truth only in high data-to-parameters regimes.
For limited data (e.g. n/d=4), the estimated STRF showed substantial background noise ( Figure 3B). In the case of pink noise, the STA was in fact never able to recover the STRF due to the high correlation between pixels in the stimulus [3] ( Figure 3C-D).
We also applied more advanced techniques to estimate the STRF to the simulated data, such as automatic smoothness determination (ASD, aka MAP with smoothness prior [6]) and automatic locality determination (ALD, aka MAP with locality prior [7]). These estimates were much closer to the ground truth already with a relatively small amount of data, with ALD outperforming ASD for both white and pink noise ( Figure 3). However, their computation time for estimating the optimized prior was long due to the high computational complexity.
We then applied a spline-based version of the wSTA (SPL wSTA) with and without L1 regularization on the basis function coefficients (see Methods). This simple change of basis functions led to a STRF estimator which performed very well for all n-d-ratios tested, even better than ASD and ALD, especially when using L1 regularization (Figure 3). At the same time, this estimator was also computationally very efficient, as the number of coefficients to be estimated was reduced to the number of basis functions b (for this example, the STRF had d = 30 × 40 = 1200 parameters, whereas the number of basis functions was only b = 9 × 12 = 108) We next simulated responses y from an LNP model ( Figure 2B) with Poisson noise similar to before using white and pink noise for stimulation. For simplicity, we used an exponential nonlinearity and treated it as known during the estimation procedure. We estimated the linear filter from different amounts of data using the STA, wSTA, ASD and ALD, as well as spline and non-spline LNP fitted via gradient descent with L1 regularization (referred to as LNP+SPL and LNP, respectively). Note that the original ASD and ALD were not designed for the LNP but the LG model (though some progress has been made for the former [16]). Despite this model mismatch, they still recovered good estimates of the STRF (as has been demonstrated in the original papers [6,7]).
Again, we found that different methods resulted in estimates of different quality depending on the available amount of data ( Figure 4). Here, in contrast to the LG model, we changed the x-axis of Figure  We finally investigated how recently proposed subunit models ( Figure 2C), in particular an LNLN cascade model [13,17] and spike-triggered clustering or non-negative matrix factorization methods [15,14], can be made more data- The assumption behind the spike-triggered clustering methods is that each spike can be attributed to a specific subunit of a STRF, so that each subunit can be viewed as a STA of a cluster of spikes. If this assumption holds, then a spline-approximated STA should be able to serve as a better centroid for clustering, as the smoothed STA is closer to the ground truth subunit STRF. We used k-means clustering as an in-place of the previous published soft-clustering method [15] and used a multiplicative update rule [18] for semi-nonnegative matrix factorization (semi-NMF) [14].
Moreover, we modified the previous published semi-NMF [14] and imposed the nonnegative constraint to the weight matrix instead of the subunit matrix, which allowed us to retrieve antagonistic STRFs.
We simulated responses y with Poisson noise from the model by stimulating two 20x20 pixels antagonistic centersurround subunits with different lengths of white noise stimuli. Both filter outputs first went through an exponential nonlinearity, were then summed up, and finally went through a softplus output nonlinearity.
We estimated subunit STRFs using k-means clustering or semi-NMF of the spike-triggered ensemble, and maximumlikelihood estimation of the LNLN model directly, with or without a spline basis ( Figure 5). Similar to the results from previous simulations, retrieved subunit STRFs from spline-based models were generally much more similar to the ground truth than models without spline, and required a smaller amount of data to achieve a similar level of performance. For example, when using a spline basis, we were able to retrieve the two subunit filters at high quality using only 4 minutes of data. With this amount of data, all other methods still showed essentially only noise with a hint of signal in the estimate of the linear filter.
It is worth noting that the computational efficiency of LNLN estimation was improved a lot by using a spline basis, as the number of coefficients to be estimated was significantly reduced. In contrast, the computation time for both spike-triggered clustering methods was actually increased by incorporating splines, because more computational steps were added to the original algorithms in those modified versions (see Methods 4.6). inversion. SPL on the other hand, was less time-consuming than wSTA and MAP, even though the time complexity was still cubic. However, the improvement was due to the reduced number of model parameters by parameterizing the STRF coefficients with the spline basis (b coefficients), which resulted in a much smaller covariance matrix to invert

| Comparing computation time between different methods and model selection
, thereby speeding up the most time-consuming step of the algorithm substantially. Thus, spline-based STRF estimators offer the best available compromise between STRF estimation quality and computation time.
The total amount of time spent on finding an optimal STRF depends on many factors, such as how the model is initialized, the choice of learning rate and stopping criteria for gradient descent optimization. Importantly, the way hyperparameters are selected also contributes significantly: most commonly, the model prediction performance is monitored for on a held-out data set over a discrete grid of hyperparameters. Even for evidence optimization methods, in which the main selling point is to learn the optimal hyperparameters automatically, cross-validation can still be used to assess model goodness of fit and generalization performance [6,19]. For spline-based GLM, this requires refitting the same model as often as the number of hyperparameter sets. Given that splines are equally spaced, a good strategy to accelerate the process would be to measure the performance of SPL wSTA over a grid of possible candidates from very few to many, choose the number when the performance is starting to plateau, then fit SPL+L1 with gradient descent to retrieve a sparse STRF.

| Diagnostic tools
In addition, even if a well-regularized model is chosen according to predictive performance, additional diagnostic tools can be useful to decide whether a certain STRF is reasonable and shows a clear structure (Figure 7). We devised three diagnostic tools that can be employed to this end: (1) the confidence interval (CI) of the fitted STRF and its corresponding p-values based on Wald statistics (with the null hypothesis that all STRF coefficients are zero, see also To demonstrate their usefulness, we investigated a ground truth 3D STRF under a Linear-Gaussian model. We showcased four different scenarios ( Figure 7): First, we looked at a well-fitted STRF ( Figure 7A). Here, the CI was ii F I G U R E 7 Examples of diagnosing STRF quality with confidence interval, p-value, prediction performance. A: Ground truth STRF with the dimensions 25x25x25. The 3D STRF is the Kronecker product of the respective temporal and spatial RF. B: A well fitted STRF with the sample size equals to d . (i) the estimated spatial RF at the minimum of temporal RF, indicated by the gray vertical dotted line in (ii); (ii) the estimated temporal RF; (iii) the change of correlation between the data and predicted response over all gradient descent iteration. (iv) the predicted performance on the permuted validation set. C: Same as B but with half amount of data. D: Same as B but with stronger additive Gaussian noise. E: Same as B but the simulated response were permuted. narrow and significantly different from an all-zero STRF (p<0.001), the gap between the training and validation prediction metrics was small and the prediction on the validation set was better than chance (p<0.001). The confidence interval and its corresponding p-value were determined by the training data. Next, we looked at scenarios where the amount of data was small or the signal-to-noise ratio (SNR) was low (Figure 7 C and D). Here, the STRF did not pass the significance test due to a lack of power, but it is still possible that the model predictions on the validation set were better than chance. Additionally, if the gap between the performance metric on the training and validation sets gets too large, it would be a good idea to refit the model with a stronger regularization. Finally, in the case of the data with extremely low SNR ( Figure 7E), which we generated by randomly permuting the simulated response, the prediction performance of the fitted STRF was just at the chance level regarding all indicators.

| Application to experimental data
We next documented the complete process of using gradient-based estimation for the parameters of spline-based LNP and LNLN models in a previously published experimental data set: extracellularly recorded spikes from tiger salamander retinal ganglion cells (RGC), stimulated by 3D non-repeat checkerboard white noise [14] ( Figure 9A). In addition, we applied the same techniques to 2-photon calcium imaging data recorded from a mouse retinal ganglion cell soma stimulated by checkerboard white noise ( Figure 9B) (using the positive gradient of the calcium trace as the response). We compared the results of spline-based models with non-spline versions.
For both data sets, only the first 10 minutes of the whole recording were used for fitting the models, another 2 minutes were used for model validation and another 4 sets of 2 minutes for testing. The predictive performance was measured by the average of the four correlation coefficients of the predicted and measured responses. We set the number of subunits k = 4 throughout. The dimensionality of the 3D STRFs were [20,25,25] and [25,15,15], respectively. We added diagnostic information as in Figure 7. Moreover, a contour was drawn on the spatial filter to illustrate the spatial extent of the STRF.
We fitted the non-spline LNP and LNLN with different L1 regularization weights from 0 and 10, with steps of 1.
For spline LNP and LNLN, the number of basis functions (aka degree of freedom, df) for temporal and spatial dimensions (assuming x and y dimensions here used the same df) were first selected based on the predicted performance of the maximum likelihood estimates on the validation set (from 1/3 to 2/3 of the pixels in each dimension, with steps of 1). Finally, for data set (1), the grid search took 11.2 minutes on the same hardware setup used for Figure   6 and yielded the optimal df= [9,9,9] ; for data set (2), it took 2.4 minutes and yielded the optimal df= [12,9,9] (Figure   8). See also how different amounts of training data affects the optimal df in Supplementary Section 1.
Then, we fitted spline LNP and LNLN with L1 regularization weights ranged from 0.5 and 1.5, with steps of 0.1 (as we observed, the spline model required a smaller value for regularization due to the reduction of model parameters).
The ordered grid search for the L1 weight, starting with the smallest value, was interrupted if the current predictive performance was lower than the previous one. All models were initialized with maximum likelihood estimates (plus small additive Gaussian noise) and fitted by gradient descent with 1500 iterations, but an early stop was also deployed with two triggers: either the training cost changed less than 1e-5 for the last 10 iterations, or the validation cost monotonically increased for the last 10 iterations. Sometimes the optimization didn't stop after reaching the lowest validation cost due to the adaptive learning rate resulting in the cost changing in a zigzag manner. But regardless of the stopping trigger, the model at the lowest validation cost was returned. The computation time of each fit for all four models was summarized by box plots in Figure 9A i v and B i v .
Fitting a 3D STRF with gradient descent directly was considered infeasible unless certain regularization was imposed [17], and the situation would become even worse for subunit models, as the number of model parameters increases linearly with the number of subunits. Previously methods [14,15] (as shown in Figure 5) relied on dimensionality reduction techniques to retrieve subunit STRFs but those methods could not be applied to 2-photon calcium imaging data directly without an extra step of spike inference. Spline GLMs not only overcome both of these problems but also allow us to assess the quality of the estimated STRF with CI and their corresponding p-values (In principle, the same statistics can also be applied to non-spline GLMs, yet a larger number of model coefficients would result in high computational cost for computing the STRF posterior covariance).
We found that for both example neurons from the two data sets, STRFs estimated by non-spline models were noisy and prediction performance was only moderate (Figure 9A

| DISCUSSION
Here, we showed that natural cubic splines form a good basis for STRF estimation by tackling one of the main challenges in sensory neuroscience: to efficiently estimate high-dimensional STRFs with limited data. Using this basis function set, we showed that inference for single filter LN models with different noise models as well as LNLN cascade models can be efficiently performed from both spiking as well as two-photon imaging data, even if the signal-tonoise ratio is low (and see also how spline-based methods compare to previous methods in Table 1). We provide an implementation of the proposed methods in the RFEst-package for Python.

| Choice of basis functions
We chose the natural cubic spline basis to represent the linear filters as default in our toolbox, because among many other spline bases such as cyclic cubic splines, B-splines, or thin-plate splines, natural cubic splines have been shown to be the smoothest possible interpolant through any set of data and theoretically near-optimal to approximate any true underlying functions closely [11]. Conveniently, assuming all knots are spaced equally in each dimension, natural cubic splines only require setting one hyperparameter: the degree of freedom, or the number of basis functions. This is unlike the raised-cosine basis used previously [9], which is governed by multiple hyperparameters. Also, natural cubic splines can be easily extended to high-dimensional settings, also known as tensor product bases, by simply taking the row-wise Kronecker product of the basis matrix along each dimension (see Method 4.4), making them uniquely suitable for estimating STRFs. There is also an interesting connection between smoothing splines and Gaussian processes. As shown by [20,21,21], spline regression converges to the MAP solution of a Gaussian process regression under certain constraints.
Through extensive validation with simulated and real experimental data sets, we showed that STRFs estimated with this basis can not only be more accurate than previous state-of-the-art methods, but the estimation procedure was also computationally much more efficient. The estimated STRFs were smoothed automatically, as they were the linear combination of a set of smooth basis functions. Further, sparsity and localization could often be achieved by adding an L1 regularization term to the loss function, which pushed the coefficient for the less relevant basis functions to zeros. Moreover, by parameterizing STRFs with basis functions, the number of coefficients could be significantly reduced, hence model fitting became much more computationally efficient, and the amount of data needed for the optimization to converge was also significantly less.

| Potential pitfalls of using splines in non-ideal situations
One caveat of using splines to fit STRFs is that some STRFs might not need smoothing, thus using splines would lead to suboptimal results. For example, if a STRF occupies only one pixel in space or the amplitude changes sharply in the temporal profile, spline-based model would likely yield suboptimal results by yielding overly smoothed filters. However, one can identify these cases by monitoring the cross-validation performance on held-out data while increasing the number of basis functions in each dimension. If the performance keeps increasing as the number of basis functions reaches the number of pixels in each dimension, a spline-free model could be preferable.
Another concern of using splines is that it would be difficult to tell a poor fit from a good one in some cases, as a poorly-fitted spline-based STRF would also look like a smooth STRF, unlike pixel-based methods that yield a noisy STRF when the fit is poor. In these cases, the diagnostic tools we introduced in Section 2.3 and Figure 7 would be helpful to assess the quality of the STRFs.

| Other approaches to efficient estimation of receptive fields
Similar to our approach, fast automatic smoothness determination (fASD) [22] has been suggested to speed up the process of evidence optimization in ASD. Because the smoothness prior covariance can be represented in frequency coordinates and many high-frequency coefficients of the Fourier-transformed filter tend to be small, the size of the prior covariance matrix can be reduced by cutting off those small value frequencies (an idea similar to the modified wSTA [4]), leading to a speed-up in the optimization process. However, as the efficiency of fASD is controlled completely by the initial frequency truncation, and the level of frequency to be truncated has to be fixed before the evidence optimization starts, one needs to resort back to grid search and cross-validation for choosing a good starting point, making the automatic hyperparameters selection via evidence optimization less automatic and comparable to the need to find the optimal degrees of freedom for the spline basis. Also, as fASD (and original ASD) rely on only a few hyperparameters, it might not even require to do evidence optimization at all but just to find the optimal hyperparameters through cross-validation completely. Importantly, fASD is so far only applicable within the framework of linear models with Gaussian additive noise, while our approach is ready to use in nonlinear models with non-Gaussian noise, as shown for LNP and LNLN models.
Regarding the inference for multiple STRFs in LNLN-cascade models, previous work has attempted to provide efficient algorithms by first retrieving subunit STRFs through spike-triggered clustering methods such as soft-clustering [15] or semi-NMF [14]. These were subsequently used to estimate other model parameters via gradient descent while holding the subunit STRFs fixed. Utilizing only the collection of stimuli that elicit a spike, these approaches have advantages in terms of the computational cost in time and memory compared to the direct fitting of the complete model using gradient descent [13,17].
The ease of using these algorithms relying on the spike-triggered ensemble under white noise stimulation is also the disadvantage of these approaches -they can only be applied to spike data under white noise stimulation. In many cases, experiments today record more diverse measurements from neurons, such as two-photon calcium imaging [23,24] or synaptic current recording [25] under more diverse types of stimulation, such as correlated noise and natural stimuli [26]. In those cases, spike-triggered analysis can not be used directly without modification, as the spike-triggered average is no longer the maximum likelihood estimate [3]. In contrast, our method does not rely on Gaussianity of the stimulus or the discreteness of the response, and in principle can be generalized to other stimuli distribution and other response types.
In addition, current spike-triggered clustering methods rely on temporally collapsing the 3D stimuli to further reduce the model complexity, assuming all subunits share the same temporal response kernel. This limits the usability of these methods to some very specific cell types and will not account for the spatio-temporal inseparable computation implemented by neurons receiving inputs from multiple sources, such as bi-stratified On-Off direction-selective cell in the retina [27]. In this case, a more general approach such as our method should be preferred.

| Further extensions
Here, we only showcased how to estimate STRFs using a natural cubic spline in a set of popular models with standard gradient descent with L1 regularization. This simple spline add-on can be directly incorporated into other previous models [13,17] with different constraints, regularization and optimization techniques, as it only changed the linear step of the models without modifying other parts.
Furthermore, to further improve model prediction performance, a response-history filter (as shown in Figure 2) and flexible nonlinearity can be added to the model [28,17], as both of which can also be parameterized by a 1D spline basis. With such extra flexibility, a model can better adapt to the cell-specific firing threshold and adaptation to the stimulus ensemble. Still, even with the response-history filter, a STRF model may still not be able to account for the total variability of the neuronal responses, if the response is influenced by other sources. Future extension of the current GLM for STRFs can consider incorporating more diverse sources of behavioral data [29] of the experimental animal under experimental or naturalistic settings as 1D time-dependent filters.

Simulated data
For the simulated data shown in Figure 3 and 4, we used a 2D rank-2 STRF with 30 time frames and 40 pixels in space (d = 1200) as shown as "ground truth" in both Figures and generated responses with two types of flicker bar stimuli: Gaussian white noise and pink noise. For LNP models, we used a fixed exponential nonlinearity.
For Figure 5, we used two 20x20 pixels antagonistic center-surround STRF and with Gaussian white noise as stimulus. We used a fixed exponential function for the filter nonlinearity and a softplus function for the output nonlinearity. We controlled the mean firing rate of all simulated spike trains to about 21 Hz by adjusting the intercept of the corresponding models (for Figure 4A and C, the intercepts were 3.5 and 2.5, respectively. For Figure 5, the intercept was 0) and also multiplying a constant R = 10 to λ.
For all simulations, we measured the similarity of the ground truth and the model estimators by computing the MSE between the respective normalized STRFs. The STRFs were normalized by element-wise division by the matrixnorm.
For Figure 3, the similarity was measured for various stimulus length (n samples ), which were scaled by different factors (factor ∈ 0.5, 1, 2, 4, 8, 16, 32, 64) with respect to the number of coefficients of the STRF. For Figure 4  For the simulated data shown in Figure 7, we used a 3D STRF with 25 time frames and 25x25 pixels in space (d = 25 × 25 × 25 = 15625), where the spatial profile was a 2D Gaussian field and the temporal profile had a sharp dip close to zero time lag, and generated responses with Gaussian white noise (sample size n = d unless noted otherwise).

Experimental data
We used a previously published data set for Figure 9A: extracellularly recorded spikes from tiger salamander retinal ganglion cells [14], available to download at https://gin.g-node.org/gollischlab/Liu_etal_2017_RGC_spiketrains_ for_STNMF.
Another two data sets of visual neurons and one data set of a non-visual neuron were used for the supplementary Figure 2 and 3:

ntnu.edu/kavli/research/grid-cell-data;
The details of the second data set for Figure 9B are described below.

| Experimental procedures for 2-photon calcium imaging
Animals and tissue preparation

Two photon imaging and light stimulation
A MOM-type two-photon microscope (designed by W. Denk, MPI, Martinsried; purchased from Sutter Instruments/Science Products) as described previously [33,34] was used for this study. Briefly, the system was equipped with a mode- Light stimuli were projected through the objective lens [34]. A digital light processing (DLP) LightCrafter 4500 (Texas Instruments, Dallas, TX) coupled with external unit with light-emitting diodes (LEDs) -"green" (576 nm) and UV (387 nm) that match the spectral sensitivities of mouse M-and S-opsins (for details, see [23,35]) was used in this study. Both LEDs were intensity-calibrated to range from 0.1 · 10 3 ("black" background) to 20.0 · 10 3 ("white" full field) photoisomerizations P * s −1 per cone. The light stimulus was centered before every experiment, ensuring that its center corresponded to the center of the microscope's scan field. For all experiments, the tissue was kept at a constant mean stimulator intensity level for ≥ 15 s after the laser scanning started and before light stimuli were presented.

Data pre-processing
Ca 2+ imaging data were pre-processed using custom-written scripts in IGOR Pro. Regions-of-interest (ROIs) were manually defined depending on the outline of the soma. For each ROI, Ca 2+ traces were extracted using the image analysis toolbox SARFIA for IGOR Pro [36]. A stimulus time marker embedded in the recorded data served to align the traces relative to the light stimulus at a temporal resolution of 2 ms. All stimulus-aligned traces together with the relative ROI position were exported for further analysis.

| Previous methods
Spike-triggered Average (STA) The STA was computed as the average of all stimuli that proceeded a spike [1]: where X is the stimulus design matrix, y the response, n can be the number of spikes or the number of samples (rows) in X and y for non-spike data.
Whitened spike-triggered average (wSTA) The maximum-likelihood estimator under a Gaussian noise model, or whitened-STA (wSTA), was computed by multiplying the STA with the inverse of the stimulus auto-correlation matrix: In our code base, we did not compute the matrix inverse explicitly but computed the least squared solution instead, which is common practice.
Maximum-a-posteriori estimator (MAP) Similar, the MAP under a Gaussian noise model was computed by adding the inverse of the prior covariance to the stimulus auto-correlation matrix: where C is a prior covariance matrix of choice, e.g. corresponding to a smoothness or sparseness assumption [6,7]. Its hyperparameters can be optimized by minimizing the negative log-evidence [6,7]: where σ is variance of the additive Gaussian noise, µ and Λ are posterior mean and covariance, respectively: and Λ = 1

Smoothness prior covariance matrix for ASD
The 1D ASD prior covariance matrix is formulated as following [6]: where ∆ i j is the squared distance between STRF coefficients w i and w j , ρ controls the scale and δ controls the smoothness.

Locality prior covariance matrix for ALD
The 1D ALD prior covariance matrix is formulated as following [7]: where B is an orthogonal basis matrix for the 1D discrete Fourier transform, C s and C f are the diagonal ALD prior covariance in space and frequency, respectively, which can be constructed as: where τ s is scalar values control the shape and extent of the local region, χ is the STRF coefficient coordinates in space-time and ν s is center coordinate of the STRF. And where τ s is scalar values control the scale of smoothness in frequency, ω is of STRF coefficient coordinates in frequency and ν f is center coordinate of the STRF in Fourier domain.
High-dimensional prior covariance matrix can be approximated by taking the Kronnecker product (⊗) of the covariance matrix in each dimension.

| Estimating STRFs with Natural Cubic Regression Splines
To generate a natural cubic spline matrix (S ), we used the implementation from Patsy (0.5.1), a Python equivalent of the original implementation in R package mcgv [11]. The only hyperparameter needed to generate a 1D spline basis matrix was the number of basis functions, or the degrees of freedom (df), as we set all knots to be equally spaced.
where S is the spline basis matrix, x is knot locations, Table 2.
An illustration of 1D and 2D natural cubic splines can be seen in Figure 1B. A spline-based STRF under a Gaussian noise model can be fitted in the similar manner as the wSTA: b S P L,w ST A = (S T X T X S ) −1 S T X T y (12) where b is the spline coefficients, and the corresponding STRF is: TA B L E 2 Basis functions for natural cubic spline.
The spline matrix can be extended to high dimensions, which is also called tensor product smooth [11], by simply taking the Kronnecker product (⊗) of the basis functions in each dimension:

| Spline-based GLMs with gradient descent
The spline coefficients can also be fitted by gradient descent with respect to the cost function defined in each model with L1 regularization.
For the LG model, the cost function is the sum square error: where α is the weight of L1 regularization.
For LNP and LNLN model, the cost function is the negative log-likelihood: where λ is the conditional intensity or the instantaneous firing rate, ∆ is the size of time bin, and f (.) is a nonlinearity applied to the filter output, which can be fixed as a softplus or exponential nonlinearity.
We used JAX [37] for automatic differentiation, and Adam optimizer [38]for parameters optimization.

| Spline-based spike-triggered clustering methods
Both k-means clustering and semi-NMF can be formulated as a matrix factorization problem: where V is the spike-triggered ensemble, W is clustering centroids matrix (or subunit matrix), and H is the clustering label (or subunit weight matrix). A spline basis can be easily incorporated into k-Means clustering by assuming that each centroid can be approximated by the same spline basis matrix: W S P L,cent r oi d = S B, where B is the spline coefficient matrix for each subunit, and can be computed as the least-squares solution to the linear matrix equation For semi-NMF, splines can also be combined with the update rules [18]: where * is point-wise multiplication, A + = ( | A | +A)/2 and A − = ( | A | −A)/2.

| Confidence interval and statistical significance of the estimated STRFs
The asymptotic posterior probability of the estimated STRF coefficients can be computed analytically [11]: where b is the fitted STRF coefficients, V b is the posterior covariance. For LG model: where σ 2 is the residual sum of squares of the data and predicted response from the training set. And for LNP model: where W is a diagonal matrix with entries w i i = 1 f (X S b ) 2 , and f (.) is the nonlinearity of the model. Thus, the standard error of b is the square root of the diagonal of V b , and the corresponding confidence interval (CI) of the estimated STRF is as following: Given the computed confidence interval, Wald Chi-Squared Test can be used to determine whether a STRF is statistically significant (against the null hypothesis that all STRF coefficients are zero), with the Wald statistic T = bV −1 b b T .

| Permutation Test
To evaluate the predictive performance of the estimated STRF, we permuted the validation test stimulus in time and made prediction over 100 repetitions. A one-tailed one-sample Student's T-test was used to evaluate if the model prediction was better than chance level.

| Visualization of 3D STRF
For visualization of 3D STRF in Figure 7 and 9, we first performed singular value decomposition (SVD) to separate the spatial and temporal components of the estimated 3D STRF. The coordinates of the extremum pixel in the spatial component were selected, and the temporal filter of the original STRF is shown for those coordinates. The confidence interval of the same extremum point was plotted as a gray shaded region on the temporal filter. For the spatial filter, we show the frame of the original STRF at the extremum point of the temporal filter (indicated by a vertical dotted line). The maximum value of the color map was set to the maximum absolute value of the STRF.

| Comparison of Computation Time
For comparison of computation time in Figure 6, we used simulated 2D RFs of different sizes (15x15, 20x20 and 30x30 pixels, respectively) and white noise stimuli. Simulations were carried out at different lengths of time in the same manner as Figure 4

| Code availability
All methods (spline-based GLMs, evidence optimization and spike-clustering methods) were implemented in Python3 and available in GitHub (https://github.com/berenslab/RFEst). And Jupyter notebooks for all figures are available also in GitHub (https://github.com/huangziwei/notebooks_RFEst) 1 The number of spline basis functions vs. the amount of data used for inference.
We investigated how different amounts of training data affects the optimal number of basis (df) estimated from cross-validation.
To this end, we used the complete recording of the example cell from Figure 9A. The last 2 minutes of the recording were used as validation set. The rest of the recording was then divided into different lengths (1,2,4,8,16,32 and 64 minutes). We used the same procedure of grid search from Figure 8 for all recording lengths. We used a smaller grid for this analysis: from 6 to 11 for temporal dimension and from 8 to 11 for spatial dimension, as we observed from Figure 8 that the optimal values would most likely fall into these ranges. For most lengths, the optimal df was [9,10,10]. When only 1 minutes of the data was used, the optimal number of basis for both dimensions were relatively smaller than others (df= [8,8,8]). On the other hand, when about 1 hour of the data was used, the df for spatial dimension increased one, while the df for temporal dimension was still the same. Overall, the effect of the amount of data to the optimal df was not very strong.