The problem of perfect predictors in statistical spike train models

Generalized Linear Models (GLMs) have been used extensively in statistical models of spike train data. However, the maximum likelihood estimates of the model parameters and their uncertainty, can be challenging to compute in situations where response and non-response can be separated by a single predictor or a linear combination of multiple predictors. Such situations are likely to arise in many neural systems due to properties such as refractoriness and incomplete sampling of the signals that influence spiking. In this paper, we describe multiple classes of approaches to address this problem: using an optimization algorithm with a fixed iteration limit, computing the maximum likelihood solution in the limit, Bayesian estimation, regularization, change of basis, and modifying the search parameters. We demonstrate a specific application of each of these methods to spiking data from rat somatosensory cortex and discuss the advantages and disadvantages of each. We also provide an example of a roadmap for selecting a method based on the problem's particular analysis issues and scientific goals.

Generalized Linear Models (GLMs) have been used extensively in statistical models of spike train data. However, the maximum likelihood estimates of the model parameters and their uncertainty, can be challenging to compute in situations where response and non-response can be separated by a single predictor or a linear combination of multiple predictors. Such situations are likely to arise in many neural systems due to properties such as refractoriness and incomplete sampling of the signals that influence spiking.
In this paper, we describe multiple classes of approaches to address this problem: using an optimization algorithm with a fixed iteration limit, computing the maximum likelihood solution in the limit, Bayesian estimation, regularization, change of basis, and modifying the search parameters. We demonstrate a specific application of each of these methods to spiking data from rat somatosensory cortex and discuss the advantages and disadvantages of each.

| INTRODUCTION
Generalized linear models (GLM) provide a powerful tool for relating observed neural spiking data to the biological, behavioral, and stimulus signals that influence them [1, 2, 3]. These models express the conditional intensity of spiking, which defines the probability of observing a spike in any small time interval, in terms of a design matrix whose columns represent the signals, or covariates, of interest. GLMs possess a number of properties that make them well suited to spike train modeling [4,3,5,6]. One important property is that GLMs can be formulated to guarantee that the likelihood of the spiking data is convex with respect to the model parameters, allowing for rapid computation of their maximum likelihood estimates (MLE) [7,8]. Multiple algorithms exist for maximizing the likelihood function, such as simulated annealing [9,10], Fisher scoring [11], and gradient descent methods [12]. Because the likelihood function is convex in GLMs, most statistical software packages implement maximum likelihood estimation for GLMs using the computationally efficient Iteratively Reweighted Least Squares algorithm (IRLS algorithm) [13,14,15]. Another useful property of GLMs is that in most cases, the parameter estimates have an asymptotically multivariate normal distribution with a covariance matrix determined by the Fisher information, which is computed as part of the IRLS algorithm [13,14]. This makes it easy to compute confidence intervals (CI) about individual parameters or about the firing intensity, as well as providing for simple hypothesis tests, e.g. maximum likelihood ratio tests, about whether firing rates are influenced by specific sets of covariates [13,14]. Accurate estimation of the Fisher information also helps determine the extent to which one signal or set of signals is confounded with another in its influence on neural spiking [13,14].
For some GLMs, the algorithm to maximize the likelihood may fail to converge because the likelihood is actually maximized in the limit as one or multiple parameters tend to ±∞. This problem is sometimes referred to as the monotone likelihood or non-existence of the MLE problem [16]. For example, consider a point process, such that whenever a non-negative covariate X is nonzero, no events occur. For a Poisson regression model relating X to the rate of events, the rate should go to zero, and therefore the log rate should go to −∞, whenever X is nonzero. This can only be achieved by making the parameter multiplying X be equal to −∞. Another perspective is that this situation arises when the response and non-response can be separated by a single predictor or a linear combination of multiple predictors. For this reason, this phenomenon is more commonly referred to as the complete separation problem [17].
This problem is well-studied in the case of logistic regression with small to medium-sized data [18,19,20,21,22,17].
Heinze et. al. explore a number of potential solutions for this case including the omission of predictors that result in complete separation, fixing the parameter values for such predictors prior to fitting the model, changing the form of the model, and using an ad-hoc adjustment [23,24]. They arrive at a preferred procedure using a modification proposed by Firth in order to reduce the bias of maximum likelihood estimates in GLMs [25,26,27].
The problem of complete separation is discussed in the statistical literature on point processes such as Cox processes as well [28,29], and has been pointed out in the context of Poisson models for neural spiking [30], but remedies have not been discussed in detail in the neural modeling literature, despite the fact that it is likely to arise in many neural systems due to properties such as refractoriness and incomplete sampling of the signals that influence spikes.
In these models, separation occurs when any covariate forces the system not to spike whenever it takes on a nonzero value. Such covariates are called perfect predictors, since one can perfectly predict no spiking when they are nonzero.
For example, a model for a neuron with an absolute refractory period lasting longer than 1 ms will be perfectly predicted by an indicator for whether there has been a spike in the past ms. Perfect predictors can also appear in models when nonzero values of certain predictors are infrequently sampled, even in cases where a lot of data exists. For example, a model fit for a rat's hippocampal place field that includes an indicator for being in a corner of the environment that the rat tends to avoid may suggest that this variable is a perfect predictor, even though the neuron might occasionally fire in that region given enough time. We say that perfect predictors of the kind in the first example are structural, while ones of the kind in the second example are the result of sampling.
We posit that when these issues arise in statistical neural models, researchers adopt one of a set of ad-hoc methods to avoid dealing with it, and the issues do not make the discussion of the resulting papers. However, perfect prediction can lead to a variety of issues limiting the utility of point process GLMs, which may not be alleviated by ad-hoc approaches. One important issue is that the computed Fisher information in these fit models does not accurately reflect the variance-covariance structure of the parameter estimates, which makes quantifying uncertainty challenging. In particular, the computed Fisher information is often close to singular, and if inappropriately used to construct confidence bounds and perform hypothesis tests, typically leads to extremely large standard error estimates for the perfect predictors and inaccurate covariance between the perfect predictors and all other signals [23,31,32].
Another issue is that in the presence of perfect predictors, certain parameters may have estimated values that are completely determined by the properties of the estimation method. Since the true maximum likelihood estimates occur at ±∞, these parameter estimates are biased, and the amount of bias can differ substantially based on the choice of methods. Bias and variance issues become more important when the estimated parameters themselves are used to make inferences about neurobiological mechanisms of spiking. For instance, in the context of single-neuron spiking models, parameters related to the influence of past spike history have been interpreted in terms of a combination of ionic currents and a moving threshold, allowing researchers to relate inferred parameters to these biological processes [33,34,35,36]. The next issue is that the algorithm used to maximize the likelihood may not converge, and will typically only stop when some fixed iteration limit has been reached. This often involves an order of magnitude more computational time than fitting a GLM without perfect predictors. In some cases, the optimization algorithm reaches a region of the likelihood surface where the computations become numerically unstable, and subsequent iterations can actually lead to a reduction in likelihood.
In this paper, we explore a range of different approaches for dealing with perfect predictors in point process GLMs.
These approaches fall into multiple categories: Interpreting the IRLS output despite lack of convergence, fixing the parameter estimates for perfect predictors, Bayesian estimation methods, regularization methods, re-parameterizing the model to remove perfect predictors, and modifications to the parameter search domain or stopping criteria. In the Methods section, we introduce these different approaches and explain how they deal with the problem of separation.
In the Results section, we select one specific method from each category and illustrate its application to modeling spiking data from a rat cortical neuron, in vitro. Furthermore, we compare these methods and illustrate how they deal with different problems associated with complete separation. Finally, in the Discussion section, we discuss the properties of these approaches, point out some advantages and disadvantages of each method, and introduce a potential road map to select the most appropriate method to use in various situations.

| Problem Formulation
Let X be the design matrix with p predictors of neural spiking (columns) and n observations (rows) andY = (y 1 , . . . , y n ) T be the response vector of spike counts where the superscript T denotes the transpose of a matrix. We use x i j , x i · and x ·j respectively to denote the i j -th element, i -th row and j -th column of matrix X . Let β = (β 0 , β 1 , . . . , β p ) T be the parameter vector, θ = X β , and λ = ¾(Y |X ) be the conditional expectation of Y given X . Let g ( ·) be the link function that relates the conditional expectation of Y to the linear combination of predictors, i.e. g (λ) = θ. Here, we work with the canonical link function for Poisson point process which is the log function [13]. The Poisson process GLM is described by where λ i is the firing rate for the i -th observation and 1 ≤ i ≤ n. Perfect prediction presents challenges for any algorithm that is focused on maximizing the likelihood when its surface is nearly flat over a large subset of the parameter space. This includes approaches such as simulated annealing [9,10], Fisher scoring [11], and gradient descent methods [12]. Because of the convex likelihood surface in GLMs, the IRLS algorithm is widely used to find the maximum likelihood solution in this case. This algorithm starts with an initial vector β (0) and then computes β (s ) recursively using the update equation until convergence where U (β ) and I (β ) respectively denote the score function and the Fisher information computed for β . For the model described in Eq. 1, the log-likelihood function l (β ), the score function U (β ), and the Fisher information I (β ) are defined as where W is the matrix of weights defined by W −1 = ( d θ dλ ) 2 V and V is the variance function evaluated at λ [13]. For our choice of link function g ( ·), W will be a diagonal matrix with i -th diagonal element equal to λ i .
The j -th column of X is a perfect predictor if having a nonzero value at this column leads to zero spike count, i.e. for any i ∈ {1, . . . , n}, x i j 0 implies y i = 0. Additionally, a set of columns, j 1 , . . . , j m , generate a perfect predictor if some linear combination of these are a perfect predictor, i.e. there exists a set of weights a 1 , . . . , a m , such that whenever a 1 x i j 1 + · · · + a m x i jm 0 then y i = 0. We let S be the set of indices corresponding to columns that generate perfect predictors, and define the set of perfect parameters as {β j : j ∈ S }. The maximum likelihood estimators for these parameters diverge to ±∞. We say that the i -th row is perfectly predicted if there is j ∈ S such that x i j 0.
A naive way to deal with perfect predictors is to omit them from the model by removing all perfectly predicted rows and all columns corresponding to perfect parameters from X . However, perfect predictors are extremely informative and removing them can harm the model substantially in terms of goodness-of-fit and statistical power. Putting this naive approach aside, in what follows we briefly explore different families of approaches that can be used to deal with perfect predictors.

| Fixed Iteration Limit
The first approach for dealing with perfect predictors is to use an optimization algorithm such as the IRLS algorithm to estimate the MLE without adjusting the model or estimation procedure. As pointed out before, in the presence of perfect predictors the actual maximum likelihood solution is achieved only when perfect parameters are equal to ±∞, and therefore the parameter estimates will depend on when the algorithm terminates, typically after a fixed iteration limit is reached.

| Maximum Likelihood Limit
Eq. 1b can also be rewritten as For perfect parameters (j ∈ S ) taking the absolute value of x i j only affects the sign of β j and doesn't change the predicted values of our model. Thus, the maximum likelihood solution is achieved when for any j ∈ S , β j is equal to its limiting value −∞. Therefore, one approach is to set perfect parameters equal to their limiting value manually, and then use an algorithm that maximizes the likelihood, e.g. the IRLS algorithm, to estimate non-perfect parameters, after omitting perfect predicting columns and perfectly predicted rows from X and Y . The final model then includes both the estimated parameters when perfect predictors are removed, and the perfect parameters set to −∞. In practice, a perfect parameter cannot be set numerically equal to -∞, so often a very large negative number is used instead.

| Bayesian Estimation
The next approach to deal with the problem of separation is using Bayesian GLM [37,38]. In this approach, we treat the parameters β as random variables, assign them prior distributions π, and compute and maximize their posterior distributions given the observed data. For the case that π is a zero-mean normal distribution with covariance matrix Σ, the posterior log-likelihood function, the posterior score function, and the posterior Fisher information are respectively given by where the subscript b stands for Bayesian. In order to use the IRLS algorithm to fit Bayesian GLM, these equations can be used in Eq. 2 to maximize l b (β ). An appropriate choice of Σ can solve the perfect predictor problem in a couple of ways: the diagonal terms of Σ prevent any individual parameters from diverging to ±∞, and the off-diagonal terms of Σ can be used to impose smoothness between sets of parameters, so that sampling issues are less likely to cause perfect predictors.

| Regularization Methods
Another approach that is commonly used to address the problem of separation is regularization. In this approach, instead of maximizing the log-likelihood function, a weighted average of the log-likelihood function and a penalty function, p (β ), is maximized. This penalty function is intended to prevent parameters from diverging by penalizing larger values of the parameters. Lasso and Ridge penalty functions are two of the most used [39]. A penalty function that has been used specifically to deal with the problem of separation in logistic regression is Firth's penalty function [24]. In fitting a regularized GLM, the regularized log-likelihood function, the regularized score function, and the regularized Fisher information are computed according to where the subscript r stands for regularization and Λ is a variable determining the amount of penalization. To extend the IRLS procedure, these quantities can be used directly in Eq. 2 to maximize the regularized log-likelihood function. It is worth mentioning that some Bayesian approaches provide equivalent estimates as regularization methods with specific penalty functions [40,41,42]. For instance, the Bayesian GLM described in Eq. 5 can be framed as a regularization method with the penalty function p (β ) = 1 2 β T Σ −1 β .

| Change of Basis
The next family of methods focuses on reparameterizing the model in order to impose smoothness between specific sets of parameters. This method can be used when we believe that the receptive field has some smoothness properties that are not explicitly captured by the model structure. For example, it is common to use estimated spike rates in nonoverlapping spatial bins to model place fields, despite the fact that this does not capture the fact that rates in adjacent bins tend to be close. Similarly, models of refractoriness or bursting often focus on autocovariance in lagged bins, ignoring the tendency of neurons to have smooth history dependence structure. One way of imposing smoothness on predictors is using basis functions such as smoothing splines [43,44] or raised cosines [45]. Employing this approach Eq. 1b is replaced by where g k is the k -th basis function and q denotes the number of basis functions. Any standard optimization algorithm such as the IRLS algorithm described in Eq. 2 can be used to fit this model. This approach is capable of alleviating issues related to both structural and sampling perfect predictors, by introducing a new smoother predictor, g k (x i . ), that integrates information over a range of the receptive field that is better sampled.

| Modified Search Criteria
Another approach, sometimes referred to as constrained optimization, is to modify the search or stopping criteria for the estimation algorithm to prevent parameters from diverging [46,47,48]. One explicit way of doing so is to restrict the search space of the optimization algorithm and identify optimal parameters within this restricted space. Another way of preventing the estimated parameters from diverging is to require parameter updates to improve the likelihood by at least a threshold value. For example, if we use the IRLS algorithm to maximize the likelihood, we can update β j in Eq. 2 only if U (β j ) is greater than a selected threshold. This prevents the IRLS algorithm from making adjustments to β when the log-likelihood function is nearly flat, which is a characteristic of the likelihood in the presence of perfect predictors.

| Data Set and the Specific Estimation Algorithms
In this section, we pick one specific method from each family of approaches introduced in the Methods section, and compare them for the problem of fitting spiking data collected from rat cortical neurons 1 . This data has been previ- repetitions of a 60-second stimulation protocol, where for the first 39 seconds both the injected current waveform in pA and voltage responses are provided for bins of length 1 ms [49,50]. We only use the first 39 seconds of these repetitions to construct our training set. Two repetitions are used to fit the model and the other repetitions are used for resampling to compute cross-validation statistics and to examine whether perfect predictors arise due to subsampling. Different pairs of repetitions are used to fit each model multiple times (10 times in total) and the mean and standard deviation of each performance measure is reported in Table 1.
We denote the firing rate of the neuron at time t by λ (t ), the spiking count in the interval (t , t + ∆t ) by ∆N t , the firing history of the neuron up to time t by H t , and let p be a fixed value representing the length of history which we believe can influence the current spike intensity. We divide the whole range of current stimulus levels into q disjoint intervals, and let I j (t ) be an indicator function that takes value 1 if the input current at time t falls into the j -th interval.
We relate the firing rate at time t to the past history of spike counts and the input current using the following point process GLM: where p and q are set to 200 ms and 6 respectively. In general, statistical model identification procedures, such as the step-wise regression, can be used to select the order of a model. Here, our purpose is not to identify the most parsimonious model, but to introduce a model that can capture specific features of the data, and compare multiple fitting procedures in their handling of perfect predictors.
When we fit this model for a single neuron using the IRLS algorithm directly, we obtain a fit with 31 perfect predictors. 20 of these perfect predictors are associated with the parameters that capture history dependence (first sum of Eq. 8) occurring at small lags due to the refractory period of the neuron. These likely reflect structural perfect predictors, in that we would not expect to see any spikes occur immediately after another, even if we collected much more data. There are also two perfect predictors corresponding to the two lowest current indicator functions, I 1 (t ) and I 2 (t ). It is not immediately clear if these are structural perfect predictors or sampling predictors (i.e. whether the neuron has a non-negligible probability of firing when the stimulus current is this low). In this case, when we increase the data size (by using the extra trials in the training set to fit the model), these perfect predictors remain so, suggesting they may be structural as well. The remaining perfect predictors occurred either in the history component of the model at specific lags that had neighbors that were not perfect predictors, or in the stimulus response component for stimulus levels that we expect could drive some amount of neural spiking. In order to analyze if these perfect predictors are structural or due to sampling, we expand the training data by including additional trials and fit the same model described in Eq. 8. We observed that all these perfect predictors vanish and hence categorize them as sampling perfect predictors, since they only existed in the first place because of limited sampling in the training data. Without using any of the methods described in the Methods section, all the perfect predictors in this example diverged to large negative numbers, limited only due to the finite iteration limit of the IRLS algorithm.
In order to compare the general approaches defined in the Methods section, we selected one specific method from each family that, in our estimation, is the most natural option or is most likely to be used by neural data analysts when encountering the problem of separation. However, we focus only on those advantages and disadvantages of each method that can be generalized to all methods of the family from which it was selected. Below, we specify the method selected from each family of approaches.
Fixed Iteration Limit. We use the IRLS algorithm which is widely used for fitting point process GLMs. We set the iteration bound equal to 100, and run the IRLS algorithm without adjusting the model or estimation procedure.
We call this specific method "Standard IRLS" throughout the rest of this paper.

Maximum Likelihood Limit.
In this method, the perfectly predicted rows and the columns associated with perfect parameters are removed from X and Y prior to fitting the model. The final model includes the estimated parameters when perfect predictors are removed, combined with perfect parameters set to −∞.

Bayesian Estimation.
There are two types of parameters in the model described by Eq. 8, the ones corresponding history spiking counts (β 1:p ) and the ones corresponding input current (β p+1:p+q ). We consider separate prior distributions for these two groups, as we observed no correlation between their parameters. Each one of these prior distributions is a zero-mean multivariate Normal distribution with a covariance matrix Σ with elements Σ i j = c |i −j | where c ∈ [0, 1]. This choice of Σ defines a temporal correlation on parameters that fall off geometrically as the distance between them increases, and therefore can deal with the problem of separation by imposing smoothness between nearby estimated parameters. We selected the parameter c = 0.9 using a leave-one-out cross-validation method. We note that more advanced methods for estimating this parameter are possible (e.g. treating it as a hyperparameter and building a hierarchical Bayesian model to estimate it).

Regularization.
We used ridge regression, which uses an L 2 penalty term, to fit the model parameters because of its common usage in the neural data analysis literature. Therefore, the penalty function in Eq. 6 will be p (β ) = β T β .
The choice of Λ can affect the outcome of the regularization model significantly and thus, we use cross-validation to determine it. The value that gave the best model fit was Λ = 0.1. This specific method is called "Ridge GLM" throughout the rest of this paper.
Change of Basis. We reparameterized the model using a cardinal spline basis expansion [43,44]  in the data will be lost due to excessive imposed smoothness. This makes it challenging to determine the ideal number and location of the knots. One complicated but proven to work option is to use Bayesian curve-fitting with free-knot spline algorithms [51]. Another simpler option is letting knots be distributed regularly and then using cross-validation only to select the number of knots. In this case study, we used the latter approach, where we pick the numbers of knots that gives the best goodness-of-fit. This specific method is called "Cubic Smoothing Spline" throughout the rest of this paper.

Modified Search Criterion.
As an example of a modified search criterion approach, we restrict the search space of the IRLS algorithm to R = {β : p+q j =1 β 2 i ≤ r }. We set r = (p + q )d 2 , where p + q is the total number of parameters (excluding the intercept) and d = −5 is the threshold for identifying if a parameter is perfect or not based on the output of the Standard IRLS method. We call this method "Bounded Search" in the rest of this paper.

| Comparison of Specified Methods
First, we compare the different methods based on their goodness-of-fit. One approach for investigating goodness-offit is to examine Kolmogorov-Smirnov (KS) plots comparing the empirical and theoretical distributions of transformed spike times based on the times-rescaling theorem [52,53]. Another approach for comparing goodness-of-fit between models is to examine the proportion of deviance in a null model that is explained by each, which is expressed by Here, D M and D N denote the deviance [13] of the proposed model and the null model respectively, and the null model is a point process with constant intensity. For a Poisson GLM with the canonical log link function, the deviance is given by whereλ is the estimated intensity from the model. The value of R shows the ratio of the deviance explained by the proposed model that the null model fails to explain. This measure does not attempt to account for over-fitting, and tends to be higher for larger models. Therefore, we also consider a cross-validated version of R , denoted by R CV , which can directly be used to assess the prediction power of the fitted models. we compare the different methods based on the computational resources they require. To do so, we compute the relative run time and the relative peak memory of each method with respect to standard IRLS on the same system.
All models have been fit 10 times, each using a different pair of trials from the training set, and mean and standard deviation values are computed. o.f ratio, shows the largest relative peak memory after Standard IRLS. This is because this method needs to construct and store a new design matrix before running the IRLS algorithm. We note that the variability of the R CV and relative run time suggest that for this example differences in these measures are not likely due to chance. While this is true for this specific example, these trends may or may not hold for other analysis problems and model structures. A more systematic analysis would be necessary to determine statistical significance between methods for any particular inference problem.
Another issue that accompanies the complete separation or perfect prediction problem is that the observed Fisher information at the maximum likelihood solution typically does not accurately reflect the variance-covariance structure of the parameter estimates. In particular, the observed Fisher information is often close to singular, and if inappropriately used to construct confidence bounds and perform hypothesis tests, typically leads to extremely large standard error estimates for the perfect predictors and inaccurate covariance between the perfect predictors and all other signals [23,31,32]. Below, we examine how different each of the methods deals with this problem.
Bootstrapping represents one alternative to the Fisher information for computing standard error estimates for perfect predictors for the Standard IRLS estimator. In this approach, random subsets of the original data are used to fit the model multiple times, and then the empirical distribution of estimated values is used to compute standard error estimates for all parameters. Bootstrapping is far more computationally expensive than the other proposed methods, but results in more robust standard error estimates for IRLS, and therefore is considered as our benchmark on the first 100 predictors, which are associated with the influence of past spike counts at lags of 1-100 ms, and illustrate the estimated correlation between one fixed predictor and all other predictors. For the fixed predictor, we consider a sampling perfect predictor (left column), a non-perfect predictor (middle column), and a putative structural perfect predictor associated with the neuron's refractory period (right column). We observe that for non-perfect predictors, all methods perform more or less similarly, with Bayesian Estimation providing a smoother version of the correlation structure estimated by the Ridge GLM and Standard IRLS methods (middle column). In this case, we can see a local correlation structure for parameters around the fixed predictor (60-th predictor), which is expected; the effect of firing activity at lag i and i +1 is likely to be correlated. For the sampling perfect predictor (left column), we see that Standard IRLS results in zero correlation for all non-perfect predictors. However, we know that this predictor is perfect only due to sampling, and hence we expect to see a temporal structure similar to what we observed between two non-perfect predictors in the middle column. Ridge GLM shows this local correlation structure weakly, but with substantially reduced effect size and statistical power. On the other hand, the correlation estimated by Bayesian GLM is much more aligned with what we expect to see, and clearly shows a temporal correlation structure similar to that between non-perfect predictors. For the putative structural perfect parameter, Standard IRLS again estimates all the correlations to non-perfect predictors to be zero. In this case, this might be the preferred inference; if it is impossible for the neuron to fire during its refractory period, this should be true regardless of the influence of a spike beyond the refractory period. In this case, Ridge GLM also gives very small correlations, but Bayesian GLM still estimates a smooth correlation structure, consistent with the prior distribution. Note that we could use

| DISCUSSION
In this paper, we investigated the problem of complete separation in GLMs, which leads to the failure of the IRLS algorithm to converge when a single predictor or a linear combination of multiple predictors completely separate a response from a non-response. This occurs frequently in point process GLMs due first, to structural properties of neurons and their receptive fields that prevent a response, e. g. refractoriness, and second, incomplete sampling of the signals that influence spiking leading to no observed neural responses. This phenomenon can have a substantial impact on modeling and statistical inference from neural data. We broadly presented various classes of approaches to deal with this problem (section 2), and illustrated how they compare in practice (section 3). Here, we further discuss some of the advantages and disadvantages that are associated with each family of approaches.

Standard IRLS
Advantages. Already implemented in many statistical software packages. Describes the existing data set optimally. Inference for non-perfect parameters is typically reliable.
Disadvantages. Very slow; continues to run until iteration limit reached. Fitted models generalize poorly to other datasets. Poor estimates of variance-covariance structure for sampling perfect parameters make inferences about these parameters challenging.

Maximum Likelihood Limit
Advantages. Fast. Describes the existing data optimally. Inference for non-perfect parameters is typically reliable.
Disadvantages. Fitted models generalize poorly to other datasets. Not typically implemented in statistical software packages. No explicit computation of the Fisher information for estimating the variance-covariance structure of the perfect parameters.

Bayesian Estimation
Advantages. Provides increased statistical power and prediction accuracy when appropriate priors are available.
Allows for precise estimation of confidence bounds and covariance between parameters. Prior can be chosen to distinguish between covariates that may lead to structural or sampling perfect predictors.

Disadvantages.
Requires methods for selecting appropriate priors. Results may be sensitive to the choice of prior.

Regularization Methods
Advantages. Already implemented in many software packages. Controls perfect parameters with minimal impact on non-perfect parameters.
Disadvantages. Sensitive to tuning parameter, which typically involves an exhaustive search to find an optimal value. This makes this method relatively slow. Can result in diminished statistical power (compared to Bayesian Estimation and Change of Basis) since it treats structural and sampling perfect predictors identically.

Change of Basis
Advantages. Leads to more parsimonious models that take advantage of known structure in receptive field models. With a judicious choice of model structure, this can lead to a substantial increase in statistical power and model interpretability and reductions in computational time. Prior knowledge can be used to select a model structure that distinguishes structural and sampling perfect predictors.
Disadvantages. Selecting an appropriate model structure can be challenging. Some reparameterizations may not solve the perfect prediction problem. The results may be quite sensitive to the choice of model structure.
Inferences may require inversion of transformations associated with reparameterization.

Modifying Search Criterion
Advantages. Has minimal impact on non-perfect parameters.
Disadvantages. Results may be sensitive to the choice of criterion. It can result in diminished statistical power (compared to Bayesian Estimation and Change of Basis) since it treats structural and sampling perfect predictors identically.
Knowing the potential advantages and disadvantages of each of these approaches can help modelers select one, or a subset of methods to use to explore their data. We can also use these to develop a sequence of questions questions are more focused on more global properties of the place field (e.g. its center and width) than a smoothed estimate based on a change to a spline basis might be more appropriate.
In some studies the objective is not only to estimate the coding properties of neurons, but also to infer physiological features of the underlying neural system. For instance, in the context of single-neuron spiking models, parameters related to the influence of past spike history have been interpreted in terms of a combination of ionic currents and a moving threshold, allowing researchers to relate inferred parameters to these biological processes [33,34,35,36].
In this case, model parameter estimates of −∞ may lead to substantively different inferences than ones with large negative values, even if both lead to nearly identical predictions of the spike intensity at all times. Therefore, it is important for modelers to understand how bias related to the choice of approach can affect such physiological inferences. For this example, the maximum likelihood limit approach might be used and parameter estimates of −∞ might suggest that physiological inferences are not reliable, or the Bayesian approach might be used, with priors based on physiological knowledge related to currents and threshold dynamics.
Perfect predictors can arise in neural data due to many different factors, such as structural properties of the receptive field, lack of data for some regions of the receptive field, or even features of the receptive field that cannot be sampled under some experimental conditions. In this paper, we focused on point process GLMs, but complete separation can also happen in other domains of neuroscience. In particular, GLMs are used for Binomial and multinomial data which occur quite often when one looks at behavioral measures in neuroscience experiments. The ideas presented in this paper are not only limited to point process GLMs, and can directly be applied to other studies that involve other kinds of GLMs. Neuroscience experiments going forward are likely to yield larger data sets with many more neurons, and potentially with receptive field structures that involve many predictors simultaneously. As a result, models needed to explain such large and complex data sets are more likely to encounter the problem of complete separation. We anticipate that the types of methods and decision procedures discussed in this paper are going to become more and more important as the experiments and data sets evolve to capture more complex structures of the stimuli.

acknowledgements
The work for this project was supported by the Simons Collaboration on the Global Brain, 542971 and the National Institute of Mental Health, MH105174. We also thank CRCNS data sharing website, and Thomas Berger, Richard Naud, Henry Markram, and Wulfram Gerstner, for collecting the data for challenge A of the 2009 quantitative singleneuron modeling competition (the data used in this paper), and publishing it publicly on CNCRS.org.
four consecutive elements {s j ,i −1 , s j ,i , s j ,i +1 , s j ,i +2 } that are given by s j ,i −1 s j ,i s j ,i +1 s j ,i +2 = 1 α α 2 α 3 where 0 ≤ t ≤ 1 is called the tension parameter and must be picked prior to constructing S . In order to find the relation between original parameters (β i in Eq. 1b) and parameters after changing basis (β i in Eq. 7), we start by simplifying the right hand side of Eq. 7.
x j i s i k ) Comparing Eq. 1b and Eq. 13 and noting that the second sum in Eq. 13 is the i -th element of vector S β reveals the relation between β andβ : β 0 =β 0 , β 1:p = Sβ 1:q Based on this relation, any new predictor is a linear combination of original predictors, and as long as some nonperfect predictor appears with nonzero weights in this combination, the new predictor will not be perfect. This can be assured to happen by choosing the knots such that there exist at least one non-perfect predictor between every two successive knots.