Correlated Components Analysis - Extracting Reliable Dimensions in Multivariate Data

How does one find dimensions in multivariate data that are reliably expressed across repetitions? For example, in a brain imaging study one may want to identify combinations of neural signals that are reliably expressed across multiple trials or subjects. For a behavioral assessment with multiple ratings, one may want to identify an aggregate score that is reliably reproduced across raters. Correlated Components Analysis (CorrCA) addresses this problem by identifying components that are maximally correlated between repetitions (e.g. trials, subjects, raters). Here we formalize this as the maximization of the ratio of between-repetition to within-repetition covariance. We show that this criterion maximizes repeat-reliability, defined as mean over variance across repeats, and that it leads to CorrCA or to multi-set Canonical Correlation Analysis, depending on the constraints. Surprisingly, we also find that CorrCA is equivalent to Linear Discriminant Analysis for zero-mean signals, which provides an unexpected link between classic concepts of multivariate analysis. We present an exact parametric test of statistical significance based on the F-statistic for normally distributed independent samples, and present and validate shuffle statistics for the case of dependent samples. Regularization and extension to non-linear mappings using kernels are also presented. The algorithms are demonstrated on a series of data analysis applications, and we provide all code and data required to reproduce the results.

How does one find dimensions in multivariate data that are reliably expressed across repetitions? For example, in a brain imaging study one may want to identify combinations of neural signals that are reliably expressed across multiple trials or subjects. For a behavioral assessment with multiple ratings, one may want to identify an aggregate score that is reliably reproduced across raters. Correlated Components Analysis (CorrCA) addresses this problem by identifying components that are maximally correlated between repetitions (e.g. trials, subjects, raters). Here we formalize this as the maximization of the ratio of between-repetition to within-repetition covariance. We show that this criterion maximizes repeatreliability, defined as mean over variance across repeats, and that it leads to CorrCA or to multi-set Canonical Correlation Analysis, depending on the constraints. Surprisingly, we also find that CorrCA is equivalent to Linear Discriminant Analysis for zero-mean signals, which provides an unexpected link between classic concepts of multivariate analysis. We present an exact parametric test of statistical significance based on the F-statistic for normally distributed independent samples, and present and validate shuffle statistics for the case of dependent samples. Regularization and exten-

| INTRODUCTION
Consider the following scenario: A group of people are shown a movie while their neural activity is recorded. Guided by the assumption that the movie evokes similar brain responses in different subjects, the goal is to identify movie-related brain activity that is common across individuals [1]. Because activity is distributed across multiple sensors, we would like to find linear combinations of sensors that have a similar time course in all subjects. In other words, we would like to identify underlying components or factors that have high inter-subject correlation.
In a different setting, the task may be to assess motor skills. A clinician observes a group of individuals performing various tasks and provides a rating for each person and task [2]. Typically, ratings are combined across tasks so that each person obtains a single aggregate performance score. But how should individual task ratings be combined? Traditionally, this is done by simple averaging of related scores. We propose instead a data-driven approach that aims to make the aggregate scores as consistent as possible across different raters. This means that we want to combine scores so as to maximize inter-rater reliability.
What is common in these two scenarios is the goal of identifying directions in multivariate data sets that are reliably reproduced across repetitions. Correlated Components Analysis (CorrCA) accomplishes this goal. CorrCA was originally developed in the context of neuroimaging studies to extract similar activity in multiple subjects [1], and was re-developed independently to extract reliable brain responses across repeated trials in a single subject [3]. Here we will show that the technique can also be used to identify aggregate ratings with high inter-rater reliability.
More generally, CorrCA is applicable whenever a data volume is available with dimensions T × D × N , where N are repeated measures with the same sensors. The method identifies directions in the D -dimensional space along which the data maximally correlate between N repeated measurements, with correlation measured across T samples. This is similar to Principal Components Analysis [PCA,4] in that CorrCA finds a set of D -dimensional projection vectors that linearly decompose a data set (of size T × D ). Instead of capturing dimensions with maximum variance within a single data set as in PCA, CorrCA captures the dimensions with maximum correlation between the N data sets. As we will show here, the optimality criterion of CorrCA can also be used to derive Canonical Correlation Analysis [CCA,5], including multi-set CCA [MCCA, 6,7]. The methods differs in that CorrCA uses a single set of linear projections for all data sets assuming they repeated measures are all taken in the same space, whereas CCA and MCCA yield a different set of projection vectors for each data-set. MCCA is well established for the purpose of identifying similar brain activity across subjects [8,9,10,11,12,13]. Here we will show for brain signals and rating data that CorrCA can achieve better performance than MCCA by virtue of the smaller number of free parameters.
The main contribution of this paper is to provide a didactic yet formal characterization of CorrCA and with this, establish a few novel theoretical results. In particular, we show that maximizing the correlation between repeated measurements is indeed equivalent to maximizing the reliability of the projected data. By "reliability" we mean the PARRA, HAUFE, DMOCHOWSKI 3 average over repetitions, divided by the variance over repetitions, which is also a sensible definition for signal-tonoise ratio. We establish a direct link to MCCA, as well as Linear Discriminant Analysis (LDA). CorrCA and MCCA both maximize inter-subject correlation, defined here the ratio of between-to within-subjects covariance, while LDA maximizes the between-class over the within-class variance. Surprisingly, LDA and CorrCA yield the same result for zero-mean signals. This direct equivalence allows us to formulate a strict statistical test for significance based on Fisher's variance ratio (F-statistic). The equivalence to LDA suggests a straightforward extension of CorrCA to non-linear mappings using the well-known "kernel" approach. We also provide a validation of CorrCA for identifying shared dimensions using simulated data and validate the proposed tests of statistical significance. Another important contribution of this paper is to formalize inter-subject correlation (ISC) as a correlation metric which applies to more than two signals. ISC differs from conventional definitions of correlation, yet seamlessly integrates into the frameworks of Linear Discriminant Analysis, Canonical Correlation Analysis, and the F-statistic.
Before we present the mathematical treatment, we begin with a simple two-dimensional example to illustrate the basic concept of identifying reliable signal components that are contained in a shared linear dimension.

| Extracting shared dimensions -an illustrative example
For this illustrative example we consider a simulation of electrical brain signals recorded from multiple subjects. Assume that we have two sensors, x 1 (t ) and x 2 (t ), which record temporal signals from each of two subjects, as shown in The locations along the gray lines are determined by the subject-dependent waveform, while the shift of the parallel line from one time instance to the next is determined by the common waveform. It is important to note that while both directions are shared by the two subjects, only one of these dimensions expresses a common waveform. Now we select a linear transformation, V, to map this data onto a component space, y(t ) = V x(t ), such that the component signals are maximally correlated between subjects. With this choice, the time course for both subjects in the first component dimension, y 1 (t ), are now perfectly correlated (panel D). Thus, we have extracted the common source of variation in these two subjects, whereas the second component dimension, y 2 (t ), now captures a waveform that is not necessarily shared between subjects (panel E). As a result, the gray lines in the scatter plot are now vertical (panel F), separating the waveforms into reproducible signal (y 1 ) and non-reproducible noise dimensions (y 2 ). Evidently, V captured the source directions in these data (blue and red arrows). The objective of Correlated Components Analysis is to identify the unknown projection matrix V by analyzing time courses x(t ). The fundamental assumption of CorrCA is that there is a reproducible signal and a non-reproducible noise, and that the directions of the reproducible signal are shared between subjects.
Readers wishing to omit the theoretical treatment below may proceed directly to Section 3, which presents several applications of CorrCA to real-world data sets. is captured by voltage sensors x 1l (t ) and x 2l (t ) for subject l . This common source contributes to both subjects with identical scaling a and b: x 1l (t ) = as(t ) + cξ l (t ), x 2l (t ) = bs(t ) + d ξ l (t ), where ξ l (t ) is a source that contributes a different value to each subject l but with the same mixing coefficients c, d . Note that voltages from multiple current sources combine additively in a resistive medium. (A) Time courses x 1l (t ) for both subjects. (B) Time courses x 2l (t ) (C) Same data, but now individual time points are connected (by gray line) between the subjects. The source directions [a, b] and [c, d ] are indicated with a red and blue arrow respectively and are the same for both subjects, i.e. they are "shared" dimensions. (D) Projection y 1l (t ) with maximal correlation. The time courses of the two subjects are perfectly correlated with inter-subject correlation ISC=1. (E) Projection y 2l (t ) that is not common between subjects. (F) Along dimension y 1 , both subjects' data are expressed identically, i.e. this is the ("reliable") source dimension that is shared between the two subjects. Values along dimension y 2 differ randomly between subjects. This dimension is not "reliable" across subjects.

| Problem statement: Maximal inter-subject correlation
The objective of CorrCA is to find dimensions in multivariate data that maximize correlation between repetitions.
Denote the D -dimensional data by x l i ∈ D , where i = 1, . . . , T enumerates exemplars, and l = 1, . . . , N represents repeated renditions of these data. In the case of brain signals, the exemplars represent time points, so that i indexes a sequence in time of length T , and l indexes N repeated recordings (subjects or trials). In the case of a behavioral assessment instrument, i may index individuals receiving a rating, while l indexes multiple raters, or repeated ratings given to the same individuals by a single rater. In both instances, the observations are D -dimensional (D sensors that record brain signals, or D items rated in the assessment instrument). For simplicity, in the following we will employ the terms "time" and "subjects" in reference to i , such that correlation is measured between subjects and across time. When appropriate, we will rephrase the resulting expressions for the case of assessing inter-rater correlation or test-retest correlation.
The goal is now to identify a linear combination of these D measures, defined by a projection vector v ∈ D : such that the correlation between N subjects (repetitions) is maximized. We define inter-subject correlation (ISC) as the ratio of the between-subject covariance, r B , and the within-subject covariance, r W : The between-subject covariance is a sum over all pairs of subjects, and within-subject covariance is a sum over all subjects: where,ȳ l * = 1 T T i =1 y l i , is the sample-mean of the projected data for subject l . The factor (N − 1) −1 in definition (2) is required so that correlation is normalized to ρ ≤ 1 (see Appendix G.2). To simplify the notation, in our "variance" and "covariance" definitions we generally omit common scaling factors, in this case we omitted (T − 1) −1 N −1 .
While we refer to ρ as the inter-subject correlation (ISC), it could also be termed the inter-rater correlation or inter-repeat correlation (IRC) depending on the context. In Appendix G.1 we discuss this definition of correlation and its relation to other measures of correlation, such as intra-class correlation or Pearson's correlation coefficient. Note that r W is also the variance around the subject-mean.
By inserting Eq. (1) into definition (2) it follows readily that where R B and R W are the between-subject and within-subject covariance matrices of x k i defined analogously to (3)-(4): Herex l * = 1 T T i =1 x l i is the sample-mean vector for subject l . The projection vector v that maximizes ρ can be found as the solution of ∂ρ/∂v = 0, which yields: Assuming that R W is invertible, v is an eigenvector of R −1 W R B with eigenvalue ρ. Maximal ISC is achieved by projecting the data onto the eigenvector of R −1 W R B with largest eigenvalue. We refer to this principal eigenvector as v 1 . In addition to this projection with maximal ISC, there may be additional dimensions that capture significant ISC in the data. Our goal is to find all such projection vectors, v d , d = 1 . . . D . Thus, a natural objective is to maximize ISC summed over all components. However, we want each component to capture a different, independent aspect of the data. We therefore require components to be uncorrelated, similar to the constraints enforced by Independent Components Analysis or Canonical Correlation Analysis. (The alternative of requiring orthogonal projection vectors, as in PCA is discussed in Appendix A.5.) Thus, in total the goal is to maximize the following cost function with respect to subject to the constraints that components are mutually uncorrelated, i.e., v d R W v c = 0, for all c d . The solution to this problem is given by the eigenvector matrix V that satisfies the eigenvalue equation (see Appendix A.4): where Λ is a diagonal matrix with diagonal terms measuring the ISC, ρ d , of the corresponding projections of the data, Note that these additional projections of the data are uncorrelated, as desired, because their correlation matrix, R y W -defined as in Eq. (7) but for y -is diagonal: The fact that Λ W is a diagonal matrix is a well-know property of the generalized eigenvalue equation ( In the illustrative example of Figure 1 with D = 2, the eigenvalues were ρ 1 = 1 and ρ 2 = 0.04, for the red and blue directions respectively (see Figure 1 D and 1 F). Evidently the shared dimension where the two subjects vary in unison (red) has been identified. A concrete application of CorrCA to the problem of identifying brain activity that is shared between subjects is described in section 3.1. A thorough evaluation of the conditions under which the algorithm identifies the correct projection vectors v d will be presented in section 4.1 (Fig. 7). The main finding is that the shared waveforms (signal) need to project in the same direction across subjects. Performance degrades if there is additive noise, or equivalently, if the directions of the non-reproducible waveforms (noise) differ between subjects.
Given a set of extracted correlated components, the question arises as to how many of these components exhibit statistically significant correlations. For zero mean, independent test data we can use the F-statistic to determine statistical significance (see section 2.4, Eq. 26). Statistical testing is described in Appendix D, where we also treat the case of non-independent data using shuffle statistics. The methods for statistical testing will be validated with numerical simulations in section 4.

| Fast computation of inter-subject correlation
In Eq. (6), the term k = l is excluded from the sum over k . Note that the excluded term is precisely R W , so that adding the two covariances together completes a sum with all pairs, including k = l . We will refer to this, therefore, as the total covariance: This relationship is useful because R T can be simplified to: PARRA, HAUFE, DMOCHOWSKI notation these are: Here, index i enumerates the different classes and index l enumerates the exemplars in each class (see Table 2 in Appendix G.1 for an overview on the use of indices in different algorithms).
are the class mean and grand mean vectors, respectively. Scatter matrices also satisfy an additivity rule: S T = S B + S W [17]. The goal of LDA is to maximize the separation between classes, which is defined as the ratio of between-to within-class variance: As before, the maximal separation S is obtained with the eigenvector of S −1 B S W corresponding to the largest eigenvalue S . As before, we can extract additional projections that achieve separation of classes. If we require that these projections are uncorrelated, then they are given by the additional eigenvectors of S −1 B S W . This concept was first proposed by [18] and generalizes the case of two classes introduced by [19]. We refer to this as classical LDA, and note that it maximizes the sum of variance ratios S computed separately for each projection (see A.4). An alternative approach is needed if instead the goal is to maximize the ratio of summed variances [20], or if one prefers orthogonal projection [21].
This optimality criterion and resulting eigenvalue problem look strikingly similar to that of CorrCA in (5). Despite the similar naming, however, the scatter matrices differ from the between-subject and within-subject covariance matrices, and it is not immediately obvious how the two optimality criteria relate to one another. There is, however, a close relationship between the two (see Appendix B): where S M is the covariance ofx l * across subjects: When these sample-mean vectors are equal across subjects then the scatter matrices are a linear combination of the between-and within-subject covariance matrices. Note that equal sample-means,x l * =x k * , does not imply equal class-means,x * i =x * j . Zero-mean signals will naturally satisfy this condition and will typically still have different class means. Under this assumption, we can also write a simple functional relationship between class separation S and inter-subject correlation ρ: PARRA, HAUFE, DMOCHOWSKI 9 This relationship will be useful in section 2.4 where we establish a link between ISC and the classic F-statistic (Eq. 26), which yields a parametric test for statistical significance of ISC. Note that S is monotonically increasing with ρ (for ρ ≤ 1), because the slope of this relationship is strictly positive: This means that maximizing class separation S in (18) also results in maximal ISC ρ, provided that we equate the means.
Thus, finding vectors v that maximize S of Eq. (18) gives the same set of solutions as maximizing ρ in Eq. (5).
To see this, note that the goal of maximizing a ratio of two matrices, such as in Eq. (5), can also be achieved by joint-diagonalization of the two matrices (see Appendix A.1): and R W , then the V that satisfies (24) also diagonalizes S B and S W . Thus, this V captures the eigenvectors to R −1 W R B as well as S −1 W S B , and satisfies in both cases the constraint of mutually uncorrelated components. Because the eigenvalues of the two problems are monotonically related, the eigenvectors will be sorted in the same order for both problems.
Thus, the solutions to both optimization problems are the same, including the constraint of uncorrelated components.
To summarize, CorrCA and classical LDA yield the same result, provided that the sample mean vectors are equal across subjects.
The equivalence between CorrCA and LDA is perhaps surprising. Note that LDA attempts to separate classes so that exemplars of one class do not overlap with the exemplars of other classes. In the illustrative example of Figure 1, time samples i take on the role of classes. There are 20 classes, each with two exemplars (connected with gray lines).
Classes are well separated in the direction of component y 1 (red arrow), whereas they are overlapping in the original dimensions x 1 and x 2 . Note that the signals are zero-mean across samples but have differing class-means. In this toy example, there is no variation between the two exemplars in each class (vertical lines in panel F indicate that the y 1 values are identical within a class). Thus, CorrCA has found a dimension of the data (y 1 ) in which the classes are perfectly separated. Zero within-class variance leads to infinite separation (Eq. 18) and this in turn to unit correlation (Eq. 22). More generally, we have demonstrated with Eq. (22) that increasing separation between samples is equivalent to increasing correlation across samples. To understand this, note that separation is high when multiple repeats have similar values, yet these values have to be different for different samples. This means that multiple repeats have to vary together across samples. In other words, the variations across samples have to be highly correlated across repeats.
Notice that the equivalence between CorrCA and LDA requires the same number of exemplars in each classsomething that is not usually the case in conventional LDA. In fact, there is a one-to-one link between exemplars across classes (gray lines in panels C and F of Figure 1).

| Maximum reliability and the F-statistic
In the present context, the scatter matrices S B and S W are interesting for another, perhaps more important reason.
Consider the case where l represents repeated measurements of raters, or repeated measures on the same subjects, so that ρ now measures inter-rater correlation, or inter-repeat correlation (IRC). According to Eq. (16), S B measures TA B L E 1 Relationship between the optimality criteria of different algorithms.
(except for a scaling factor) the sample-covariance ofx * i , which is the mean across repeats. On the other hand, according to Eq. (17), S W measures the sample-covariance averaged over repeats. When projected onto y with v they define the variance of the mean across repeats, σ 2 y , and average variance around these means,σ 2 y , respectively. The ratio of the two variances is a sensible definition for signal-to-noise ratio (SNR): We take this SNR also as a metric of repeat-reliability. In this view, the results of the previous section show that maximizing correlation between repeats is equivalent to maximizing repeat-reliability. In particular, Eq. (22) provides the relationship between SNR and ISC. An application of CorrCA to the problem of identifying ratings that are reliably reproduced between different raters is described in Sections 3.2 and 3.3. In Section 3.4 we describe results on the problem of identifying components of brain activity that is reliably reproduced across repeated trials. In all instances, the goal is to maximize ρ, which is equivalent to maximizing S .
A component decomposition method that specifically maximizes SNR was previously described in de Cheveigné and Parra [23]. This method can be formulated as a joint diagonalization (JD) problem similar to CorrCA or LDA. When SNR is defined as in Eq. (25), then the JD approach diagonalizes matrices S T and S B in the present notation. By the same argument as in section 2.3, it is clear that this JD approach provides the same solution as LDA and CorrCA. The optimization criteria for the three techniques are summarized in Table 1.
We note that S is a ratio of variances, which permits a direct link to the F -statistic, as used in the analysis of variance: For normally and independently distributed samples this quantity follows the F -distribution with degrees of freedom Chapter 6]. We can therefore use the F -distribution to assess statistical significance for ρ. This gives us a strict parametric test of significance for each CorrCA component (on unseen data and provided the means are equalized, i.e.x l * =x * * ). We will leverage this observation when we perform tests for statistical significance of ρ evaluated on independently and identically distributed test data. For non-independent samples, which are more typical of the temporal signals of interest, we will rely instead on shuffle-statistics, which can also be applied directly to training data (see Appendix D). Both methods will be evaluated for accuracy in section 4.

| Forward model
An import aspect of multivariate models is parameter interpretation. CorrCA is an example of a "backward model" in the sense that the observed data is transformed into components that capture the source of covariation. It shares this property with a host of other methods such as PCA and ICA. An alternative strategy is "forward modeling" such as Factor Analysis or General Linear Models, which transforms known variables to explain the observed data [25]. Such models often capture the physical processes underlying data generation. For instance, in the case of electromagnetic brain signals, a forward model refers to the "image" that a current source in the brain generates at the scalp sensors.
For CorrCA, the backward model is given by the matrix V which maps observations x to components y = V x.
A corresponding forward model can be defined as the projection A that best recovers measurements x from the components y [26,27] The least-squares estimate of this forward model projection iŝ Note that the columns a k of this matrix capture how correlated a putative source y k is with the observed sensor recordings x. This approach of recovering the forward model from the backward model is explained in more detail in [25], along with a discussion on how forward and backward model parameters need to be interpreted. In the illustrative example of Figure 1, the forward model is shown in panel C as red and blue arrows for first and second component respectively. Evidently, the directions recovered by CorrCA match the underlying source orientations described in that example.
The utility of the forward model will be demonstrated for the case of brain activity, where we are interested in how the activity of a recovered neural source manifests at the sensor level ( Figure 2 and Figure 5). These forward models can be used directly to perform source localization to identify the spatial origin of the corresponding current sources [25]. In the case of questionnaire ratings (section 3.3) we will interpret the meaning of a component by inspecting the forward model, as it measures the correlation of a component with answers of raters on different topics. For the case of ratings, we may also be interested in which items lead to a reliable aggregate score (sections 3.2). In these cases, an analysis of backward model weights may provide valuable insight on what data features are most important to construct a reliable aggregate measure.
Note that if the original data is uncorrelated, then R W is diagonal and the projections V are orthogonal. In that case the forward and backward models are identical, except of an overall scale for each component. In that case, we can directly inspect the backward model, as is customarily done in PCA, which has orthogonal projections and one inspects the component "loadings". In the example of Figure 4D the covariance R W is approximately diagonal and one can inspect the results with either the forward of backward model.

| Relationship to multi-set Canonical Correlation Analysis
Thus far we have assumed that all repetitions should receive a common projection vector. This is sensible in the case of multiple trials if we expect that the sources of interest behave similarly across repetitions. However, in the case of diverse subjects or raters, it may be more appropriate to assume that each repetition should be uniquely weighted. 12 PARRA, HAUFE, DMOCHOWSKI For instance, brain activity recorded at a given location may differ between subjects due to anatomical or functional differences. This brings us to the case treated by Canonical Correlation Analysis (CCA) for M = 2, or more generally, multi-set CCA (MCCA) for M > 2. There are multiple implementations of MCCA [7,28]. We will derive one of these instantiations as a maximization of ISC as defined in Eq. (2). Assign to every subject l a unique projection vector v l : By inserting this into definition (2), it follows readily that where R l k is the cross-covariance matrix between x l i and x k i : The projection vectors v l that maximize ρ can be found as the solution of ∂ρ/∂v l = 0, which yields for each l the following equation: This set of equations for all v l can be written as a single equation via concatenation to a single N D -dimensional column where, λ = (N − 1)ρ + 1, R is a matrix combining all R l k , and where the covariance matrices R l l are arranged in a block-diagonal matrix D: If D is invertible, the stationary points of the ISC are now the eigenvectors of D −1 R, or more generally, the generalized eigenvectors of Eq. (33). We can arrange all such eigenvectors as the columns of a single matrix The eigenvector with the largest eigenvalue λ maximizes the ISC because ρ increases linearly with λ. The subsequent eigenvectors maximize ISC subject to the constraint that the component signals are uncorrelated from the previous components (see Appendix A.4). In other words, the solution V diagonalizes D: Formulating the MCCA problem as an eigenvalue problem can be derived in multiple ways [29]. In Kettenring [7] it is presented as the maximization of the eigenvalue of the correlation matrix between subjects (MAXVAR criterion), but it can also be derived by maximizing the sum of correlations between subjects (SUMCORR criterion) [ to the data size (N > T /D ) CorrCA may be preferable due to the reduced number of free parameters. This will be demonstrated in section 3.1.
When there are only two data sets (M = 2) then the present MCCA algorithm reduces to conventional CCA [31, see Table 4.1]. In the illustrative example of Figure 1, CorrCA by definition gives the same result for both subjects (red and blue arrows), but we find that the directions derived with CCA differ between subjects (not shown). In section 3.3, we will apply CCA to ratings data from two distinct raters (pairs of parent and child), in which case distinct projections may be warranted.

| APPLICATIONS
In the following sections we demonstrate existing and new applications of CorrCA, while also examining issues related to statistical significance testing and regularization of the model. We also compare CorrCA with MCCA/CCA and kernel-CorrCA.
Briefly, subjects in these experiments watched movies or listened to narrated stories while their neural responses were recorded with multiple sensors (electroencephalography (EEG) -electrical potentials measured at multiple locations on the scalp surface). CorrCA was then used to extract components that were most reliable across subjects. These experiments showed that brain signals are significantly correlated between subjects on a rapid time-scale of less than a second (signals were high-pass filtered at 0.5 Hz). While the ISC values are small (ρ ≈ 0.05), these values are statistically significant and very reliable across sessions [1] even in realistic environments [34]. Interestingly, the level of correlation between subjects as measured by ISC is indicative of how attentive subjects are towards the stimulus [33] and, therefore, predictive of how well individuals remember the content of the videos [14, see Appendix G.3 for a definition of ISC for individual subjects]. ISC of the EEG also predicts the retention of TV and online audiences [32,35].
Here we reanalyzed the data of [14] to compare CorrCA with MCCA (see section 2.6) on their ability to learn reliable projections of the data, while also using the opportunity to demonstrate forward model computation (section 2.5) and statistical testing (Appendix D). The data set consisted of the EEG from N = 18 subjects as their brain activity was Here MCCA was regularized with TSVD (K = 12; see Appendix C), while CCA was not regularized. Darker shades indicate that the learned components were found to be statistically significant using circular shuffle statistics at α < 0.05 (see Appendix D): 3 significant components were found with CorrCA, while 1 was identified with MCCA. Light shades indicate non-significant ISC. (B) ISC measured on independent test data. Test-set performance was computed by training on 4/5 of the time samples and testing on the 1/4 of left-out samples.. Here CorrCA produces higher ISCs than MCCA across 4 of the top 5 components (only the first is significant) (C) The spatial distributions of the first three correlated components displayed here as "forward models" (see section 2.5). The three components exhibit smooth topographies with unique spatial profiles, suggesting that each component recovers genuine and distinct neural activity. (D) The forward models of the subject-specific components learned by MCCA. Each column displays the forward models of the components from an individual subject (i.e., subjects 1, 5, 9, 13, and 17). The components are markedly less smooth, consistent with over-fitting, as MCCA learns separate parameters for each of 18 subjects. For details on this data set, please refer to Cohen and Parra [14].
being higher for the latter. Indeed, when evaluating ISC on unseen test data the ISC drops substantially for MCCA as compared to the training data, and CorrCA yielded higher ISC values than MCCA for 4 of the first 5 dimensions (Figure 2 B).
Both of these findings are indicative of over-fitting with MCCA, despite the use of regularization. We used the Both CorrCA and MCCA learned spatial filters that were applied to the multi-electrode EEG data to maximize the ISC. The activity captured by each spatial filter (component) is best depicted by "forward models" which are illustrated row-wise for the first three correlated components in Figure 8) to decrease faster when removing subjects instead of samples.

| Maximizing inter-rater reliability
The theory laid out in section 2.4 shows that one can use CorrCA to identify components with maximal inter-rater reliability. In this following example, clinicians assessed the motor skill of children by measuring performance on various standard tasks (Zurich Neuromotor Assessment). The rating obtained for each task here is the time taken by the child to perform each task. This time is normalized relative to the values of a standard group of children of the same age [37].
Although this is an objective measure, there is nonetheless variability due to variable performance of the child when repeating the task, and also subjective variability in the observer operating the stop-watch. In this specific data set, we have ratings from T = 30 children on D = 12 tasks from N = 2 raters. The 12 measures are mostly positively correlated across the children (Figure 3 A, see panel G for descriptive task labels). Reliability between raters is measured here as the IRC, which differs across tasks (Figure 3 B).
For many assessment instruments, individual ratings are summed up to obtain a total score. It is also natural to compute sub-scores by combining ratings on related items. In these sums, ratings effectively obtain equal weights, which is a sensible but nonetheless arbitrary choice. CorrCA can provide weightings that will give the most reliable response across raters, as reproducibility of results is a desirable goal. In this specific example, the goal is to find the weightings of the 12 motor tasks that provide reliable aggregate (sub-) scores, and which capture independent aspects of these ratings This has the potential of simplifying the assessment of motor performance while increasing its reliability. An important aspect of these various components is that, by design, they are uncorrelated from one another (uncorrelated across the In this data set, children were also evaluated a second time by the same rater in order to assess test-retest reliability (Figure 3 C). Comparing panels B and C, it is evident that the test-retest reliability is somewhat higher than the interrater reliability (i.e., the evaluation does indeed depend to some degree on the subjective judgment of the rater). The components extracted with the highest IRC also exhibit the strongest test-retest correlations (at least for the first few components). This confirms that these aggregate measures generalize to data that was not used in the CorrCA optimization, and that minimizing subjective variability of the observer also minimizes test-retest variability.

| Identifying agreement between child and parent
We also provide an additional example of CorrCA on subjective ratings data related to the style of parenting. In this example, the goal is to identify questions on which parent and child agree. Specifically, we analyzed data collected with the Alabama Parenting Questionnaire [APQ, 38] as part of the Healthy Brain Network initiative [16]. In the APQ, parents and children answered D = 42 questions pertaining to their relationship by providing a numerical value from 1 to 5 for each question. From this, the APQ averages various questions to provide aggregate scores related to different aspects: "involvement score", "positive parenting score", "poor supervision score", "inconsistent discipline score", and "corporal punishment score". At the time of this analysis, complete data were available for T = 616 children/parent pairs Given that parents and children may genuinely differ in their judgments, it would make sense to allow for different weighting to the questions. Therefore, we also applied MCCA to this data as described in section 2.6. We found only F I G U R E 4 Ratings from the Alabama Parenting Questionnaire for T = 616 families and D = 42 questions, with N = 2 ratings from a child and one of its parents. Here, the goal of CorrCA is to maximize IRC, i.e., to identify agreement between child and parent. Data were collected by the Healthy Brain Networks Initiative [16]. Data were divided evenly and at random into a training and a test set. Second and third columns show results for training and test sets, respectively. Red error bars indicate the standard deviation across 100 random partitions of the data into training and test sets, and sampling with replacement within that (bootstrap estimates). Dark-shaded bars indicate significant IRC at α < 0.05 (F-statistic, Bonferroni corrected with T = 308 for panels A and C; and circular shuffle statistic with T = 616 for panels B and E). We have also tested the non-linear kernel-CorrCA method (see Appendix F) on this ratings dataset (Figure 4 E).
There was a small gain in terms of inter-rater correlation or inter-trial correlation respectively in the test data (Figure 4 F).
However, the projections of the strongest components were not substantially different from the linear projections, suggesting that there were no strong non-linear relationships in the data. We find a similar result with kernel-CorrCA for the the data presented in the next section 3.4 (results not shown).

| Identifying stimulus-related components in EEG with high SNR
PCA is often used to extract a low-dimensional representation of high-dimensional data. To this end, PCA selects orthogonal projections that capture most of the variance of the projected samples. When multiple repetitions of the data are available, CorrCA can be used for the same purpose of reducing dimensionality of the data, but with the objective of capturing dimensions that are reliably reproduced across repetitions. Here we apply this to a conventional scenario in neuroimaging: neural activity is collected over time with multiple sensors, while the same experimental stimulus is repeated in multiple trials. We propose to use CorrCA to identify spatio-temporal patterns of activity that are well preserved across trials, similar to what was proposed in [3,23,39,40]. Ideally, the corresponding time-courses of multiple components are uncorrelated in time so that they capture different activity. CorrCA guarantees this by design as it diagonalizes R W . CorrCA is therefore similar to other source separation methods such as Independent Components Analysis [41], Denoising Source Separation [42], or Joint-Decorrelation [23]. In fact, CorrCA can be viewed as a form of blind-source separation, as no labels are required for the analysis.
Here we analyze EEG data that were collected during an Eriksen Flanker Task [26]. Subjects were presented with a collection of arrows pointing left or right. They were asked to decide as quickly as possible between two alternative choices (the direction of the center arrow) by pushing corresponding buttons with either their left or right hand. We analyze data from a single participant who responded correctly in 550 trials, and made an error in 46 trials ( Figure 5).
This error, which participants notice as soon as they make their incorrect response, leads to a negative EEG potential 0-100 ms after the button press (this period is indicated between black and red vertical lines in panel In our earlier work we specifically looked for a projection that distinguishes the EEG evoked response between error and correct responses [26]. Here the goal is to identify reliable error responses. When we apply CorrCA to these data, we obtain a series of components with high inter-trial correlation ( Note that CorrCA has reduced the D = 64 dimensional data to just a few components that capture the reliable portion of the data (high SNR; panel F). By compactly representing neural activity in a small number of components, it becomes easier to probe for changes in activity due to manipulation of experimental variables, whose effects on the original D = 64 dimensional data may be obscured due to the lower SNR and the multiple comparison problem.
Like most data decomposition algorithms, CorrCA does not require that the underlying data have any time structure.
For time series data, this implies that the correlated signal of interest is observed with the same delay in each channel and repetition. If this assumption is violated, drops in performance will occur.

| Effect of regularization on the reliability of EEG responses
A common numerical difficulty with generalized eigenvalue problems is the required matrix inverse. In the case of CorrCA, it is necessary to invert R W in equation (8). When some of the eigenvalues are small (or even zero), then random fluctuations due to noise will dominate the inverse. It is thus important to regularize the estimate of R W . In the past, we have used the Truncated Singular Value Decomposition (TSVD) [45] and shrinkage [46] for this purpose [1,32,33,14].
These methods are described in detail in Appendix C.
To evaluate the effects of these regularization methods on the results of CorrCA we use data from our study on neural responses to advertisements [32, we select one commercial among several aired during the 2012 Super Bowl].
For both the TSVD and shrinkage approaches, we spanned the space of regularization strength (a single parameter, K and γ, in each case) and measured ISC. The training set was the EEG data from 12 subjects during the first viewing of the ad, while the EEG data collected during the second viewing served as the test set.
The sum of the first 3 ISC values for both regularization methods followed a similar trajectory with increasing regularization strength (decreasing K and increasing γ;

| VALIDATION ON SIMULATED DATA
We evaluated the conditions under which CorrCA identifies the true source directions. We also determined the performance of CorrCA at identifying components with high ISC, and compared this against a simple benchmark based on PCA of the subject-average data. Additionally, we compared the accuracy of the statistical tests described in Appendix D at estimating the correct number of components. To this end, we generated artificial data with known properties. The data are generated assuming a known number of correlated components (signals shared across subjects) and additive spatially correlated noise not shared across subjects. The following factors were varied parametrically: the number of subjects (raters, repeats), N , the number of samples per subject, T , the number of dimensions, D , the number of correlated components, K , and the signal-to-noise ratio (SNR) as defined in the space of multivariate measurements.
Each simulated correlated component was identically reproduced in each subject, so that the ISC of all components was equal to one. We further varied the temporal dependency structure of the signal and noise components (IID or pink noise), the distribution of the signal and noise components (Gaussian or χ 2 distributed), and the heterogeneity of the signal and noise subspaces (either the same or different for all subjects). Signal components were generated to be correlated across subjects, but uncorrelated with each other. Noise components were uncorrelated with each other 22 PARRA, HAUFE, DMOCHOWSKI F I G U R E 6 Regularization increases reliability of EEG test data. The inter-subject correlations of EEG responses from 12 subjects viewing a popular television advertisement [32]. We took the data of the highest-scoring advertisement maximized for ISC while varying levels of TSVD and shrinkage regularization. (A) The spatial distributions of the three most correlated components. Note that the distributions of the second and third components were distinct for unregularized, TSVD-regularized and shrinkage-regularized CorrCA. (B) Performance with TSVD as a function of K (decreasing K corresponds to stronger regularization), and (C) performance with shrinkage as a function of γ (increasing γ corresponds to stronger regularization). Components were extracted on the data from a first viewing of the advertisement (training) and evaluated on data from a second viewing (test). Reliability peaks at γ = 0.4 for shrinkage, and when truncating R W to K = 20 dimensions, with very similar peak values of ≈ 0.16 (sum of first three inter-subject correlations) for the two regularization approaches. as well as across subjects. The number of noise components was fixed to D . The following default parameters were used unless otherwise noted: N = 5, T = 200, D = 30, K = 10, SNR = 0 dB, no regularization, IID Gaussian signal and noise components, ISC = 1 for all K components. Details on data generation and evaluation methods can be found in Appendix E. depends on the amount of noise that is collinear to each original correlated dimension. In the IID case, all estimates of K converge to the true number of 10 for high SNR (panel C). Here, the parametric test and the surrogate data approach using random circular shifts perform similarly, while the surrogate data approach using phase scrambling performs considerably worse at low SNR, presumably due to the absence of a well-defined notion of frequency and phase in the sense of the Fourier transform. In the case of dependent (pink noise) samples, both surrogate data approaches perform identically, converging to the true value of K for high SNR (panel D). In contrast, the parametric test, which incorrectly assumes IID data, overestimates the number of correlated components across the entire SNR range, as expected.

| Accuracy as a function of Signal-to-Noise Ratio
Note that the significance test is applied here to multiple components (D = 30). Under appropriate conditions, all three methods converge to the correct number of dimensions in these simulate data (K = 10). This indicates that all three methods (described in Appendix D) properly account for these multiple comparisons.

| Accuracy of estimating dimensionality of correlated components
The performance of CorrCA as a function of the true number of correlated components, K , is depicted in Figure 8.
Here, T = 200, D = 30, N = 5, SNR = 0 dB, where K was varied from 3 to 28. As the overlap between signal and noise subspaces increases with K , the average ISC of the reconstructed correlated components drops (panel A). A similar decline is seen in the reconstruction of simulated correlated components and their forward models (data not shown).
As a result, the percentage of correlated components that can be recovered (i.e., reach statistical significance) also decreases with increasing K (panels B & C).
We also compared the ISC values obtained with CorrCA against a simple benchmark, namely, PCA of the subjectaveraged data. While this simple method captures the correct directions for high SNR case, CorrCA significantly outperforms at medium and low SNR values (Figure 7 A). The same is true for simulations with varying number of underlying dimensions (Figure 8 A), where we see that CorrCA outperforms this simple method in all instances tested. Figure 9 shows the estimated number of correlated components as a function of the number of data points, T , and the number of repetitions (subjects/raters), N , with fixed D = 30, K = 10, SNR = 0 dB. In case of IID data, all estimates converge to the true value of 10 for increasing T or N . This is also true for pink noise data in combination with non-parametric statistical tests. For the parametric test, however, the number of seemingly correlated components diverges as T increases, and is overestimated regardless of N .  We also compared the ISC values obtained with CorrCA against a simple benchmark, namely, PCA of the subject-averaged data. While this simple method captures the correct directions for high SNR case, CorrCA significantly outperforms at medium and low SNR values (Figure 7 A). The same is true for simulations with varying number of underlying dimensions (Figure 8 A), where we see that CorrCA outperforms this simple method in all instances tested. Consequently, the number of correlated components can be less well estimated, the higher K. For IID data, this number is bound from above by the true number for all statistical approaches. (C) Overestimation of K may occur for pink noise data when the parametric F test is used. Figure 9 shows the estimated number of correlated components as a function of the number of data points, T , and the number of repetitions (subjects/raters), N , with fixed D = 30, K = 10, SNR = 0 dB. In case of IID data, all estimates converge to the true value of 10 for increasing T or N . This is also true for pink noise data in combination with non-parametric statistical tests. For the parametric test, however, the number of seemingly correlated components diverges as T increases, and is overestimated regardless of N . For Gaussian distributed components, we considered an additional case, where observations x l (t) were dichotomized (thus, Bernoulli distributed) by taking their sign. This simulation scenario resembled the typical setting in which two raters provide a binary assessment of the same issue. While non-Gaussianity of the underlying components does not substantially a↵ect the extraction of correlated components, the same is considerably more di cult in the presence of dichotomous observations. An additional analysis was conducted for IID Gaussian data and N = 5 to assess the performance of CorrCA when signal and noise covariance structures (as defined by the mixing matrices A l s and A l n , n = 1, . . . , N) di↵er across subjects (Figure 10

| CONCLUSION
The goal of this work was to provide a formal, yet didactic and comprehensive treatment of this new component analysis technique. The analytic development resulted in a direct link to MCCA and a surprising equivalence to LDA, thus linking the new methods with classic concepts of multivariate analysis. We also found an efficient scheme for computing ISC/IRC which surprisingly does not require computation of pairwise correlations. We also identified the F-statistic as an exact parametric test for statistical significance, which is valid in the case of normal and identically distributed samples. For the case of dependent samples we demonstrated and validated circular-shift shuffle statistics as an efficient non-parametric test of significance. We also presented an extension to non-linear mappings using kernels. Finally, perhaps the most important contribution of this work is the formalization of inter-subject correlation. Conventional metrics, such as pairwise Pearson's correlation are limited to two signals. Inter-subject correlation as defined here is applicable to more than two signals. ISC differs from conventional definitions of correlation (see section refapdx:ICC), yet seamlessly integrates in well established mathematics related to linear discriminants, canonical correlation, and the F-statistic.
We make all code and data available at http://parralab.org/corrca, and for long-term storage here [48]. This code is significantly faster than previous instantiations and has been extensively evaluated. All figures presented in this paper can be reproduced with this new code base and associated data. We hope that this analysis and code inspires new uses for correlated component analysis.

| ACKNOWLEDGMENTS
We are grateful to Valentin Rousson who pointed us to work on inter-class correlation, and Oskar Jenni of the Children University Hospital in Zurich, Switzerland who collected the data published in [37], which we reanalyzed here in Figure 3.
Similarly, we would like to thank Mike Milhalm for pointing us to the problem of identifying parent/child agreement and Correlated Component Analysis in Figure 10 (panel C & D) is a restatement of the dependence on SNR of Figure 7. The result illustrates that identical directions for the signal components across subjects are a fundamental assumption of CorrCA. If this assumption is suspected to be violated, the use of MCCA (section 2.6) may be beneficial, or approaches that can gradually move from CorrCA to MCCA (Kamronn et al., 2015). Gaussian and squared Gaussian ( 2 -distributed) signal and noise components, as well as for dichotomized (Bernoulli-distributed) observations. Data points were either drawn IID or sampled from a pink noise process. While non-Gaussianity of the underlying components does not substantially a↵ect the extraction of correlated components, the same is considerably more di cult in the presence of dichotomous observations. (C, D) performance on IID Gaussian data when signal and noise covariance structure di↵ers across repetitions (e.g., subjects, raters). Significant drops in the number of estimated correlated components as well as the average ISC are observed if the signal covariance, the noise covariance, or both, di↵er across subjects.

27
F I G U R E 1 0 CorrCA performance as a function of signal and noise distributions. Other parameters were fixed to T = 200, D = 30, K = 10, SNR = 0 dB. (A, B) Performance for Gaussian and squared Gaussian (χ 2 -distributed) signal and noise components, as well as for dichotomized (Bernoulli-distributed) observations. Data points were either drawn IID or sampled from a pink noise process. While non-Gaussianity of the underlying components does not substantially affect the extraction of correlated components, the same is considerably more difficult in the presence of dichotomous observations. (C, D) performance on IID Gaussian data when signal and noise covariance structure differs across repetitions (e.g., subjects, raters). Significant drops in the number of estimated correlated components as well as the average ISC are observed if the signal covariance, the noise covariance, or both, differ across subjects. 28 PARRA, HAUFE, DMOCHOWSKI Lindsay Alexander for providing access to the Healthy Brain Network data [16] for Figure 4. We also thank Samantha Cohen for providing the data from [14] for Figure 2. We thanks Samantha Cohen, Jens Madsen Michael X Cohen, and two anonymous reviewers for useful comments on an earlier version of this manuscript. We want to thank Flaviano Morone for assisting with the proof in Appendix A.4.

A.1 | Joint diagonalization
Here we will review the well-know relationship between the joint diagonalization of two symmetric matrices A and B and the eigenvectors of B −1 A. Consider eigenvectors arranged as columns in V with the corresponding eigenvalues in the diagonal matrix Λ: To find the solution V for this equation, we can replace the positive definite matrix B with its Cholesky factorization Insert this into (36) and left-multiply with L −1 to obtain: with M = L −1 AL − and U = L V. Equation (39) is a conventional eigenvalue equation with the same eigenvalues as (36).
It is now easy to see that the corresponding eigenvectors They also diagonalize B, which we see by left-multiplying (36) with V and right-multiplying with Λ −1 . This yields: Thus, the eigenvectors that solve (36) jointly diagonalize A and B. The reverse is also true. A matrix V that satisfies: where Λ A and Λ B are diagonal matrices, also satisfy (36). To see this, solve (43) for V and substitute this into (42), which yields: and therefore which is the same as the eigenvalue equation (36) with

A.3 | Identifiability
An additional important observation is that the eigenvectors of (36) are only uniquely defined for unique eigenvalues.
If an eigenvalue is degenerate, say with multiplicity K , then the corresponding eigenvectors span a K -dimensional subspace. Any vector within that subspace is a solution to the eigenvalue equation with that eigenvalue. In this space any arbitrary rotation will yield a valid solutions. Thus, when estimating components, it is possible that solutions with closeby ISC are "mixed". Such components are not uniquely identifiable. One should also note that the sign of any eigenvector is arbitrary. If v d is a solution, then so is −v d . One can always arbitrarily swap the sign of v d for any component d , which also reverses the sign of y l d and a d -the d th column of the forward model A. Thus, keep in mind that in all figures the sign of individual v d and a d are arbitrary. The same problem arises with most signal decomposition methods such as PCA, ICA or CCA. A common approach used in practice to "fix" the sign is to select one sensor/dimension with the strongest forward-model coefficient, and set that to a positive value (if indeed it is significantly larger than other coefficients with opposite sign).

A.4 | Optimization of sum ISC
The optimization criterion (9) has the form and is subject to the constraints that components are uncorrelated (relative to B): v d Bv c = 0, for d c.
It will be convenient to convert this optimality criterion into conventional Rayleigh quotients. To do so use the Cholesky decomposition (37) as before and replace, v d = L − w d and A = LML to yield: with the constraint now being a conventional orthogonality constraint: One can now optimize for matrix W with the w d as its column vectors. The solution to this constrained optimization problem is well established [49,Corollary 4.3.39]. It is given by, W = U, the eigenvalue matrix of equation (39). Inserting (39) gives the solution to the original problem, which is given by the generalize value equation (36).
To see that the eigenvector matrix of M maximizes J (W), first note that except for the constraint, the terms in the sum can be optimized independently, and that they all have the same form. One can therefore consider the maximum of a single term, written here simply as which is to be optimized subject to the orthogonality constraint (51). From here on the argument proceeds conventionally. Let u i , i = 1 . . . D be the orthonormal eigenvectors of matrix M with eigenvalues λ 1 ≥ λ 2 ≥ . . . ≥ λ D . We can express the solution vectors in this orthogonal basis as where α i is the projection of w on that orthonomal basis of these eigenvectors, α i = u T i w. Inserting (53) into (52) gives This upper bound is attained for α 1 > 0 and all others equal zero, and thus w 1 = u 1 . For the second vector we require that it is orthogonal to the first, and thus we have to constrain α 1 = 0. The sum in (54) now excludes the first dimension and has therefore an upper bound of λ 2 which is attained for w 2 = u 2 . The argument continues iteratively until for the last component only, w D = u D , is orthogonal to all other projections. It also happens to be the minimum of ρ.

A.5 | Uncorrelated components vs orthogonal projections
We have required here that the extracted components are uncorrelated, i.e. the projection vectors diagonalize R W : This means that the component dimensions of y l i are uncorrelated, but only in the averaged over all subject/repeats l . Note that for an individual subject, y l i may not be uncorrelated. The same is true for MCCA as derived in section 2.6. Instead of uncorrelated, we could have required that the projection vectors are orthogonal, V V = I, i.e. the linear transformation is limited to rotations. To find the orthogonal (rotations) that maximize ISC, one will PARRA, HAUFE, DMOCHOWSKI 31 have to resort to constrained optimization algorithms [e.g . 21]. A related issues arises in the context of Canonical Correlation Analysis which attempts to find a linear mapping between two datasets such that the components are uncorrelated [50,51]. Alternatively, the Procrustes problem aims to find a mapping between two data-sets with orthogonal projections [52]. [The later is what is used for "hyper-alignment" between subjects in fMRI by 53]. CCA and Procrustes are related, and in fact one can smoothly titrate between orthogonal and uncorrelated solutions [54]. In our case we favor uncorrelated components, because we want the components to capture independent information. At a minimum, this means the components have to be uncorrelated. In that sense, our method can be seen as a source separation method, similar to Independent Component Analysis [41], Denoising Source Separation [42], or others [55].
One advantage of restricting the transformations to rotations is that euclidean distances in the new space are preserved.
This means that differences found in the original sensor space are preserved in the new component space. Another advantage is that all original dimensions contribute equally to the new dimensions. In contrast, the non-orthogonal shearing operation of CorrCA or CCA may discount some dimensions and emphasize others. Both these advantages of orthonormal mappings, however, require that all new dimensions are considered. This is typically the case in the applications of hyper-alignment [56]. For the purposes of identifying dimensions that are reliably reproduced, however, we find in practice that many dimensions are not significantly correlated between repeats, and one will typically want to remove those. Similarly, one will want to dismiss sensors in the original data that are not reproducible across repeats while emphasizing others. Ultimately, however, the application will dictate whether one should enforce uncorrelated components or orthogonal projections. In either case, the goal can remain to maximize ISC, and the solution can be found, either through an eigenvalue question (as we presented here for uncorrelated components) of with constrained optimization [for orthogonal projections, along the lines of 21].

B | RELATIONSHIP BETWEEN SCATTER MATRICES AND BETWEEN-AND WITHIN-SUBJECT COVARIANCE MATRICES
Matrix S M , defined in (21), captures the variance of the mean across subjects and is zero when all subjects have equal mean:x l * =x * * . 1 With this result, scatter and covariance matrices can be related as follows:

C | REGULARIZATION AND ROBUSTNESS
Note that a high inter-subject correlation corresponds to large eigenvalues of R −1 W R B . These eigenvalues will be large if the selected direction exhibits large power in the space of R B , or low power in the space of R W . The former is meaningful and desired, but the latter can occur with rank-deficient or impoverished data and may lead to spuriously high intersubject correlation. Truncating the eigenvalue spectrum of R W has the desired effect of forcing the extracted data dimensions to have both high power in R B and R W . This makes it less likely to find spuriously reliable dimensions.
Formally, the TSVD approach is to select V according to: where the within-subject covariance is now regularized with the following approximation: are the K principal eigenvectors of R W and associated eigenvalues.
The inverse of R W is computed according to:R −1 which then left-multiplies (59) and yields the regularized solution to the CorrCA problem.
The parameter K is the number of eigenvectors to retain in the representation of R W . Decreasing this number strengthens the level of regularization, and K = D is equivalent to not regularizing.
Shrinkage regularization operates by a similar principle: dimensions of the data exhibiting low variance (corresponding to low eigenvalues of R W ) are enriched by incrementing these eigenvalues: whereΛ W = (1 − γ)Λ W + γλI, γ is the shrinkage parameter with 0 ≤ γ ≤ 1 , andλ = Tr(R W )/D is the mean eigenvalue of the (unregularized) within-subjects covariance. The effect of shrinkage, which retains the full dimensionality of the covariance matrix, is that the smaller eigenvalues are increased and larger eigenvalues are reduced (hence the term "shrinkage").
The advantage of shrinkage regularization is the simplicity of its implementation, where the regularized covariance can be simply computed asR without any further modifications to the generalized eigenvalue routine. TSVD instead requires more careful handling of the null-space when computing the general eigenvalues (see provided code). On the other hand, TSVD has the important advantage that the solutions V to the regularized eigenvalue problem (59) diagonalizeR W as well as the original R W .
This means that the extracted components are strictly uncorrelated, whereas for shrinkage that is only approximately the case.
The performance of TSVD and shrinkage can be evaluated by maximizing ISC on one data set (training set) and using the resulting projection vectors to measure ISC on a different data set (test set). We do this on one of the application examples in section 3.5.
Note that the TSVD approach can also be used to regularize the MCCA algorithm. In this case, the regularized inverseD has to be computed separately for each block of the block-diagonal matrix D in (34).

D | TESTING FOR STATISTICAL SIGNIFICANCE
The be reasonable for rating data, where rated individuals may be independent from one another, but it is not fulfilled for time series data such as EEG recordings. Here, we present three approaches to estimate the significance of individual correlated components as well as the dimensionality of the correlated subspace for IID as well as auto-correlated data.
Parametric null distribution-For IID data, we can use the F-statistics given in Eq. (26) to test for significant ISC (provided the means are equalized). Evidently, this statistic is not applicable for ρ values obtained on training data, as the we have adapted parameters to optimize this statistic itself. However, we can use the optimal projection vectors to assess statistical significance on separate, previously unseen test data. An important limitation is that the T exemplars have to be independent. This is true perhaps for rating data, but typically not true for temporal signals, where samples in time are often correlated. For such data, the non-parametric test discussed above are more appropriate. To avoid overestimating the number of components due to multiple testing (there are always D components to be tested), we adopt a Bonferroni correction. This limits the family-wise error rate, i.e. the probability of making one or more false discoveries among the D tests. Thus, only S values with p < α/D are considered significant, where α is the desired significance level and p is computed using the F-statistics.
Phase-scrambled surrogate data-For auto-correlated data we have to use shuffle statistics. This approach was introduced by [57] as a tool to test for nonlinearities in time series data. The idea is to create data that are consistent with the null hypothesis but otherwise resemble the observed experimental data in terms of temporal and spatial 34 PARRA, HAUFE, DMOCHOWSKI correlations. The original surrogate data of [57] preserves the amplitude spectrum of the original data but uses a random phase spectrum, which is achieved through an application of the Fourier transform and its inverse. This approach was extended to additionally maintain the spatial covariance structure of multivariate data by using identical random phase shifts for all variables by [58]. Here, we use the approach of Prichard and Theiler to test for significant ρ based on the consideration that no inter-subject correlation can be present after phase-scrambling the data of each subject. Variants of this approach have previously been used to test for significant correlations [59,33,14,60] We test the validity of the three methods on simulated data in section 4. The goal is to determine under which conditions these methods determine the correct number of underlying correlated components.

E | VALIDATION ON SIMULATED DATA: DATA GENERATION AND EVALUA-TION METHODS
The generated signal and noise components, s l (t ) ∈ K and n l (t ) ∈ D , t = 1, . . . , T , l = 1, . . . , N , were mapped to the measurement space as x l s (t ) = A l s s l (t ), x l n (t ) = A l n n l (t ) and normalized as x l s (t ) ← x l s (t )/ | |x l s | | F , x l n (t ) ← x l n (t )/ | |x l n | | F , where | |x | | F denotes the Frobenius norm of the multivariate data set x(t ). The mixing matrices A l s = O l s D l s ∈ D ×K and A l n = O l n D l n ∈ D ×D were generated as products of matrices O l s ∈ D ×K (O l n ∈ D ×D ) with random orthonormal columns and diagonal matrices D l s ∈ K ×K (D l n ∈ D ×D ), the diagonal entries of which were drawn randomly as D i i = exp(d i ), d i ∼ N(0, 1) and normalized to a maximum of 1. This ensured that the non-zero eigenvalues of the signal and noise covariances matrices decayed exponentially as is the case for many real-world data sets. Unless otherwise noted, the same mixing matrices A l s ≡ A s and A l n ≡ A n were used for all subjects, reflecting the fundamental assumption of CorrCA that the spatial covariance structure of signal and noise components is the same in all subjects. The emulated data were generated as x l (t ) = ξx l s (t ) + (1 − ξ)x l n (t ), where the scalar parameter ξ = 10 SNR/20 /(1 + 10 SNR/20 ) was used to adjust the SNR. Note that SNR was defined here in terms of the contribution of the signal and noise components to the Frobenius norm of the entire multivariate measurement. It is, therefore, different from the definition provided in Eqs. (18) and (25), which applies to single channels or components and relates to the ISC through Eq. (22). Note that in most of our experiments, the ISC of each simulated signal component was set to 1, corresponding to infinite SNR according to Eq. (22).
We applied CorrCA to the emulated data and measured its performance in terms of the achieved ISC (average over the first K components) on the training set. We also measured the average ISC for a test set, i.e., using the same CorrCA projections V on new emulated data of identical size. Moreover, we measured the performance of CorrCA to reconstruct the true (emulated) correlated components s l (t ) and their corresponding mixing matrices A l s by comparing them to y l (t ) and the estimated forward model A (section 2.5). The performance metric for this was the Pearson correlation between individual simulated and reconstructed components or columns of A s . With this, we measure if each component was correctly identified. However, because components with equal ISC can be arbitrarily mixed within a subspace (Appendix A.3), we also assessed identifiability in terms of the angle between the subspaces spanned by the first K columns of the estimatedÂ and A n , as well as the angle between the simulated components s l (t ) and the first K dimensions of y l (t ) reconstructed by CorrCA. Angles were normalized to the interval [0, 1]. Finally, we estimated the number of simulated correlated components, K , using phase-scrambled surrogate data, surrogate data obtained using random circular shifts, and the parametric F-test. Empirical null distributions were obtained using 1000 surrogate data sets. The number of components with p-values smaller than α = 0.05 was used as an estimate for K . For the parametric test, the T samples were randomly split into training and test sets of equal size, where the CorrCA projections V were obtained on the training set, and the statistical assessment was conducted on the test set. For this approach, the median number of components with p-values smaller than α = 0.05 across 100 random train/test splits was used as an estimate

F | EXTENSION TO NON-LINEAR MAPS USING KERNELS
So far we have only considered linear transformations between x and y. Now, given the close relationship between CorrCA and LDA, we can extend CorrCA to non-linear transformations following the approach of kernel-LDA [61,62,63,64]. Assume a non-linear mapping Φ(·) for the data of subject l : In this notation, X l has dimension D × T spanning the entire multivariate time series for subject l , and Z l has dimensions D × T . Thus, the non-linear transformation Φ(·) is a mapping from space-time to a new space with possibly quite different dimensions. 2 The approach now is to first extract non-linear features of the data with Φ(·), and then combine these features linearly with projection v: Note that the transformation Φ(·) does not need to be defined explicitly. Instead, the "kernel trick" [66] allows one to specify this non-linearity only implicitly, by defining instead the inner product of vectors in this new D -dimensional feature space with a kernel function K(·, ·): 2 In the example of brain signals, Z l could represent, for instance, the power spectrum in source space [65], where D captures the number of sources and T the number of frequency bins. In the application examples and in the code we simplify this by applying the non-linear mapping to each time point separately, so that T = T and we have a purely spatial nonlinear mapping, similar to conventional kernel-LDA. However, this more general formulation allows one in principle to reduce the time dimension as well. This is important because the computational complexity of the kernel approach scales with O (T 3 ), and thus becomes prohibitive for long time-domain signals.
Knowing how to compute the inner product is sufficient, if all the expressions in the algorithm can be formulated in terms of this inner product. 3 In a slight abuse of notation, we will write K k l = K(X k , X l ) as shorthand for this matrix defined for each pair of subjects l k . Note that the dimensions of K k l are T × T and thus this matrix can potentially be quite large. In cases where the non-linear transformation Φ(·) can be expressed explicitly, and D T , one may chose to operate directly with Z l . Otherwise, it will be beneficial to leverage the definition of the non-linear transformation in terms of K k l . In the following, we will rewrite the optimization criterion ρ in terms of K k l alone.
The crucial step is to define the projection vectors in terms of the samples of the mean across subjects.
wherez * i is the across-subject mean vector for exemplar i in the non-linear mapped space (i th column ofZ * ). Here, α i are parameters that indicate how exemplarsz i are to be combined to represent the projection vector v. The approach will be to find optimal vector α instead of vector v, and thus we need to rewrite the optimality criterion in term of α. 4 Combining (65) with (67) one can now express the non-linear components y as a linear combination of K l k as follows: where the bar inK * l , indicates that we are taking the mean, and the asterisk specifies over which index this mean is taken. It analogously follows that, v Z * = α K * * .
Denote the columns of matrix Z l andK * l as z l i andk * l i . Then we can also write with this notation for the average: 3 Note that this inner product is only over dimension D . The function K(·, ·) is therefore only a "kernel" in this D dimensional space. 4 As an alternative to (67), we could have also defined the projection vector in terms of the samples of all subjects: The derivations can all be repeated analogously and the resulting algorithm is the same, but with NT degrees for freedom in α. In the provided code, this is referred to as the "full model", whereas equations derived here for the expansion in terms of the mean (67) are referred to as the "mean model". In the example provided in Figure 11, the numerical results are very similar for both models. The question as to whether the mean model is sufficient is empirical and comes down to whether there is enough diversity in the mean response to capture the individual variations across subjects. Mathematically, this implies that the expansion in terms of the mean is complete, i.e., the mean response matrixZ * is full rank.
The within-subject and total variance of the non-linear features z l i can thus be written as: where we defined the within-subject and total covariance of k * l i as: Because of the symmetry of these expressions with the definitions of R W and R T , we have again, C B = C T − C W , and can therefore write the inter-subject correlation in the non-linear space as a function of the new parameters α as follows: This can be solved again as an eigenvalue problem but now with the within-and between-subject covariance matrices of the kernel vectorsk * l i and the corresponding projection vector α. Note that a similar result is obtained for the kernel version of CCA, namely, the conventional CCA algorithm is applied to kernel matrices instead of the original data [67].
To demonstrate the method, we generated a simple 2-dimensional example where the shared dimension is nonlinear, and this non-linear relationship has to be discovered from the data alone. As with the example for linear CorrCA (Figure 1), we have simulated N = 2 "subjects", each providing two signals (D = 2), now with T = 40 samples in time ( Figure 11 A & B). In order to test whether the algorithm finds the correct non-linear mapping we generate data with a known non-linear relationship. In this case the two subjects share the amplitude (distance from the origin in the 2-dimensional input space of x; see Figure 11 C). However, the phase angle in this 2-dimensional plane is selected randomly at each time point and for each subject. We apply kernel-CorrCA using Gaussian kernels [63] keeping the time axis unchanged (T = T ; see also our companying code). The first component dimension found by kernel-CorrCA, y 1 , is approximately linear with the amplitude and is independent of phase in the original 2D plane (Figure 11 G & H).
The algorithm therefore discovered in the first component amplitude as the shared non-linear dimension. The second component dimension, y 2 is essentially a nonlinear function of component y 1 (Figure 11 F), hence it also has high ISC. The specific non-linearity between the two components is arbitrary, but does enforce that the two variables are uncorrelated. We obtain nearly identical results regardless of whether we use Gaussian or Tanh kernels [see code and 63], although the specific relationship between y 1 and y 2 differs. F I G U R E 1 1 Kernel-CorrCA. In this demonstrative 2D example, two subjects share amplitude but have a random phase. The goal is for the algorithm to discover the common amplitude signal -a non-linear transformation of the data -from 60 samples. (A-B) Two dimensions of the original data with one trace for each subject. (C) Samples are connected between the two subjects with gray lines, as in Figure 1. Circles of varying radii are drawn for visualization purposes. Note that connected points have the same distance from the centers, i.e. subjects share an amplitude. (D-E) After kernel-CorrCA using Gaussian kernels, signals between subjects are strongly correlated (ISC=1). (F) Points are in a new non-linear space, which is necessarily curved to decorrelate the two dimensions. (G-H) Colored lines correspond to the circles in panel C. The first dimension y 1 is independent of phase and grows linearly with amplitude. This means that the algorithm identified the non-linear dimension that is shared between the two subjects.

G.1 | Relationship of ISC to Intra-Class Correlation
The definition of inter-subject correlation in the present work is similar to the definition of intra-class correlation (ICC) first introduced by [68]. Fisher's ICC is referred to as pairwise inter-class correlation and differs from Pearson's correlation coefficient [69]. A more modern definition of intra-class correlation, which in the present context would allow for a different number of subjects per sample (i.e. N i ), was introduced by [70]: The definition of pairwise ICC, r p , differs from our definition of ISC, ρ, in that it subtracts the total meanȳ * * , whereas we subtract the sample mean of each subject,ȳ l * . The two are identical in the case of unbiased ratings, i.e. S M = 0 defined in Eq. (21). A definition comparable to ISC can be found in [6] in the context of MCCA under the term "modified inter-class correlation", but is not commonly seen in the literature. Instead, the generalized eigenvalue solution to MCCA presented here has been often derived using the sum of correlations, subject to a normalization constraint [e.g. 28,71,29].
In Table 2 we summarize various uses of these correlation measures. For each case the indices play different roles.
In the context of ICC, index i represents different classes whereas l , k are different exemplars of a class. Each class can have a different number of exemplars N i . In the context of ISC, l , k are subjects and i represent measures taken, perhaps in time. In the case of ratings, l , k are the raters and i are the items being rated. In Karlin et al. [70], the measure r p is used to assess intra-familial correlations (correlations between siblings). In this case, k , l indexes the siblings, whereas i enumerates the families. The algorithms listed in Table 2 aim to extract components in multi-dimensional data that maximize the corresponding statistic. Where no algorithm is listed, either CorrCA or LDA could be used.
The definition of ISC/IRC in Eq. (2) can be thought of as a Pearson correlation coefficient, where we concatenate the signals (after subtracting the individual subject/rater means) corresponding to all possible pairs of subjects. In contrast to conventional Pearson correlation, with this symmetric definition (where we include pairs l k as well as k l ), the signals 40 PARRA, HAUFE, DMOCHOWSKI of different subjects have to have identical scale in order to achieve perfect ISC/IRC (see section G.2). In the case of ratings, this means that different raters can only differ in their mean ratings but have to otherwise provide identical deviations from the mean in order to achieve perfect inter-rater correlation. This means that this metric is sensitive to multiplicative bias (i.e. each rater/subject has a different scale). The issue of bias in the context of inter-rater agreement is discussed in more detail by [2], who argue that Pearson correlation should be used to assess test-retest reliability to account for learning effects of subjects from one test to the next. If there is an undesirable proportional bias, then one can easily correct for this bias by standardizing the data, i.e. dividing by the standard deviation for each subject/rater in each dimension prior to applying CorrCA.
As we have shown here, maximizing IRC is equivalent to maximizing inter-rater-reliability, where we have defined reliability in Eq. (25) as the variance of the mean over the mean of the variance. For instance, if the rating is the speed of performing a task (as in the application 3.2), this would mean that a 100 millisecond variation across repeats is fairly reliable for a task that takes 5 seconds to complete, but really quite unreliable if the task only takes 0.5 second in average.
It should be noted that, when ratings are on a discrete numerical scale, IRC may not be an appropriate measure of reliability. For instance, a rating on an integer scale of 1, 2, 3, 4, 5, may have an inter-rater variability of ±1.0. With the present definition of IRCρ in Eq. (2) -this same variability would be considered half as reliable if the mean rating is 4 as compared to a mean of 2. Whether this is desirable or not may depend on the specific ratings statistic. Finally, IRC can only be used meaningfully for ratings that are on a proportional scale. For categorical ratings one should use instead the conventional Cohen's κ or similar inter-reliability measures.

G.2 | ISC is normalized
To our knowledge, the present definition of ρ in Eq. (2) does not appear in this exact form in the literature, except for our previous work [14]. Therefore, we take some time here to show that this definition of correlation has indeed a maximal value of 1, and that this occurs only when the all signals are identical across subjects (not just proportional as in Pearson's correlations), except for a subject-dependent mean, which is permitted.
Without loss of generality, let us assume that all signals are zero mean,ȳ l * = 0. To show that ρ ≤ 1, we can also demonstrate, equivalently, that the following expression is non-negative: