Probabilistic representations as building blocks for higher-level vision

Current theories of perception suggest that the brain represents features of the world as probability distributions, but can such uncertain foundations provide the basis for everyday vision? Perceiving objects and scenes requires knowing not just how features (e.g., colors) are distributed but also where they are and which other features they are combined with. Using a Bayesian computational model, we recover probabilistic representations used by human observers to search for odd stimuli among distractors. Importantly, we found that the brain integrates information between feature dimensions and spatial locations, leading to more precise representations compared to when information integration is not possible. We also uncover representational asymmetries and biases, showing their spatial organization and arguing against simplified “summary statistics” accounts. Our results confirm that probabilistically encoded visual features are bound with other features and to particular locations, proving how probabilistic representations can be a foundation for higher-level vision.


| INTRODUCTION
How the brain represents the visual world is a long-standing question in cognitive science. One captivating idea is that the brain builds statistical models that describe probability distributions of visual features in the environment (Fiser et al., 2010;Knill & Pouget, 2004;Lange et al., 2020;Pouget et al., 2000;Rao et al., 2002;Tanrıkulu, Chetverikov, Hansmann-Roth, et al., 2021;Zemel et al., 1998). By combining information about different features and their locations, the brain can then form representations of objects and scenes. Indeed, the idea that the brain represents feature distributions matches our conscious visual experience well. Most objects, such as the apple in Figure 1A, contain a multitude of feature values that can be quantified as a probability distribution, and we are seemingly aware of these feature constellations. Surprisingly, most studies of probabilistic representations do not test how such constellations are represented, assuming instead that a stimulus is described by a single value, such as the orientation of a Gabor patch in vision studies or the hue of an item in working memory experiments and that the only uncertainty comes from the sensory noise. While this unrealistic assumption was noted a while ago (Zemel et al., 1998), it is still prevalent, leaving open the possibility that the results can be explained with alternative models without assuming detailed representations of probability distributions (Block, 2018;Rahnev, 2017).
Here, we aim to close this gap and ask 1) if the visual system is capable of quickly forming precise representations of heterogeneous stimuli, representations that reflect the probability distribution of their features and 2) if such representations can be bound to other features or to spatial locations thereby serving as building blocks for upstream object and scene processing.
What precisely do we mean by probabilistic perceptual representations? We assume that the brain operates with probabilistic representations if any of the internal variables used in the perceptual decision-making is represented probabilistically, that is, allowing for uncertainty in their values (similar to, e.g., Koblinger et al., 2021). Often, the concept of probabilistic representations is embedded in the context of Bayesian models. Bayesian perceptual models assume that observers are presented with a stimulus s that generates sensory observations x with a certain probability p(x|s). The observer knows the parameters of this generative model, that is, p(x|s), and can inverse it to compute p(s|x), the probability that a distal stimulus s has a certain value given a sensory observation x. Importantly, this approach focuses on how a stimulus is inferred from sensory observations. This inferred probability p(s|x) is associated with the probabilistic representation. This is intuitively agreeable as long as the focus is on a simple single stimulus but becomes murky when stimuli are heterogenous like the ones shown in Figure 1 (and in reality, homogeneous stimuli do not exist) as it is not clear what an observer infers or should infer in this case.
We approach the problem of probabilistic representations differently, asking instead whether an observer can represent a probability that a stimulus (e.g., an apple or a set of lines in Figure 1) could have a feature with a certain value (e.g., the red color on an apple or a line with a certain orientation). For example, are apples likely to contain red? In Bayesian terms, this means extending the model outlined above to a stimulus-feature-observation hierarchy and asking whether observers can represent probability distributions within this hierarchy, specifically, whether a probability distribution of features given a heterogeneous stimulus is represented within an observers' generative model.

| Probabilistic representations of heterogenous stimuli
How can the brain represent heterogeneous stimuli, that is, stimuli that have more than one feature value? The visual system may track each feature value at each location to form a precise representation isomorphic to the stimulus.
However, this would be extremely costly in terms of computational resources and unnecessary or even misleading for action because specific feature values can vary from one moment to another because of changes in viewpoint, lighting, etc. (Kristjánsson, 2022). Another possibility, explored in the "summary statistics" 1 or "ensemble perception" literature (Ariely, 2001;Cohen et al., 2016;Haberman & Whitney, 2012;Rahnev, 2017;Treisman, 2006) is that only a few values, for example, the mean and the variance are represented. Note that the concept of probabilistic representations is rarely used in this field, because an observer can compute summary statistics without using probability distributions. For example, an average of the features can be computed arithmetically from sensory observations. However, such representations are functionally equivalent to a simplified probability distribution ( Figure 1A). But we believe that such simplified representations are also unlikely because multiple stimuli can have the same summary statistics while being quite different from each other. More realistically, the brain could compromise by approximating feature distributions in the responses of neuronal populations that capture important aspects of stimuli without being too detailed ( Figure 1A).
Previous studies have indeed shown that the visual system encodes the approximate distribution of visual features and uses them in perceptual decision-making (Girshick et al., 2011;Seriès & Seitz, 2013). However, most of the findings are confined to relatively long-term learning of environmental statistics. If feature probability distributions are to be useful for everyday visual tasks, such as object recognition or scene segmentation, the brain needs to learn feature distributions quickly and effortlessly. Importantly, we have recently provided evidence that such rapid learning may occur in simple cases by studying how human observers learn to ignore distracting stimuli while searching the visual scene (Chetverikov et al., 2016(Chetverikov et al., , 2017dChetverikov et al., 2019;Hansmann-Roth et al., 2019;. The basic idea with this Feature Distribution Learning paradigm is to use rolereversal effects upon response times when targets and distractors change their roles between visual search trials, to reveal observers expectations about upcoming search displays . Priming in visual search is a well-known phenomenon: search is faster when features of targets or distractors repeat from trial-to-trial even when observers do not have to rely on previous trials, that is, when a target is defined as an odd-one-out (Kristjánsson, 2022;Kristjánsson & Campana, 2010;Kristjánsson & Driver, 2008;Lamy et al., 2008;Maljkovic & Nakayama, 1994).
And when the targets and distractor switch features ("role reversal"), the search is slower. In our previous experiments, observers were asked to find an odd-one-out item in a search array where, importantly, distractor features (colors or orientations) are randomly drawn from a given probability distribution for several trials rather than having constant features. A test trial (introducing the role reversal) is then presented with a target of varying similarity to previously learned distractors. We found that response times as a function of this similarity parameter followed the shape of the previously learned probability distribution, whether it was Gaussian, uniform, skewed, or even bimodal. That is, the search was slowed proportionally to how unexpected the target was, based on previously learned environmental statistics. This shows that representations of the shape of feature probability distributions in the visual input (similar to scene statistics (Oliva & Torralba, 2001;Rosenholtz, 2016)) are not limited to long-term learning, but can occur rapidly.
This previous work was, however, limited to simple scenarios with a single feature distribution present, while real environments contain multiple objects (that contain multiple features) and scene parts with various different features.
Furthermore, knowledge about statistics of a given feature (e.g., orientation) in isolation is not very useful. Observers need to know where in the external world a given feature distribution is and which other features should be bound with it (related to the "binding" problem, (Treisman, 1996)) to recognize objects or segment scenes. Notably, such 1 Note that this is different from image-computable summary statistics approaches based on the statistics of the outputs from multi-level image processing filters (Balas et al., 2009;Freeman & Simoncelli, 2011;Portilla & Simoncelli, 2000). While these are related, the statistics in the ensemble perception literature are conceptualized in a more abstract way, more consistent with the type of questions we are interested in here. Yet, even in the image-computable statistics literature it has been demonstrated that images identical in a model statistical space might be still distinguishable by the observers, suggesting that the image-computable summary statistics do not fully match human perception (Wallis et al., 2016) (upper-left), distractors were drawn from two distributions that were either mixed together or separated by location or color with one example of the spatial separation shown here. We assumed that observers would form a distractor representation by learning which distractors are more probable as shown in previous studies (bottom-left). On test trials (upper-right), we varied the similarity between the target and previously learned distractors. We then measured response times assuming that they should be monotonically related to the probability of a given target being a distractor based on a simplified ideal observer model (bottom-right). C: Example stimuli used on learning trials in Experiment 1.
binding to spatiotopic locations and to other features does not necessarily require any additional neural machinery, because information about feature distributions can be readily encoded in neural population responses (Pouget et al., 2000;Sahani & Dayan, 2003;Vértes & Sahani, 2018;Zemel et al., 1998). Evidence for such effortless integration of probabilistic visual inputs is, however, still lacking.
Ensemble averaging studies testing how observers estimate probabilistic properties of several sets of stimuli provide some initial support for the hypothesis that probabilistic information can be bound to locations or other features.
It is well known that observers can estimate the average of a perceptual ensemble, such as the mean orientation of a set of lines (Alvarez, 2011;Haberman & Whitney, 2012;Whitney & Yamanashi Leib, 2018). Notably, they can estimate properties of subsets grouped by location or by other features although this causes performance detriments (Attarha & Moore, 2015a, 2015bAttarha et al., 2014;Chong & Treisman, 2005;Oriet & Brand, 2013;Utochkin & Vostrikov, 2017). This means that at least a summary representation, functionally equivalent to a simplified probabilistic repre-sentation based on mean and variance, can be bound to a location. If, for example, a mean can be computed only for a whole set, separate probabilistic representations of different subsets would, in our opinion, be less likely. Yet, this approach has only provided evidence for single-point estimates (e.g., the mean) but no direct evidence for binding of feature probability distributions. Here, we aim to overcome the limitations of previous studies and test how observers encode properties of feature distributions and bind them with both spatial locations and other features.

| RESULTS
In three experiments, observers viewed dressed-down versions of the environment that allowed precise control of the critical aspects of feature distributions. Observers searched for an unknown oddball target that differed from other items in orientation and judged whether it was in the upper or lower half of the stimulus matrix ( Figure 1B). Observers did this quickly and accurately despite not knowing the target or distractor parameters before each block (average response time across experiments and conditions M = 754 ms, SD = 197, proportion correct M = 0.90, SD = 0.04; see Figure S1 for raw RT on test trials by condition).
In all experiments, the trials were organized in miniblocks of intertwined learning and test trials. In each miniblock, during five to seven learning trials, distractor stimuli were randomly drawn from two probability distributions, that were the same within each miniblock but different between miniblocks. Crucially, learning trials were organized in different ways depending on the condition. In Exp. 1, the distractors from the two distributions were either mixed together (Baseline), colored differently (Color) or separated into different halves of the visual field (Spatial, see details below). On test trials, we randomly varied the similarity of the current target to non-targets from preceding trials ( Figure 1B) with the aim of understanding how observers represent complex heterogeneous stimuli such as visual search distractors. The distractors on test trials were always from a Gaussian distribution centered at 60 to 120°r elative to the current target. We assumed that during the learning trials observers encode the distractors and the distractor representation can be revealed by the response times on search trials. This would be consistent with our previous results where we have shown how response times follow the shape of the probability density function of the distractors, whether they are Gaussian or uniform, leftwards or rightwards skewed (Chetverikov et al., 2016(Chetverikov et al., , 2017d, or even bimodal versus uniform (Chetverikov et al., 2017c. To lay the groundwork for the analyses of empirical data, we first modeled the relationship between the distractor representations and response times in a Bayesian observer model.

| Bayesian observer model
How do behavioral responses depend on distractor representations from previous trials? To answer this question and to reconstruct distractor representations from the behavioral responses of our observers, we built a Bayesian memory-guided observer model linking observers' internal representations of distractors to response times ( Figure   2A). This model is described here in short while the full description is available in the Methods.
We first describe a general structure of the model shown in Figure 2A  Using knowledge about the sensory noise distribution and the approximation of feature distributions for targets and distractors, p * (s i | L T = i) and p * (s i | L T = i), observers compute probabilities that the sensory observations at a given location correspond to the target, p (L T = i | x i ), or a distractor, p (L T = i | x i ). These probabilities are combined into a decision variable d i used to make a decision or to continue gathering evidence if the currently available observations do not provide enough evidence for the decision (see details in Methods). B: The Bayesian observer model enables predictions about response times for a given representation of distractor stimuli based on the information acquired from previous trials p * (s i | L T = i, θprev) (see text for more details; different example distributions are shown in blue and green). Crucially, there is a monotonic relationship between the two, with response times on test trials increasing as distractor probability increases. C: In our analyses, we used the monotonic relationship between probabilistic representations and response times to recover the representation of distractors (right) based on the response times on test trials (left). Here, the data from an example observer in the Spatial condition is split based on whether the target was located in the left (orange) or in the right (blue) hemifield. We then estimated the parameters of the representation, such as the mean expected orientation (dashed orange line), SD and across-distribution bias (the shift in the mean towards the other distribution relative to the true mean, shown by the dashed black line).
observations are not identical to the stimuli because of sensory noise that has a probability distribution p (x i,t | s i ). In other words, a given stimulus might result in different sensory responses, and, conversely, a given sensory observation might correspond to different stimuli.
Crucially, we do not assume that either the task parameters, such as p (s i |L T = i) and p (s i |L T = i), or the stimuli are known to the observer. However, the observer knows the parameters of sensory noise p (x i,t | s i ) and has an approximate knowledge of the target and distractor distributions denoted with asterisk, p * (s i | L T = i) and p * (s i | L T = i), that could be further separated into the knowledge based on the previous and on the current trial, e.g., p * (s i | L T = i, θprev) and p * (s i | L T = i, θcurr) for distractors with θprev and θcurr corresponding to the latent variables describing the parameters of the previous and the current trial, respectively. That is, in contrast to the traditional normative (ideal observer) models, our observer is not omniscient and does not know what was the distribution of distractors in the previous or the current trial. We assume instead that the observer has learned some approximation of the distractor distribution from previous trials and combined it with the information about the current trials to improve search efficiency.
Using this knowledge, the observer aims to find the target by comparing for each location the probability that the sensory observations are caused by a target present at that location, p (L T = i | x i ) against the probability that they are caused by a distractor, p (L T = i | x i ): where x i = {x i,1 , x i,2 , . . . , x i,t=K } are the samples obtained for location i up until a decision threshold is reached.
How can the observer estimate the probabilities in Eq. 1? Here, we would focus on the distractor-related part in the denominator but similar derivations can be done for the target. First, following the Bayes rule, the posterior probability that a distractor is at a given location is proportional to the likelihood of samples being drawn from the distractor distribution: Assuming that the samples are independent in time their probability in log-space is equal to a sum of individual probabilities: Importantly, the probability of a single observation corresponding to a distractor can be found by integrating over all possible stimuli values: In other words, the observer combines the knowledge about sensory noise and the distractor distribution to estimate that a given sensory observation corresponds to distractors.
Notably, we are mainly interested in the test trials where the parameters of the current trial are independent of the parameters of the previous trials, hence: To reiterate, in our experiments, by design, the parameters of the current trial are controlled with respect to the current stimuli (i.e., the distractors on the current test trial are drawn from a distribution with a mean from 60°to 120°o ff the current test target). Hence, only p * (s i | L T = i, θprev) matters for relative changes in response times.
Notably, if sensory observations are obtained with high frequency and sensory noise is low relative to the uncertainty in distractor representations, the log-sum of probabilities can be approximated as: In words, if many samples are acquired at a given location, the log-probability that they are caused by a target is approximately equal to the number of samples obtained, times the log-probability that a stimulus at this location is a distractor.
Using Eq. 1, 5 and 6 (and analogous equations for target representations), the observer then can estimate the decision variable for all the samples obtained as: with constants subsumed under the proportionality sign. In words, the decision variable when a decision is made is proportional to a difference in the amount of evidence that a stimulus is a distractor and that it is a target times the number of samples obtained.
Finally, assuming that target and distractor representations are independent and noting that the response time is proportional to a number of observations needed to reach a decision, RT ∝ K, the observer representation of distractor features learned from previous trials is related to response times: where C 0 and C 1 are constants (see details in Methods). In words, there is an inverse relationship between response times and the approximate likelihood that a given stimulus is a distractor, p * (s i | L T = i, θprev), with the information obtained from previous trials described by a set of latent parameters, θprev.
While this decision model is relatively simple, it provides a good intuition for observer behavior in the task (a more optimal model is provided in the Supplement but the conclusions do not depend on model choice). The model does not make any assumptions of how the observer learns these parameters. However, it shows that when the probability that a stimulus at a given location (e.g., a test target) is a distractor is lower, response times are lower as well, and vice versa.
This model provides an important insight, namely, that observers' representations are monotonically related to response times ( Figure 2B). Hence, the monotonic relationship between the distribution parameters (mean, standard deviation, and skewness) reconstructed from RTs and from the true representation parameters would hold under any other monotonic transformation (for example, if RTs are log-transformed and the baseline is subtracted as we do in our analyses; see also Figure S2). In other words, the analysis above shows how response times can be used to approximately reconstruct observers' representations of distractors and estimate their parameters.

| Binding orientation probabilities to locations and colors
Having shown how observer response times can be related to the distractor representations, we now turn to the empirical data. Observers' response times to different test targets allow us to infer which orientations were the most difficult to find, resulting in the longest response times. As explained above, this methodology has enabled us to reveal how observers can represent distractor distributions in surprsising detail (Chetverikov et al., 2016(Chetverikov et al., , 2017d. Crucially, we can then reconstruct observers' representations of the probability distributions during learning trials (see Methods).
The experiments differed in the structure of the learning trials. There were three conditions in Experiment 1. The learning trials in the Spatial condition were organized so that distractor distributions in the left and the right hemifield differed to mimic the clustering of similar visual stimuli in the real world. In the Color condition, instead of spatial Spatial, 60° distance F I G U R E 3 Spatial structure of probabilistic representations. A: Example stimuli (left column), recovered mean expected orientations (middle column) and the across-distribution biases in mean expected orientations relative to the true orientations at a given location (right column). The stimuli show a single learning trial from the search task in the corresponding experiment. The mean expected orientation (MEO) was then computed at each location relative to the overall average orientation in the preceding learning block. For presentation purposes, the data were rearranged so that the distribution in the left hemifield (or in the columns 1,2,5,6 in the stripes condition) was oriented clockwise relative to the overall mean. The biases in MEO were computed by subtracting the mean orientation for a given part of the distribution (e.g., at the left/right hemifield in the Spatial condition of Experiment 1) and recoding the resulting errors so that the positive values correspond to a bias towards the other distribution. B: Average MEO by column of stimuli matrix in the spatial conditions. Small dots show the data for individual observers, larger dots and bars show means and 95% CI, respectively. Dashed horizontal lines show the true means for a given part of the distribution.
grouping, different distractor subsets were grouped by color while individual items were randomly distributed. Finally, in the Baseline condition, items from the two distributions had the same color and were randomly distributed ( Figure   1C).
Firstly, we report the results on the mean expected orientations (MEO) corresponding to the means of the recovered representations ( Figure 2C). If observers ignore the separation of the two parts of the distribution, then MEO should match the mean of the overall distribution, but should differ between the distributions if the representations are bound to locations or colors. For example, if observers accurately learn the properties of the distributions, the MEO should be at +20°relative to the overall mean in the Spatial condition when the test line is presented in the hemifield that previously had distractors with an average relative orientation of +20°.
We found that in the Spatial condition, observers' representations in each hemifield followed the actual physical distractor distribution (Figure 3). The estimated MEO relative to the overall mean was M = -14.02°(SD = 6.02) and M = 14.90°(SD = 5.14) for probes for clockwise (CW) and counterclockwise-shifted (CCW) distributions, respectively.

| Encoding orientation probabilities at different spatial scales
Having established that observers associate information about most likely orientations with specific locations or colors, we then asked if we can uncover the origins of the observed biases by assessing the recovered representations in the Spatial condition in more detail (for this and later analyses, we increased the sensitivity of our analyses by combining the data from the Spatial group in Experiment 1 with an additional participant group that performed the same task; see Methods). We computed MEO using the aggregated data from all participants for each location in the stimuli matrix in this condition. As Figure 3 shows, across-distribution biases were stronger closer to the boundary between the two hemifields. We then tested this observation by directly comparing MEOs for test trials with targets presented at the boundary (two central columns) between the hemifields against other test trials. We found that the bias was significantly larger at the boundary between the two distributions than in the other columns (M = 4.80°(SD = 6.99) and M = 9.04°(SD = 11.36), b = 4.23, 95% HPDI = [0.21, 8.32], BF = 42.34; Figure 3B). However, the biases were also significantly above zero outside the boundary (BF = 248). This suggests that the distribution representations are not homogeneous and influence each other strongly when they are close in space, but this mutual influence also extends outside the immediate neighboring locations (see Discussion).

| Bias strength depends on similarity and spatial arrangement
In two follow-up studies, we further investigated observers' representations of spatially-grouped heterogeneous stimuli.
In Experiment 2, we tested whether the similarity between the distributions along the tested feature dimension (orientation) affects the strength of the across-distribution biases. Recent studies suggest that similarity is an important factor determining whether the information is pooled or not at different levels of perceptual processing (e.g., Coen-Cagli et al., 2015;Herrera-Esposito et al., 2021;Manassi et al., 2012;Qiu et al., 2013;Utochkin et al., 2018).
We hypothesized that the bias should be stronger when the stimuli from the two distributions are more likely to have the same cause in the external world. For example, the boundary effect in Experiment 1 might occur because stimuli that are close in space are more likely to belong to the same object. By the same reasoning, if the two distributions are less similar, they are less likely to have the same cause, and the biases should be weaker.
To test this, we used the same spatial arrangement as in the Spatial condition in Experiment 1, but the distribution means were now 60°away from each other instead of 40°as in Experiment 1 (see example stimuli in Figure 3A). We In Experiment 3, we tested whether an even more complex spatial arrangement would allow us to recover the "map" of observers' expected orientations. To this end, the stimuli were organized in "stripes" of two matrix columns with two different distributions from Experiment 1 (with means separated by 40°) positioned at odd and even stripes (counterbalanced across blocks, Figure 3A). We found that observers expected clockwise-rotated orientations (M = 6.20°, SD = 9.91) at locations of stripes rotated 20°clockwise relative to the overall mean and counterclockwiserotated orientations (M = -11.04°, SD = 17.11) at other stripe locations. However, the across-distribution bias (M = 11.70°, SD = 7.52) was stronger than in the Spatial condition in Experiment 1 (b = 5.90, 95% HPDI = [2.50, 9.33], BF = 4.30). This demonstrates that while separating distributions in space helps observers track distributions (as shown in Experiments 1 and 2), the effects of spatial organization decrease as the organization becomes more complex.

| Higher-order parameters of probabilistic representations
Next, we asked whether observers' representations contain more information about the distributions than just their average? We used the reconstructed distractor representations ( Figure 4A)  Dashed vertical lines show the mean of the representation (black) and the true mean of the stimulus distributions (light gray). Note that the representations are aligned so that when two distributions are present, the true mean at the tested location is clockwise (-20°or -30°) while the other mean is counterclockwise (20°or 30°relative to the true mean). B: Estimated parameters (bias, SD and skewness). Large dots and errorbars show the mean across observers for a given parameter and the associated 95% confidence intervals. Smaller dots show data for individual subjects.
were less well separated, observers were more uncertain in their estimates, leading to distractor representations with higher SD's ( Figure 4B).
We also expected that the distribution presented at the tested location or in the tested color would weigh more highly in the resulting representation, causing an asymmetry. Alternatively, if observers only use the mean and variance to encode the distribution (as assumed by "summary statistics" accounts), then the represented distribution should be symmetric. We found that observers' representations were asymmetric in all conditions, with a higher probability mass at the side corresponding to the distribution presented at the tested location or in the tested color, M = -0.03, 95% CI = [-0.04, -0.02]. Notably, however, no differences between conditions were found, BF = 1.99 × 10 −6 , indicating that symmetry is not affected by the way the distributions are organized in the display. In sum, observers represent not only the average stimulus values but also their variability, and the representations are skewed towards distributions presented at other locations or in different colors.

| DISCUSSION
Our main hypothesis was that observers extract information about probabilities of visual features from heterogeneous stimuli and bind the resulting probabilistic representations with locations on the one hand and other features on the other. Our results support both these proposals very clearly. Importantly, this demonstrates for the first time, how the visual system can build probabilistic representations of the visual world by extracting information about the features of complex heterogeneous stimuli.
A visual search task allowed us to uncover representations of heterogeneous distractors. We formulated a Bayesian observer model and demonstrated analytically and through simulations that response times are a monotonic function of observers' expectations about distractor orientations, supporting earlier empirical findings (Chetverikov et al., 2016(Chetverikov et al., , 2017bChetverikov et al., 2019;Hansmann-Roth et al., 2019;Hansmann-Roth et al., 2021;Tanrıkulu et al., 2020. Using this knowledge, we were able to estimate the characteristics of observer representationstheir means, precision, and skewness -and to assess how they vary depending on whether observers can associate them with locations or with other, task-irrelevant features, such as color. We found that observers both encode and combine feature distributions in scenes containing two different distributions. The representations generally follow the physical distribution of the stimuli for a given location or a given color, but importantly, observers are also biased towards the other distribution. The strength of the bias depends on the degree of separation between the distributions. When the distributions were separated in space, observers' representations of one distribution were less influenced by the other distribution, compared to when they were separated by color or were intermixed (Baseline condition). Furthermore, as we found in Experiment 3, more complex spatial arrangements ("stripes") increased the biases towards the other distribution. In sum, observers bind probabilistic representations of visual features to locations and other features, but such binding is not impenetrable, reminiscent of 'illusory conjunctions' of discrete feature values (Treisman & Schmidt, 1982).
We were then able to recover the representation of the distribution at different spatial scales. We found that for spatial separation, the biases are stronger at the boundary between the two distributions. This is reminiscent of the hierarchical organization of information about feature probabilities within a scene proposed for perceptual ensembles (Alvarez, 2011;Haberman & Whitney, 2012). Such hierarchical ensemble models suggest that observers represent information about feature probabilities at different levels: for example, the orientation statistics at a particular location are combined to form a representation for a group of items, which are, in turn, combined to form an overall ensemble representation. Our results agree with this idea: the stimuli observers expect at a given location depend not only on what was previously shown at this location but also on stimuli presented at other locations. Crucially, biases were also present for the Color condition as well as for the non-boundary locations in the Spatial condition of Experiment 1. This indicates that the results cannot be explained by purely local summation of the inputs. It remains to be tested whether there are actual separable representations of probability distributions at different levels, or just a unified spatio-featural map guiding observer responses.
We hypothesized that the representations should be more biased by each other when they are more likely to have the same cause in the external world. This could provide a normative explanation for the boundary effect: sensory input from adjacent locations is likely to be caused by the same object and should therefore be integrated while locations far away from each other should be treated separately. Similarly, for example, in multisensory integration studies, auditory and visual signals are less likely to be integrated when there is a large discrepancy in their locations (Körding et al., 2007;Shams & Beierholm, 2010). However, in Experiment 1, we found across-distribution biases at locations far from the other distribution. We reasoned that this is because the stimuli themselves are similar enough to be potentially caused by the same object, and the inputs are therefore integrated even from non-neighboring locations. In Experiment 2, we tested this explanation by asking if the similarity between the distributions themselves in the tested feature domain (orientation) also plays a role. We found that when the distributions were made more dissimilar, the biases were observed only at the boundary between the distributions but not at other locations. That is, observers no longer take into account the input from non-neighboring locations, when stimuli are dissimilar. Speculatively, introducing longer learning streaks could also help to reduce the bias by increasing precision of the representations (Chetverikov et al., 2017c). This supports the proposed normative explanation and suggests that the principles of information integration for heterogeneous visual inputs are the same as for other cases, such as multisensory integration or estimation of complex visual features .
We then tested if observers represent more than just the mean distractor orientation. We found that observers represent the distractor variability (i.e., the standard deviation or width of their representations), which varies in a predictable fashion with the separability between distractor distributions. When distractor distributions are poorly separated (e.g., only by color or are organized in 'stripes'), their representations are wider, indicating more uncertainty.
Furthermore, the representations are asymmetric where the tail of the distribution corresponding to the orientations matching the tested location or color is fatter. While we are agnostic to the specific mechanisms of how the information is integrated across different parts of the visual field, speculatively, such asymmetric distributions can be seen as an output of a hypothetical weighting process. As a computational abstraction, the resulting representation can be seen as a normalized sum of basis functions (similar to a kernel density estimator). The weight of a certain basis function could depend on how well it matches the stimuli across the visual field (i.e., how many distractors had a certain orientation) and their relevance to the current goals. The presence of skewness indicates that observers do not simply represent the distractors with a (biased) mean and variance, their representations have a complex shape with more relevant information (e.g., previous orientations at a tested location) weighted higher and less relevant information (e.g., previous orientations at the other locations) having lower weight, but still influencing the outcome. However, we do not see the effect of condition on the distribution asymmetry, which would be expected if the matching and nonmatching parts of the distribution were combined as a weighted mixture with weights depending on the degree of separation. The absence of the condition effect might be related to the greater difficulty of precisely estimating the amount of skewness as opposed to mean or variance. Nevertheless, the overall skew in the representations is indicative of how sophisticated the learning can be where various factors, such as the amount of information about the underlying probability distribution and task-relevance of the stimuli in each case, are taken into account in the representation, and how this determines how different parts of the display are weighted.
These findings indicate that observers represent information about distractor features as a probability distribution rather than only in terms of the summary statistics, in contrast to popular ideas of simple "summary statistics". For example, Treisman (2006) argued that statistical processing is a distinct mode of perceptual and attentional analysis of stimulus sets. She proposed that because of limited attentional capacity statistical summaries are generated that include the mean, variance, and perhaps the range. These summaries enable rapid assessment of the general properties and layout of natural scenes (Chong & Treisman, 2005;Emmanouil & Treisman, 2008). Similarly, Rahnev (Rahnev, 2017;Yeon & Rahnev, 2020) argued that observers represent only a summary consisting of the most likely stimulus and the associated strength of evidence, and Cohen et al. (2016) used summary statistics to explain the richness of consciousness experience. Our results argue against such views, since the representations that are bound together are far more detailed than this implies. That is, the brain might instead approximate the visual input by using a complex set of parameters to provide accurate descriptions of feature probabilities (Freeman & Simoncelli, 2011;Rosenholtz, 2020).
A recent finding may provide insights into why summary statistic accounts have been so poular. Hansmann-Roth et al. (2021) reasoned that optimal behavior requires the encoding of full feature distributions, not only summaries, but observers might be unable to explicitly report the full distribution. This is analogous to how difficult it might be to verbally describe the variety of colors of an apple without resorting to simplifications (see Figure 1A). Hansmann-

Roth et al. tested observers' representations both implicitly and explicitly and while explicit judgments were limited
to the mean and variance of feature distributions, implicit measures revealed detailed representations of the same distributions. More information was therefore available to observers than studies of summary statistics, that have mostly relied on explicit measures, have indicated. Crucially, Hansmann-Roth et al. were able to uncover why this is: revealing these detailed representations requires implicit methods where representations are probed by assessing the effects of role reversals between targets and distractors as we do here.
Can the results observed in our study be explained by simple sensory adaptation mechanisms? Sensory adaptation is a well-studied phenomenon: at a neural level, exposure to a stimulus alters neural responses to subsequently presented stimuli, while behaviorally adaptation often results in a repulsive bias with estimates of, for example, orientation in the adjustment task shifted away from the adapter (Clifford et al., 2007;Schwartz et al., 2007). In our task, observers were exposed to a certain distractor distribution within each miniblock, so their behavior could be influenced by adaptation. We tested this idea by developing a variation of a model reported here assuming that the observer does not encode the knowledge about the distractors directly (p * (s i | L T = i, θprev) is flat) but instead the sensory space is warped to efficiently encode the distractor distribution (Stocker & Simoncelli, 2005;Wei & Stocker, 2012. The results (not shown) indicated that the search would be the most efficient when a target matches previous distractors, contrary to our findings and the well-known role-reversal effects in visual search literature (Kristjánsson & Driver, 2008). The intuition here is that when targets and distractors are on the opposite sides of the adapter, they are actually pushed together due to the circularity of orientation space. Furthermore, recent studies using a similar behavioral task by Pascucci et al., 2022 also speak against the involvement of adaptation. Rafiei et al. looked at how the perception of a current search target (2021) and a neutral line (2021) is affected by distractors and previous targets. Importantly for the current issue, they found that when distractors are similar to the test item, its estimates are pulled towards the distractors, while the adaptation profile is generally repulsive for similar items. In a recent study by Pascucci, Ceylan, and Kristjánsson (2022), the authors found that when observers passively viewed an array of lines that all came from the same distribution (no singleton), no learning of the distribution occurred. In summary, both the modelling results and the empirical data suggest that sensory adaptation cannot explain our findings.
Where does the uncertainty in the distractor representation come from and how the observers learn about it?
Our study is agnostic on this issue. We do not assume any parametric form for this representation (that is, we do not assume that observers represent, for example, the mean or variance). Note, however, that it is unlikely that a single-trial representation is a simple Gaussian and only when aggregating across trials the skewness appears. This idea was tested in a recent study by Chetverikov et al. (2020), where observers have to find two targets on each trial with a mixture distribution similar to the Baseline condition used here. By comparing a model assuming that the distribution is approximated as a single-peaked and a model assuming that the distribution is approximated as a mixture of two parts, we found that the hypothesis of a simple Gaussian representation is not supported by the data.
It might well be, however, that at a more detailed level, for example, at a level of a single location or a single moment in time, the representation is a simple one. Yet, from the computational perspective, these simpler representations have to be combined into an aggregated representation of distractors shown here and in our previous studies to be corresponding to a probability distribution of distractors with more details that the simple summary statistics would allow.
In our experiments, observers learn the distractor feature by combining inputs from heterogeneous stimuli across several trials in each block, and it can be argued that this is different from perceiving a single stimulus on a single trial.
However, the visual cortex aggregates information on many different timescales (de Lange et al., 2018). Even on a single trial, perception unfolds in time and at each moment is dependent on what has been seen before. And even for a simple stimulus, the visual cortex receives inputs from many retinal neurons that are affected by processing noise, potentially indistinguishable from the input from varying features. Indeed, this is why stimulus variability ('external noise') is often used to manipulate visual uncertainty (Barthelmé & Mamassian, 2009;Hénaff et al., 2020). We therefore believe that distinguishing "simple" and "complex" perception is impossible. However, our results clearly show that information about feature probabilities is available for visually-guided behavior.

| SUMMARY
Taken together, our results show that observers can not only encode probabilities of features from heterogeneous stimuli in detail, but also integrate them with both locations and other features that have different distributions. These results arguably represent the strongest support yet for the long-standing idea that the brain builds probabilistic models of the world (Chetverikov et al., 2017a;Fiser et al., 2010;Knill & Pouget, 2004;Orhan & Ma, 2015;Rao et al., 2002;Sahani & Dayan, 2003;Tanrıkulu, Chetverikov, Hansmann-Roth, et al., 2021) and show that probabilistic representations can serve as building blocks for object and scene processing. Notably, such representations are not simply limited to summary statistics (e.g., a combination of mean and variance; Cohen et al., 2016). Our results also indicate that observers do not represent physical stimuli precisely, but instead construct an approximation influenced by input from other stimuli. This probabilistic perspective stands in sharp contrast to views where discrete features of individual stimuli are either bound together to form objects or processed "statistically" (Rosenholtz, 2020;Treisman, 2006). Instead, we suggest that the probabilistic representations are automatically bound to locations and other features since such binding occurred even though it was not required in the task. Probabilistic representations are therefore not acquired in isolation but constitute an integral part of perception.

| Participants
In total, eighty observers (fifty female, age M = 23.10) participated in the experiments. Twenty observers (ten female, age M = 25.45) participated in the first experiment (Baseline, Spatial, and Color conditions) split across two sessions.
Twenty observers (fourteen female, age M = 25.00) participated in Experiment 2 ("Spatial, 60°distance") and another twenty (thirteen female, age M = 25.45) in Experiment 3 ("Spatial, stripes"). Finally, the data from additional twenty observers (thirteen female, age M = 16.50) were collected for the Spatial condition of Experiment 1 to increase the sensitivity of the spatial analyses.

| Procedure
In Experiment 1, each participant performed a search task in five conditions. In each condition on each trial, observers were presented with 8×8 matrices of 64 lines (line length: 0.71°of visual angle; matrix size: 16×16°; uniform noise of ±0.5°was added to each line coordinate). The goal was to find the odd-one-out line whose orientation differed most from the others. Sessions were divided into miniblocks of 5 to 7 learning trials followed by 1 or 2 test trials (the number of trials chosen randomly for each block; the variation in the number of trials was introduced to decrease the effect of temporal expectations, Shurygina et al., 2019. During learning trials, the overall mean of distracting items stayed the same within each miniblock (but varied randomly between miniblocks) with half of the distractors drawn from one distribution and the other half from another distribution with the properties of distributions differing between conditions: Baseline: two truncated Gaussian distributions with SD = 10°and range of 40°, with means separated by 40°( ±20°relative to the overall mean), all stimuli had the same color (white); half of the distractors were drawn randomly from one distribution, half from another, then they were positioned randomly within the stimuli matrix and then a randomly chosen distractor was replaced with a target.
Spatial: two distributions (either a truncated Gaussian with SD = 10°and a range of 40°or uniform with the range of 40°in random combinations) with means separated by 40°(±20°relative to the overall mean), all stimuli had the same color (white), one distribution was shown in the left half of the matrix, the other in the right half.
Color: the same distributions as in the Spatial condition were used, but lines drawn from one distribution were blue, while lines from the other distribution were yellow. Positions for each line within the stimuli matrix were chosen randomly.
In all cases, two lines were added to each distractor distribution with their orientation equal to the minimal and maximal values from that distribution range. As a result, Gaussian and uniform distributions always had the same range. The target orientation on each trial was drawn randomly from a uniform distribution ranging between 60°and 120°relative to the mean distractor orientation.
On test trials, distractors came from a single Gaussian distribution with SD = 10°(range-restricted in the same way as described above) and a mean sampled from a 0-180°uniform distribution, while target orientation was determined in the same way as on the prime trials (that is, selected randomly from 60 to 120°range relative to the current distractor mean). In the color condition, half of the lines from that distribution were blue, half were yellow.
The Baseline condition had 2304 trials, while the Spatial and Color conditions had 5376 trials each with the higher number of trials used in the latter case to counterbalance additional factors (distribution type combinations).
The trials were split into two (for Baseline) or four sessions (other conditions) with a break for rest halfway within each session. Observers participated in each session at a separate time depending on availability but with a break of no less than two hours between sessions and no more than two sessions within a day. The order of sessions with respect to conditions was counterbalanced with a Latin square design.
Experiments 2 and 3 followed the same general procedure as the Spatial condition of Experiment 1. In Experiment 2 the means of the distributions were separated by 60°(±30°relative to the overall mean) instead of 40°in Experiment 1. In Experiment 3, the two distributions were separated by 40°, as in Experiment 1, but arranged in "stripes" so that the lines drawn from the first distribution were positioned in the 1 st , 2 nd , 5 th , and 6 th columns of the stimuli matrix while the other columns were populated with lines from the second distribution. In both experiments, each participant took part in two sessions of 1536 trials each with a rest period halfway within each session.

| Data processing
For our main analyses of interest, incorrect responses were excluded, and response times were log-transformed and centered by subtracting the mean for each participant. Then, to reduce the noise in RT measurements, spatial and featural confounders were removed (the results remain similar when no corrections are applied). First, the effect of the distance between target locations on consecutive trials and the effect of the target location were removed by regressing out the fifth-degree polynomials of the absolute distance (in degrees of visual angle) between the target locations on the current and the previous trials and the current targets horizontal and vertical coordinates. Then, we also removed potential influences from the well-known oblique effect (the search speed differs between oblique and cardinal stimuli Chetverikov et al., 2017a;Wolfe et al., 1999 by regressing out the fifth-degree polynomials of target and distractor obliqueness computed as an absolute distance in degrees to the nearest cardinal orientation.
The regression was run separately for each experiment and condition.
To reconstruct observers' distractor representations, we used the response times on the first test trial in each miniblock. We then converted the response times as a function of the similarity between the test target and the previous distractor mean to a probabilistic representation and estimated its parameters.
To convert the noisy response times into probabilities, we first smoothed RT as a function of the test target and previous distractor mean using the local regression approach (a generalization of the moving average) for each observer in each condition. To account for circularity, we appended 1/6 of the data from each end of the orientation space to the opposite end before smoothing. In analyses applied to each stimulus location, we further assumed that RTs are a smooth function of the stimuli matrix row within the local regression while columns of the stimuli matrix were treated independently. We then transformed a smoothed RT function into a probability mass function by subtracting the baseline and normalizing to one. Finally, we computed the parameters of the recovered probabilistic representation: the mean expected orientation (circular mean), circular standard deviation, and circular skewness as defined by Pewsey (Pewsey, 2004). Note that under the hypothesized Bayesian observer model, the estimated standard deviation and skewness are monotonically related to the true parameters of the distractor representation but are not identical to it (additionally confirmed in simulations, Figure S2).
Unless stated otherwise, we used Bayesian hierarchical regression with the brms (Bürkner, 2017) package in R.
Note that while we include Bayes factor values in the description of the results, we were mostly interested in measuring the effects of the variables of interest in our models, and hence the models included the default flat (uniform) priors for regression coefficients. Given that Bayes factors are prior-dependent, we believe that the information provided by the 95% highest-density posterior intervals (HDPI) is more useful for judging the results than the Bayes factors.
To make sure that the conclusions are not dependent on the particular analytic approach, we repeated the analyses using the conventional frequentist statistical test with the same results (the report using this approach is provided alongside the data in an online repository, see data availability statement).

| Bayesian observer model
In our experiments, participants located a target among a set of distractors and indicated if it was in the upper or the lower part of the stimuli matrix. On each trial, the experimenter sets the task parameters, namely, parameters of the target distribution, p(s i |L T = i), and parameters of the distractor distribution, p(s i |L T = i), for each location i = 1 . . . N in the stimuli matrix as well as the target location, L T . These parameters were then used to generate the stimuli s i at each location.
Neither the task parameters nor the stimuli are known to the Bayesian observer. Instead, at each moment in time t within a trial, the observer obtains sensory observations at each location, x i,t . These observations are not identical to the stimuli because of the presence of sensory noise, p (x i,t | s i ). That is, a given stimulus might result in different sensory responses, and, conversely, a given sensory observation might correspond to different stimuli. We assume that the observations are distributed independently at each location and at each moment in time.
To make an optimal decision in a particular task, the observer needs to know the relationship between the sensory observations and the task-relevant quantities. For the visual search task used in our study, we assumed that observers compare for each location the probability that the sensory observations are caused by a target present at that location, . , x i,t=K } are the samples obtained for location i up until the time K, against the probability that they are caused by a distractor, p (L T = i | x i ): The observer then decides that a given item is a target as soon as the decision variable at a given location reaches a certain threshold B. Although this decision rule is not fully optimal, because the observer makes a decision for each item individually, it greatly reduces the task complexity, and we believe that it allows for a more realistic model (the simulations based on a more complex but more optimal model are described in the supplement and lead to identical conclusions).
The observer can compute the probability of hypotheses L T = i and L T = i given the sensory data using the Bayes rule: In words, the probability of a hypothesis that a target is at the given location, L T = i, for a set of sensory observations x i is equal to the likelihood of the data given this hypothesis multiplied by a prior probability for this hypothesis p (L T = i) and divided by the probability of the observations p (x i ).
Assuming that the prior p (L T = i) = 1 N = 1 − p (L T = i) is the same for all locations, the decision variable can then be rewritten in log-space as the difference in the log-likelihoods in favor of the two hypotheses: What are the probabilities of sensory observations under each hypothesis, p (x i,t | L T = i) and p (x i,t | L T = i)?
To compute them, the observer needs to take into account how the stimuli are distributed under each hypothesis and how the sensory noise is distributed for each stimulus. We assume that the sensory noise distribution is known to the observer through long-time exposure to the visual environment (that is, the observer knows p (x i,t | s i )).
However, to determine how probable it is that sensory observations correspond to the search target, the observer must also know what defines targets and distractors. The experimenter knows that only certain orientations describe a target, but the observer is not omniscient and does not know the true distributions of target and distractor stimuli, approximating them instead as p * (s i | L T = i) and p * (s i | L T = i). Then the probability of sensory observations under each hypothesis can be computed as: The probability distributions p * (s i | L T = i) and p * (s i | L T = i) correspond to the observer's approximate representation of target and distractor distributions. Notably, each of them can be further separated into the representation based on the previous trials and the one based on the current trial: with θ = {θprev, θcurr} corresponding to the independent latent variables describing the parameters of the previous and the current trial by the observer (similar equations related to targets are omitted for brevity). In our experiments, by design, the parameters of the current trial are controlled with respect to the current stimuli (i.e., the distractors on the current test trial are drawn from a distribution with a mean from 60°to 120°off the current test target). Hence, only p * (s i | L T = i, θprev) matters for relative changes in response times.
In our analyses, we wanted to reconstruct the representation of distractor stimuli using the response times for different test targets. Because the decision time is proportional to the number of samples when the sampling frequency is constant, we aimed to relate the number of samples K to an observer's approximate representation of distractors based on the previous trials p * (s i | L T = i, θprev).
Assuming that the sensory observations are obtained with high frequency, we can approximate the total evidence in favor of a given hypothesis: We expect the sensory noise to be low compared to the noise in the target and distractor representations. Then, the following approximation is valid: where C is a constant. Similar derivations can be used for the total evidence for the alternative hypothesis p (x i,t | L T = i).
Then, given that a decision is made when log d i = log B: Given that the target and distractor parameters are independently manipulated in the experiment, can be treated as a constant. Similarly, p * (s i | L T = i, θcurr) would be constant as discussed above. Given that RT ∝ K, we can then approximate is as follows: and where C 0 and C 1 are constants. In words, there is an inverse linear relationship between the likelihood that a given stimulus is a distractor (in log-space) and the response times. When this likelihood increases, response times decrease.
We highlight that this model provides an important insight, namely, that observers' representations are monotonically related to response times. Hence, even though C 0 and C 1 are unknown, the relationship between the moments (mean, standard deviation, and skewness) of observers' representations reconstructed from RT and the true representations would hold under any other monotonic transformation (for example, RTs are log-transformed and the baseline RTs are subtracted as we do in our analyses).

Supplement 1. Bayesian observer model combining information across locations.
The model reported in the main text presents a simplified version of the decision-making process assuming that stimuli at each location are analyzed separately. We believe that such a model might be more realistic as it greatly simplifies the computations that observers have to make. However, for the sake of completeness, here we briefly describe a more complex conditionally-optimal memory-guided Bayesian observer model. We refer to this model as conditionally optimal for two reasons. First, a memory-guided observer is by definition not fully optimal in our task, where the test trial parameters are unrelated to the previous learning trials. However, given that the task parameters repeat throughout learning trials, using the information from the previous trials might be beneficial when the observer does not know that the trial parameters have changed. Secondly, we assume that the observer's learning or memory about the stimuli features might not be ideal, hence they use the approximations of feature distributions. We show that under this more complex and more optimal model, the predictions with respect to the monotonic relationship between the response times and expected distractor probabilities stay the same.

| Task structure
Participants have to locate a target among a set of distractors and indicate if it is in the top or in the lower part of the stimuli matrix. The experimenter sets the task parameters, namely, the target distribution, p(s i |L T = i), and the distractor distribution, p(s i |L T = i), for each location i = 1 . . . N in the stimuli matrix (with top half having indices from 1 to N/2 and the bottom half from N 2 + 1 to N) as well as the target location (L T ), to generate the stimuli (s i ) at each location. Here, L T = i and L T = i indicate that the target is or is not at location i, or in other words, that the target location is or is not i, respectively. | Ideal observer model At each moment in time t = 1 . . . K (with K as the decision moment) and at each location i, the observer obtains sensory observations x i,t corrupted by the presence of sensory noise: where f V M is a von Mises distribution density with concentration parameter κs quantifying the amount of noise. We assume that the observations are distributed independently at each location and at each moment in time: To make an optimal decision in a particular task, the observer needs to compare the probability that a target is located in the upper half of the stimuli matrix with a probability that it is located in the lower half: where C = 1 and C = 2 correspond to the two hypotheses about the target location. After applying the log transfor-mation, the decision variable can be expressed as a difference in the amount of evidence for the two hypotheses: The decision time assuming a certain threshold B can then be found as a time K when the decision variable reaches the threshold. The average decision time can be found by estimating when the expectation of log d becomes equal to log B: The probabilities for each hypothesis C = 1 and C = 2 can be found using the Bayes rule. For example, for C = 1: Because the observer does not know what stimuli are presented and only knows the sensory observations, the likelihood p (x | C = 1) needs to be computed by averaging (marginalizing) over the unknown stimuli values: Because the target can only be present at one location, the likelihood p (x | C = 1) is computed by summing over the possibilities of finding a target at each particular location: where similarly to the main text, we use an asterisk to denote probability distributions as approximated by the observer through a set of parameters related to previous and current trials θ = {θprev, θcurr}. That is, we assume that the observer is unaware of the true distributions p(s i |L T = i) and p(s i |L T = i) and approximates them instead using the information available. Note that the sum is done separately for each half of the stimuli matrix, hence N 2 is used in Eq. S.8.
If a target is at location i, it cannot be anywhere else. Hence: Using Eq. S.9, it can be further shown that: Note that the product in the square brackets is the same for all locations, and the remaining part of the equation is a ratio of the probability that the measurements at a given location are from the target against the probability that they are from the distractor, similarly to the model described in the main text.
The probability that a given stimulus is a target (or a distractor) depends on both the previous and the current trial: For each location and each location-specific hypothesis L T = i and L T = i, the current trial parameters need to be computed separately because of the nature of the odd-one-out task. A target is defined as the item most different from the distractors. For simplicity, we assumed that observers use the following circular normal approximation for the distractors at the current trial based on the sensory observations: In words, when the observer needs to estimate, how likely it is that the stimulus at location i is a distractor, the observer approximates the distribution of stimuli as a von Mises (circular normal) distribution based on the sensory observations from other locations.
The observer might use the knowledge that the target distribution in the task design is on average 90°away from the mean of distractors. We again assume a von Mises approximation: where κ T is the expected precision of the target distribution. In contrast to the distractor distribution precision that could be guessed based on the samples on the current trial (κ j =i ), the target distribution precision cannot be estimated on a single trial (there is only one target stimulus in a given trial) and has to be based on the other sources of information (e.g., learning throughout the experiment).
Given that the measurement noise is independent across locations, the likelihood of the hypothesis C = 1 can be further expressed as: Then, assuming that the prior probability of each decision alternative is the same, the decision variable can be expressed in log-space as: The decision time assuming a certain threshold B can then be found as a time K when the decision variable reaches the threshold.

| Simulations
To estimate the behavior of the observer using this model, we simulated the decision-making process and estimated the mean response times while varying the properties of the distractor representation p * (s i | L T = i, θprev). The task parameters were based on the actual experiment design. We used 36 stimuli for each trial with one stimulus being the test target (s L T ) and the rest being the distractors. The distractors on each simulated trial were distributed as p (s i | L T = i) = f V M (s i ; µ D , κ D ) where µ D ∼ U s L T + 60 • ; s L T + 120 • (that is, the mean of distractors is set to 60°to 120°away from the test stimulus) and κ D = 8.7 (approximately equivalent to the standard deviation of 10°in orientation space). The sensory observations were assumed to be noisy (κs = 2, approximately equivalent to the standard deviation of 24°in orientation space; note that this is the noise level for samples collected at each moment in time). The observers' target representation was assumed to be linked with to the distractor representation as p * (s i | L T = i, θprev) = f V M s i ; µ Dprev , κ T with κ T = 3.35 (based on a normal approximation to a uniform target distribution with 60°range used in the experiments). The same κ T was used for target-related computations based on the current trial data (Eq. S.13). The decision threshold was set to log B = 4.60 assuming a 1% probability of error if the observer assumptions are correct. For each test target from 1°to 180°in half-degree steps, we simulated 56 trials for each combination of distractor representation parameters.
We ran simulations for the wrapped skewed normal distribution with the mean varied from -60°to 60°in 20°s teps, while the standard deviation varied from 20°to 60°in 10°steps, and skew varied from -10 to 10 in steps of 2. The results of the simulations ( Figure S3) confirmed the findings obtained with a simplified model: the means are recovered precisely while for standard deviation and skewness the monotonic relationship holds. Test target orientation (centered at the learned distractors, °) RT on test trials (ms) F I G U R E S 1 Raw response times for test trials in different conditions. The curves shows the average RT with 95% CI shown with shading. Different colors indicate RT for test trials when a test target matches (in location or color) the clockwise-and counterclockwise-shifted parts of the learned distractor distribution. Dashed lines show the mean orientations for the corresponding distribution parts. The average was estimated with locally-weighted regression (LOESS, an extension of the moving window approach) that accounted for circularity of the orientation space by padding the data at each end of the [-90,90] range with 1/6 th of the data from the other end. F I G U R E S 2 Simulated parameters under the simplified Bayesian observer model. We simulated the response times under the assumptions of the simplified Bayesian observer model described in the main text and applied the same approach as used for the real data to see if the assumed monotonic relationship between the true parameters and the recovered parameters holds. Firstly, we used a simple wrapped normal (top) with means varying from -80°t o 80°in 20°steps and standard deviation from 5°to 60°in 5°steps. For each parameter combination the RT were computed using Eq. 16. We then estimated the parameters of the recovered distribution. As is evident from the plots, the mean estimates were identical to the true mean while the standard deviation was overestimated but the overall monotonic relationship held. The skewness estimate was at zero as expected for the symmetric wrapped normal distribution. Secondly, we simulated the data using the skewed normal distribution (Pewsey, 2004) with means again varying from -80°to 80°in 20°steps, scale parameter varying from 5°to 60°in 5°steps, and skewness parameter varying from -10 to 10 in steps of 1. For the means and standard deviations, the conclusions were the same as for the wrapped normal distribution. Similarly, skewness estimates followed monotonically the changes in the true skewness parameter (note that the sign of the estimated circular skewness is the opposite of the skewness parameter of the skewed wrapped normal distribution because of how it is defined, see Pewsey, 2004). In sum, the mean estimates match the true means, and the standard deviation and skewness estimates monotonically depend on the true standard deviation and the skewness parameters. True skewness, a.u. Estimated skewness, a.u.
F I G U R E S 3 Simulated parameters under the more optimal Bayesian observer model. We simulated the response times under the assumptions of the more complex Bayesian observer model described in the Supplement applied the same approach as used for the real data to see if the assumed monotonic relationship between the true parameters and the recovered parameters holds. The results were similar to the simulations with the simplified model ( Figure S2). The mean estimates were identical to the true mean, while for the standard deviation and skewness the monotonic relation holds (note that the sign of the estimated circular skewness is the opposite of the skewness parameter of the skewed wrapped normal distribution because of how it is defined, see Pewsey, 2004). In sum, the mean estimates match the true means, and the standard deviation and skewness estimates monotonically depend on the true standard deviation and the skewness parameters.