Is the richness of our visual world an illusion? – Trans-saccadic memory for complex scenes

Susan J. Blackmore

School of Psychology, University of the West of England, St. Matthias College, Bristol BS16 2JP, England

Gavin Brelstaff,
Kay Nelson
Tom Troscianko

Perceptual Systems Research Centre, Department of Psychology, University of Bristol, 8 Woodland Rd, Bristol BS8 1TN, England.


Abstract

Our construction of a stable visual world, despite the presence of saccades, is discussed. A computer graphics method was used to explore trans-saccadic memory for complex images. Images of real-life scenes were presented under four conditions: they stayed still or moved in an unpredictable direction (forcing an eye movement), while simultaneously changing or staying the same. Changes were the appearance, disappearance or rotation of an object in the scene. Subjects detected the changes easily when the image did not move but when it moved their performance fell to chance. A grey-out period was introduced to mimic that which occurs during a saccade. This also reduced performance but not to chance levels.

These results reveal the poverty of trans-saccadic memory for real-life complex scenes. They are discussed with respect to Dennett’s view that much less information is available in vision than our subjective impression leads us to believe. Our stable visual world may be constructed out of a brief retinal image and a very sketchy, higher-level representation along with a pop-out mechanism to redirect attention. The richness of our visual world is, to this extent, an illusion.

1 Introduction

Not only are we blind to many aspects of our personal visual world, we are also surprisingly unaware of this fact. Under normal circumstances, for example, we do not notice that we blink; that we have large retinal blind spots; that our instantaneous spatial, chromatic, and temporal resolution varies dramatically with eccentricity; and that our vision is interrupted several times a second by rapid eye movements (saccades). Indeed, despite all of these considerable distractions, we believe that we see a complete, dynamic picture of a stable, uniformly detailed and colourful world.

It is even tempting to suppose that there is a “me” in there and a place from where “I” observe. Dennett (1991) calls this the “Cartesian Theatre” and argues that this powerful illusion is propped up by a “nearly impenetrable barrier of intuitions” (p 322). One of these intuitions is that a complete visual picture of the observable world is present in the mind at any time – that consciousness “contains” a rich model of the visible world. This, claims Dennett, is simply not true. Only in the fovea is the information detailed and rich, and every time the eyes move this detailed information is overwritten.

The illusion that we are simultaneously aware of every aspect of the view in front of us could arise because almost every question asked of our visual system is seamlessly answered. Rather than the answers being retained in visual memory, some argue that the outside world itself acts as a visual memory – rapidly accessible by looking again (Minsky, 1985). A sense of presence might build up over time as different glances answer different questions. A time lag would thus be predicted between the onset of the visual input at a given fixation and conscious awareness based on it; the time taken to integrate this new input with at least one of the existing higher-level representations of the scene. This might relate to delays of up to half a second in sensory experience reported by Libet (1982).

If this is true it has some very odd implications for the view out of my window. Unless I am looking straight at that tree, it is not represented in any detail in my visual system, the representation having been washed away at the last saccade. It only seems to be available to consciousness because I can look again.

Dennett gives the example of walking into a room covered with identical portraits of Marilyn Monroe. You can immediately see that there are hundreds of portraits even though you can only have looked at one or two of them. How can you “see” them all at once. He answers that you don’t: there is no representation of each one in the visual system. Rather they are represented as being present; as perhaps “more of the same”. The details are not needed, are not represented and are not “present in consciousness” – whatever it may feel like. Thus he reveals an important distinction – between the presence of representation and the representation of presence. The fact that we confuse these two contributes to the illusion.

One may object that if one picture had a slight difference, a moustache or scribbled on glasses, you would notice. In fact, you may notice but because of separate mechanisms (discussed below) that operate within each fixation.

Is this true? There are three questions here: (1) How does the visual system detect changes in the environment? (2) How much information is retained at each saccade? (3) How much missing detail is “filled in” by the cognitive system? Although “filling-in” is the topic of much current debate (e.g. Ramachandran, 1993; Ramachandran and Gregory, 1991; Dennett, 1993) it is the first two that are most relevant here.

1.1  Detection of Change

If the “looking again” strategy, described above, is to succeed, it may rely on the external world remaining stable across the time required to change fixation (to “look again”). Since this time is relatively short the visual system may be able to get away with it. During a single fixation the visual system is, in fact, highly sensitive to many kinds of spatial, temporal and chromatic changes in the visual input. This high sensitivity to change is supported by several special mechanisms: e.g. retinal adaptation (Ditchburn, 1973), “pop-out” systems (Treisman & Gelade, 1980) and motion detectors (Reichardt, 1961). Is it this special sensitivity to change during each fixation that gives us an inflated impression of our awareness of change between fixations?

There is an odd implication of this view – that changes occurring between fixations should not be easily detectable. In other words, little information need be retained from one saccade to the next.

1.2  Transsaccadic Memory

Just what information is retained after each saccade? There must presumably be some kind of trans-saccadic integration otherwise no model of the world could be constructed from visual information. The question is what kind. At one extreme might be a very low level process in which successive retinal representations are fused into a continuous and detailed representation. At the other extreme might be a very high level or abstract integration.

Towards the former end of the spectrum is the spatiotopic fusion hypothesis (e.g. Breitmeyer, 1984). This suggests that eye movements are compensated for and successive pictures fused according to environmental coordinates rather than retinal ones. Towards the other extreme Pollatsek, Rayner & Henderson (1990) suggest something like location independent object detectors. Dennett’s view is also far out on this extreme, implying that most information is lost on each saccade and the brain does not bother to fill in the missing details. Only a high level abstract representation remains.

The evidence clearly favours the latter end of the spectrum. Stimulus displacements during a saccade are hard to detect (Bridgeman & Mayer, 1983), and there is little effect on the reading of words if the case of individual letters is changed (Pollatsek, Rayner & Collins, 1984). Pollatsek et al (1984) also showed that subjects identified a line drawing of an object faster when they had had an extrafoveal preview – implying integration across the saccade. However, moving the target did not abolish the preview benefit, suggesting that integration is not location dependent (Pollatsek, Rayner and Henderson, 1990). Instead, it seems to rely on internal spatial relationships within the stimulus, as Irwin (1991) demonstrated using simple dot patterns – further evidence against spatiotopic fusion.

What sort of memory is responsible? The very short lived iconic store (Sperling, 1960) is an unlikely candidate because it is tied to retinal, not environmental, coordinates. More plausible is the short term visual store (Phillips, 1974) which is not disrupted by a bright light or pattern mask, is not tied to anatomical coordinates and is affected by pattern complexity. Indeed, Irwin (1991) recently concluded that transsaccadic memory is long lasting, undetailed, of limited capacity, and not tied to spatial position. There need be no separate memory for this and it is probably identical to visual short-term memory.

O’Regan and Levy-Schoen (1983) suggested that information is only integrated across saccades if it is semantically encoded (eg “in front of the red table”). However, more recent work by Hayhoe, Lachter, and Feldman (1991) argues that the memory representation is likely to be something intermediate between the purely symbolic and purely semantic extremes. Hayhoe et al suggest that the most likely candidate is a spatially ordered visual buffer (originally proposed by Feldman (1985)). This is a representation similar to a map, in which a given visual location has an associated set of features, or parameters. It precedes object recognition, but is precise enough to support geometrical judgements such as the angle-categorisation task used by Hayhoe et al.

However, the experiments described above used synthetic, simple stimuli. It is unclear how the possible memory representations deal with the real-world images (or, indeed what kind of capacity they have and how this relates to objects in the scene). In order to address this issue it is necessary to use complex images. Dennett (1991, p 361; Grimes pers. comm.) has recently taken part in experiments using an eye-tracker in which the stimulus is changed during saccades. Subjects can apparently read text without noticing anything odd while words or letters are changed in this way. Relatively large changes in complex images, such as the appearance or disappearance of people or objects, also go unnoticed when they occur during a saccade. In a lecture demonstration Dennett showed pairs of Grimes’ pictures one after the other. The changes were so obvious that most people laughed unless, by chance, the change happened during a saccade, when it could not be seen. The effect is dramatic and counter-intuitive. It implies, as Dennett suggested, that very little information is retained after a saccade, even less than suggested by Irwin, Zacks and Brown (1990).

1.3  Saccadic Suppression

Conceivably the effect could be due to saccadic suppression, a mechanism supposed to suppress processing during saccades. However, it has long been known that saccadic suppression, if it exists, is not total (e.g. Dodge, 1900, see Carpenter, 1988 for a review). For example, Brooks, Yates & Coleman (1980) report that a dot stimulus moving relative to saccadic motion can be seen, suggesting there is no suppression. In any case is there any need to suppress anything? If low level detailed information is never stored there may not be. During saccades movement detectors or pop-out mechanisms will overload and so convey no information. Dennett concludes that the brain treats all this with benign neglect.

We have developed a novel method to test Dennett’s unexpected claims. The idea is simple and no eye tracking system is required. Any change made to an image is synchronised with a rapid displacement of the entire image in a random direction. Subjects must move their eyes to see any change and this can be compared to a no-movement condition. This provides a simple technique to explore what information is retained as we move our eyes about the world.

Method

2.1  Apparatus

In all experiments, images were presented on the screen of an IBM RS/6000 workstation. The visual display unit was an IBM 6019 monitor,  the screen of which was 36 cm wide and 30 cm high. When subjects sat at a comfortable viewing distance (approximately 50 cm), the screen subtended approximately 41 x 34 degrees and the image approximately 10 degrees. The resolution of the screen was 1280 x 1024 pixels.

The images were obtained by sampling a normal scene (typically of our laboratory) with a monochrome video camera. Images were in pairs (as shown in Fig 1), for example one showed a breakfast scene with a full glass of milk in one picture and nearly empty in another; a second pair showed a street scene with and without a person; a third showed a person’s head with either one or two earrings. An example is given in Fig 1.

The images were digitised and held in the computer’s memory as 256 x 256, 8-bit (256 grey-level) images. Each image was de-blurred by applying a moderate amount of high-pass filtering; each image appeared as a normal, sharp, monochrome picture. The subtense of each image was about 8 degrees square. Mean luminance was approximately 30 cd m-2 (Minolta Spot Chroma Meter). Each image could be positioned anywhere on the larger screen. The background colour of the screen was set to mid-grey.

A new image could be displayed in any location, and the old one extinguished, within a frame blanking interval (approx 1 ms). All images used in an experimental run were stored in the frame buffer and could therefore be displayed virtually instantaneously. This is a standard computer-graphic technique (Troscianko and Low 1985; Harris, Makepeace, and Troscianko 1987; Brelstaff and Wilson 1994).

2.2  Pilot Study

We carried out a pilot experiment and two experiments in the main study. In the pilot experiment, a pair of images was prepared. The first image was presented in the centre of the screen (for approx 5 sec). Next, this image was extinguished and the second image was displayed in an unpredictable location on the screen after the frame blanking interval. This shift in image location elicited at least one saccade, since subjects were instructed to move their gaze to the new image location as the image “jumped”. In this preliminary experiment, five out of six naive subjects failed to spot the mismatch between the number of chairs in the two images; to them, the second image appeared identical to the first.

This pilot experiment suggested that there was an interesting inability to register changes in the visual scene across a saccadic eye-movement, and the two main experiments were designed to address this question in greater detail.

2.3  Design

In Experiment 1, we compared the cases of where the:

(i) Image changes (or does not change) and moves (as in pilot experiment);

(ii) Image changes (or does not change) and stays in same place.

Thus, Experiment 1 was similar to the pilot experiment except that larger sets of images and subjects were used. We predicted that image changes would be easy to see in the absence of image displacement (a saccade). However, if we obtained results of this nature, it is still unclear whether the impairment in the displaced condition arises (a) as a result of an eye-movement being made, or (b) whether any interruption of vision may give a similar result. Since saccades are often thought of as leading to “grey-out” during the timecourse of the saccade, we wanted to test a condition in which we introduced “grey-out” externally (by presenting a blank grey interstimulus mask). This mid-grey ISI field was presented for 250 ms. It was therefore hoped that Experiment 2 would clarify whether an eye-movement was necessary to elicit the experimental effect of poor object retention.

In Experiment 2, the following conditions applied:

  1. Image changes and moves (as in Exp 1);
  2. Image changes but stays in the same place; a mid-grey interstimulus interval (ISI) separates the two images in time.

2.4  Procedure

The subject sat in a darkened room at a comfortable distance from the screen. In both experiments, the first image appeared at the centre of the screen for 2 sec, then a priming beep sounded. The second image was then presented in either a different location or the same location, depending on the condition. Different locations were always a fixed angular distance away (8 deg) but in a random direction.

The subject’s task was to press one of two mouse buttons, reporting whether the second image was the same as, or different from, the first one. The same number of “same” and “different” image pairs were used, with each stimulus pair being randomly selected. Each subject saw a total of 30 pairs of images in Experiment 1 (15 image pairs with a change in one item, plus 15 “same” image pairs). In Experiment 2, each subject saw 24 pairs of images. All orders were randomised.

In pilot experiments, we had found some learning effects. To avoid these, no subject saw any image pair more than once. Subjects were asked not to communicate about the experiment with others.

Each subject was given a practice trial before each experiment, with a stimulus different from that used in the experiment.

2.5  Subjects

Five naive subjects were used Experiment 1, and twelve in Experiment 2. They had normal or corrected-to-normal vision.

3  Results

Table 1 shows the results of Experiment 1.

Condition

saccade                             no saccade

Image pair                    same    different                        same    different

% Correct                    92                    61                    100                  87

Table 1. Mean results of Experiment 1.

 

Performance was better in the no-saccade condition than in the saccade condition; the number of correct responses was significantly different between the conditions (t=9.13, df=4, p<0.0005).

Table 2 shows the results of Experiment 2.

Condition

saccade                                  static grey-out

% Correct                                55                                            65

Table 2. Mean results of Experiment 2.

Statistical analysis of these results showed that:

  1. The number of correct responses in the saccade condition is at chance (t=1.48, p<0.05, 1-tailed).
  2. The number of correct responses in the grey-out condition is significantly above chance (t=8.16, p<0.01, 1-tailed).
  3. The number of correct responses differs significantly between the two conditions (t=2.79, p<0.02, 2-tailed).

 

4  Discussion

These results show that when subjects had to move their eyes to detect a change between two complex images, they could not do so reliably. Indeed, for changes that could easily be detected when the image did not move, their performance was close to chance when it did move (forcing a saccade). A simple grey-out and delay also reduced performance but not to chance levels. Taken together these results show the fragility of visual memory for a complex scene. The ability to detect even large changes is partly disrupted by a simple delay and more or less destroyed by a saccade.

This suggests that Dennett may be right. Although we appear to have available “in consciousness” a complete detailed and stable representation of the world this may in fact be constructed out of only two things (a) a transient retinal image which is detailed only in the fovea and lasts only the time between saccades, and (b) a higher-level sketchy model of the whole scene that contains too little information for even relatively large changes to be noticed. The suggestion by Hayhoe et al (1991) of an intermediate-level store at a level before object recognition is broadly in keeping with our data; but the implication of our findings is that such a store has severely limited capacity to encode parts of the scene that are not actively attended to. The impression that we can easily see anything that changes is provided by the pop-out mechanisms that redirect attention. If this is so it shows that our intuitions about our own visual function are far from useful in understanding the construction of our convincing and stable visual world.

Acknowledgements

The research facilities used in this study were provided by a grant from the Defence Research Agency (grant no D/ER1/9/4/2034/102/RARDE).
References

Breitmeyer,B.G. (1984) Visual masking: an integrative approach. New York; OUP.

Brelstaff GJ, Wilson JB (1994) Generating colour and texture verniers. Int. J. Psychophysiol. 16, 199-208

Bridgeman,B., Mayer,M. (1983) Failure to integrate visual information from successive fixations. Bulletin of Psychonomic Soc., 21, 285-286.

Brooks,B.A., Yates,J.T., Coleman,R.D. (1980) Perception of images moving at saccadic velocities during saccades and during fixation. Experimental Brain Research, 40, 71-78

Carpenter,R.H.S., (1988) Movements of the Eyes. London; Pion, 313-322

Dennett,D.C. (1991) Consciousness Explained. Boston; Little, Brown & Co.

Dennett,D.C. (1993) Reply to my critics. In. Dennett and his Critics: Demystifying Mind. Ed. Dahlbohm,B. Oxford; Blackwell.

Ditchburn,R.W. (1973) Eye Movements and Visual Perception, Oxford; Clarendon Press.

Dodge,R. (1900) Visual perception during eye movement. Psychological Review, 7, 454-465

Feldman,J.A. (1985) Four frames suffice: A provisional model of vision and space. Behavioral and Brain Sciences 8, 265-289

Harris,J.P., Makepeace,A.P.W., Troscianko,T. (1987) Cathode ray tube displays in psychophysiological research. J. Psychophysiol 4, 413-429

Hayhoe,M., Lachter,J., Feldman,J. (1991) Integration of form across saccadic eye movements, Perception, 20, 393-402

Irwin,D.E. (1991) Information integration across saccadic eye movements, Cognitive Psychology, 23, 420-456

Irwin,D.E., Zacks,J.L., Brown,J.S. (1990) Visual memory and the perception of a stable visual environment, Perception and Psychophysics, 47, 35-46

Libet,B. (1982) Brain stimulation in the study of neuronal functions for conscious sensory experiences, Human Neurobiology, 1, 235-242

Minsky,M. (1985) The Society of Mind, New York; Simon Schuster.

O’Regan,J.K., Levy-Schoen,A. (1983) Integrating visual information from successive fixations: Does trans-saccadic fusion exist? Vision Research, 23, 765-768

Phillips,W.A. (1974) On the distinction between sensory storage and short-term visual memory, Perception and Psychophysics, 16, 283-290

Pollatsek,A., Rayner,K., Collins,W.E. (1984) Integrating pictorial information across eye movements, Journal of Experimental Psychology: General, 113, 426-442

Pollatsek,A., Rayner,K., Henderson,J.M. (1990) Role of spatial location in integration of pictorial information across saccades, Journal of Experimental Psychology: Human Perception and Performance, 16, 199-210

Ramachandran,V.S. (1993) Filling in gaps in perception: Part 2., scotomas and phantom limbs. Current Directions in Psychological Science, 2, 56-65

Ramachandran,V.S., Gregory,R.L. (1991) Perceptual filling-in of artificially induced scotomas in human vision, Nature, 350, 699-702

Reichardt,W. (1961) Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In Sensory Coding, Ed. Rosenblith,W.A., New York; Wiley.

Sperling,G. (1960) The information available in brief visual presentations, Psychological Monographs, 74.

Treisman,A. Gelade,G. (1980) A feature-integration theory of perception, Cognitive Psychology, 12, 97-136

Troscianko,T., Low,I. (1985) A technique for presenting isoluminant stimuli using a microcomputer. Spatial Vision, 1, 197-202