(This is a talk I gave before the Columbia
School of Linguistics on February 17, 2002.)
I must confess that my title does not employ certain terminology quite accurately and that I chose it mainly because it sounded so clever. Instead of "the medium is not the message", it should really read "the signal is not the sign." You are undoubtedly aware that the air is the medium through which the sounds of language travel. Sound is a disturbance of the air, or a signal, which is now travelling from my mouth to your ears. Furthermore, when I used the word "message" instead of the word "sign," I probably made William Diver roll in his grave. For Diver, the message was not just a function of the signs, it also involved the scene or the circumstances where the signs are employed.
But I might be forgiven for this facetious use of Marshall McLuhan’s expression, because my thesis follows in the spirit of his point. I’m claiming that just as people are distracted by the fancy electronic media only to ignore the trashy messages transmitted, linguists have been distracted by the sounds of language and have been ignoring the true units of language, the articulatory gestures. I’m even going to blame some of the electronic media for this distraction.
Actually, Diver used the word "signal" a little unconventionally – he considered morphemes to be signals, as you can see in the following excerpt from his paper on phonology as human behavior:
The overall interrelationship among phonology, physiological phonetics, and acoustic phonetics can be sketched as follows: the speaker learns the signals of the language (the morphemes) as made up of a limited number of distinct articulatory gestures; these are the phonological units. In the particular circumstance of the individual acts of speech, the attempt at the articulatory gesture produces certain vocal movements. These can be recorded and observed on, for instance, X-ray film. The vocal movements in turn shape and excite resonant cavities, and the resulting sounds can be recorded and analyzed with the spectrograph. The movements and sounds, then are consequences of the articulatory gestures.
I intend to use the word "signal" more conventionally, as impulses which propagate through a medium bearing information to a receiver. Sound waves and light waves are examples of signals.
Now, I suppose you noticed that in the excerpt Diver names the articulatory gestures as the phonological units. That is the thrust of my paper, only I am going to give a semiotic perspective on the matter. I hope to visit the fields of semiotics, psychology, and information theory, and to show how they point to the articulatory gestures, not their reflexes, the sounds, as the phonological units of language. I want to show that just as a statue is not constituted by the light it reflects, speech is not constituted by the sounds that the gestures reflect or at times produce. I hope to help us overcome some natural and manmade predispositions and understand how something which we cannot visualize nevertheless can be an entity that we perceive.
Not much appears written in the field of Semiotics
on the physical manifestation of the sign. Saussure, who claimed
that linguistics is just a part of semiology, thought that the linguistic
sign is a combination of a concept plus a sound image. This
entirely mental definition seems too abstract and difficult to study scientifically.
In a well-publicized diagram, the late Thomas Sebeok lists various sources
of signs, from natural to manmade to extra-terrestrial. In
another, he lists various channels through which signs can be communicated.
But he never seems to enter the argument as to whether the sound or the
articulatory gesture is the linguistic sign. Umberto Eco presents
a very elaborate theory of semiotics, and he seems to endorse the tacit
linguistic assumption that the sound is the sign.
Sense (mode) | Sight | Hearing | Smell | Touch | Taste |
Medium (means of xmission) | Ether / space | Air | Air | None / air (radiator) | None / negative (must enter mouth) |
Signal (what is transmitted) | Light | Sound | Odor / gas | Texture / temp / pressure | Chemicals (sweet, sour, etc.) |
Gestural sign | ASL gesture | /dog/ (artic. gest) | Putting on perfume | Patting on back | Putting salt in sugarbowl |
Meaning of gesture | A banana | A dog | Formality | Congratulation | Anger at spouse |
Permanent sign | Clothing / writing | ? | Perfume | Itchy clothing | Salt in sugarbowl |
Non-gestural sign | Painting | Music | Making of perfume | Statue | A gourmet meal |
Distortion | Perspective | Speaker variation | ? | Positions of hands | ? |
I have prepared a chart on your handout (see above), where I list all the five senses in columns and some semiotic phenomena in rows. In the first row below the headings, I list the medium through which the signal can be transmitted, and then the customary names for the signal. The next row lists some example gestures that can be transmitted through the medium, followed on the next row by the meaning of those examples. Umberto Eco would call these examples replicas, since they are easily replicable. I call them gestures, because they are non-permanent. The next row gives examples of permanent signs, ones that last more than a few seconds. Eco does not distinguish fleeting and permanent signs. Eco would distinguish the next row, since it lists signs which are not easily replicable. Essentially, I agree with Eco that we need to distinguish signs which are easily replicable from those which are not. We produce replicable signs daily when we speak, and in this case we are should be able to conceive of the gestures as the signs. I would not consider the brush strokes performed in the production of a painting to be a kind of phonology of the painting. The exception might be found in an impressionistic painting, but notice that the signs would be the painted strokes on the canvass, not the gestures of the artist’s hand and brush used to produce them. Only another artist, trained in the impressionist school, could interpret the painting as something he might have produced himself.
So when we compare the various modes of communication, it seems clear that at least when we are dealing with familiar gestures, those gestures are candidates to be considered the true signs, as opposed to the signal produced by them. This seems particularly compelling in the case of American Sign Language gestures, since the gestures are so visual, and the word "sign" is part of the name of the language. The distinction between the signal and the sign is even more evident when you consider visual perspective. In the case of a sign language gesture, your view of it changes, depending on the physical orientation of the speaker relative to you, the "listener ." For example, the ASL gesture for "I" or "me" is made by pointing to your torso with your index finger. Seen straight on, the hand in this gesture recedes from the addressee as it approaches the signer's torso. If the signer is seen in profile, the hand appears to moveto the left or to the right. In the case of a non-gestural sign, say, a statue, the light image, the signal, hitting your retina can vary enormously as you move around the statue. There will certainly be some details completely missing in certain views and prominent in others. What is constant is the shape of the statue. And that is what constitutes the sign. This shape can also be perceived by touching the statue. In the case of ASL, the constant is the gestures. And Helen Keller, who was both blind and deaf, perceived gestures through her sense of touch.
I will call visual perspective a kind of distortion, to distinguish it from static, and other forms of random interference in the signal. Distortion may involve loss of information, but it is entirely regular and predicable. These distortions of perspective even occur in the written language, changing the apparent shapes of letters as you re-orient the book you are reading. But is there anything in spoken language corresponding to perspective which might be called distortion? The sound spectrogram certainly doesn’t change depending on the angle from which it is recorded. Later on I will propose the variations between individual speakers as a source of non-linear distortion of speech.
But why do so many linguists consider sounds to be the fundamental units of phonology, and by implication the signs of spoken language, disregarding the articulatory gestures? For one thing, the name phonology itself means "study of sound." Personally, I would prefer term like "gestology." Another reason probably involves our strongly visual orientation. We can see the shape of the statue, so we know that its shape is the sign. We cannot easily see our linguistic articulators, and so we have a hard time conceiving of them, or "picturing" them, as it were, as the signs of language. I remember when I first saw the standard diagram of the articulators, a lateral x-ray view of the head, I found it hard to believe that my tongue was so stubby. Ironically, there has always been a tacit undercurrent of "gestology" in linguistics. The International Phonetic Alphabet has always named its sounds by their place and manner of articulation. And I suspect that even the most ardent binary phonologist in practice uses the working term labio-dental spirant, rather than strident-anterior-continuant consonant. We name our sounds after the gesture, because it is nearly impossible to think of other natural phenomena with similar sounds, Diver’s famous snake-like sibilant notwithstanding. It’s a little odd that linguists have ignored the gestures, since the vast majority of sound changes in the evolution of languages are small adjustments of the gestures, sometimes producing large changes on the spectrogram, as when a stop becomes an homorganic spirant.
When the sound spectrogram came along in the mid-twentieth century, there was great hope of making phonology an exact science, and of building an automatic speech analyzer. Now one could finally "see" speech. Despite intensive training, nobody could be trained to read the spectrograms, however, not even hearing people. Nonetheless, acoustic phonology was born, reinforcing the belief that the sounds were the signs. But Saussure lived a long time before the spectrogram, didn’t he? Saussure lived most of his life in the era of telephony, however, and I believe this may have been an influence. What is the midpoint between the two talking crew-cut men in the well-known illustration to his Cours? Why, the sound of course! They might as well have been on the phone – in fact, the sagging dotted line of information flow between them looks suspiciously like a wire. The telephone transmits another kind of signal, but it is a reflex of the sound signal.
When I read a word, I do not hear a sound image of my own voice or anybody else’s. On the other hand, many people move their lips when they read. I can’t detect any mental sound image in my head, given any written word. The linguistic mental sound images I do have are related directly to certain speakers. I can easily call up my sister’s "Hi, Tom" telephonic greeting. I suppose that most of us can mentally recall the expression "they’re grrrreat!" in the voice of Tony the Tiger with all its nuances and intonation. In these cases, the expressions are fixed, and the symbols are the entire expressions. Mentally, they are not even analyzed into morphemes, much less phonemes.
Although now I am a big proponent of articulatory gestures, I am still quite visually oriented like everyone else. A few months after I had convinced myself that the gestures were the signs of language, I gave myself a shock when I thought of sound symbolism. If the gestures were the signs, then shouldn’t they look like the thing they represent? In the expression "ding-dong", shouldn’t at least the shape of the articulators resemble a bell? Should I throw this theory in the trash? Well, my paper isn’t in the trash quite yet, because my hands don’t smell like a banana. In ASL, the sign for a banana is made by sticking one finger up in the air and pretending to peel it with the other. This is an obvious example of an iconic visual sign. The articulators are made to visually resemble the object they represent. But they don’t have to resemble the banana through any other senses. The hands don’t have to smell like a banana, or feel like a banana, or sound like a banana. So the articulatory gestures only have to sound like a bell when they pronounce "ding-dong."
As a matter of fact, sign language is much better at iconicity than speech, and many, many ASL signs resemble the things or actions they represent. One reason is that the hands have three dimensions to work with, whereas the vocal tract is essentially two-dimensional. With ASL, a three-dimensional sign is then projected onto a two-dimensional retina. In speech, a two dimensional gesture is projected onto a one-dimensional sound. Your intonation is preserved in the fundamental frequency and your articulators are reflected in the other fundamentals, also known as formants. Because vision has at least one extra dimension, much more information per second is delivered to your brain through your optic nerves than through your auditory nerves. No wonder vision is so much more compelling than hearing. No wonder speech needs to produce symbols at a faster rate, and employs the double articulation to create its wealth of vocabulary. Both speech and sign language compress at least one dimension by a process of distortion with loss of information. The receiver must reconstruct the gestures through what is probably an instinctual, hard-wired process in both seeing sign language and hearing speech. This might to explain why people can’t seem to learn to read spectrograms, yet they somehow understand speech through the auditory channel.
On the other hand, some people are fairly good at lip reading. Too bad for the hearing-impaired that such a large portion of our articulators is hidden from view. Too bad for linguists that spectrograms are so easy to make, and it’s so hard to record the articulatory gestures. The fact that ordinary people seem to be so interested in seeing what other people’s articulators are doing helps convince me that the gestures are the signs. I often find myself watching other people’s mouths as I listen to their conversation. Perhaps some of you are watching my mouth right now. I have it on the word of a sound editor that speech is the one sound that must be most carefully synchronized in films. It’s certainly the easiest way for me to recognize that a movie is out of synch. Most convincing of all, however, is a well-known experiment first done many years ago. Subjects are shown videos of people pronouncing the sounds /ba/, dubbed over with the sounds /ga/. The subjects report hearing /da/. This is now known as the "McGurk effect."
So listeners apparently will use any cues they are given in order to understand what the speaker is saying, even visual ones. When you think about it, this is the way communication systems usually work. Take the signaling system used in papal elections, for example. The cardinals, huddled in a room, vote repeatedly until a pope is elected, each time burning the paper ballots after tallying them. If the vote is inconclusive, they add straw to the fire to create dark smoke, indicating that they need more time. The outside world gets the message by looking at the chimney outside the building. I would claim that the sign in this system is the gesture of burning either paper or straw, not the shade of the smoke. The smoke might be considered the signal. Obviously, in setting up this system, they had to be sure that the color of the smoke could be used to distinguish straw from paper. But the cardinals do not adjust the amount of straw based on viewing conditions – they might not know whether more or less straw is needed to compensate for rain or fog. They might not even know what the weather conditions are outside. On the other hand, if a video camera were installed at the fireplace, the outside world could see whether the cardinals were adding straw or not. Even if the smoke were ambiguous, everyone would conclude that there is still no pope if they saw the hand of a cardinal adding straw to the fireplace.
The same is true for spoken language. It is my job as a speaker to enunciate clearly, and to phrase my talk in a way I think it will be most intelligible. Your job, as listener, is to understand ne in any way you can. Sometimes this process is unconscious. For example, how many of you heard me intentionally mispronounce "me" as "nee" in the phrase "understand me"? In any case, I cannot adjust my articulatory gestures to help you understand me, but you may need to reinterpret your perceptions in order to understand which articulatory gestures I am employing. The articulatory movements are performed to match a specific target, and I believe that the target is articulatory, not auditory. When setting up our linguistic capacity, evolution had to create enough articulators and places of articulation to produce a certain amount of distinguishable sounds. Whenever a language evolves, any newly emerging articulatory targets must each be auditorily distinguishable from all others of a given speaker. What they sound like is irrelevant. Now, these articulatory targets cannot be changed by individuals. There are no two ways to hit the target. When people with speech defects employ the wrong articulatory gestures, it is almost always obvious. Of course, children acquiring a first language use their auditory feedback to adjust and perfect their gestures, but their models are gestural, not auditory.
Why do I believe this? Because of cross-speaker distortion. As I suggested before, there is no standard sound image of oneself. How could a baby model its speech on itself? On the other hand, a baby cannot exactly reproduce its mother’s sounds either, since its vocal tract is too small. How could it hone its articulations on her sounds? It has been known for many years now, that there are considerable differences among men’s, women’s and children’s speech. This is one reason why artificial speech recognition is so difficult. And these differences are non-linear, as has been known for some time. What hasn’t been known until fairly recently is that infants have an apparently innate capacity to identify an articulatory gesture no matter who the speaker is.
Some of you may have heard about the experimental technique of training babies to turn their heads in reaction to audio or visual stimuli. Well, in 1991, Patricia Kuhl used this technique in an experiment and found that six-month-old infants, trained to distinguish /a/ from /ae/ in a given speaker, could transfer this ability directly to the speech of various other speakers, without further training. This ability is called talker normalization. And it is all a child needs to accomplish the task of honing its articulatory gestures when it grows old enough to talk. The new talker only needs to tell when its own gestures match its parents’ gestures. The sounds produced may be quite different, but the child can stop adjusting gestures when the normalized sounds match. Of course, infants must have some ability of this kind, since their parents articulatory gestures produce different sounds than their own. If they didn’t have this ability, then firstborn children should have a natural handicap in language learning, since they don’t have any sibling sound models on which to base their own speech. I suspect that, if anything, they acquire speech faster than their younger siblings do.
So if the articulatory gesture is a psychological entity, it probably ought to be a linguistic entity as well. In most language instruction books, the phonological units of language are described by their articulatory gestures, although at times there are impressionistic sound descriptions as well. With language tapes, learners are supposed to replicate what they did as children with their first language: that is, experiment with their articulators until they can "match" what’s on the tape. If we would like ultimately a highly detailed description of what’s happening not only with the articulators, but also acoustically, it seems we could start either with an auditory description or a gestural description, and then generate the other description. But as we remove redundant details from either side of the function, assuming always that the redundant details can be generated by universal (across languages) rules, I suspect that we would wind up with a simpler description on the articulatory side than on the acoustic side. Given any tape recording as input, there must be some universal function that could output the gestures, even if the function might have to be solved by trial and error, as the language learner would do with tapes. But if we restrict ourselves to featural descriptions, I suspect it would be easier to generate the phonetic features from the gestural features than to generate the gestural features from the phonetic features. And if we still wanted to go the auditory route, given that male and female voices are different and non-linearly related, it seems likely that we’d have to pick one sex or the other as the standard. An average of these non-linearly related spectrograms might represent nobody’s voice at all.
Of course, articulatory gestures as phonological units are really nothing new. In the 1930’s, Freeman Twaddell defended articulatory fractions as the basis for classifying phonetic relations in his article "On defining the phoneme." Ironically, this was the famous article which divorced phonology from any physical considerations. William Diver suggested that the skewness of the distribution of the phonemes was entirely due to articulatory considerations. I believe this theory should be tested experimentally, if it hasn’t been done already. One could test whether subjects made more errors in identifying phonemes in words like /bab/ /dad/ and /gag/ than words that have different intials and finals, since we know there is a tendency for languages to avoid such alliterative combinations. If subjects had no particular trouble with them, we could conclude that Diver was right, the reason must lie in production, not perception.
Researchers from Haskins Laboratories in New Haven have always tended to side with gestures rather than the sounds. Alvin Liberman’s motor theory of speech perception is probably the best known of the early theories. Now there is a new theory called Articulatory Phonology, championed by Catherine Browman and Louis Goldstein, also at Haskins Laboratories. They create "gestural scores" of short words, graphing the positions of various articulators over time. The gestures translate into vocal tract variables, which can be translated into acoustic realizations. Some might call this phonetics rather than phonology; nonetheless, they have come up with some interesting findings. For example, they have found that the gestural scores get compressed in a predictable manner during rapid speech. Sometimes articulations become so compressed that they become entirely simultaneous with others. For example, the final "t" of the word "perfect" can disappear into the other consonants in an expression like "perfect memory". In this extreme case, the "t" may be completely inaudible, yet the tongue still forms it, even though the airflow to it is already cut off by the "c". No wonder some dialects drop the "t" entirely! Elsewhere, Browman and Goldstein have shown how gestural analysis can explain as one simple phenomenon what is traditionally considered variation in multiple allophones.
In this paper, I have not presented any new experimental evidence nor have I really presented a new theory. But my goal may be bigger than that, in some sense. I have tried show you the articulatory gestures in a new light, as the signs or symbols of language. As such, they can be seen as part of a broader discipline, semiotics. I hope that a semiotic point of view can be useful in other ways as it draws comparisons among different symbolic systems. It could possibly impose a kind of discipline to linguistics. I might also venture the hope that automatic speech recognition could be improved if the distortion function from gestures to sound could be inverted, and applied first to the sounds to produce a representation of the gestures. If I am right, the gestures should be easier to identify across speakers, and they may even be easier to segment.
In comparing speech to sign language, I have found more similarities than I have time to explain here. Many of these comparisons are quite interesting, as they are not simply attempts to find traditional linguistic units in sign language. Many linguistic units, such as syllables, don’t seem to serve any function in speech anyway. But some of the debates raging in linguistics can be tested on sign language and might be resolved. For example, when I see the phonemes as gestures, I can see a sign language counterpart to the question of whether diphthongs should be counted as one sound or more than one sound. In any case, the analogies and comparisons can be fascinating, as when one compares the sound produced by the vocal cords and reflected on the other articulators to the sun’s light shining on the hands of ASL signers. Yes, we can speak in the dark, but our light turns on only when we exhale, and blinks off for the voiceless sounds. One result of this unreliable source of illumination is our messy spectrograms, which are notoriously difficult to read.