What Happens to the Speaking Subject When the Listener is a Computer?

Affective Computing of the Speaking Voice

Jessica Feldman29.10.2024Article, Issue 01

I am sitting with a group of colleagues on a Saturday night. We are in the back of a bar, getting some dinner and drinks after a conference. The room is loud with boisterous conversations. There is rock music playing in the background, and we are having a playful discussion about serious stuff: the role of the fourth estate, whether it is worth voting in America, does surveilling the police really keep brutality in check? What are we all working on? I start describing my research into affective listening software. They want to know if it works, how it’s used, and if it learns. I am not sure about the first question, but I take out my iPhone and open the “Moodies” app. “Moodies” is an app designed to evaluate the user’s emotional state by listening to the acoustic qualities of the voice. My friend across the table is excited to try it. She has a warm, slightly twangy, high-pitched voice. It sounds to me like there is always a smile in her voice, and I have thought that her voice makes her sound friendly even when she is saying challenging or contrary things. She speaks to the app, mostly about nothing. She describes the scene, tells it that we are sitting in a bar having some drinks, testing it out. After twenty seconds, it beeps and gives her an evaluation of her emotional state: “Anger: anger and bitterness. Pride. Possessiveness.” She laughs uproariously. “That’s not right!” My colleagues posit problems with the input—the room is too noisy, it’s actually hearing the background music, she didn’t speak long or loudly enough, she needs to hold the mic closer to her mouth. She tries it again, and this time it tells us: “Anger: A loud and emotional state. Radical Leadership. Fanaticism. Dichotomy.” She laughs again and shakes her head, hands the phone back to me. I try it now. I’m feeling a bit tired and weak as I’m getting over a migraine, but also perhaps a little nervous and guarded as I am just getting to know these new friends. I feel as though I am fine-tuned to others’ reactions right now, speaking softly and listening a lot, trying to be careful about what I say and how I say it. I similarly speak to the app about nothing of consequence: the bar, the day, the neighborhood. After twenty seconds, it gives me my reading: “Dominance: preaching, forceful leadership, dominance. Aggressive communication. Anger and contempt.” I am taken aback and laugh. “That’s DEFINITELY not true! That is the opposite of true.” I shake my head and smile, deny the allegations. We all decide together that it doesn’t really work. But I think perhaps we’re a bit shaken by it. Does it work? Is it telling us something secret about our feelings that we are trying not to share? Is it telling us something about ourselves that we don’t even know or want to admit?

“The ‘grain’ is the body in the voice as it sings, the hand as it writes, the limb as it performs. If I perceive the ‘grain’ in a piece of music and accord this ‘grain’ a theoretical value (the emergence of the text in the work), I inevitably set up a new scheme of evaluation which will certainly be individual—I am determined to listen to my relation with the body of the man or woman singing or playing and that relation is erotic—but in no way ‘subjective’ (it is not the psychological “subject” in me who is listening; the climactic pleasure hoped for is not going to reinforce—to express—that subject but, on the contrary, to lose it). The evaluation will be made outside of any law, outplaying not only the law of culture but equally that of anticulture.”1

“Based on our team of physics, neuropsychology, and decision-making experts, we have managed to decode the human intonation using 10-15-second voice segments. We discovered that emotions create universal patterns in all voice frequencies and intensities. Our patented core engine includes hundreds of mood variations as well as a complete emotional decision-making model based on our vocally detected intonations. By listening and focusing on how people speak rather than trying to understand what they say, our technology taps into a much stronger source of emotional information. Furthermore, since emotions are both universal and intuitive, Emotions Analytics solutions need no complex and cumbersome sets of rules and syntax to try and convert words into meanings.”2

The realm of the voice and the realm of the affective share the distinction of the ineffable. Here, human instincts, raw flesh, autonomic reactions, sweat, nerves, animal chemistry, and gut reactions leave their marks in sound. Such expressions are imagined transmitting their effects to other sentient creatures, somehow bypassing language and touching our pleasure points, stirring our souls, or hitting us where it hurts before we can make meaning of it. Digital listening has recently latched on at the intersection of these realms, aiming to evaluate—and to predict—a speaker’s mood, personality, truthfulness, confidence, and mental health based on algorithmic evaluations of the acoustic parameters of the voice. In order to consider the psychological and political values embedded in this technique of listening, it is necessary to look closely at the code, patents, and marketing language used by the emerging affective listening software. What happens to the speaking, feeling subject when the listener is a computer? How do these algorithms imagine, and attempt to quantify, the human soul, and to what ends? The digital encoding of the affective realm reveals a collection of (man-made) guidelines for listening, which in turn lead to a prescription of what is listenable—of what counts as legible and possible. I here consider which cultural and ethical values are deeply embedded in this software, and what kinds of intersubjectivities this digitized listening might prescribe or foreclose.

Although theorists like Deleuze and Guattari often describe the affective realm as the most unquantifiable plane of experience, the affective sciences are actually all about counting feelings, and about which feelings count. In 1667, Spinoza defined affect as a “passion of the mind” expressed only through the vitality of the body, not in language. Spinoza counted three such passions: desire, joy, and sorrow.3 Psychologist Silvan Tomkins took up this idea again in the 1960s, increasing the count to nine, and theorizing universal bodily expressions particular to each one.4 Sound figures prominently in his descriptions of affective communication, as a literal expression and metaphor (e.g.: the transmission of affect is called “affective resonance”). Bodies can transfer emotions without words, as packs of animals transmit fear amongst themselves or one crying baby can set a whole nursery wailing. Phenomenologist Max Scheler called this the “contagion of emotion.”5 This gave rise in the 1970s to M. F. Basch’s theory of “primitive empathy,” which transmits the “raw data of emotion”6 through wailing voices or nervous skin.

That this raw data could be described using discrete and limited categories meant that feelings could then be digitized—coded for the computer in terms of their quantitative affiliation with one or another itemizable affect (95% joy and 5% sorrow, for example). These codification processes have flourished in the past quarter-century—in tandem with the rise of portable personal computing—leading to an outpouring of research and new technologies in digital listening. Although most such tools claim to merely register in digital form an affective meaning in the voice that is “natural” and universal to the human body, I instead find that these technologies of encoding assert certain forms of recognition of the self and other that are based primarily on the capacities of the computer and the priorities of the market. Far from being modeled on fundamental truths of the human body (if such truths exist), “the affective” has recently gained validity as a psycho-epistemological category in tandem with, perhaps because of, the rise of personal and predictive computing.

The affects gain traction because they are numbered, and therefore the affective is something a computer can handle. “Affective computing” emerged in the late 1990s and early 2000s as an area of technical research focused mainly on designing software that could identify and respond to human emotions based on the machine coding of facial gestures.7 More recently, these practices have expanded to include listening software. Such projects quantify vocal expressions and claim to read their affective content according to a rubric that understands these signals as indexing entries in various libraries of affective and emotional labels. This gave rise to a new mode of listening: one that claimed to be both cybernetic and sympathetic. Our feelings become heard, recognizable—in fact, worthy of recognition—as they become perceptible to the computer.

But this is not—or ought not be—an easy task. Roland Barthes has written about the struggle to describe sound using linguistic codes. Language, he says, “manages very badly” at discussing music and tends to fall back on that “poorest of linguistic categories,” the adjective.8 Listening to music has the effect of reassuring and constituting the subject—culturally and relationally—and this effect gets expressed by the listener in adjectival terms. Sound is described with the most subjective vocabulary; using words with no set quantitative or universal meaning: loud, soft, moving, violent, sweet, rich, harsh, etc. For Barthes, the “grain” of the voice is a site of escape from the “problem of the adjective.” The voice—especially the untrained voice—carries unintentional and inevitable traces of the individual body from which it emanates: the dimensions of the singer’s lungs, the flesh of his tongue, the shape of his teeth, etc. Although the same could be said for any expressive gesture issuing from the body, the voice is unmediated—travels directly from the mouth to the ear without the intervention of an instrument—and therefore puts the speaker’s (or singer’s) and listener’s bodies into an immediate, erotic relation. For Barthes, this grain is something beyond codification and culture—it evinces only the body; speaks from the body and to the body.

For affective listening software, uncontrollable, unintended, or habitual vocal inflections are imagined as signifying not just the body, but also the emotions, intentions, desires, fears, personality, and mental health of the speaker—in short, the soul. These technologies listen for and quantify changes in pitch, timbre, volume, pacing—the “musical” and “granular” parameters of the voice—in order to ascribe affective meaning to these changes. Emerging and recent listening technologies like Nemesysco’s9 “Voice Risk Analysis,” Beyond Verbal’s10 “emotion analytics” software, and Cogito’s11 “Dialog” do their listening in a range of contexts, from healthcare administration, to artificial intelligence, to financial investing. Benefits administrators use this software to screen claimants for both wellness and sincerity. Some healthcare providers use the software in call centers for diagnostic purposes, particularly to detect depression. Self-tracking mobile phone apps offer the user a description of her mood and its history. Customer-service call centers now use “automatic dialogue systems” to detect if a speaker is angry or frustrated. Military training simulators are incorporating the software to measure stress levels. Human resources departments use it to weed out job applicants.12 Finally, some recent studies have been directed at developing hyper-focused voice surveillance systems that alert the authorities when tensions are running high.

By and large, the evolution of these products reveals a broader techno-cultural shift: from truth to prophecy. Although these designs derive originally from lie-detection technologies, most contemporary affective listening products are couched in the languages of prediction and control: self-tracking, targeted marketing, investing, and risk-management. Companies claim to offer high ROIs (Return on Investments) by detecting the investment-worthiness of a CEO, the likelihood of a claimant to benefit from rehab, when a user will become depressed or anxious, whether a worker will perform well, and even a speaker’s “illegal intentions.” Accuracy is not the goal here. Rather than proving an existing fact, these tools are instead focused on describing a field of psychic possibility. Such products “work” by providing their users with probabilities that allow them to invest their time and money better. Arjun Appadurai’s explanation of the risk economy hinges in part on Weber’s definition of magic as “some sort of irrational reliance on any sort of technical procedure, in the effort to handle the problems of evil, justice, and salvation.”13 Indeed, these technologies attempt to do something magical—to know the unknowable, to quantify the uncertain, to see into the soul in order to predict the future.

What does magic have to do with ethical listening? Barthes makes the connection clear: “the musical adjective becomes legal whenever an ethos of music is postulated, each time, that is, that music is attributed a regular—natural or magical—mode of signification.”14 Ethics and politics enter into listening when the sound is described using a language that is given a general and repeatable meaning, thereby both limiting, and making accountable, its expressions. In this case, the encoding of the affective realm in the voice is the moment where cultural values get embedded into these technologies. The feelings in the sounds are evaluated and named: stressed, tired, deceptive, passive, happy, angry, anxious, etc. Insofar as ethics concerns the recognition of the other,15 the rubrics of recognition used by these technologies prescribe their ethical vocabulary. They show us their model of the human soul, and in this model, show us what structures of feeling “count” as recognizable, or worthy of recognition. The translation from sound to language is where the magic happens, and where the ethics of these technologies comes into view.

A second ethos surfaces when these technologies meet the market. The products—regardless of their relationship with the natural, rational, or magical—have efficacy insofar as they are being used. Their applications retroactively prescribe the possible meanings that they assign to certain vocal qualities. A comparative study of the patents and early available open-source code of these technologies reveals common methods of extracting data from the vocal signal, but a great variety of ways of assigning meaning to that data in the early phases of ideation and development of these products. Not surprisingly, what these technologies claim to listen for changes as they hit the market. Regardless of the diverse range of psychological models proposed in early documentation of these technologies, their applications are remarkably similar. Affective listening software that are being marketed today all are coalescing around a few standard uses: namely, prediction and risk management, in the realms of benefits administration, labor relations, financial investing, and surveillance.

This is a revised extract of an article initially published as “The Problem of the Adjective,” Transposition 6 (2016).

Jessica Feldman (she/her) is a sound and new media artist, researcher and assistant professor in the Department of Global Communications at the American University of Paris.


00:00-00:00
  • Hanne Lippard, Homework, Talk Shop, 2024

    review

  • Amina Abbas-Nazari, Polyphonic Embodiments: Materials

    Article (Issue 01)

  • Giulia Deval, Audio Excerpts from Pitch

    Article (Issue 01)

  • Luïza Luz, Thunder, Music for Wild Angels

    Article (Issue 01)

  • Anna Bromley, No2 How Katrina Krasniqi almost gets lost

    Article (Issue 01)