As smart assistants and voice interfaces become more common, we’re giving away a new form of personal data — our speech. This goes far beyond just the words we say out loud.
Speech lies at the heart of our social interactions, and we unwittingly reveal much about ourselves when we talk. When someone hears a voice, they immediately start picking up on accent and intonation and make assumptions about the speaker’s age, education, personality, etc. Humans do this so we can make a good guess at how best to respond to the person speaking.
But what happens when machines start analyzing how we talk? The big tech firms are coy about exactly what they are planning to detect in our voices and why, but Amazon has a patent that lists a range of traits they might collect, including identity (“gender, age, ethnic origin, etc.”), health (“sore throat, sickness, etc.”), and feelings, (“happy, sad, tired, sleepy, excited, etc.”).
This worries me — and it should worry you, too — because algorithms are imperfect. And voice is particularly difficult to analyze because the signals we give off are inconsistent and ambiguous. What’s more, the inferences that even humans make are distorted by stereotypes. Let’s use the example of trying to identify sexual orientation. There is a style of speaking with raised pitch and swooping intonations which some people assume signals a gay man. But confusion often arises because some heterosexuals speak this way, and many homosexuals don’t. Science experiments show that human aural “gaydar” is only right about 60% of the time. Studies of machines attempting to detect sexual orientation from facial images have shown a success rate of about 70%. Sound impressive? Not to me, because that means those machines are wrong 30% of the time. And I would anticipate success rates to be even lower for voices, because how we speak changes depending on who we’re talking to. Our vocal anatomy is very flexible, which allows us to be oral chameleons, subconsciously changing our voices to fit in better with the person we’re speaking with.
We should also be concerned about companies collecting imperfect information on the other traits mentioned in Amazon’s patent, including gender and ethnic origin. The speech examples used to train machine learning applications are going to learn societal biases. It has already been seen in other similar technologies. Type the Turkish “O bir hemşire. O bir doctor” into Google Translate and you’ll find “She is a nurse” and “He’s a doctor.” Despite “o” being a gender-neutral third-person pronoun in Turkish, the presumption that a doctor is male and a nurse is female arises because the data used to train the translation algorithm is skewed by the gender bias in medical jobs. Such problems also extend to race, with one study showing that in typical data used for machine learning, African American names are used more often alongside unpleasant words such as “hatred”, “poverty”, “ugly,’” than European American names, which tended to more often be used with pleasant words such as “love”, “lucky”, “happy”.
The big tech firms want voice devices to work better, and this means understanding how things are being said. After all, the meaning of a simple phrase like “I’m fine” changes completely if you switch your voice from neutral to angry. But where will they draw the line? For example, a smart assistant that detects anger could potentially start to understand a lot about how you get along with your spouse by listening to the tone of your voice. Will Google then start displaying advertisements for marriage counseling when it detects a troubled relationship? I’m not suggesting that someone is going to deliberately do this. The thing about these complex machine learning systems is that these types of issues typically arise in unanticipated and unintended ways. Other mistakes that AI might make include detecting a strong accent and inferring that this means the speaker is less educated, because the training data has been skewed by societal stereotypes. This could then lead a smart speaker to dumb down responses to those with strong accents. Tech firms need to get smarter about how to avoid such prejudices in their systems. There are already worrying examples of voice analysis being used on phone lines for benefit claimants to detect potential false claims. The UK government wasted £2.4M on a voice lie detection system that was scientifically incapable of working.