Jeff Bier’s Impulse Response—"Could You Repeat That, Please?"

Submitted by Jeff Bier on Tue, 10/22/2013 - 22:01

Lately I've been spending many hours on conference calls: Early morning calls with colleagues in Europe and India, mid-day calls with customers in the U.S., and late evening calls with partners in China. I often find these calls difficult and fatiguing--not because of what people are saying, but because I frequently have trouble understanding what people are saying.

At first, I chalked this up to what seemed like the obvious explanations: "It's late in the day… It's my seventh call today… The person talking has an accent." But I suspected there was more to it, and that poor audio quality was at least partly to blame. It doesn't seem right that, nearly 100 years after the first transcontinental telephone call, and more than 30 years after digital signal processors began to be used in telephony, we're still suffering with crummy sound. And, really, it's more than an annoyance; poor audio quality makes it harder to communicate, especially when combined with participant accents, background noise and other challenges.

To be fair, sound quality has improved in recent years for certain kinds of voice calls. Mobile phone calls these days are less likely to be exercises in "Can you hear me now?", thanks to improved network coverage, better speech compression algorithms, and sophisticated noise cancellation techniques. And when there's a good Internet connection available, VoIP calls can deliver improved fidelity, thanks to increased audio bandwidth.

But, sadly, most of my conference calls still go through legacy telephone networks, and they often sound terrible. I think there are a few culprits here. Clearly the limited audio bandwidth of the legacy phone network is a major factor. As Polycom co-founder Jeff Rodman explains in this excellent paper (PDF), the fact that most telephone calls transmit audio only between 300 Hz and 3200 kHz is more than a mere annoyance:

Crandall noted in 1917, "It is possible to identify most words in a given context without taking note of the vowels... the consonants are the determining factors in... articulation." "Take him to the map" has a very different meaning from "take him to the mat," and a handyman may waste a lot of time fixing a "faucet" when the faulty component was actually the "soffet."

… This critical role of consonants in speech presents a serious challenge for the telephone network. The reason for this is that the energy in consonant sounds is carried predominantly in the higher frequencies, often beyond the telephone's bandwidth entirely. While most of the average energy in English speech is in the vowels, which lie below 3 kHz, the most critical elements of speech, the consonants, lie above. The difference between "f" and "s," for example, is found entirely in the frequencies above 3 kHz; indeed, above the 3.3 kHz telephone bandwidth entirely.

Rodman goes on to explain why the combination of limited audio bandwidth and accents is problematic:

Understanding accented speech can be much more difficult than native speech, both because of the presence of an accent, and because grammar, pronunciation, and even word selection are much different than the listener expects… This increases the importance of the physical parameters of speech communication (bandwidth, reverberation, amplitude, interaction, and noise) because it is no longer safe to assume that an unclear word can be deduced from its grammatical context.

Now things are starting to make sense to me: The limited bandwidth of conventional phone calls doesn't simply make them sound bad--it actually requires the listener to work harder to resolve the ambiguities introduced by limited bandwidth. And decoding those ambiguities can be significantly more difficult when the person talking is not a native speaker of the language.

But bandwidth isn't the only problem. In my experience, the most difficult calls are often the ones involving speakerphones in multiple locations. In addition to limited bandwidth, speakerphones introduce several additional challenges, such as the need to cancel the speaker-to-microphone echoes in each room, and the attenuation, reverberation and interference introduced by the distance between the speakerphone and the call participants. Plus, the simple fact that these calls often involve a large number of people adds a layer of complexity, as listeners may struggle to determine who is speaking. The recently introduced Dolby Voice product attacks this last challenge, and is a promising development. Still, I think much more is needed. For the sake of the sanity of millions of businesspeople around the world who participate in conference calls daily, we can and should do better.

One would think that, as robust Internet connectivity rapidly becomes ubiquitous, the 3 kHz bandwidth limitation of the legacy network will soon disappear. But until everyone (or at least everyone who participates in conference calls) uses VoIP, the legacy phone network will continue to be the default medium for these calls. What will it take to make VoIP the default instead? Or is there another solution?

And what about those speakerphones? While decent-quality speakerphones have become remarkably affordable in recent years, it seems that some breakthrough is needed to solve the intelligibility problems that frequently plague conference calls involving participants seated around large conference tables in several locations.

Digital signal processing has been used to solve many tough problems. I suspect it can be used to solve this one, too. Who's game to try?

Jeff Bier is president of BDTI and founder of the Embedded Vision Alliance. Please post a comment here or send him your feedback at http://www.BDTI.com/Contact.

Permalink

TheMicroMan Thu, 10/24/2013 - 06:27

I don't know, Jeff. I blame much of today's hard-of-hearing telephony on modern science and humans behavior.

Indeed the POTS circuits were designed within the constraints of human hearing, much like today's signal processing algorithms are tuned to the (now better-understood) human ear limitations. In the old days at Ma Bell we used to test end-to-end quality of service by sending a 1000 Hz tone across the wires to the end point. Noise and attenuation picked up through electromechanical switches, repeaters (amps), toll equipment, shoddy connectors/connections, crosstalk in bundles of color-coded wires, and long, wet wires all added up. (there was a not-politically-correct jingle used to remember the wire sequence BBROYGBVGW - and I don't know whether the wire sequence came first or the resistor identifier). Generally, above 3 KHz nobody cared. There were special circuits that we tuned up and tested for the radio stations that carried their signals from studio to broadcast tower which as I recall would handle up to 5 KHz for AM and to 15 KHz for FM (there really was better music on FM!). But voice was definitely centered on 1 KHz.

Today, I blame bad telephony more on the accumulation of lossy compression algorithms, sloppy assumptions, and bad habits. MP3 algorithms (I know, not telephony), adaptive bit compression rates, shortcuts in algorithm processing, variance in microphones / speakers, and the inevitable "that's good enough - we've got to save cost" ... all contribute distortion. (What's your other article this week? "Imperfect Processing...") After the fourth iteration of saving a picture in JPEG, it has fuzzed up quite a bit since that first version.

Run across foreign telephone systems, POTS wires, cellular transmissions, Internet protocol, and DIY connections and you'll pick up some mud. Add in telephone bridges, speaker phones, cellular phones, and headsets. Noise suppression, echo cancellation, crosstalk/reflection, feedback, processing delays can introduce their own effects. It's remarkable that anything legible gets through.

When was the last time someone really sold their cell phone on its amazingly clear voice quality? There's a Sertain brand of Korean phoneS that may Sound good to the phone'S owner, but the guy on the other end gets a broken up Signal. "No, I can hear you just fine." Someone prioritized their signal processing to the receiving side and maybe scrimped on the outgoing.

Then there's the weed-wacker/leaf blower/garbage truck/construction work just outside the window or the listener hammering away on his keyboard, ordering/eating lunch, or getting txt message notifications. And so many people think their magical cell phones will pick up anything they say even with their head turned away. My wife thinks that as long as you're in the room, a speakerphone will pick up your voice. I'm always creeped out when you're in the men's room and some guy's over there doing his business and talking on his cell phone - and maybe flushing. That background noise doesn't get cancelled out.

California callers can be the worst, with everyone on bad Bluetooth headsets or speakerphones, driving down the freeway honking their horn with the radio on and the sun roof open while passing an 18-wheeler. Hey, call me when you get home!

And then there's the accents, the non-native speaker, the soft voices, and the sore throats. And am I still alert and paying attention, not reading some email as I multi-task.

I'm not sure how much you can blame on the old POTS. (When was the last time you dialed "O" for Operator? Does anybody answer? Did that cost you $5?)

Add new comment

Log in to post comments