Dolby Voice: A Spatially Inclusive Conferencing Choice

Submitted by BDTI on Wed, 09/18/2013 - 22:02

Our team at BDTI includes people based in California, Colorado, Illinois, and Rhode Island. And our customers are all over the world. So, we all spend a lot of time in conference calls...often-frustrating conference calls.

Any of you in a similar situation will immediately know what I'm talking about. Conventional conferencing systems blend together mono narrowband signals coming from the various callers (with cellular voice being the most egregious sonic choice), outputting a mixed mono narrowband signal back to each participant. Background noise isn't appreciably suppressed, leading to abundant use of the "mute" switch (and plenty of subsequent "sorry, I was on mute and therefore talking to myself" comments and repeats). Clumsy, artificial communication styles are employed in an attempt to "take turns" and preclude "talking over each other." When multiple participants are talking at once, at best only the words of the loudest two or three (and usually only those of the loudest one) end up getting transmitted to the others. And even the sounds of those few most strident speakers tend to blend together, because unlike in real life, volume and tone are the only differentiators (Figure 1).

Figure 1. Folks aren't very happy with conventional conference call systems, judging from the results of a Dolby-sponsored survey.

Enter Dolby Labs. Senior Product Marketing Manager Jeff Smith uses a cocktail party analogy to explain what's missing. How is it that our ears and brains can more-or-less "tune in" to the voice of just one person, when plenty of others are also speaking around him or her, as well as around the listener? Locality is the key, according to Smith, and it's also the pivotal piece missing from conventional audio conferencing systems. Even if multiple people at a cocktail party are all speaking at the same time and at roughly the same volume, they're occupying unique locations in the 3-D space around each of us. As a result, we're able to focus our attention on just one (or several, if we're particularly adept) of them.

Dolby Labs, as many of you know, has extensive expertise in "placing" sounds within a virtual 3-D sound field generated by a set of headphones or pair of speakers, using amplitude, phase, and frequency shifts and other techniques derived from an understanding of how our ears and brains ("really strong signal processors," in Smith's words) discern audio source location. The company has adapted these same techniques for use with its Dolby Voice conference call technology, which lead partner British Telecom (BT) is in the process of productizing on a worldwide basis. BT's MeetMe works with any number of conference clients, conventional and Dolby Voice-cognizant (Figure 2).

Figure 2. British Telecom, the initial Dolby Voice implementer, is in the process of stitching together multi-O/S soft clients, server-side coordination software, and a planned dedicated-function and Dolby-branded conference room device.

Each sound source's incoming bitstream is augmented at the conventional Linux server nexus (using another set of Dolby-developed software) with metadata indicating an assigned location in 3-D space. Simplistically, for a five-participant conference call situation, the following assignments might be encountered (for very high participant counts, some degree of location "mixing" inevitably occurs, to minimize the amount of per-client downstream bandwidth generated):

  • Right rear
  • Right front
  • Center
  • Left front
  • Left rear

Participants using conventional devices will hear only a traditional mono mixed presentation of the conference call. In contrast, participants running the Windows, Mac OS, Android or iOS MeetMe soft client, or using the upcoming Dolby-branded conference-calling device, receive not only each caller's audio but also corresponding per-caller location metadata. Dolby Voice-enabled participants will therefore benefit from location-derived speaker discernment. In either case, however, background noise suppression and other audio enhancements will produce a conference call experience that Smith claims is superior not only to conventional narrowband but also to more modern wideband (i.e., "HD Voice") approaches (Figure 3).

Figure 3. Dolby Voice users on traditional telephones still prefer it to conventional narrow-and wideband approaches, according to the company; the Dolby Voice preference further increases for clients with access to stereo headsets that can present the technology's spatial features.

All that's required to run the MeetMe soft client, aside from a device running an appropriate operating system, is a stereo analog or USB (i.e., digital) headset (perhaps obviously, a mono audio output won't allow for generation of the virtual 3-D soundfield). A digital headset is preferred, because it allows MeetMe to discern the headset manufacturer and model and subsequently implement additional audio pre- and post-processing to account for any microphone and/or speaker shortcomings Dolby has encountered.

In the interest of full disclosure, I'll note that Smith and I initially attempted to conduct the Dolby Voice briefing over a prototype MeetMe connection, but I was unsuccessful in joining the session, therefore necessitating a conventional call (narrowband cellular in my case, wideband VoIP in his). The fact that I was running Windows 7 as a virtual machine, on Mac OS 10.7 via VMware Fusion, may have been a factor. Dolby Voice itself didn't seem to be at the root of the problem, because the drivers ostensibly installed without issue and the subsequent pre-connection speaker and microphone tests I conducted worked fine. MeetMe is still under development by BT, prior to widespread deployment later this year, so hopefully any remaining wrinkles will be ironed out between now and then.

Speaking of schedules, to date Dolby has delivered Windows and Mac OS X client software to its premier partner; the Windows code is further along at BT, thereby explaining why I ran it in a virtual machine (not having access to any Windows-native hardware here). Android and iOS clients are scheduled to follow from Dolby by the end of this month. BT plans to productize its initial Windows- and Mac OS X-only MeetMe implementation in the fourth quarter, with mobile O/S client support following in the first quarter of next year. And sometime next summer, the Dolby-branded and BT-supplied conference room device, intended to be a functional superset of a conventional Polycom or equivalent IP conferencing product, is scheduled to begin shipping.

Dolby Voice, according to Smith, represents a fairly significant evolution of a prior-generation surround sound chat client called Dolby Axon, intended for computer gamers. Dolby Voice both decreases latency and increases quality, albeit at a bandwidth tradeoff; whereas the average downstream bandwidth with Dolby Axon was 16 Kbps, with Dolby Voice it's closer to 24 Kbps. Smith is quick to point out, however, that the algorithm is highly adaptive both based on the available bandwidth at each client, the number of clients participating in the call, and the aggregate sound characteristics at any point in time. During silent "lulls" in the conversation, for example, downstream bandwidth is near zero. The typical upstream bitrate generated by each client is on the order of 8-12 Kbps.

For a Dolby Voice demonstration that's operating system- and browser-independent, requiring only installation of the Adobe Flash plugin, head to www.dolby.com/voice. There, you'll also find an informational video and various documentation items. Perhaps obviously, the company is highly motivated to sign up additional Dolby Voice licensees, a fact which leads to the final important piece of information: cost. While unwilling to quote specifics, Smith reassured me that an end-to-end Dolby Voice implementation, including the intermediary server, would be equivalent if not lower in cost to a Radisys or competitor's conventional telecom bridge-based approach.

Dolby Voice is an imaginative means of using digital signal processing technology to improve the quality of increasingly common conference calls. Its widespread adoption is by no means assured at this early stage in its life; BT and Dolby will need to flawlessly execute on their planned rollouts both of the existing computer-based clients and the upcoming mobile operating system applications, for example, and Dolby will also need to sign up many more licensees. The technology's reliance on a dual-speaker setup to "place" conference call participants within a virtual 3-D soundfield is unfortunately also incompatible with the single transducers commonly found in handsets, headsets and speakerphones. Nonetheless, assuming the demo I auditioned is reflective of the shipping product's capabilities, it's an impressive achievement.

Add new comment

Log in to post comments