AES PNW Meeting Report - Audio vs. Video

n.b. Chrome users need to refresh their browsers to ensure they have the latest content.

Meeting held October 24, 2006 at Opus 4 Studios, Bothell, WA

AES PNW Section Meeting Report Audio vs. Video - What's the Difference? with James (JJ) Johnston Microsoft Corp

Presenter James Johnston speaks on "Audio vs. Video - What's the Difference?"	October meeting attendees at Opus 4 Studios, Bothell, WA	JJ Johnston explains a point about the difference between Audio and Video.
Photos by Gary Louie

The PNW Section met for a second time in October at Opus 4 Studios in Bothell, WA for an interesting diversion comparing perception of audio and video, presented by James (JJ) Johnston of Microsoft. Since the final target for almost all audio and video is a human being, it seems that the way the human being perceives audio and video signals is paramount to understanding how to treat each kind of signal. 18 AES members, including AES President-Elect Bob Moses, and 15 non-members attended.

JJ Johnston worked 26 years for AT&T Bell Labs and its successor AT&T Research Labs. He is an IEEE Fellow and an AES Fellow. He was one of the inventors and standardizers of MPEG 1/2 audio layer 3 and MPEG 2 AAC. He received his BSEE and MSEE from Carnegie-Mellon University in 1975, and 1976 respectively. Most recently he has been working in the area of auditory perception of soundfields, ways to capture soundfield cues and represent them, and ways to expand the limited sense of realism available in standard audio playback for both captured and synthetic performances. He is a committee member of the AES PNW Section.

The first slide comparing block diagrams of the audio and visual perception systems noted that audio perception has an immediate frequency analysis component that the eye does not have. Then both systems have similar amplitude and feature analysis and feature extraction. Thus, the eye is a weak time analyzer, and the ear is a good time & frequency analyzer. JJ continued with additional differences between the functions of the eye and ear, such as the ear analyzing one variable (pressure), and the eye working with 4 variables of RGB (red/green/blue) color and luminance (brightness), and the response changes with direction.

He said that most theories of auditory response recognize 3 levels of memory, as sounds are recognized. These are sequential processing steps with different persistence times: loudness memory (about 200ms); feature memory (seconds); and long term memory. The audio is not persistent like a picture, but becomes encoded in the brain in a highly compressed way as "audio features" extracted by a temporal analysis, whereas you can extract information from a still picture via a spatial analysis, for which there is no analog in audio. Auditory glitches such as a brief digital burst get recognized strongly due to the ear's frequency analysis. Video glitches, such as a repeated TV frame, are typically hardly noticed. Dynamic range response might be similar for the eye and ear, but the ear adapts to sounds in at most a few seconds, while the eye can take several minutes to adapt. On an energy vs. neural impulses basis, the ear is measurably more non-linear in response than the eyes. Despite this, 16 bits are required for acceptable digital audio because of the strong frequency electivity of the ear, while a non-uniform quantization of 8-10 bits suffices for vision.

Both systems have 2 receptors for stereo response - the ear uses time and level differences between the 2 ears, and can derive directional info; the eyes can use parallax for distance cues, with direction cues intrinsic to the 2D image. So, in the eye, direction is primary, distance secondary, while with the ear, direction is secondary and distance tertiary. Color response varies with brightness level, unlike the ear. A discussion began on how some people say they "hear" colors under certain medical conditions (synesthesia).

With gamma correction, the light-dark response curve applied to video, harmonic distortion doesn't matter as the eye has no frequency analyzer. A similar thing for audio wouldn't work and would be quite audible. The ear requires linear capture and reproduction.

Finally, for audio, keeping the time domain intact is important. For vision, the spatial domain is more mportant and time less so.

A snack break and prize drawing was then held. About 14 fortunate attendees walked away with swag ranging from CDs to earplugs to lanyards and a color photo provided by JJ Johnston.

JJ fielded some Q & A afterwards. Special thanks to Dr. Michael Matesky's Opus 4 Studios for hosting the event.

Reported by Gary Louie, PNW Section Secretary

Last Modified 8/25/2015 16:18:00, (dtl)