music_performance_and_instruction_over_highspeed

来自：mc_eastian > 馆藏分类

配色：

字号：大中小

music_performance_and_instruction_over_highspeed_networks

2023-03-20 | 阅：转： | 分享

Music Performance
and Instruction over
High-Speed Networks

November 2008

1 White Paper: Music Performance and Instruction over High-Speed IP Networks
INTRODUCTION

High-speed IP networks are creating opportunities for
new kinds of real-time applications that connect artists
and audiences across the world. A new generation of
audio-visual technology is required to deliver the
exceptionally high quality required to enjoy
performances over IP networks.

This paper discussed how the Manhattan School of
Music (MSM) uses Polycom technology for music
performance and instruction over the high-speed
Internet2 network that connects schools and
universities in the United States and other countries
around the world.

Figure 1 - Pinchas Zukerman at Canada Art Centre
There are three major challenges to transmitting high-
teaches a student in New York over video
quality music performances over IP networks: (1) true

acoustic representation, (2) efficient and loss-less
School of Music further envisioned the potential of this
compression and decompression that preserves the
technology-powered medium to support and develop
performance quality, and (3) recovery from IP packet
many more aspects of the institution’s educational and
loss in the network. This paper analyzes each of these
artistic mission. Through the development and creative
three elements and provides an overview of the
use of broadband videoconferencing and related
mechanisms developed by Polycom to deliver
instructional technologies, Manhattan School of
exceptional audio and video quality, and special
Music’s newly-instituted Distance Learning Program,
modifications for the Music Mode in Polycom
could provide access to artistic and academic
equipment.
resources that enhance student education in musical

performance while heightening global community
MUSIC PERFORMANCE AND INSTRUCTION AT
awareness of and participation in the musical arts. As
MANHATTAN SCHOOL OF MUSIC
the first conservatory in the United States to use

videoconferencing technology for music education,
Manhattan School of Music’s use of live, interactive
Manhattan School of Music could feature its Distance
videoconferencing technology (ITU-T H.323 and
Learning initiatives to build audiences for the future; to
H.320) for music performance and education began in
preserve and expand musical heritage; to foster
1996. At that time, world-renowned musician and MSM
leadership, creativity, and technical innovation in
faculty member, Pinchas Zukerman, violinist and
support of new performance and educational
composer, brought the concept of incorporating live,
opportunities; and to facilitate cross-cultural
two-way audio-visual communication into the
communication, understanding, and appreciation
Zukerman Performance Program at the School. The
through music.
idea behind its incorporation was the mutually-

beneficial arrangement of providing students with more
APPLICATION EXAMPLES
lessons (i.e., access to instruction), while also

accommodating the touring schedule of a world-class
Today, the Manhattan School of Music Distance
performing musician. Without such a technological
Learning Program reaches over 1700 students each
opportunity, students were limited to having lessons
academic year including learners from 27 of the 50
with Zukerman when he was physically on campus. In
United States and 16 other countries to date. Musical
prior academic years, this could amount to only one or
applications in classical and jazz music include regular
two lessons per semester. Within a year of adopting
public lessons referred to as "master classes"; private
videoconferencing for music education, Zukerman was
lessons; jazz clinics on performance-related
offering his students regular videoconference lessons
techniques; instrumental and vocal coaching on
from Albuquerque to New Zealand. Figure 1 is a
musical interpretation; sectional rehearsals for large
photograph of one of these sessions.
ensemble performances; workshops on musically-

related topics such as ‘sports medicine’ for musicians;
After the early success of the Zukerman Performance
professional development for faculty; educational and
Program Videoconference Lesson Project, Manhattan
community outreach music programs for K-12 schools,

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

2 White Paper: Music Performance and Instruction over High-Speed IP Networks
libraries and hospitals around the country; composer digital audio mixing console. Audio and video signals
colloquia with notable composers around the world; are then captured into DV format for future use,
panel discussions on diverse themes and topics such reference, and access such as MPEG-2 (DVDs) or
as copyright protection for musicians in the digital age H.264 Web streaming.
or an anniversary year celebration of a notable
composer, such as Leonard Bernstein; and "mock" Both of the images in Figure 2 show world-renowned
audition sessions that are designed to help aspiring classical musicians teaching talented young
young musicians prepare for live auditions with professionals through interactive videoconferencing. In
symphony orchestras or big band ensembles. the left-hand image, Maestro Pinchas Zukerman, a
pioneer in the use of videoconferencing for music
More recent applications currently under development performance, is teaching a student located at the
include simultaneous Webcasting of live, Manhattan School of Music from Ottawa, Ontario
videoconference exchanges, "telementoring" sessions where he maintains the post of Artistic Director of
on career and professional advancement, and remote Canada’s National Arts Centre Orchestra. On the right
auditioning from far-reaching locales such as Beijing of Figure 2, Thomas Hampson, renowned American
and Shanghai in the People’s Republic of China. Since baritone and leading singer at the Metropolitan Opera
over 40 percent of the Manhattan School of Music House, is teaching a student at the Curtis Institute of
student body comes from outside of the United States, Music in Philadelphia from the Manhattan School of
having the opportunity to audition for Manhattan Music campus. These images demonstrate the ability
School of Music through live, interactive of music institutions to both import and export artistic
videoconferencing from one’s home country would resources between remote sites.
provide a significant savings for students and their
families, and also open up the opportunity for talented, Similarly, in the field of jazz music, the great jazz
yet economically-disadvantaged, students to audition pianist and saxophonist, Kenny Barron and David
for one of the world’s leading music conservatories. Liebman (Figure 3), respectively, teach talented jazz
students in remote locations in Canada from
Figure 2 illustrates the use of interactive Manhattan School of Music’s newly-established
videoconferencing technology for live music instruction William R. and Irene D. Miller Recital Hall, fully
between student and master teacher at remote equipped with HD videoconferencing capabilities.
locations. Both images show public lessons or master
classes within a concert hall setting that enable large
groups of people to observe and participate in these
learning and performance exchanges

Figure 3 - Jazz programs at MSM

Presently, Manhattan School of Music sees no end to

the possible uses and applications of
Figure 2 - Classical Programs at MSM
videoconferencing technology for music performance

and education. The field of music has already
MSM integrates professional audio and video
demonstrated the need, use, and desire to use cutting-
technology with the Polycom HDX codec to create an
edge technology such as videoconferencing to teach
optimal, high-performance virtual learning
and reach new audiences around the globe.
environment. This technology includes HD video
Ultimately, live music performance is a shared,
projection and cameras, as well as professional-grade
participatory, community-oriented activity which gives
condenser audio microphones through an external
outlet and meaning to human expression, and

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

3 White Paper: Music Performance and Instruction over High-Speed IP Networks
therefore should not be limited by physical or deployed, standardized, reliable, efficient, cost-
geographic boundaries. effective, and easy to use.

TECHNICAL CHALLENGES What would be required in order to achieve
educationally useful and beneficial exchanges and
The previously mentioned applications cover the two performances in music? Here''s the list:
broad areas of applied music, as well as thematic or
topically-driven music programs. Both types require a ? A seamless virtual environment conducive to
highly interactive virtual environment. Within the learning/teaching/performing (low latency)
context of a music lesson, the teacher and student ? True, accurate sonic representation of the
engage in constant and rapid-fire exchanges of acoustical properties of sound and music
speaking, playing, and gesturing—and they do all of ? Functional and expressive elements of music
this simultaneously or over one another. Musicians are ? Functional: melody, harmony
trained to process multiple, simultaneous sensory (pitches), meter (rhythm), and form
stimuli, and therefore, as they hone this skill, they (composition)
come to demand this capability within a working, ? Expressive (timbre, dynamics, tempo,
learning, or teaching environment. The "rules" of articulation and texture) and the
exchange are different from speech etiquette and contrasts therein
therefore test the responsiveness of full-duplex audio ? Stereo sound
and acoustic echo cancellation to a high degree. ? Full-frequency response (20 Hz – 22 kHz) - full
human auditory bandwidth, including harmonic
Despite the Distance Learning Program’s continual and non-harmonic overtones of fundamental
growth and progress, the underlying video tones
conferencing technology was adversely affecting the
quality of live, interactive transmissions for music MSM approached Polycom engineering with this
performance and education. The Program was particular conundrum of the inherent incompatibility of
growing but not to its full potential given the ongoing musical sound with standard video codec. Through
requests and demonstrated need for music discussion and study of MSM’s unique application,
videoconferencing programs around the globe. MSM’s engineering indicated that with modifications made to
program relied heavily on an interactive technology certain underlying advanced acoustic technologies in
that was designed for applications with speech and not the audio codec, that MSM’s special needs and
musical sound. In fact, the acoustical properties of requirements could potentially be met. To test this
musical sound demand different technological audio theory, Polycom engineers and MSM
requirements than speech, so much so that without collaborated on a series of live music tests and
very necessary modifications, musical sound experiments with different musical instruments and
transmitted through videoconferencing systems would ensembles. They found that these newly incorporated
never be truly satisfactory to music performing and alterations produced very promising results. Further
training on a high level. Some of the myriad issues and modifications were tested and the final outcome of
problems the program faced in previous generations of these tests resulted in the creation of a special audio
video codecs included: feature set that was deployed in the Polycom? VSX?
and Polycom HDX codec lines called Music Mode—a
? Limited frequency bandwidth (Up to 7 kHz) system specially designed for the transmission of live,
? Mono instead of stereo sound interactive, acoustic music performances and
? Low bit rate sound quality (8-bit) exchanges.
? Lack of dynamics and dynamic range
? Muting or loss of sound transmission AUDIO AND VIDEO TRANSMISSION BASICS
? Network and codec latency resulting in
compromised audio-visual communication Due to bandwidth limitations, sending uncompressed
? Echo cancellation artifacts including noise, audio and video is not an option in most IP networks
echo, and ‘squealing’ sounds today. Audio and video streams have to be encoded
(compressed) at the sender side to reduce the used
MSM did indeed experiment with other codec solutions network bandwidth. At the receiver side, the real-time
given access and membership to the Internet2 audio-video stream has to be decoded and played at
community and Abilene network; however, at the same full audio and video quality. Figure 4 summarizes the
time it was also seeking a solution that would be widely encoding and decoding concept. The critical element
here is the equipment’s ability to preserve a high audio

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

4 White Paper: Music Performance and Instruction over High-Speed IP Networks
quality of 50 Hz to 22 kHz and a high video quality of 16/24/32 kbps). Figure 5 shows the most popular
at least HD 720p. codecs.

Figure 5 - Advanced audio encoding
Figure 4 - Audio and video transmission basics

SIREN 22 HIGHLIGHTS
IP networks inherently lose packets when there are

bottlenecks and congestions, and mechanisms for
The patented Polycom Siren? 22 algorithm offers
recovery from packet loss in the IP network are
breakthrough benefits compared to earlier super
required to deliver a high-quality performance to the
wideband audio technology. The technology, which is
receiver. Compressed audio and video are packetized
offered with royalty-free licensing terms, provides CD
in the so-called Real Time Protocol (RTP) packets,
quality audio for better clarity and less listener fatigue
and then transmitted over the IP network. The Lost
with audio and visual communications applications.
Packet Recovery (LPR) mechanism, developed by
Siren 22 advanced stereo capabilities make it ideal for
Polycom to overcome this issue, is discussed later in
acoustically tracking audio source’s movement and
this paper.
help deliver an immersive experience.

ADVANCED AUDIO CODING
Siren 22 Stereo covers the acoustic frequency range

up to 22 kHz. While the human ear usually does not
The standard for voice transmission quality was set
hear above 16 kHz, the higher upper limit in Siren 22
about 120 years ago with the invention of the
delivers additional audio frequencies that are
telephone. Based on technical capabilities at the time,
especially important for music. Siren 22 offers 40
it was decided that transmitting acoustic frequencies
millisecond algorithmic delay using 20 millisecond
from 300 Hz to 3300 Hz is sufficient for a regular
frame lengths for natural and spontaneous real-time
conversation. Even today, basic narrow-band voice
communication. Siren 22 stereo requires relatively low
encoders, such as ITU-T G.711, work in this frequency
bit rates of 64, 96, or 128 kbps—leaving more
range, and are therefore referred to as 3.3 kHz voice
bandwidth available for improved video quality.
codecs. Another important characteristic for a voice

codec is the bit rate. For example, G.711 has a bit rate
Siren 22 requires 18.4 MIPS (Million Instructions per
of 64 kbps, that is, transmitting voice in G.711 format
Second) for encoder + decoder operation, compared
requires a network bandwidth of 64 kbps (plus network
to 100+ MIPS for competing algorithms such as MP3
protocol overhead). Other narrow-band codecs are
or MPEG4 AAC-LD. Therefore, Siren 22 can be used
G.729A (3.3 kHz, 8 kbps), G.728 (50 Hz – 3.3 kHz, 16
with lower-cost processors that consume less battery
kbps), and AMR-NB (3.3 kHz, 4.75 – 12.2 kbps).
power, for example, in PDAs, mobile phones, or even

wrist watches.
Advanced audio coding went far beyond basic voice

quality with the development of wide-band codecs that
Siren 22 handles speech, music, and natural sounds
support 7 kHz, 14 kHz, and most recently 22 kHz
with equal ease. Most codecs listed in Figure 5 are
audio. ITU-T G.722 was the first practical wideband
designed for voice and break up when presented with
codec, providing 7 kHz audio bandwidth at 48 to 64
natural sounds or music. Table 1, below, focuses on
kbps. Polycom entered the forefront of audio codec
audio codecs that are designed for music and natural
development with Siren 7 (7 kHz audio at 16 – 32
sounds
kbps), standardized by ITU-T as G.722.1 (7 kHz,

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

5 White Paper: Music Performance and Instruction over High-Speed IP Networks
compared to Siren 22 maximum bit rate of 128 kbps—
and now allows completely transparent coding of
music material. For example, the dual size transform
used in G.719 is of great benefit for percussive music
sounds. It is therefore expected that G.719 will provide
even better quality than Siren 22 for music
performances and instruction over IP networks.

Polycom is committed to supporting G.719 in its
product line—along with the well-established Siren 22
that guarantees interoperability with the installed base
of endpoints and telephones.

Table 1 - Audio codecs comparison ADVANCED ACOUSTIC TECHNOLOGIES

MP3 is an audio codec well-known because of its use Polycom video endpoints, such as Polycom HDX
in portable music players. This codec is not designed
9000? system, deliver the highest level of voice and
for real time audio-visual communication and features video quality for applications such as video
relatively high delay and bit rate. MPEG4 AAC-LD is a
conferencing, telepresence, and vertical applications
newer audio codec that lowered the delay (therefore for the education, healthcare, and government
LD for Low Delay) but has even higher complexity than
markets. They implement key acoustic technologies,
MP3. The higher complexity means that more powerful including Automatic Gain Control (AGC), Automatic
and expensive hardware is required to run the AAC
Noise Suppression (ANS), Noise Fill, and Acoustic
codec, which makes it less suitable for telephones and Echo Cancellation (AEC), that are very useful in voice
mobile devices.
communication.

G.719 HIGHLIGHTS
Very soon we realized that the technologies that
guarantee true music reproduction are often in conflict
The new ITU-T G.719 full-band codec is based on
with the technologies developed to improve the user
Polycom Siren 22 and Ericsson’s advanced audio experience in a standard bi-directional voice call. We
techniques. The G.719 number gives this codec very
will explain how each of these mechanisms works for
higher visibility, and signifies the importance of the voice transmissions, and then focus on the
embedded technology. In its decision, ITU-T cited the
modifications that had to be made in standard audio-
strong and increasing demand for audio coding visual communications equipment for transmission of
providing the full human auditory bandwidth.
music performances (Music Mode).

Conferencing systems are increasingly used for more
AUTOMATIC GAIN CONTROL (AGC)
elaborate presentations, often including music and
sound effects both of which occupy a wider audio
AGC is a technology that compensates for either audio
bandwidth than that of speech. In today’s multimedia or video input level changes by boosting or lowering
enriched presentations, playback of audio and video
incoming signals to match a preset level. There are a
from DVDs and PCs has become a common practice. variety of options to implement audio AGC in audio-
New telepresence systems provide high definition
visual equipment. Polycom systems have multiple
(HD) video and audio quality to the user, and require AGC instances—including those for room
high-quality media delivery to create the immersive
microphones and for VCR/DVD input—and they are all
experience. Extending the quality of remote meetings turned on by default.
helps reduce travel that in turn reduces greenhouse

gas emission and limits climate change.
Here, we''ll focus on the AGC for the connected

microphones. The function is activated by speech and
Table 1 highlights the close relationship between music. It ignores white noise. For example, if a
G.719 and Siren 22. Bandwidth and delay are similar
computer or projector fan is working close to the
but, most importantly, G.719 inherits the low microphone, AGC will not ramp up the gain.
complexity of Siren 22 and is therefore a great choice

for implementation on telephones and mobile devices.
The Polycom AGC is very powerful, and automatically
In order to take audio quality to the next level, the
levels an input signal in the range +6dB from nominal
G.719 maximum bit rate was increased to 256 kbps— while covering distances of up to 12 feet / 3.6 meters

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

6 White Paper: Music Performance and Instruction over High-Speed IP Networks
(depending on the room characteristics and on the Many noise suppression implementations make voice
person talking). Nominal is the volume of an average sound artificial, as it does on a cell phone.
person speaking in a normal voice about 2 feet / 60cm Sophisticated technical implementations, such as the
away from the microphone. As a rule of thumb, 3dB is one in Polycom endpoints, deliver artifacts-free audio,
twice the nominal signal strength (10^(3/10) = 10^0.3 = and make the noise suppression completely
1.995) and 6 dB is four times stronger signal than the transparent, that is, it is not perceptible to users.
nominal (10^(6/10) = 10^0.6 = 3.981). Figure 6
graphically represents the AGC concept. Also important is the algorithm reaction time. For
example, if a fan turns on, ANS should not take more
than few seconds to recalculate the new threshold and
remove the noise. Note that while ANS is very useful
for eliminating the noise of laptop and projector fans
and vents, it does not eliminate clicks and snaps—this
is done by a separate algorithm in Polycom systems
(keyboard auto mute function).

Noise Fill is a receiver side function and works like
comfort noise in telephony—a low level of noise is
added at the receiver side to reassure the user that the
line is up. Noise fill is necessary because all echo
cancellers mute the microphones in single talk—when

one side is talking and the other is not—so that the
Figure 6 - Automatic Gain Control
quiet side does not send any traffic over the network.

To make sure users do not think the line is completely
In the nominal case, AGC does not modify the gain,
down, the receiver system adds some noise, and
that is, the input volume is equal to the output volume.
hides the fact that the other side is muted in single
When the speaker moves further away, the
talk. Figure 7 provides a graphic representation of
microphone picks up lower volume, and the AGC in
ANS and Noise Fill.
the audio-visual equipment has to amplify the volume

to keep the output constant.

As we found in our research, however, AGC
completely destroys the natural dynamic range of
music. If the music volume increases, AGC kicks in,
and compensates by decreasing it. If the music
becomes quiet, AGC automatically increases the
volume. There is not a good way to overcome this
behavior so we disabled AGC in Music Mode.

AUTOMATIC NOISE SUPPRESSION (ANS) AND
NOISE FILL
Figure 7 - ANS and Noise Fill

White noise is a mix of all frequencies. ANS detects
Unfortunately, ANS is not designed for music. The
white noise at the sender side, and filters it out. The problem is that with a sustained musical note, one that
sender codec measures the signal strength when the
lasts for several seconds, for example, the ANS
audio volume is the lowest (for instance, when no one algorithm reacts as if it is noise and removes it.
is speaking), and uses this value as a filter threshold
Imagine how ANS would destroy a piece of classical
for the automatic noise suppression algorithm. music. Adding noise at the receiver side would not

help either--one would be far better off simply playing
Polycom ANS reduces background white noise up to 9 the audio that comes over the network. After analyzing
dB or eight times (10^(9/10) = 10^0.9 = 7.943). If the
ANS and Noise Fill, we concluded that both have to be
noise energy is above the threshold, gain stays at 1. If disabled in Music Mode.
it is below the threshold, gain is reduced by 9 dB or
eight times.

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

7 White Paper: Music Performance and Instruction over High-Speed IP Networks
ACOUSTIC ECHO CANCELLATION (AEC) transmission, the receiver will get artifacts, and low
notes can be cut out. Therefore, we developed special
Echo in telecommunications can have many causes: it tunings in the echo canceller that prevent very quiet
may be a result of acoustic coupling at the remote site musical sounds from being cut out. Knowing that
or may be caused by the communications equipment people will use Music Mode in a quiet studio
itself. Whatever the cause, it creates audio that is sent environment, we changed the AEC thresholds to be
to a remote site to feed back through the speakers at more much more aggressive than in standard mode.
the sender site. Through acoustic coupling between
speakers and microphones at the sender site, this INSTALLED AUDIO
feedback is sent again through the created loop and
adds even more echo on the line. AEC technology has Installed audio systems provide superior audio in
to be deployed at the receiver site to filter out the conference rooms, auditoriums, class rooms, meeting
echo. A graphical representation of the AEC concept is rooms, court rooms, and concert halls. When
shown by Figure 8. connected to audio-video endpoints such as Polycom
HDX systems they enhance the acoustic capabilities.
TM
Figure 9 shows the Polycom SoundStructure
installed audio system which supports up to 16
microphones and speakers.

Figure 8 - Acoustic Echo Cancellation

Figure 9 - Installed audio configuration
Basic AEC characteristics include the operating range

and adaptive filter length. For systems that support 50-
SoundStructure allows Manhattan School of Music to
22,000 Hz acoustic frequency range, AEC should be
deliver the applications listed above to larger rooms
able to support the same range. Adaptive filter length
while keeping the full 22 kHz stereo audio and
is the maximum delay of the echo for which the system
enjoying advanced stereo echo cancellation.
can compensate. Polycom leads this area providing

maximum adaptive filter length of 260ms.
Perfect interworking between the Polycom HDX video

endpoint and SoundStructure is assured through fully
An additional benefit of the Polycom implementation is
digital connection between the two (C-Link interface)
that the AEC algorithm does not need any learning
that allows bi-directional exchange of audio as well as
sequence, that is, Polycom does not send out white
shared mute and volume control. There is no need to
noise to train it. The AEC algorithm trains quickly
configure the two components to work with each
enough on real voice/audio.
other—an auto-discovery mechanism makes any

manual configuration unnecessary.
If the system supports stereo, it is important that AEC

can identify multiple paths of the stereo loudspeakers,
If the loudspeaker and microphone tonality is not
and quickly adapt to microphones that are moved (if
correct, the equalization function in SoundStructure
you move the microphone, you change the echo path
can improve the sound quality, thereby improving the
and the adaptive filter has to learn the new path). Our
music. For instance, if the loudspeakers are weak in
AEC implementation adapts within two words of
the treble, the equalizer can correct for this deficiency
speech, that is, echo comes back for short time (one
with a treble boost.
or two words) and then the canceller adjusts.

One big advantage of SoundStructure is that it
Musicians want to be able to play sustained note (for
provides a separate stereo echo canceller for 8
example, to press sustain pedal on piano), and hear it
microphones (one can also expand in multiples of 8 by
all the way even 1dB over the noise floor. If the
cascading SoundStructure systems). Polycom HDX
standard AEC is active during a music performance

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

8 White Paper: Music Performance and Instruction over High-Speed IP Networks
systems provide separate stereo echo cancellers for transmitting SD quality video at 30fps over 256 kbps
up to two external microphones for capturing audio (512kbps delivers good video quality with a lot of
from music ensembles. Often a separate microphone movement).
is used to pickup each instrument.
The latest generation of products support High
Musicians care not only about the upper limit of 22 kHz Definition (HD), starting with 720p HD, that is, 1280 x
but also about the ability to encode / decode lower 720 pixels with progressive scan. Current video codec
frequencies. The Polycom Siren 22 encoder / decoder technology allows transmitting HD quality video at
is flat in frequency response down to 20 Hz and even 30fps over 1 – 2Mbps. HD 720p at 60 fps requires at
down to D.C. However, the echo canceller in Polycom least 2Mbps. In June 2008, Polycom demonstrated its
HDX video endpoints goes down to 50 Hz because of HD 1080p technology at InfoComm and is embedding
noise issues that crop up quite often in conference in the product portfolio. The required bandwidth for an
rooms. Manhattan School of Music is therefore using HD 1080p connection is around 3Mbps.
the Polycom SoundStructure discussed below in this
paper. The echo canceller in SoundStructure does go FAR END CAMERA CONTROL
all the way down to 20 Hz, and delivers the bass
response that musicians seek. Far End Camera Control (FECC) is a feature that
allows controlling the video camera at the remote site
ADVANCED HIGH DEFINITION VIDEO —assuming that the remote site supports the feature
TECHNOLOGY and uses a PTZ (pan, tilt, zoom) camera. Figure 11
describes the functionality and lists the technical
While it is secondary to audio, video plays an parameters of the Polycom EagleEye? HD camera.
important role in transmitting music performances. The
ability to see the performers in high resolution
enhances the experience, as does the ability to zoom
on a solo player or see additional content related to
the performance. First, let’s look at the changed
expectations for video quality.

The vertical axle of the diagram in Figure 10 shows the
video quality while the horizontal axle shows the
necessary bandwidth to transmit video at this quality
across the network.

Figure 11 - PTZ Camera Control

The feature can be used, for example, to see particular
group of musicians within an orchestra, or to focus on
a solo player. Technologies for automatic detection of
the audio source will allow the system to auto focus on
solo player and then quickly refocus on another
musician or the whole orchestra, based on intelligent
audio source detection algorithms. More detail about
the way FECC is implemented in H.323 and SIP is in
the Polycom white paper, Migrating Visual

Communications from H.323 to SIP.

Figure 10 - Video Quality and Required Network
DUAL STREAM FOR CONTENT SHARING
Bandwidth

Video systems today support two parallel video
The Common Intermediate Format (CIF) has
channels. While the "live" channel is used for the live
resolution of 352x288 pixels. Current video technology
video, the "content" channel is used for sending text,
allows CIF quality video to be transmitted at 30fps over
pictures, or videos, associated with the "live" channel.
56kbps (256 kbps delivers good video quality when
When transmitting live music performances, the
there is a lot of movement). Standard Definition (SD) is
content channel can be used for transmitting any
the quality of a regular TV, and has resolution of
supporting information related to the performance—
704x480 pixels. Current video technology allows

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

9 White Paper: Music Performance and Instruction over High-Speed IP Networks
pictures, background videos, on in the case of opera in way Dual Stream is implemented in H.323 and SIP is
a foreign language, subtitles in the local language. in the Polycom white paper Migrating Visual
Dual Video Streams allows a "presentation" Communications from H.323 to SIP.
(sometimes also called "content") audio-video stream
to be created in parallel to the primary "live" audio- GIVING AUDIO HIGH PRIORITY
video stream. This second stream is used to share any
type of content: slides, spreadsheets, X-rays, and In standard video applications, audio quality is
video clips, for example. Polycom’s pre-standard automatically reduced when the total available
version of this technology is called Polycom bandwidth for the video call decreases. For example, if
People+Content? IP technology. H.239 is heavily a Polycom HDX system places a video call at 1Mbps,
based on intellectual property from People+Content it will automatically set the audio to the best quality of
technology and became the ITU-T standard that allows Siren 22 at 128kbps. If the video call is at 256kbps –
interoperability between different vendors. Figure 12 1Mbps, HDX will use a little lower quality: Siren 22 at
describes this functionality. 96kbps. Audio quality is only Siren14 at 48kbps if the
total bandwidth for the video call is below 256kbps.
Figure 13 depicts scenario and the audio settings.

Figure 12 - Dual Stream Function

The above video technologies in concert can not only
Figure 13 - High Priority Audio
enhance the experience of traditional music

performances but allow for experimentation with new
When music is transmitted, audio must stay at the best
art forms.
quality level no matter how much total bandwidth is

assigned to the video call; therefore, a Polycom HDX
While the function works well on single-monitor
system in Music Mode keeps using the highest quality
systems, it is especially powerful in multi-screen
(Siren 22 at 128 kbps) down to 256 kbps calls. This
setups (video endpoints can support up to four
makes sure that the audio gets high priority and video
monitors). In Figure 9, the "presentation" stream is
remains secondary in a Music Mode call.
created in parallel to the "live" stream, and the content

is displayed on the left screen of the receiver system.
PACKET LOSS AND LOST PACKET RECOVERY

(LPR)
The benefit of this functionality is that users can share

not just slides or spreadsheets but also moving
Packet loss is a common problem in IP networks
images: Flash video, movie clips, commercials, and so
because IP routers and switches drop packets when
on. The "‘presentation" channel has flexible resolution,
links become congested and when their buffers
frame rates, and bit rates. For dynamic images, it can
overflow. Real-time streams, such as voice and video,
support full High-Definition video at 30 frames per
are sent over the User Datagram Protocol (UDP)
second, and for static content, such as slides, it can
which does not retransmit packets. Even if an UDP/IP
work, for example, at 3 frames per second, and save
packet gets lost, retransmission does not make much
bandwidth in the IP network. Another major benefit of
sense, since the retransmitted packet would arrive too
using a video channel for content sharing is that the
late; playing it will destroy the real-time experience.
media is encrypted (by AES in H.323 and by SRTP in

SIP). In addition, once the firewall and NAT traversal
Lost Packet Recovery (LPR) is a new method of video
works for the "live" stream, it works for the
error concealment for packet-based networks, and is
"presentation" channel as well and there is no need for
based on forward error correction (FEC). Additional
a separate traversal solution. More detail about the
packets that contain recovery information are sent

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

10 White Paper: Music Performance and Instruction over High-Speed IP Networks
along with the original packets in order to reconstruct encapsulation (thereby converting them into the
packets that were lost during transmission. original RTP packets) and gives them back so they
can be processed and forwarded onto the video
For example, suppose you have 2% packet loss at 4 decoder. The decoder has been optimized for compute
Mbps (losing approximately 8 packets for every 406 and latency.
packets in a second). After engaging LPR, the packet
loss can be reduced to less than 1 packet /5 minutes LPR has advanced dynamic bandwidth allocation
1
or .00082%. capabilities. Figure 15 illustrates the behavior.

Figure 14 depicts the functions in LPR—both sender
side (top) and receiver side (bottom).

Figure 15 - LPR DBA Example

Figure 14 - Lost Packet Recovery
When packet loss is detected, the bit rate is initially

dropped by approximately the same percentage as the
LPR includes a network model that estimates the
packet loss rate. At the same time, we turn on FEC
amount of bandwidth the network can currently carry.
and begin inserting recovery packets. This two-
This network model is driven from the received packet
pronged approach provides the fastest restoration of
and lost packet statistics. From the model, the system
the media streams that loss creates, ensuring that
determines the optimal bandwidth and the strength of
there is little or no loss perceived by the user. The
the FEC protection required to protect the media flow
behavior can be modified by the system administrator
from irrecoverable packet loss. This information is fed
through configuration.
back to the transmitter through the signaling channel,

which then adjusts its bit rate and FEC protection level
When the system determines that the network is no
to match the measured capacity and quality of the
longer congested, the system reduces the amount of
communication channel. The algorithm is designed to
protection, and ultimately goes back to no protection. If
adapt extremely quickly to changing network
the system determines that the network congestion
conditions—such as cross congestion from competing
has lessened, it will also increase the bandwidth back
network traffic over a wide area network (WAN).
towards the original call rate (up-speed). This allows

the system to deliver media at the fastest rate that the
The LPR encoder takes RTP packets from the RTP Tx
network can safely carry. Of course, the system will
channel, encapsulates them into LPR data packets
not increase the bandwidth above the limits allowed by
and inserts the appropriate number of LPR recovery
the system administrator. The up-speed is gradual,
packets. The encoder is configured by the signaling
and the media is protected by recovery packets when
connection, usually when the remote LPR decoder
the up-speed occurs. This ensures that it has no
signals that it needs a different level of protection.
impact on the user experience. In the example in

Figure 13, up-speeds are in increments of 10% of the
The LPR decoder takes incoming LPR packets from
previous bit rate.
the RTP Rx channel. It reconstructs any missing LPR

data packets from the received data packets and
If the packet loss does not disappear, the system
recovery packets. It then removes the LPR
continues to monitor the network, and finds the
protection amount and bandwidth that delivers the best

user experience possible given the network condition.
1
1 packet /5 minutes = 1 packet/(406 packets/sec300 sec)

= 1 packet/121,800 packets = .00082%.

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

11 White Paper: Music Performance and Instruction over High-Speed IP Networks
A recent evaluation of Polycom Lost Packet
Recovery? (LPR?) by the independent analysts from

Wainhouse Research www.wainhouse.com concluded

the following: “While most of the video systems on the
market today include some form of packet loss or error
concealment capability, Polycom LPR is one of only
two error protection schemes available today that uses
forward error correction (FEC) to recover lost data.
One of LPR’ differentiators and strengths is that it
protects all parts of the video call, including the audio,
video, and content / H.239 channels, from packet
loss.”

CONCLUSION

Manhattan School of Music and Polycom have worked
closely to set key parameters in the Polycom audio-
visual equipment for support of music performances
and instructions over high-speed IP networks. The
unique collaboration among musicians and audio
engineers led to the creation of the Music Mode in the
Polycom HDX and Polycom VSX endpoints. In the
administration Web page of system, Music Mode is
just one check mark. Clicking on it, users of Polycom
video systems do not realize the amount of work and
the wealth of technology know-how that is behind this
simple option–nor should they: Music Mode users are
musicians and not engineers. The only thing they need
care about is that thousands of miles away, their notes
are being played true to the original and without any
distortions and artifacts.

ABOUT THE AUTHORS

Christianne Orto is Assistant Dean for Distance
Learning and Director of Recording at the Manhattan
School of Music. She has led the conservatory’s
distance learning program since its inception in 1996.

Stefan Karapetkov is Emerging Technologies Director
at Polycom, Inc. where he focuses on visual
communications market and technology analysis. He
has spent more than 13 years in product management,
new technology development and product definition.
His blog is http://videonetworker.blogspot.com/.

ACKNOWLEDGEMENTS

We would like to thank Peter Chu, Stephen Botzko,
Minjie Xie and Jeff Rodman from Polycom and
President Robert Sirota and Trustee Marta Istomin
from MSM for their contributions and support for this
paper.

?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc.
All other trademarks are the property of their respective owners. Information is subject to change without notice.

12

献花(0)

(本文系mc_eastian首藏)

类似文章 更多

发表评论：