Music Performance and Instruction over High-Speed Networks November 2008 1 White Paper: Music Performance and Instruction over High-Speed IP Networks INTRODUCTION High-speed IP networks are creating opportunities for new kinds of real-time applications that connect artists and audiences across the world. A new generation of audio-visual technology is required to deliver the exceptionally high quality required to enjoy performances over IP networks. This paper discussed how the Manhattan School of Music (MSM) uses Polycom technology for music performance and instruction over the high-speed Internet2 network that connects schools and universities in the United States and other countries around the world. Figure 1 - Pinchas Zukerman at Canada Art Centre There are three major challenges to transmitting high- teaches a student in New York over video quality music performances over IP networks: (1) true acoustic representation, (2) efficient and loss-less School of Music further envisioned the potential of this compression and decompression that preserves the technology-powered medium to support and develop performance quality, and (3) recovery from IP packet many more aspects of the institution’s educational and loss in the network. This paper analyzes each of these artistic mission. Through the development and creative three elements and provides an overview of the use of broadband videoconferencing and related mechanisms developed by Polycom to deliver instructional technologies, Manhattan School of exceptional audio and video quality, and special Music’s newly-instituted Distance Learning Program, modifications for the Music Mode in Polycom could provide access to artistic and academic equipment. resources that enhance student education in musical performance while heightening global community MUSIC PERFORMANCE AND INSTRUCTION AT awareness of and participation in the musical arts. As MANHATTAN SCHOOL OF MUSIC the first conservatory in the United States to use videoconferencing technology for music education, Manhattan School of Music’s use of live, interactive Manhattan School of Music could feature its Distance videoconferencing technology (ITU-T H.323 and Learning initiatives to build audiences for the future; to H.320) for music performance and education began in preserve and expand musical heritage; to foster 1996. At that time, world-renowned musician and MSM leadership, creativity, and technical innovation in faculty member, Pinchas Zukerman, violinist and support of new performance and educational composer, brought the concept of incorporating live, opportunities; and to facilitate cross-cultural two-way audio-visual communication into the communication, understanding, and appreciation Zukerman Performance Program at the School. The through music. idea behind its incorporation was the mutually- beneficial arrangement of providing students with more APPLICATION EXAMPLES lessons (i.e., access to instruction), while also accommodating the touring schedule of a world-class Today, the Manhattan School of Music Distance performing musician. Without such a technological Learning Program reaches over 1700 students each opportunity, students were limited to having lessons academic year including learners from 27 of the 50 with Zukerman when he was physically on campus. In United States and 16 other countries to date. Musical prior academic years, this could amount to only one or applications in classical and jazz music include regular two lessons per semester. Within a year of adopting public lessons referred to as "master classes"; private videoconferencing for music education, Zukerman was lessons; jazz clinics on performance-related offering his students regular videoconference lessons techniques; instrumental and vocal coaching on from Albuquerque to New Zealand. Figure 1 is a musical interpretation; sectional rehearsals for large photograph of one of these sessions. ensemble performances; workshops on musically- related topics such as ‘sports medicine’ for musicians; After the early success of the Zukerman Performance professional development for faculty; educational and Program Videoconference Lesson Project, Manhattan community outreach music programs for K-12 schools, ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 2 White Paper: Music Performance and Instruction over High-Speed IP Networks libraries and hospitals around the country; composer digital audio mixing console. Audio and video signals colloquia with notable composers around the world; are then captured into DV format for future use, panel discussions on diverse themes and topics such reference, and access such as MPEG-2 (DVDs) or as copyright protection for musicians in the digital age H.264 Web streaming. or an anniversary year celebration of a notable composer, such as Leonard Bernstein; and "mock" Both of the images in Figure 2 show world-renowned audition sessions that are designed to help aspiring classical musicians teaching talented young young musicians prepare for live auditions with professionals through interactive videoconferencing. In symphony orchestras or big band ensembles. the left-hand image, Maestro Pinchas Zukerman, a pioneer in the use of videoconferencing for music More recent applications currently under development performance, is teaching a student located at the include simultaneous Webcasting of live, Manhattan School of Music from Ottawa, Ontario videoconference exchanges, "telementoring" sessions where he maintains the post of Artistic Director of on career and professional advancement, and remote Canada’s National Arts Centre Orchestra. On the right auditioning from far-reaching locales such as Beijing of Figure 2, Thomas Hampson, renowned American and Shanghai in the People’s Republic of China. Since baritone and leading singer at the Metropolitan Opera over 40 percent of the Manhattan School of Music House, is teaching a student at the Curtis Institute of student body comes from outside of the United States, Music in Philadelphia from the Manhattan School of having the opportunity to audition for Manhattan Music campus. These images demonstrate the ability School of Music through live, interactive of music institutions to both import and export artistic videoconferencing from one’s home country would resources between remote sites. provide a significant savings for students and their families, and also open up the opportunity for talented, Similarly, in the field of jazz music, the great jazz yet economically-disadvantaged, students to audition pianist and saxophonist, Kenny Barron and David for one of the world’s leading music conservatories. Liebman (Figure 3), respectively, teach talented jazz students in remote locations in Canada from Figure 2 illustrates the use of interactive Manhattan School of Music’s newly-established videoconferencing technology for live music instruction William R. and Irene D. Miller Recital Hall, fully between student and master teacher at remote equipped with HD videoconferencing capabilities. locations. Both images show public lessons or master classes within a concert hall setting that enable large groups of people to observe and participate in these learning and performance exchanges Figure 3 - Jazz programs at MSM Presently, Manhattan School of Music sees no end to the possible uses and applications of Figure 2 - Classical Programs at MSM videoconferencing technology for music performance and education. The field of music has already MSM integrates professional audio and video demonstrated the need, use, and desire to use cutting- technology with the Polycom HDX codec to create an edge technology such as videoconferencing to teach optimal, high-performance virtual learning and reach new audiences around the globe. environment. This technology includes HD video Ultimately, live music performance is a shared, projection and cameras, as well as professional-grade participatory, community-oriented activity which gives condenser audio microphones through an external outlet and meaning to human expression, and ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 3 White Paper: Music Performance and Instruction over High-Speed IP Networks therefore should not be limited by physical or deployed, standardized, reliable, efficient, cost- geographic boundaries. effective, and easy to use. TECHNICAL CHALLENGES What would be required in order to achieve educationally useful and beneficial exchanges and The previously mentioned applications cover the two performances in music? Here''s the list: broad areas of applied music, as well as thematic or topically-driven music programs. Both types require a ? A seamless virtual environment conducive to highly interactive virtual environment. Within the learning/teaching/performing (low latency) context of a music lesson, the teacher and student ? True, accurate sonic representation of the engage in constant and rapid-fire exchanges of acoustical properties of sound and music speaking, playing, and gesturing—and they do all of ? Functional and expressive elements of music this simultaneously or over one another. Musicians are ? Functional: melody, harmony trained to process multiple, simultaneous sensory (pitches), meter (rhythm), and form stimuli, and therefore, as they hone this skill, they (composition) come to demand this capability within a working, ? Expressive (timbre, dynamics, tempo, learning, or teaching environment. The "rules" of articulation and texture) and the exchange are different from speech etiquette and contrasts therein therefore test the responsiveness of full-duplex audio ? Stereo sound and acoustic echo cancellation to a high degree. ? Full-frequency response (20 Hz – 22 kHz) - full human auditory bandwidth, including harmonic Despite the Distance Learning Program’s continual and non-harmonic overtones of fundamental growth and progress, the underlying video tones conferencing technology was adversely affecting the quality of live, interactive transmissions for music MSM approached Polycom engineering with this performance and education. The Program was particular conundrum of the inherent incompatibility of growing but not to its full potential given the ongoing musical sound with standard video codec. Through requests and demonstrated need for music discussion and study of MSM’s unique application, videoconferencing programs around the globe. MSM’s engineering indicated that with modifications made to program relied heavily on an interactive technology certain underlying advanced acoustic technologies in that was designed for applications with speech and not the audio codec, that MSM’s special needs and musical sound. In fact, the acoustical properties of requirements could potentially be met. To test this musical sound demand different technological audio theory, Polycom engineers and MSM requirements than speech, so much so that without collaborated on a series of live music tests and very necessary modifications, musical sound experiments with different musical instruments and transmitted through videoconferencing systems would ensembles. They found that these newly incorporated never be truly satisfactory to music performing and alterations produced very promising results. Further training on a high level. Some of the myriad issues and modifications were tested and the final outcome of problems the program faced in previous generations of these tests resulted in the creation of a special audio video codecs included: feature set that was deployed in the Polycom? VSX? and Polycom HDX codec lines called Music Mode—a ? Limited frequency bandwidth (Up to 7 kHz) system specially designed for the transmission of live, ? Mono instead of stereo sound interactive, acoustic music performances and ? Low bit rate sound quality (8-bit) exchanges. ? Lack of dynamics and dynamic range ? Muting or loss of sound transmission AUDIO AND VIDEO TRANSMISSION BASICS ? Network and codec latency resulting in compromised audio-visual communication Due to bandwidth limitations, sending uncompressed ? Echo cancellation artifacts including noise, audio and video is not an option in most IP networks echo, and ‘squealing’ sounds today. Audio and video streams have to be encoded (compressed) at the sender side to reduce the used MSM did indeed experiment with other codec solutions network bandwidth. At the receiver side, the real-time given access and membership to the Internet2 audio-video stream has to be decoded and played at community and Abilene network; however, at the same full audio and video quality. Figure 4 summarizes the time it was also seeking a solution that would be widely encoding and decoding concept. The critical element here is the equipment’s ability to preserve a high audio ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 4 White Paper: Music Performance and Instruction over High-Speed IP Networks quality of 50 Hz to 22 kHz and a high video quality of 16/24/32 kbps). Figure 5 shows the most popular at least HD 720p. codecs. Figure 5 - Advanced audio encoding Figure 4 - Audio and video transmission basics SIREN 22 HIGHLIGHTS IP networks inherently lose packets when there are bottlenecks and congestions, and mechanisms for The patented Polycom Siren? 22 algorithm offers recovery from packet loss in the IP network are breakthrough benefits compared to earlier super required to deliver a high-quality performance to the wideband audio technology. The technology, which is receiver. Compressed audio and video are packetized offered with royalty-free licensing terms, provides CD in the so-called Real Time Protocol (RTP) packets, quality audio for better clarity and less listener fatigue and then transmitted over the IP network. The Lost with audio and visual communications applications. Packet Recovery (LPR) mechanism, developed by Siren 22 advanced stereo capabilities make it ideal for Polycom to overcome this issue, is discussed later in acoustically tracking audio source’s movement and this paper. help deliver an immersive experience. ADVANCED AUDIO CODING Siren 22 Stereo covers the acoustic frequency range up to 22 kHz. While the human ear usually does not The standard for voice transmission quality was set hear above 16 kHz, the higher upper limit in Siren 22 about 120 years ago with the invention of the delivers additional audio frequencies that are telephone. Based on technical capabilities at the time, especially important for music. Siren 22 offers 40 it was decided that transmitting acoustic frequencies millisecond algorithmic delay using 20 millisecond from 300 Hz to 3300 Hz is sufficient for a regular frame lengths for natural and spontaneous real-time conversation. Even today, basic narrow-band voice communication. Siren 22 stereo requires relatively low encoders, such as ITU-T G.711, work in this frequency bit rates of 64, 96, or 128 kbps—leaving more range, and are therefore referred to as 3.3 kHz voice bandwidth available for improved video quality. codecs. Another important characteristic for a voice codec is the bit rate. For example, G.711 has a bit rate Siren 22 requires 18.4 MIPS (Million Instructions per of 64 kbps, that is, transmitting voice in G.711 format Second) for encoder + decoder operation, compared requires a network bandwidth of 64 kbps (plus network to 100+ MIPS for competing algorithms such as MP3 protocol overhead). Other narrow-band codecs are or MPEG4 AAC-LD. Therefore, Siren 22 can be used G.729A (3.3 kHz, 8 kbps), G.728 (50 Hz – 3.3 kHz, 16 with lower-cost processors that consume less battery kbps), and AMR-NB (3.3 kHz, 4.75 – 12.2 kbps). power, for example, in PDAs, mobile phones, or even wrist watches. Advanced audio coding went far beyond basic voice quality with the development of wide-band codecs that Siren 22 handles speech, music, and natural sounds support 7 kHz, 14 kHz, and most recently 22 kHz with equal ease. Most codecs listed in Figure 5 are audio. ITU-T G.722 was the first practical wideband designed for voice and break up when presented with codec, providing 7 kHz audio bandwidth at 48 to 64 natural sounds or music. Table 1, below, focuses on kbps. Polycom entered the forefront of audio codec audio codecs that are designed for music and natural development with Siren 7 (7 kHz audio at 16 – 32 sounds kbps), standardized by ITU-T as G.722.1 (7 kHz, ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 5 White Paper: Music Performance and Instruction over High-Speed IP Networks compared to Siren 22 maximum bit rate of 128 kbps— and now allows completely transparent coding of music material. For example, the dual size transform used in G.719 is of great benefit for percussive music sounds. It is therefore expected that G.719 will provide even better quality than Siren 22 for music performances and instruction over IP networks. Polycom is committed to supporting G.719 in its product line—along with the well-established Siren 22 that guarantees interoperability with the installed base of endpoints and telephones. Table 1 - Audio codecs comparison ADVANCED ACOUSTIC TECHNOLOGIES MP3 is an audio codec well-known because of its use Polycom video endpoints, such as Polycom HDX in portable music players. This codec is not designed 9000? system, deliver the highest level of voice and for real time audio-visual communication and features video quality for applications such as video relatively high delay and bit rate. MPEG4 AAC-LD is a conferencing, telepresence, and vertical applications newer audio codec that lowered the delay (therefore for the education, healthcare, and government LD for Low Delay) but has even higher complexity than markets. They implement key acoustic technologies, MP3. The higher complexity means that more powerful including Automatic Gain Control (AGC), Automatic and expensive hardware is required to run the AAC Noise Suppression (ANS), Noise Fill, and Acoustic codec, which makes it less suitable for telephones and Echo Cancellation (AEC), that are very useful in voice mobile devices. communication. G.719 HIGHLIGHTS Very soon we realized that the technologies that guarantee true music reproduction are often in conflict The new ITU-T G.719 full-band codec is based on with the technologies developed to improve the user Polycom Siren 22 and Ericsson’s advanced audio experience in a standard bi-directional voice call. We techniques. The G.719 number gives this codec very will explain how each of these mechanisms works for higher visibility, and signifies the importance of the voice transmissions, and then focus on the embedded technology. In its decision, ITU-T cited the modifications that had to be made in standard audio- strong and increasing demand for audio coding visual communications equipment for transmission of providing the full human auditory bandwidth. music performances (Music Mode). Conferencing systems are increasingly used for more AUTOMATIC GAIN CONTROL (AGC) elaborate presentations, often including music and sound effects both of which occupy a wider audio AGC is a technology that compensates for either audio bandwidth than that of speech. In today’s multimedia or video input level changes by boosting or lowering enriched presentations, playback of audio and video incoming signals to match a preset level. There are a from DVDs and PCs has become a common practice. variety of options to implement audio AGC in audio- New telepresence systems provide high definition visual equipment. Polycom systems have multiple (HD) video and audio quality to the user, and require AGC instances—including those for room high-quality media delivery to create the immersive microphones and for VCR/DVD input—and they are all experience. Extending the quality of remote meetings turned on by default. helps reduce travel that in turn reduces greenhouse gas emission and limits climate change. Here, we''ll focus on the AGC for the connected microphones. The function is activated by speech and Table 1 highlights the close relationship between music. It ignores white noise. For example, if a G.719 and Siren 22. Bandwidth and delay are similar computer or projector fan is working close to the but, most importantly, G.719 inherits the low microphone, AGC will not ramp up the gain. complexity of Siren 22 and is therefore a great choice for implementation on telephones and mobile devices. The Polycom AGC is very powerful, and automatically In order to take audio quality to the next level, the levels an input signal in the range +6dB from nominal G.719 maximum bit rate was increased to 256 kbps— while covering distances of up to 12 feet / 3.6 meters ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 6 White Paper: Music Performance and Instruction over High-Speed IP Networks (depending on the room characteristics and on the Many noise suppression implementations make voice person talking). Nominal is the volume of an average sound artificial, as it does on a cell phone. person speaking in a normal voice about 2 feet / 60cm Sophisticated technical implementations, such as the away from the microphone. As a rule of thumb, 3dB is one in Polycom endpoints, deliver artifacts-free audio, twice the nominal signal strength (10^(3/10) = 10^0.3 = and make the noise suppression completely 1.995) and 6 dB is four times stronger signal than the transparent, that is, it is not perceptible to users. nominal (10^(6/10) = 10^0.6 = 3.981). Figure 6 graphically represents the AGC concept. Also important is the algorithm reaction time. For example, if a fan turns on, ANS should not take more than few seconds to recalculate the new threshold and remove the noise. Note that while ANS is very useful for eliminating the noise of laptop and projector fans and vents, it does not eliminate clicks and snaps—this is done by a separate algorithm in Polycom systems (keyboard auto mute function). Noise Fill is a receiver side function and works like comfort noise in telephony—a low level of noise is added at the receiver side to reassure the user that the line is up. Noise fill is necessary because all echo cancellers mute the microphones in single talk—when one side is talking and the other is not—so that the Figure 6 - Automatic Gain Control quiet side does not send any traffic over the network. To make sure users do not think the line is completely In the nominal case, AGC does not modify the gain, down, the receiver system adds some noise, and that is, the input volume is equal to the output volume. hides the fact that the other side is muted in single When the speaker moves further away, the talk. Figure 7 provides a graphic representation of microphone picks up lower volume, and the AGC in ANS and Noise Fill. the audio-visual equipment has to amplify the volume to keep the output constant. As we found in our research, however, AGC completely destroys the natural dynamic range of music. If the music volume increases, AGC kicks in, and compensates by decreasing it. If the music becomes quiet, AGC automatically increases the volume. There is not a good way to overcome this behavior so we disabled AGC in Music Mode. AUTOMATIC NOISE SUPPRESSION (ANS) AND NOISE FILL Figure 7 - ANS and Noise Fill White noise is a mix of all frequencies. ANS detects Unfortunately, ANS is not designed for music. The white noise at the sender side, and filters it out. The problem is that with a sustained musical note, one that sender codec measures the signal strength when the lasts for several seconds, for example, the ANS audio volume is the lowest (for instance, when no one algorithm reacts as if it is noise and removes it. is speaking), and uses this value as a filter threshold Imagine how ANS would destroy a piece of classical for the automatic noise suppression algorithm. music. Adding noise at the receiver side would not help either--one would be far better off simply playing Polycom ANS reduces background white noise up to 9 the audio that comes over the network. After analyzing dB or eight times (10^(9/10) = 10^0.9 = 7.943). If the ANS and Noise Fill, we concluded that both have to be noise energy is above the threshold, gain stays at 1. If disabled in Music Mode. it is below the threshold, gain is reduced by 9 dB or eight times. ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 7 White Paper: Music Performance and Instruction over High-Speed IP Networks ACOUSTIC ECHO CANCELLATION (AEC) transmission, the receiver will get artifacts, and low notes can be cut out. Therefore, we developed special Echo in telecommunications can have many causes: it tunings in the echo canceller that prevent very quiet may be a result of acoustic coupling at the remote site musical sounds from being cut out. Knowing that or may be caused by the communications equipment people will use Music Mode in a quiet studio itself. Whatever the cause, it creates audio that is sent environment, we changed the AEC thresholds to be to a remote site to feed back through the speakers at more much more aggressive than in standard mode. the sender site. Through acoustic coupling between speakers and microphones at the sender site, this INSTALLED AUDIO feedback is sent again through the created loop and adds even more echo on the line. AEC technology has Installed audio systems provide superior audio in to be deployed at the receiver site to filter out the conference rooms, auditoriums, class rooms, meeting echo. A graphical representation of the AEC concept is rooms, court rooms, and concert halls. When shown by Figure 8. connected to audio-video endpoints such as Polycom HDX systems they enhance the acoustic capabilities. TM Figure 9 shows the Polycom SoundStructure installed audio system which supports up to 16 microphones and speakers. Figure 8 - Acoustic Echo Cancellation Figure 9 - Installed audio configuration Basic AEC characteristics include the operating range and adaptive filter length. For systems that support 50- SoundStructure allows Manhattan School of Music to 22,000 Hz acoustic frequency range, AEC should be deliver the applications listed above to larger rooms able to support the same range. Adaptive filter length while keeping the full 22 kHz stereo audio and is the maximum delay of the echo for which the system enjoying advanced stereo echo cancellation. can compensate. Polycom leads this area providing maximum adaptive filter length of 260ms. Perfect interworking between the Polycom HDX video endpoint and SoundStructure is assured through fully An additional benefit of the Polycom implementation is digital connection between the two (C-Link interface) that the AEC algorithm does not need any learning that allows bi-directional exchange of audio as well as sequence, that is, Polycom does not send out white shared mute and volume control. There is no need to noise to train it. The AEC algorithm trains quickly configure the two components to work with each enough on real voice/audio. other—an auto-discovery mechanism makes any manual configuration unnecessary. If the system supports stereo, it is important that AEC can identify multiple paths of the stereo loudspeakers, If the loudspeaker and microphone tonality is not and quickly adapt to microphones that are moved (if correct, the equalization function in SoundStructure you move the microphone, you change the echo path can improve the sound quality, thereby improving the and the adaptive filter has to learn the new path). Our music. For instance, if the loudspeakers are weak in AEC implementation adapts within two words of the treble, the equalizer can correct for this deficiency speech, that is, echo comes back for short time (one with a treble boost. or two words) and then the canceller adjusts. One big advantage of SoundStructure is that it Musicians want to be able to play sustained note (for provides a separate stereo echo canceller for 8 example, to press sustain pedal on piano), and hear it microphones (one can also expand in multiples of 8 by all the way even 1dB over the noise floor. If the cascading SoundStructure systems). Polycom HDX standard AEC is active during a music performance ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 8 White Paper: Music Performance and Instruction over High-Speed IP Networks systems provide separate stereo echo cancellers for transmitting SD quality video at 30fps over 256 kbps up to two external microphones for capturing audio (512kbps delivers good video quality with a lot of from music ensembles. Often a separate microphone movement). is used to pickup each instrument. The latest generation of products support High Musicians care not only about the upper limit of 22 kHz Definition (HD), starting with 720p HD, that is, 1280 x but also about the ability to encode / decode lower 720 pixels with progressive scan. Current video codec frequencies. The Polycom Siren 22 encoder / decoder technology allows transmitting HD quality video at is flat in frequency response down to 20 Hz and even 30fps over 1 – 2Mbps. HD 720p at 60 fps requires at down to D.C. However, the echo canceller in Polycom least 2Mbps. In June 2008, Polycom demonstrated its HDX video endpoints goes down to 50 Hz because of HD 1080p technology at InfoComm and is embedding noise issues that crop up quite often in conference in the product portfolio. The required bandwidth for an rooms. Manhattan School of Music is therefore using HD 1080p connection is around 3Mbps. the Polycom SoundStructure discussed below in this paper. The echo canceller in SoundStructure does go FAR END CAMERA CONTROL all the way down to 20 Hz, and delivers the bass response that musicians seek. Far End Camera Control (FECC) is a feature that allows controlling the video camera at the remote site ADVANCED HIGH DEFINITION VIDEO —assuming that the remote site supports the feature TECHNOLOGY and uses a PTZ (pan, tilt, zoom) camera. Figure 11 describes the functionality and lists the technical While it is secondary to audio, video plays an parameters of the Polycom EagleEye? HD camera. important role in transmitting music performances. The ability to see the performers in high resolution enhances the experience, as does the ability to zoom on a solo player or see additional content related to the performance. First, let’s look at the changed expectations for video quality. The vertical axle of the diagram in Figure 10 shows the video quality while the horizontal axle shows the necessary bandwidth to transmit video at this quality across the network. Figure 11 - PTZ Camera Control The feature can be used, for example, to see particular group of musicians within an orchestra, or to focus on a solo player. Technologies for automatic detection of the audio source will allow the system to auto focus on solo player and then quickly refocus on another musician or the whole orchestra, based on intelligent audio source detection algorithms. More detail about the way FECC is implemented in H.323 and SIP is in the Polycom white paper, Migrating Visual Communications from H.323 to SIP. Figure 10 - Video Quality and Required Network DUAL STREAM FOR CONTENT SHARING Bandwidth Video systems today support two parallel video The Common Intermediate Format (CIF) has channels. While the "live" channel is used for the live resolution of 352x288 pixels. Current video technology video, the "content" channel is used for sending text, allows CIF quality video to be transmitted at 30fps over pictures, or videos, associated with the "live" channel. 56kbps (256 kbps delivers good video quality when When transmitting live music performances, the there is a lot of movement). Standard Definition (SD) is content channel can be used for transmitting any the quality of a regular TV, and has resolution of supporting information related to the performance— 704x480 pixels. Current video technology allows ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 9 White Paper: Music Performance and Instruction over High-Speed IP Networks pictures, background videos, on in the case of opera in way Dual Stream is implemented in H.323 and SIP is a foreign language, subtitles in the local language. in the Polycom white paper Migrating Visual Dual Video Streams allows a "presentation" Communications from H.323 to SIP. (sometimes also called "content") audio-video stream to be created in parallel to the primary "live" audio- GIVING AUDIO HIGH PRIORITY video stream. This second stream is used to share any type of content: slides, spreadsheets, X-rays, and In standard video applications, audio quality is video clips, for example. Polycom’s pre-standard automatically reduced when the total available version of this technology is called Polycom bandwidth for the video call decreases. For example, if People+Content? IP technology. H.239 is heavily a Polycom HDX system places a video call at 1Mbps, based on intellectual property from People+Content it will automatically set the audio to the best quality of technology and became the ITU-T standard that allows Siren 22 at 128kbps. If the video call is at 256kbps – interoperability between different vendors. Figure 12 1Mbps, HDX will use a little lower quality: Siren 22 at describes this functionality. 96kbps. Audio quality is only Siren14 at 48kbps if the total bandwidth for the video call is below 256kbps. Figure 13 depicts scenario and the audio settings. Figure 12 - Dual Stream Function The above video technologies in concert can not only Figure 13 - High Priority Audio enhance the experience of traditional music performances but allow for experimentation with new When music is transmitted, audio must stay at the best art forms. quality level no matter how much total bandwidth is assigned to the video call; therefore, a Polycom HDX While the function works well on single-monitor system in Music Mode keeps using the highest quality systems, it is especially powerful in multi-screen (Siren 22 at 128 kbps) down to 256 kbps calls. This setups (video endpoints can support up to four makes sure that the audio gets high priority and video monitors). In Figure 9, the "presentation" stream is remains secondary in a Music Mode call. created in parallel to the "live" stream, and the content is displayed on the left screen of the receiver system. PACKET LOSS AND LOST PACKET RECOVERY (LPR) The benefit of this functionality is that users can share not just slides or spreadsheets but also moving Packet loss is a common problem in IP networks images: Flash video, movie clips, commercials, and so because IP routers and switches drop packets when on. The "‘presentation" channel has flexible resolution, links become congested and when their buffers frame rates, and bit rates. For dynamic images, it can overflow. Real-time streams, such as voice and video, support full High-Definition video at 30 frames per are sent over the User Datagram Protocol (UDP) second, and for static content, such as slides, it can which does not retransmit packets. Even if an UDP/IP work, for example, at 3 frames per second, and save packet gets lost, retransmission does not make much bandwidth in the IP network. Another major benefit of sense, since the retransmitted packet would arrive too using a video channel for content sharing is that the late; playing it will destroy the real-time experience. media is encrypted (by AES in H.323 and by SRTP in SIP). In addition, once the firewall and NAT traversal Lost Packet Recovery (LPR) is a new method of video works for the "live" stream, it works for the error concealment for packet-based networks, and is "presentation" channel as well and there is no need for based on forward error correction (FEC). Additional a separate traversal solution. More detail about the packets that contain recovery information are sent ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 10 White Paper: Music Performance and Instruction over High-Speed IP Networks along with the original packets in order to reconstruct encapsulation (thereby converting them into the packets that were lost during transmission. original RTP packets) and gives them back so they can be processed and forwarded onto the video For example, suppose you have 2% packet loss at 4 decoder. The decoder has been optimized for compute Mbps (losing approximately 8 packets for every 406 and latency. packets in a second). After engaging LPR, the packet loss can be reduced to less than 1 packet /5 minutes LPR has advanced dynamic bandwidth allocation 1 or .00082%. capabilities. Figure 15 illustrates the behavior. Figure 14 depicts the functions in LPR—both sender side (top) and receiver side (bottom). Figure 15 - LPR DBA Example Figure 14 - Lost Packet Recovery When packet loss is detected, the bit rate is initially dropped by approximately the same percentage as the LPR includes a network model that estimates the packet loss rate. At the same time, we turn on FEC amount of bandwidth the network can currently carry. and begin inserting recovery packets. This two- This network model is driven from the received packet pronged approach provides the fastest restoration of and lost packet statistics. From the model, the system the media streams that loss creates, ensuring that determines the optimal bandwidth and the strength of there is little or no loss perceived by the user. The the FEC protection required to protect the media flow behavior can be modified by the system administrator from irrecoverable packet loss. This information is fed through configuration. back to the transmitter through the signaling channel, which then adjusts its bit rate and FEC protection level When the system determines that the network is no to match the measured capacity and quality of the longer congested, the system reduces the amount of communication channel. The algorithm is designed to protection, and ultimately goes back to no protection. If adapt extremely quickly to changing network the system determines that the network congestion conditions—such as cross congestion from competing has lessened, it will also increase the bandwidth back network traffic over a wide area network (WAN). towards the original call rate (up-speed). This allows the system to deliver media at the fastest rate that the The LPR encoder takes RTP packets from the RTP Tx network can safely carry. Of course, the system will channel, encapsulates them into LPR data packets not increase the bandwidth above the limits allowed by and inserts the appropriate number of LPR recovery the system administrator. The up-speed is gradual, packets. The encoder is configured by the signaling and the media is protected by recovery packets when connection, usually when the remote LPR decoder the up-speed occurs. This ensures that it has no signals that it needs a different level of protection. impact on the user experience. In the example in Figure 13, up-speeds are in increments of 10% of the The LPR decoder takes incoming LPR packets from previous bit rate. the RTP Rx channel. It reconstructs any missing LPR data packets from the received data packets and If the packet loss does not disappear, the system recovery packets. It then removes the LPR continues to monitor the network, and finds the protection amount and bandwidth that delivers the best user experience possible given the network condition. 1 1 packet /5 minutes = 1 packet/(406 packets/sec300 sec) = 1 packet/121,800 packets = .00082%. ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 11 White Paper: Music Performance and Instruction over High-Speed IP Networks A recent evaluation of Polycom Lost Packet Recovery? (LPR?) by the independent analysts from Wainhouse Research www.wainhouse.com concluded the following: “While most of the video systems on the market today include some form of packet loss or error concealment capability, Polycom LPR is one of only two error protection schemes available today that uses forward error correction (FEC) to recover lost data. One of LPR’ differentiators and strengths is that it protects all parts of the video call, including the audio, video, and content / H.239 channels, from packet loss.” CONCLUSION Manhattan School of Music and Polycom have worked closely to set key parameters in the Polycom audio- visual equipment for support of music performances and instructions over high-speed IP networks. The unique collaboration among musicians and audio engineers led to the creation of the Music Mode in the Polycom HDX and Polycom VSX endpoints. In the administration Web page of system, Music Mode is just one check mark. Clicking on it, users of Polycom video systems do not realize the amount of work and the wealth of technology know-how that is behind this simple option–nor should they: Music Mode users are musicians and not engineers. The only thing they need care about is that thousands of miles away, their notes are being played true to the original and without any distortions and artifacts. ABOUT THE AUTHORS Christianne Orto is Assistant Dean for Distance Learning and Director of Recording at the Manhattan School of Music. She has led the conservatory’s distance learning program since its inception in 1996. Stefan Karapetkov is Emerging Technologies Director at Polycom, Inc. where he focuses on visual communications market and technology analysis. He has spent more than 13 years in product management, new technology development and product definition. His blog is http://videonetworker.blogspot.com/. ACKNOWLEDGEMENTS We would like to thank Peter Chu, Stephen Botzko, Minjie Xie and Jeff Rodman from Polycom and President Robert Sirota and Trustee Marta Istomin from MSM for their contributions and support for this paper. ?2008 Polycom, Inc. All rights reserved. Polycom and the Polycom logo design are registered trademarks of Polycom, Inc. All other trademarks are the property of their respective owners. Information is subject to change without notice. 12 |
|