TWI296406B

TWI296406B -

Info

Publication number: TWI296406B
Application number: TW94143690A
Authority: TW
Inventors: Yi Sheng Tang
Original assignee: Inst Information Industry
Priority date: 2005-12-09
Filing date: 2005-12-09
Publication date: 2008-05-01
Also published as: TW200723248A

Description

1296406 九、發明說明·· 【發明所屬之技術領域】本發明係為一種音訊内容感知之裝置及其方法，特別為可對音訊内容(content)進行分級傳輸之裝置及其方法。【先前技術】網際網路(Internet)是-項通訊技術，這項科技因為規範了 ❿共同標準，每個符合標準的終端都可互相連接，而使得網際網路可成為-種新興的傳輸媒體，且在這個媒體上所傳輸的資料舉凡文字、聲音、影像、視訊及各種組合的多媒體資訊，均可透過無遠弗屆的網際網路傳送。 (VoIP , Voice over Internet Protocol) 話語音以網際網路為媒介進行通話的一種新興科技，请即將顛覆傳統的電信通訊，相較於傳統電信而言是一種成本效益 • $高、更符合商品市場需求的通訊媒介。當然通話的品質與網路上的頻見、擁塞錢有相當賴連。針對減少網路傳輸延遲、確保語音雜的及咖與完紐也魅了—些财來降低語音資料對傳輸延遲的敏感度。服務保證（QoS， QUality_〇f_SerViCe)的規範可使語音、視訊資料於網路上傳輸能得到優先的傳輸權，不必與其他資料排隊傳輸進而得到一定 ° cRSVP，Resource Reservation 触在通訊的兩端可以要求保證網路上資料傳輸的一定 5 1296406 /瓜里使即日守而不可分割的音訊（Au(^〇 signais )在網路上傳輸的美夢成真。而所研9訊」’泛指人類可以聽到的聲音，這些聲音大致上可以被分為三大類：有聲音（例如：/a/、的、破音（也稱做摩擦音，例如·· /sh/、/n/、/p/)、及混合音。語音是空氣由肺經由聲劍嘴巴。聲帶的震動（酸關）決定了有聲音的高低。聲道的形狀蚊了你的聲音的形狀。說鱗，聲道形狀每隔10 100ms才有改變。聲音代表了空氣的密度隨時間的變化’基本上是一個連續的函數，但是若要將此訊動I存在電腦裡，就必須先將此訊號數位化。現行的影音串流(streaming)傳輸系統所採用的抗誤機制’通常是針對遺失的封包②ackets 1〇ss)重傳，或是經過 channel coding，此作法的缺點如下： 1·抗誤機制通常與内容(content)特性並沒有相關性，所以雙保護的封包可能對於客戶端⑽ent)視訊或音訊品質沒有實質的增益。、 2·有考慮內容特性的設計，只針對視訊資料做設計，並没有針對音訊資料做設計。展望未來網際網路的應用，VoIP是一項技術也是一種趨勢。必定纽變整個語音聽職解構，她在技術不斷的進步下絕提供更的電紐務，同時也冑彳f Int_t & 走進日常生种成為何或元錯誤發生時，___ t 因此，如何使傳輸重要之課題昇品質，顯然為目:一錯誤所造成之影響，進而提【發明内容】BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for sensing audio content and a method thereof, and more particularly to an apparatus and method for hierarchically transmitting audio content. [Prior Art] Internet (Internet) is a communication technology. This technology standardizes the common standards. Each standard-compliant terminal can be connected to each other, making the Internet an emerging transmission medium. And the information transmitted on this media, such as text, sound, video, video and various combinations of multimedia information, can be transmitted through the Internet. (VoIP, Voice over Internet Protocol) An emerging technology that uses voice over the Internet as a medium to communicate. Please reverse the traditional telecom communication, which is cost-effective compared to traditional telecommunications. • High, more in line with the commodity market. The communication medium of demand. Of course, the quality of the call is quite dependent on the frequency and congestion of the Internet. It is also fascinating to reduce the delay of network transmission, to ensure that the voice is mixed, and to change the voice data to the transmission delay. The service guarantee (QoS, QUality_〇f_SerViCe) specification enables voice and video data to be transmitted over the network to obtain priority transmission rights. It does not have to be queued with other data to obtain a certain degree of cRSVP. The Resource Reservation can be touched at both ends of the communication. It is required to ensure that the data transmission on the network is 5 1296406 / Guali makes the day-to-day inseparable audio (Au (^〇signais) transmitted on the Internet to come true. And the 9th research" refers to what humans can hear. Sounds, these sounds can be roughly divided into three categories: sounds (for example: /a/,, broken sounds (also known as fricatives, such as ··/sh/, /n/, /p/), and blends The voice is the air from the lungs through the sonar mouth. The vibration of the vocal cords (acid off) determines the level of sound. The shape of the channel mosquitoes the shape of your voice. Say the scale, the shape of the channel is only every 10 100ms Change. Sound represents the change in density of air over time 'is basically a continuous function, but to have this signal I in the computer, you must first digitize this signal. Current video stream (strea) Ming) The anti-missing mechanism used by the transmission system is usually retransmitted for lost packets (2 sss) or channel coding. The disadvantages of this method are as follows: 1. The anti-missing mechanism is usually related to the content characteristics. There is no correlation, so a double-protected packet may have no substantial gain for the client (10) ent) video or audio quality. 2) Designed with content characteristics in mind, designed for video data only, and not designed for audio data. Looking ahead to the future of Internet applications, VoIP is a technology and a trend. It must be deconstructed throughout the whole voice, and she will provide more electric powers under the continuous advancement of technology. At the same time, she will enter the daily breeding to become a source of error or ___ t. So, how? To raise the quality of the important issues of transmission, it is obvious that the impact of an error, and then [invention]

t π Μ上的問題，本發衮咸要的在於提供—種音訊内谷感知之衣置及其方法，淮杆，A、”，猎哺4 W的分析，以及接收時卑说^ w 威爆日的發生，更可使傳輸錯 M t，龍要音軸容之影響岐最低。 H達上述目的，本發騎揭露之—種音訊内容感 =置，包合—音訊擷取模組，用㈣取音訊資料，一缓衝子區，肋暫存分級之音訊倾，—網路模組，用以傳送編碼_過之分級後之音崎料’並接收—具有不等誤碼率保護 (UEP，U-qualErrorPr〇tecti〇n)«^^^,,^^(s^^^ rve相UEP ^讀紗轉之音崎料，及—音訊播放模組，用以播放修補後之音訊資料，其特徵在於具有：一音訊分析模組，用以對欲傳送之音訊資料分割成多個音 t(frame)進行多個特倣抽取’並以各音框對應之震盈參數值 (Shock Value): S = aP+(l_a)V 對各音框進行分級，其中 p (p〇wer Estimation)值係為各音框之一時域(temp〇ral)音量強度，v (Variance)值係為各音框之一時域音量變異數，及a值為調整各音框之時域音量強度及變異數大小兩者比重之參數，且a值小 1296406 於等於1並大於等於0(0^9);及一錯誤補償模組，用以對接收之音訊資料進行錯誤偵測並對遺失之各音框資料作修補。依據本發明之目的且達到上述之優點，本發明之方法包含下列步驟·首先’傳送端接收音訊資料，接著，分析音訊資料之震靈减健行分級，織，以資料流方式將分級之音訊資料傳至串流飼服器，並由串流祠服器以聰方式傳至接收端，由練端觸是^有封包遺失？當音訊#料有遺失時，進行遺失封包之音訊資料修補。有關本發明的特徵與實作，兹配合圖示作最佳實施例詳細說明如下。【實施方式】本發明_露-種音_容感知之裝置及其方法。在本發 . 明的以下詳細說明中，將描述多種特定的細節以便提供本發明，凡整祝L’對熟知技藝者來說，並可以不需要使用該等特（細即便可以實施本發明，或者可以縣糊替代的元件或方去來^⑯本發明。在其他的狀況下，並不特別詳細地說明已去的方法、程序、部件、以及電路，以免不必要地混淆本發明的重點。 x 清參^昭「楚1同 …、圖」’此為本發明之架構方塊圖。在本發明稍路之系統中包含了傳送端可將分級後之音訊，在經過 8 1296406 £細、編碼後先以貝料流方式傳送到串流饲服器⑼reaming S_)200 ’再由串流祠服器透過網際網路恤_)5〇〇傳至接收端’其中傳送端及接收端皆為本發明所提之音訊内容感知裝置綱。並請一併參照「第3圖」，此為本發明之方法流程圖，各步驟之細節部分將於底下作詳細之說明。 s訊内合感知裝置1〇〇為處理音訊分級及錯誤補償，至少必須包含下列單元： > 音訊擷取模組m ’用以於方法—開始時擷取音訊資料 300(步驟31G)，而來源可以是單純的音訊資料卿，例如由麥克風(microphone)所接收之聲音，或者是影音資料働，由於影音資料在舰時可觀訊龍及音域料·分開傳送，故可以擷取到單獨音訊的部分。音訊分析模組12G ’可用以對音訊資料進行特徵抽取，並依據音訊資料300在時域(temp㈣的音量強度，以及在時域的音錢異數大小來給料同之重要料料步驟 320)。以下就音訊資料300特徵抽取的方式，舉一實施例說明如下：當我們在分析聲音時，通常以「短時距分析」（sh〇rt__ Analysis)為主，因為音訊資料·在短時間内是相對穩定的。我們通常將聲音先切成音框（fr槪）210，每個音框21〇長度大約在20 ms左右’點數大約是祝或512等大小，再根 9 !2964〇6 據音框2Η) _訊號來進行分析；其中音框2ΐ()敎大，就無去抓出音訊育料300隨時間變化的特性；反之，音框21〇若太小’就無法抓出音訊資料3〇〇的特性。一般而言，音框MO必如b夠包含數個音訊資料3〇〇的基本週期。（另，音框21〇長度财是2的錄次^，可便於進行「快賴立葉轉換」。）一般而δ，聲音分成一個一個的音框21〇後，會將每一個 _ 音框的訊號210從時域（time d_in)轉換到頻域㈣腦^ domain)。這種轉換，可以利用一組滤、波器(mter bank)、或 (調整過的DCT)、或濾波器與的組合來達成。頻域的每個訊號再透過量化(qUantizatiGn)將其雜成可以傳送的資料。訊號被量化後，無可避免會與原訊號不同。量化訊號與原訊號的差異，稱為錯誤訊號(err〇r Signal)。由於人的聽覺有一定的頻率範圍在，大致是從20Hza 20kHz左右，如何使這個 _ 錯誤訊號降到最低，或使之落在人耳較不敏感的頻域範圍，是曰汛資料300處理的最關鍵部份。這個工作完全仰賴人耳的聽覺模型(psycho_acoustics model)，根據人耳的聽覺模型的分析，當環境中存在一個聲音時，某些其它成份的聲音會被屏遮敝掉而導致人耳無法聽到該聲音。這樣的屏蔽效果在時域與頻域中會因不同原因以不同型態而存在，分別為時域屏蔽效應 (Ncm-simultaneous Masking)及頻域屏蔽效應（Simultane〇usThe problem on t π Μ, the hair is 提供在于在于在于在于在于在于在于在于在于在于在于音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音音The occurrence of the explosion day can make the transmission wrong M t, and the influence of the dragon's sound shaft capacity is the lowest. H achieves the above purpose, and the hair is exposed, the content of the audio content = set, the inclusion - the audio capture module, Use (4) to take audio data, a buffer sub-area, rib temporary storage grading audio, - network module, used to transmit the code _ after the classification of the sound of the raw material 'and receive - with unequal error rate protection (UEP, U-qualErrorPr〇tecti〇n)«^^^,,^^(s^^^ rve phase UEP ^ read the yarn to the sound of the raw material, and - audio playback module for playing the repaired audio The data is characterized in that: an audio analysis module is configured to divide the audio data to be transmitted into a plurality of sounds t (frame) and perform a plurality of special imitation extractions, and the seismic value values corresponding to the respective sound frames (Shock) Value): S = aP+(l_a)V Grading each frame, where the p (p〇wer Estimation) value is the temp〇ral volume intensity of each frame, v The (Variance) value is the time-domain volume variation of each of the frames, and the a value is a parameter for adjusting the time-domain volume intensity and the variation of the size of each frame, and the value of a is 1296406 is equal to 1 and greater than or equal to 0 (0^9); and an error compensation module for performing error detection on the received audio data and repairing the lost frame data. According to the object of the present invention and achieving the above advantages, the present invention The method comprises the following steps: First, the transmitting end receives the audio data, and then analyzes the seismic data of the audio data to classify, weave, and distributes the classified audio data to the serial feeding device by means of data streaming, and is served by the streaming device. The device is transmitted to the receiving end in a smart way. The touch is lost by the training end. When the audio material is lost, the audio data of the lost packet is repaired. The features and implementations of the present invention are most consistent with the illustration. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail in the following detailed description of the present invention. I wish L's to the skilled artisan, and may not need to use these special features (even if the invention can be implemented, or the components or parties that can be replaced by the county paste). In other cases, The method, the program, the components, and the circuit have been described in detail in order to avoid unnecessarily obscuring the focus of the present invention. x „ ^ 昭「楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚楚The system of the present invention includes a transmitting end, and the classified audio can be transmitted to the serial feeding device (9) reaming S_) 200 ' after being fine-coded and encoded by 8 1296406. The server transmits the Internet to the receiving end through the Internet _ _ 〇〇 ' 其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中其中Please also refer to "Figure 3", which is a flow chart of the method of the present invention, and the details of each step will be described in detail below. In order to handle audio grading and error compensation, the semaphore sensing device 1 must include at least the following units: > The audio capture module m' is used for the method - the audio data 300 is initially captured (step 31G), and The source can be a simple audio data, such as a sound received by a microphone, or a video data. Since the audio and video data can be transmitted separately during the ship, the video can be captured separately. section. The audio analysis module 12G' can be used for feature extraction of the audio data, and according to the audio data 300 in the time domain (temp (4) volume intensity, and in the time domain of the odds size to feed the same material step 320). In the following, the method for extracting the features of the audio data 300 is as follows: When we analyze the sound, we usually use "short interval analysis" (sh〇rt__ Analysis), because the audio data is in a short time. Relatively stable. We usually cut the sound into a sound box (fr槪) 210. The length of each sound box 21〇 is about 20 ms. The number of points is about 512 or so, and the root is 9 !2964〇6 according to the sound box 2Η) _ signal to analyze; where the sound box 2ΐ () is large, there is no way to capture the characteristics of the audio feed 300 over time; on the contrary, if the sound box 21 is too small, it can not capture the audio data. characteristic. In general, the sound box MO must be as large as a basic period of three audio data. (In addition, the sound box 21〇 length is 2 recordings ^, which can facilitate the "fast Lai Yee conversion".) Generally, δ, the sound is divided into one sound box 21, then each _ sound box The signal 210 is converted from the time domain (time d_in) to the frequency domain (four) brain domain). This conversion can be achieved using a set of filters, mter banks, or (adjusted DCT), or a combination of filters and. Each signal in the frequency domain is then multiplexed into quantizable data (qUantizatiGn). After the signal is quantized, it will inevitably be different from the original signal. The difference between the quantized signal and the original signal is called the error signal (err〇r Signal). Since human hearing has a certain frequency range, roughly from 20Hza 20kHz, how to make this _ error signal to a minimum, or make it fall in the frequency range of the human ear is less sensitive, is the data processing 300 The most critical part. This work relies entirely on the psycho-acoustics model of the human ear. According to the analysis of the auditory model of the human ear, when there is a sound in the environment, the sound of some other components will be concealed by the screen and the human ear cannot hear the sound. sound. Such shielding effects exist in different types in the time domain and the frequency domain for different reasons, namely, Ncm-simultaneous masking and frequency domain shielding effect (Simultane〇us).

Masking)。透過聲音與聲音之間的屏障分析，我們可以對錯誤 1296406 訊號作特殊的安排。㈣而言，經由聽覺模翻分析指出後，目為屏蔽效應，人耳對某些頻率比較不敏感，在音訊資料300❸量化過程我們就可以把較多的錯誤訊號藏在這些頻率中。實際處理的方式如下：聲音透過快速傅立葉轉換將其轉為頻譜資料，再經過人耳聽覺模型分析後，產生-組資料告訴量化器每一個頻率人耳所能忍受的最大錯誤。如果經過量化後，每一個頻域的訊號量化籲真差都小於它們可以忍受的最大錯誤，則經過處理過後的音訊資料300與原音品質對人耳而言是沒有差異的。如果因位元率過低導致可以用的位元數不足，某些頻率的誤差必須大於它們可以忍受的最大錯誤，則將這錯誤分配在較不敏感的頻率上。因為必須同時滿足位元率以及考量最大忍受誤差，所以這個錯誤分派（error shaping)通常是以循環迴路的方式（iteration)來達到最佳的效果。若是希望相鄰音框21〇之 _ 間的變化不是太大，可以允許音框21〇之間有重疊，重疊部分可以是音框210長度的1/2到2/3不等。若是重疊部分越多，相對的計算量也就越大。假設在一個音框21〇内的音訊是穩疋的’便可對此音框210求取特徵，如過零率(zcr，Masking). Through the barrier analysis between sound and sound, we can make special arrangements for the error 1296406 signal. (4) In terms of auditory modal analysis, the target is the shielding effect, and the human ear is less sensitive to certain frequencies. In the 300 ❸ quantization process of audio data, we can hide more error signals in these frequencies. The actual processing is as follows: the sound is converted into spectral data by fast Fourier transform, and after the human ear auditory model analysis, the generated-group data tells the quantizer the maximum error that the human ear can tolerate at each frequency. If, after quantization, the signal quantization error in each frequency domain is less than the maximum error they can tolerate, the processed audio data 300 and the original sound quality are indistinguishable from the human ear. If the number of bits that can be used is insufficient due to a low bit rate, and the error of some frequencies must be greater than the maximum error they can tolerate, assign this error to a less sensitive frequency. Since the bit rate must be met at the same time and the maximum tolerance error is considered, this error shaping is usually achieved by looping the iteration. If it is desired that the change between the adjacent frames 21 is not too large, the overlap between the frames 21〇 may be allowed, and the overlapping portions may be 1/2 to 2/3 of the length of the frame 210. If there are more overlapping parts, the relative calculation amount will be larger. Assuming that the audio in a frame 21 is stable, the frame 210 can be characterized, such as a zero-crossing rate (zcr,

ZeroCrossmg Rate)、音量、音高、Mel頻率倒頻譜係數 (MFCCs ’ Mel-Frequency Cepstral Coefficients )…等。而根據過零率、曰里及音咼荨，進行端點偵測（En(jp〇int Detection )， 1296406 並保留端點内的特徵資訊。兹就三個主要聲音特徵說明如下：音高（Pitch) ··代表聲音的高低，可由基本頻率 (Fundamental Frequency )來類比，這是基本週期加 Period)的倒數。音色（Timbre):代表聲音的内容（例如英文的母音），可由每一個波形在一個基本週期的變化來類比。音！（Volume):代表聲音的強度，又稱為「力度」、「強度」amensity)或「能量」（Energy)，可由—^音框内的訊號震幅大小來類比，基本上可採用「絕對值的總和」及「平方值的總和」兩種方式來計算；後者的計算方式是先取以1〇為底之對數值，再乘以1〇，這種方法得到的值是以分貝（Dedbds)為單位，是—個姉齡雜，比對於大小聲音的感覺。〇耳 ▲本發明在對音訊資料3〇〇進行特徵抽取之後，可再進行統計上的分析’縣P_EstimatiQn(釘簡稱ρ，代時域的_度)及v_-(底下_ V，代表聲音在時i的變異數大小)這兩個參數來分析估測音訊資料遍在時域振幅能量大小。音訊分析池m將估算P及v，並視獨之重要性出震蘯參雜阀純Vahe)22G ·· s =卿罐，仏幻，^ 中a值便是作為調整時域音量強度及時域音量變異數大小^ 12 1296406 參數。而個參數S i 22〇便作為代表該音框⑽震盪程度的參考指標值。音訊分析模組120根據(Shock Value) S值來評估此一音框210的重要等級。音框21〇的s值22〇愈大者，表示聲曰的振Ml較大’通常也代表資制容較為重要，因此對该音框210的保護越重視，反之則重視程度較輕。若傳送之資料為衫曰資料400，則音訊資料3〇〇的抗誤機制設計將配合視訊資料，如：重傳機制，最長為一個圖像組(GOP，Group of Picture) 〇然而「音1」只是賊學的公絲逼近人耳触覺，和人耳的感覺有時候會有相當大的落差，為了區分，發明人使用「主觀音量」來表示人耳所聽到的音量大小。例如，人耳對於同樣振幅但不襲率的聲音’職生社觀音量就會非f不-樣。人耳對於不同醉的聲音的靈敏程度，這也就是人耳的頻率響應（FrequencyResponse)。音afl貝料300在加入S值22〇之後，便先置於緩衝暫存區 130 ’待累積-定貧料量之後由網路模組⑽以資料流方式，傳至串流伺服盗2〇〇，再接著透過網際網路5〇〇轉送到接收端 (步驟330)。由於串流伺服器2〇〇具備不等誤碼率保護(uEp， Unequal E窗 Protection)機制，例如:翻或是Variable channd coding咖，不等誤碼率保護是一種對改進來源編碼錯誤極有效率的方法。它可脉多種語音和音訊編 1296406 碼的系統’其運作在易錯誤的頻道’如行動電話網路或數位音訊廣播（Digital Audio Broadcasting )。其編碼出的位元依錯誤敏銳度分類並做不同防護，以本發明而言即是對於不同$值 220之音訊信號可提供不等程度之重要性保護。當網際網路 500發生藝塞情形時，串流祠服器200所傳送之編碼壓縮過之音訊資料300便可能會發生遺失之狀況。請參照「第2圖」，此為一音訊資料之示意圖，假設分別 > 由左至右傳送⑻、（b)、（c)、及⑹4個音框21〇，當音框⑻傳送失敗後’透過谷錯控制（err〇r resilience)可使資料封包失誤牯自動重傳’但由於下一個音框(b)的s值220相對來說極低，因此即使音框⑻經過多次重傳，而導致無法傳送音框⑼，也就是說可以保障重要性較高之音訊資料3〇〇能完整傳到的機率增高，另一種情形為音框⑻在傳送過程中遺失，則串流伺服裔200或疋傳送端檢視其s值22〇後便可決定不須重傳， >而直接傳达下-個音框(c)。如此一來，便可避免重要性較高的曰#、在、、罔路壅基的時候未傳到，而傳到的音訊資料300卻可能為不具内容的音框210之情形發生。 3此’接㈣在彳㈣傳來分、贿護之音訊#料3GG或是影曰資料00樣先儲存在緩衝暫存區130，待累積一定資料量之後再由音訊播放模組廳來進行播放，但是若在傳送過程有資料运失透過防錯控制（⑽ti〇n)來判斷是否有 14 1296406 封包遺失(步驟340) ’可修正音訊資料遍因傳輸所造成的錯誤，假没在預設之20ms的重傳延遲下，可重傳一次，若下一筆音框210的S值220小於1，通常代表其内容可能為無聲或背景雜音，代表其優先權較低，由於這些低優先權之音框剔傳送失敗的機率較高，因此當傳送失敗時，只能以無聲之聲音訊號作代表，因此在將語音進行解碼後，錯誤補償模組15〇例如.無聲訊號產生器，便在該傳送失敗之音框21〇段填入無聲的音框210資料(步驟350)，再由音訊播放模組16〇播放，人耳600便可聽到符合「人耳的聽覺」的聲音。透過本發明所揭露之音訊處理方式，不僅方法簡單也不須經過太多額外之運算，可減少系統資源之負擔；再者，經過本方法所處理過之音訊皆符合一般標準的音訊編/解碼器所能處理之訊號，因此無相容性之問題，且可供各種不同之音訊壓縮編碼方式來進行壓縮。在資料傳送的過程中，即使遇到封包遺失之情形，也仍可維持平順的聲音，有效的降低爆音的情形發生，可讓使用者可以聽到順暢的聲音，而不會有封包遺失的感免，再配合原本的傳輸抗誤機制，可使影音的呈現獲得極佳之保障。雖然本發明以前述之較佳實施例揭露如上，然其並非用以限定本發明，任何熟習相像技藝者，在不脫離本發明之精神和範圍内，當可作些許之更動與潤飾，因此本發明之專利保護範 15 1296406 圍須視本說明書所附之申請專利範圍所界定者為準。【圖式簡單說明】第1圖係本發明之架構方塊圖；第2圖係本發明之音訊資料之示意圖；及第3圖係本發明之方法流程圖。【主要元件符號說明】 100 音訊内容感知裝置ZeroCrossmg Rate), volume, pitch, Mel Frequency Frequency Cepstral Coefficients (MFCCs ' Mel-Frequency Cepstral Coefficients ), etc. According to the zero-crossing rate, the 曰及 and the sound, the endpoint detection (En(jp〇int Detection), 1296406 and retains the feature information in the endpoint. The three main sound features are as follows: Pitch) ·· represents the level of sound, can be analogized by the fundamental frequency (Fundamental Frequency), which is the reciprocal of the basic period plus Period). Timbre: The content that represents the sound (such as the vowel of English), which can be analogized by the change of each waveform in a basic period. sound! (Volume): The intensity of the representative sound, also known as "strength", "strength" or "energy", can be analogized by the magnitude of the signal amplitude in the -^ box. Basically, "absolute value" can be used. The sum of the sum and the sum of the squared values are calculated in two ways; the latter is calculated by taking the logarithm of 1〇 and multiplying by 1〇. The value obtained by this method is in dedbds. The unit is a mixed age, more than the feeling of size and sound. 〇耳▲ The invention can perform statistical analysis after extracting the features of the audio data 3' County P_EstimatiQn (nail ρ, generation time domain _ degree) and v_- (bottom _ V, representing sound in These two parameters are used to analyze the estimated time domain amplitude energy of the audio data. The audio analysis pool m will estimate P and v, and the importance of the uniqueness of the shocking valve is pure Vahe) 22G ·· s = Qing can, 仏幻, ^ The value of a is used as the adjustment time domain volume intensity and time domain Volume variation size ^ 12 1296406 parameters. The parameter S i 22 is used as a reference index value representing the degree of oscillation of the sound box (10). The audio analysis module 120 evaluates the importance level of the sound box 210 based on the (Shock Value) S value. The larger the s value 22 of the frame 21〇, the larger the vibration M1 of the sonar is, which usually means that the capacity is more important, so the more attention is paid to the protection of the frame 210, and vice versa. If the information transmitted is 400, then the anti-missing mechanism of the audio data will be designed to match the video data, such as the retransmission mechanism, which can be up to a group of pictures (GOP, Group of Picture). Only the guild of the thief is close to the human ear, and the feeling of the human ear sometimes has a considerable gap. In order to distinguish, the inventor uses the "subjective volume" to indicate the volume of the human ear. For example, the human ear is not the same as the sound of the same amplitude but not the rate. The sensitivity of the human ear to different drunken sounds, which is the frequency response of the human ear (FrequencyResponse). After adding the S value of 22 〇, the sound afl bedding material 300 is first placed in the buffer temporary storage area 130 'to be accumulated - the lean amount of the material is transferred by the network module (10) to the stream servo thief 2〇 Then, it is then forwarded to the receiving end via the Internet 5 (step 330). Since the streaming server 2 has an unequal error rate protection (uEp, Unequal E Window Protection) mechanism, such as a flip or a variable channd coding, the BER protection is an excellent source coding error. The method of efficiency. It can be used in a variety of voice and audio systems with 1296406 codes, which operate on error-prone channels such as mobile phone networks or Digital Audio Broadcasting. The encoded bits are classified according to the error acuity and are protected differently. In the present invention, it is possible to provide an equal degree of importance protection for audio signals of different values of 220. When the Internet 500 is in an immersive situation, the encoded compressed audio data 300 transmitted by the serial server 200 may be lost. Please refer to "Figure 2", which is a schematic diagram of an audio data. It is assumed that the sound boxes (8) are transmitted from left to right (8), (b), (c), and (6). 'The err〇r resilience can make the data packet error 牯 automatic retransmission' but the s value 220 of the next frame (b) is relatively low, so even if the sound box (8) has been retransmitted many times The result is that the sound box (9) cannot be transmitted, that is, the probability that the audio data of higher importance can be transmitted completely is increased, and the other case is that the sound box (8) is lost during the transmission, and the streaming servo is used. 200 or 疋 transmission end view its s value of 22 便可 can decide not to re-transmit, > and directly convey the next - box (c). In this way, it can be avoided that the more important 曰#, 、, 罔壅未未未 , , , , , , , , , , , , , , 。。。。。。。。。。。。。。。。。。。。 3 This is connected to the audio message. Play, but if there is data loss during transmission, through error protection control ((10) ti〇n) to determine whether 14 1296406 packet is lost (step 340) 'can correct the error caused by audio data transmission, false is not preset Under the retransmission delay of 20ms, it can be retransmitted once. If the S value 220 of the next box 210 is less than 1, it usually means that its content may be silent or background noise, which means its priority is lower due to these low priority. The probability of failure of the frame rejection transmission is high, so when the transmission fails, it can only be represented by a silent voice signal, so after decoding the voice, the error compensation module 15 such as a silent signal generator is in the The sound box 21 of the transmission failure is filled in the silent sound box 210 data (step 350), and then played by the audio playback module 16 ,, the human ear 600 can hear the sound corresponding to the "ear hearing". Through the audio processing method disclosed by the present invention, not only the method is simple, but also requires too many additional operations, which can reduce the burden of system resources; further, the audio processed by the method conforms to the general standard audio encoding/decoding. The signal that the device can process, so there is no compatibility problem, and can be compressed by various audio compression coding methods. In the process of data transmission, even if the packet is lost, the smooth sound can be maintained, and the situation of popping can be effectively reduced, so that the user can hear the smooth sound without the loss of the package loss. And with the original transmission error-proof mechanism, the audio-visual presentation can be guaranteed. While the present invention has been described above in terms of the preferred embodiments thereof, it is not intended to limit the invention, and the invention may be modified and modified without departing from the spirit and scope of the invention. The patent protection of the invention is defined by the scope of the patent application attached to this specification. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram of the architecture of the present invention; Fig. 2 is a schematic diagram of audio data of the present invention; and Fig. 3 is a flow chart of the method of the present invention. [Main component symbol description] 100 audio content sensing device

110 音訊擷取模組 120 音訊分析模組 130 緩衝暫存區 140 網路模組 150 錯誤補償模組 160 音訊播放模組 200 串流伺服器 210 音框 220 S 值 300 音訊資料 400 影音資料 500 網際網路 600 人耳步驟310 傳送端接收音訊資料 16 1296406 步驟320 分析音訊資料之震盪參數值進行分級步驟330 將分級之音訊資料傳至串流伺服器，再由串流伺服器透過網際網路傳至接收端步驟340 接收端判斷是否有封包遺失？步驟350 進行遺失封包之音訊資料修補 17110 audio capture module 120 audio analysis module 130 buffer temporary storage area 140 network module 150 error compensation module 160 audio playback module 200 streaming server 210 sound box 220 S value 300 audio data 400 video material 500 Internet Network 600 human ear Step 310 The transmitting end receives the audio data 16 1296406 Step 320 Analyze the oscillating parameter value of the audio data to perform the grading step 330. The grading audio data is transmitted to the streaming server, and then the streaming server transmits the data through the Internet. To the receiving end, step 340, the receiving end determines whether there is a packet lost? Step 350: Performing the audio data repair of the lost packet 17

Claims

1296406 X. Patent application scope: 1. A device for sensing audio content, the device has an audio capture module for capturing a plurality of audio data, and a buffer temporary storage area for temporarily storing the audio information of the rating Data, a network module, for transmitting the classified audio data to the data stream to the _p, UnequalE_ Protection)^fij^- φ (Streaming Server) » ^ stream The plurality of audio data transmitted by the server through the Internet, and an audio playback module 'for each of the audio signals after the repair, the ship has (4): - the audio analysis amount is transmitted by the series The audio data is divided into a plurality of sounds _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The value is the time domain of each of the sound boxes = the sound volume intensity, and the v (vari_) value is the size of the noise in the time domain of each of the sound boxes and the value of a is the time domain of each of the sound boxes. The volume strength and the time domain volume 嶋 the size of the specific gravity - and the value of a is less than The compensation module is greater than or equal to 〇(〇Sa$i); and 、··^, for detecting the error of each received audio and repairing the missing frame data. 20ms 2. Shen Gangqi 1 Nian Ji, the length of the towel frame is the modification of each of the sound box materials. 3. The device 18 1296406 as described in the scope of the patent application section is replaced by a silent sound signal. . 4. The apparatus of claim 3, wherein the corresponding oscillating parameter value of each of the sound box data is less than one is a low priority. The device described in the first paragraph of the U.S. patent scope, wherein the error is determined by error protection to determine whether or not a packet is lost. 6. The device of claim 1, wherein the error compensation module is a silent signal generator. ♦ 7·-Transmission method of sound-assistance sensing, the method includes the following steps: · A communication end divides the transmission of the pen audio into a plurality of frames to perform a plurality of feature extractions; The corresponding oscillating parameter value (Sh〇ck Value): s== 'l_a)V classifies each of the sound boxes, where the p (p_r Estimati〇n) value is the busy ^j^(temp0rai) The volume intensity, v (variance) value is the size of the time zone volume face of each talk box, and the a value is one parameter for adjusting the time domain volume intensity of each of the sound frames and the time domain volume variation size And the & value is less than or equal to 1 and greater than or equal to; each of the classified audio data is streamed to a streaming server having a UEP 'Unequal Error Protection mechanism, and The streaming server transmits to the receiving end through the Internet; and the receiving end performs error detection and repairs the audio data when the packet is lost. 19 1296406 8· The transmission method according to item 7 of the patent application scope, wherein the voice private production system is 20ms 〇9·If the patent application method is read, the modification method of the sound box data is Replace with a silent voice signal. 10. The method of transmitting the patent, wherein the sound box data corresponding to the oscillating parameter value is less than one is a low priority. H. If the transmission mentioned in item 7 of the patent application scope is defensive, the error detection system uses the error protection to determine whether there is a spoon or a spoon.