KR20070046272A

KR20070046272A - Method for objective speech quality assessment

Info

Publication number: KR20070046272A
Application number: KR1020050102755A
Authority: KR
Inventors: 이민기; 김경태; 강홍구; 박영철; 윤대희
Original assignee: 연세대학교 산학협력단
Priority date: 2005-10-31
Filing date: 2005-10-31
Publication date: 2007-05-03
Also published as: KR100729555B1

Abstract

본 발명은 객관적인 음성 품질의 평가방법에 관한 것으로서, 더욱 상세하게는 원음에 대한 정보를 제공받지 아니하고 합성된(왜곡된) 음성에 대한 정보만으로 패킷 손실에 가중치를 부여함으로써 음성 품질을 평가하는 객관적인 음성 품질의 평가방법에 관한 것이다.The present invention relates to a method for evaluating objective speech quality, and more particularly, to assess speech quality by weighting packet loss only with information on synthesized (distorted) speech without receiving information about the original sound. It relates to a method of evaluating quality.

이를 위해, 본 발명은 객관적인 음성 품질의 평가방법에 있어서, 음성 복호화기에서 통신채널을 통과한 음성 신호를 복호하고 매 프레임마다 음성 신호를 합성하는 제 1 단계; 음성 특성 분류기에서 상기 1 단계에서 합성된(왜곡된) 음성신호를 통계적 특성에 따라서 분류하는 제 2 단계; 패킷 손실 감지기에서 상기 음성 복호화기를 통과하여 합성된 왜곡음을 받아 상기 음성 복호화기를 통과하면서 발생한 패킷 손실의 발생 여부를 감지하는 제 3 단계;및 상기 제 3 단계에서 패킷 손실이 감지되면 상기 패킷 손실에 가중치를 부여함으로써 음성 품질을 평가하는 제 4 단계를 포함하여 이루어지는 것을 특징으로 하는 객관적인 음성 품질의 평가방법을 제공한다.To this end, the present invention provides a method for evaluating objective speech quality, comprising: a first step of decoding a speech signal passing through a communication channel in a speech decoder and synthesizing the speech signal every frame; A second step of classifying, in the speech characteristic classifier, the speech signal synthesized (distorted) in the first step according to the statistical characteristic; A third step of detecting whether packet loss occurred while passing through the voice decoder by receiving a synthesized distortion sound from the packet loss detector after passing through the voice decoder; and when the packet loss is detected in the third step, And a fourth step of evaluating speech quality by assigning weights.

음성 품질, 음성 품질 평가, 싱글 엔드, 더블 엔드, 패킷 손실, 가중치, 비침입적, 침입적, 객관적, 주관적 평가 방법 Voice Quality, Voice Quality Evaluation, Single-Ended, Double-Ended, Packet Loss, Weighted, Non-Invasive, Invasive, Objective, Subjective

Description

Objective for Objective Speech Quality Assessment

도 1은 싱글 엔드 방식과 더블엔드 방식을 나타낸 블록도.1 is a block diagram showing a single-ended system and a double-ended system.

도 2는 본 발명에 따른 객관적인 음성 품질의 평가방법에서 음성 품질을 평가하는 과정의 흐름도.2 is a flowchart illustrating a process of evaluating voice quality in the method of evaluating objective voice quality according to the present invention.

도 3은 음성 특성을 고려하여 가중치를 부여하는 예를 나타낸 도면.3 is a diagram illustrating an example of weighting in consideration of speech characteristics;

도 4는 본 발명에 따른 객관적인 음성 품질의 평가방법의 시뮬레이션에 의한 PESQ-LQ와의 분포도를 나타낸 그래프.Figure 4 is a graph showing the distribution with PESQ-LQ by the simulation of the objective voice quality evaluation method according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

210 : 복호단계 220 : 분류단계210: decoding step 220: classification step

230 : 감지단계 240 : 평가단계 230: detection step 240: evaluation step

본 발명은 객관적인 음성 품질의 평가방법에 관한 것으로서, 더욱 상세하게 는 원음에 대한 정보를 제공받지 아니하고 합성된(왜곡된) 음성에 대한 정보만으로 패킷 손실에 가중치를 부여함으로써 음성 품질을 평가하는 객관적인 음성 품질의 평가방법에 관한 것이다.The present invention relates to a method for evaluating objective speech quality, and more particularly, to assess speech quality by weighting packet loss only with information about synthesized (distorted) speech without receiving information about the original sound. It relates to a method of evaluating quality.

음성 품질평가방법에는 여러 사용자의 반복 청취 실험을 통한 주관적 음성 품질평가방법이 있다. 일반적으로 MOS(Mean Opinion Score) 평가방법이 사용되는 데, 이는 ITU-T P.800에서 공식화 된 것으로 실험자가 왜곡된 음성 신호를 들은 후 음성 신호의 왜곡 정도를 아래의 표 1과 같이 1단계에서 5단계로 주관적으로 평가하는 방법이다. There is a subjective speech quality evaluation method through repeated listening experiments of various users. In general, the MOS (Mean Opinion Score) evaluation method is used, which is formulated in ITU-T P.800. After the experimenter hears the distorted voice signal, the degree of distortion of the voice signal is measured in step 1 as shown in Table 1 below. It is a subjective evaluation in five steps.

그러나 이 방법은 이용자의 체감 음성 품질과 직접적인 관계가 있다는 장점에도 불구하고, 다양한 환경에서 반복해서 수행하기에는 많은 시간과 노력 및 비용이 소모되므로 실제 적용에는 한계를 나타낸다.However, this method is limited in practical application because it takes a lot of time, effort and cost to repeatedly perform in various environments despite the advantage that it has a direct relationship with the user's haptic voice quality.

따라서 이러한 문제를 극복하기 위하여 주관적으로 평가된 MOS를 평가할 수 있는 객관적 음성 품질평가방법들이 도입되었으며, 구체적인 평가방법은 시간 영역에서의 평가방법, 주파수 영역에서의 평가방법 및 심리음향(psychoacoustic) 영역에서의 평가방법으로 구분된다. 이 중 심리음향 영역에서의 평가가 원음과 왜곡음의 대수적인 차이를 구하는 방법에 있어서 주관적인 음성 품질을 가장 잘 표현하는 것으로 알려져 주로 사용된다.Therefore, in order to overcome this problem, objective voice quality evaluation methods have been introduced, which can evaluate subjectively evaluated MOS, and the specific evaluation methods are evaluated in the time domain, in the frequency domain, and in the psychoacoustic domain. Is divided into evaluation methods. Among them, evaluation in the psychoacoustic domain is known to best express subjective speech quality in the method of obtaining the algebraic difference between the original sound and the distortion sound.

이러한 심리음향 영역에서의 평가방법은 여러 가지 ITU-T에서 표준화된 알고리즘들이 존재하며, 대표적으로 ITU-T P.862에 의하여 표준화된 더블 엔드(double ended) 방식으로 알려진 침입적(intrusive) 방식의 알고리즘을 들 수 있다. 상기 ITU-T P.862는 PESQ(Perceptual Evaluation of Quality)라고 알려져 있으며 송신단에서 사용된 원음과 수신단에서 합성된 왜곡음을 심리음향 영역에서 비교하는 알고리즘이다.In this psychoacoustic evaluation method, there are algorithms standardized in various ITU-T, and the intrusive method known as double ended standardized by ITU-T P.862 is typical. Algorithms. The ITU-T P.862 is known as Perceptual Evaluation of Quality (PESQ) and is an algorithm for comparing the original sound used at the transmitter and the distortion sound synthesized at the receiver in the psychoacoustic domain.

그러나 상기 PESQ는 심리음향 영역에서 원음과 왜곡음의 대수적인 차이를 구하여 체감 음성 품질을 평가하지만 매번 송신단의 원음과 수신단의 왜곡음을 비교하는 침입적방식이기 때문에 네트워크에 별도의 부하가 작용하거나 송신단과 수신단의 동기화 등과 같은 문제점들이 있다.However, the PESQ evaluates the perceived voice quality by calculating the algebraic difference between the original sound and the distorted sound in the psychoacoustic domain, but since the PESQ is an invasive method comparing the original sound of the transmitter and the distorted sound of the receiver each time, a separate load is applied to the network or the transmitter And problems such as synchronization of the receiver and the receiver.

이러한 상기 PESQ의 문제점을 해결하고자 ITU-T P.563에 의하여 표준화된 싱글 엔드(single ended) 방식으로 알려진 비침입적(Non-intrusive) 방식의 알고리즘이 개발되었으며, 상기 알고리즘은 원음에 대한 정보 없이 왜곡음에 대한 정보만으로 음성 품질을 평가하는 알고리즘이다. 이것은 도 1의 싱글 엔드 방식과 더블엔드 방식을 나타낸 블록도를 참조하면 분명히 알 수 있다.In order to solve the problem of the PESQ, a non-intrusive algorithm known as a single ended standardized by ITU-T P.563 has been developed, and the algorithm is distorted without information on the original sound. Algorithm that evaluates voice quality based on information about sound. This can be clearly seen with reference to the block diagram showing the single-ended and double-ended schemes of FIG.

그러나 상기 싱글 엔드 방식의 경우 원음에 대한 정보 없이 음성 품질을 평가하는 대신 의사기준음(Pseudo reference signal)을 추정하고 상기 의사기준음과 왜곡음의 차이를 통하여 잡음의 양을 계산하는 방식이므로 상기 PESQ과 비하여 주관적 음성 품질과의 상관도가 떨어지는 문제점이 있다. However, the PESQ is a method of estimating a pseudo reference signal and calculating the amount of noise based on a difference between the pseudo reference sound and the distortion sound, instead of evaluating voice quality without information on the original sound. Compared with the subjective speech quality, there is a problem inferior.

본 발명은 상기와 같은 점을 감안하여 안출한 것으로서, 심리 음향 영역의 음성 품질의 평가방법에 있어서 싱글 엔드 방식을 기초로 하여 원음에 대한 정보를 제공받지 아니하고 음성 복호화기에서 합성된(왜곡된) 음성에 대한 정보만을 가지고 패킷 손실이 발생한 경우 음성 품질 저하도와 여러 가지 방법에 의하여 결정된 가중치에 의하여 음성 품질을 평가하는 객관적인 음성 품질의 평가방법을 제공하는데 그 목적이 있다.The present invention has been made in view of the above, and in the method of evaluating the speech quality in the psychoacoustic domain, a speech synthesizer is synthesized (distorted) without receiving information on the original sound based on a single-ended method. It is an object of the present invention to provide an objective voice quality evaluation method for evaluating voice quality based on voice quality degradation and weights determined by various methods when packet loss occurs only with voice information.

이하, 본 발명에 따른 객관적인 음성 품질의 평가방법의 바람직한 실시예를 첨부도면을 참조로 상세하게 설명한다.Hereinafter, a preferred embodiment of the objective voice quality evaluation method according to the present invention will be described in detail with reference to the accompanying drawings.

첨부한 도 2는 본 발명에 따른 객관적인 음성 품질의 평가방법에서 음성 품질을 평가하는 과정의 흐름도이다. 본 발명에 따른 객관적인 음성 품질의 평가방법은 음성 신호를 복호하는 제 1 단계(S210), 음성 특성에 따라서 분류하는 제 2 단계(S220), 패킷의 손실 여부를 감지하는 제 3 단계(S230), 패킷 손실이 발생한 경우 음성 품질을 평가하는 제 4 단계(S240)로 이루어진다.2 is a flowchart illustrating a process of evaluating voice quality in the method of evaluating objective voice quality according to the present invention. The method for evaluating the objective speech quality according to the present invention includes a first step (S210) of decoding a voice signal, a second step (S220) of classifying according to voice characteristics, a third step (S230) of detecting whether a packet is lost, When packet loss occurs, a fourth step (S240) of evaluating voice quality is performed.

상기 제 1 단계(S210)에서는 음성 복호화기에서 통신채널을 통과한 음성 신 호를 복호하는 단계이며, 상기 음성 신호는 프레임 단위의 음성 정보를 가지고 있다. 이 경우 상기 음성 복호화기를 통과한 상기 음성 신호는 매 프레임마다 합성된(왜곡된) 음성 신호를 출력한다.In the first step S210, the voice decoder decodes the voice signal passing through the communication channel, and the voice signal has voice information in units of frames. In this case, the speech signal passing through the speech decoder outputs a synthesized (distorted) speech signal every frame.

그러나 상기 음성 신호의 프레임 정보가 손실되면 상기 음성 복호단계에서는 이러한 손실을 감소시키기 위하여 과거 사용된 음성 정보 중에서 적당한 프레임을 선택하여 다시 사용한다. However, if the frame information of the speech signal is lost, the speech decoding step selects an appropriate frame from the speech information used in the past and uses it again to reduce the loss.

상기 방식의 알고리즘을 패킷 손실 은닉 알고리즘(packet-loss concealment algorithm)이라고 하며, 구체적인 절차는 프레임의 음성 정보가 손실되어 패킷 손실이 발생하면 송신단의 음성 부호화기는 패킷 손실을 은닉하기 위하여 상기 패킷 손실 은닉 알고리즘에 따라 이전 프레임의 음성 정보를 가지고 손실된 음성 정보를 재구성하는 것이다. 여기서 재구성된 음성이 시불변적 특성이라면 재구성된 음성 정보는 신뢰할 수 있을 것이나 상기 재구성된 음성이 시변적 특성이라면 재구성된 음성 정보는 신뢰할 수 없다.The algorithm of the scheme is called a packet-loss concealment algorithm. When a packet loss occurs due to loss of speech information of a frame, a speech loss concealment algorithm of the transmitter is concealed to conceal packet loss. According to this, the lost speech information is reconstructed with the speech information of the previous frame. If the reconstructed speech is time-invariant, the reconstructed speech information may be reliable. If the reconstructed speech is time-varying, the reconstructed speech information may not be reliable.

또한 상기 제 1 단계(S210)에서는 특정의 부호화기에 따라서 결정되는 패킷 손실 은닉 알고리즘에 의하여 음성신호를 합성하며, 이 때 특정의 부호화기는 사용되는 음성 시스템에 따라서 정해지는 부호화기를 말한다. 예를 들어 사용되는 음성 시스템이 인터넷 전화, 이동 통신 등과 같은 경우에 각각 사용되는 특정의 부호화기를 의미한다.In the first step S210, a speech signal is synthesized by a packet loss concealment algorithm determined according to a specific encoder. In this case, the specific encoder refers to an encoder determined according to a speech system used. For example, it refers to a specific encoder used in each case where the voice system used is an internet phone, a mobile communication, or the like.

상기 제 2 단계(S220)는 음성 특성 분류단계로서 상기 제 1 단계에서 상기 음성 복호화기를 통과하여 합성된(왜곡된) 왜곡음을 받아 통계적 특성에 따라서 분 류하는 단계이다.The second step (S220) is a step of classifying speech characteristics and receiving the synthesized (distorted) distortion sound passing through the speech decoder in the first stage and classifying the speech according to statistical characteristics.

상기 통계적 특성은 예를 들어 SMV(Selective Mode Vocoder)에서 사용되는 프레임 분류를 사용할 수 있다. 여기서 SMV에 의한 분류방법에 대하여 살펴보면, 묵음(silence), 잡음성 무성음(Noise-like unvoiced), 무성음(Unvoiced), 온셋(Onset), 시변 유성음(Non-stationary voiced), 유성음(Voiced)과 같이 5가지로 분류되며 아래 표 2와 같다. 여기서 4번은 쓰이지 아니하므로 생략한다.The statistical characteristics may use, for example, frame classification used in selective mode vocoder (SMV). Here, the classification method by SMV is described as silence, noise-like unvoiced, unvoiced, onset, non-stationary voiced and voiced. It is classified into five categories and is shown in Table 2 below. Since 4 is not used, it is omitted.

그러나 음성 특성을 분류하는 통계적 특성은 상기 SMV 방법만이 사용되는 것은 아니고 사용하고자 하는 음성 시스템에 따라서 다른 통계적 특성이 사용될 수 있다. 예를 들어 음성 복호화기에서 수신된 여러 가지 파라미터만을 가지고 음성 특성을 분류할 수도 있다.However, the statistical characteristics for classifying speech characteristics are not only used for the SMV method, but other statistical characteristics may be used according to the speech system to be used. For example, voice characteristics may be classified using only various parameters received from the voice decoder.

상기 제 3 단계(S230)는 패킷 손실 감지기에서 상기 음성 복호화기에서 합성된 왜곡음의 패킷 손실 여부를 감지하는 단계이며, 이 단계에서는 상기 음성 복호화기에서의 패킷 손실의 유무만을 판단한다.The third step S230 is a step of detecting whether a packet loss of the distortion sound synthesized by the speech decoder is detected by the packet loss detector, and in this step, only the presence or absence of packet loss in the speech decoder is determined.

이어서 (S231)단계에서는 상기 패킷 손실 여부를 판단하여 패킷 손실이 발생하였다면 (S240)단계로 진행하고 패킷 손실이 발생하지 아니한 경우는 본 발명의 특징이 아니므로 설명을 생략한다.Subsequently, in step S231, if the packet loss is determined by determining whether the packet is lost, the process proceeds to step S240. If the packet loss does not occur, the description thereof is omitted.

상기 제 4 단계(S240)에서는 패킷 손실이 발생한 경우 패킷 손실에 가중치를 부여함으로써 음성 품질을 평가하는 단계이며, 프레임의 음성 품질의 저하도를 결정하는 제 4-1 단계(S241), 음성 신호의 수신자가 체감하는 음성 품질의 저하량을 결정하는 제 4-2 단계(S242) 및 상기 음성 품질 저하량을 가지고 회귀분석을 수행하여 MOS값을 구하여 음성 품질을 평가하는 제 4-3 단계(S243)로 이루어져 있다.In the fourth step (S240) is a step of evaluating the speech quality by weighting the packet loss when the packet loss occurs, in the fourth step (S241) to determine the degree of degradation of the speech quality of the frame, Step 4-2 (S242) of determining the amount of degradation of the voice quality felt by the receiver and step 4-3 (S243) of evaluating the voice quality by calculating a MOS value by performing a regression analysis using the amount of voice quality reduction. Consists of

상기 제 4-1 단계(S241)에서는 상기 제 3 단계에서 패킷 손실이 감지되어 음성 신호에서 패킷 손실이 발생한 경우 패킷 손실이 발생한 이전 이후의 음성 특성을 고려하여 현재 프레임의 음성 품질의 저하도(d_pkloss(n))를 결정한다. 여기서 상기 음성 품질의 저하도(d_pkloss(n))가 패킷 손실 가중치를 의미한다.In the fourth step (S241), if packet loss is detected in the third step and packet loss occurs in the third step, the voice quality of the current frame is reduced in consideration of voice characteristics before and after the packet loss occurred (d _pkloss (n)). In this case, the deterioration (d _pkloss (n)) of the voice quality means a packet loss weight.

특히, 상기 제 4-1 단계(S241)에서 상기 패킷 손실가중치를 결정하는 방법으로는 음성 특성에 따라서 결정되는 방법, 확률 모델에 따라서 결정되는 방법이 있다. In particular, the method for determining the packet loss weighting value in the fourth step (S241) includes a method determined according to voice characteristics and a method determined according to a probability model.

먼저 상기 패킷 손실 가중치가 음성 특성에 따라서 결정되는 방법에 대하여 살펴본다. 일반적으로 사람은 통화하면서 일정 시간 동안 발생한 통화 품질의 저하는 기억하지만 일정 시간이 길어지게 되면 상기 일정 시간 동안의 통화 품질의 저하는 기억하지 못한다. First, a method of determining the packet loss weight according to speech characteristics will be described. In general, a person remembers a decrease in call quality that occurs during a certain time while talking, but when a certain time becomes longer, the call quality does not remember a decrease in call quality during the predetermined time.

따라서 일정 구간(예를 들어 8초) 동안의 음성 품질 저하도의 합을 구하여 음성 품질 저하량을 구하거나 최근의 음성 품질 저하도에 패킷 손실 가중치를 일정한 부여(예를 들어 1)하고 과거로 갈수록 패킷 손실 가중치를 작은 가중치로 부여(예를 들어 0에 가깝게)하여 음성 품질 저하도의 합을 구하는 방법이다.Therefore, the sum of voice quality deterioration for a certain period (for example, 8 seconds) is obtained to obtain a voice quality deterioration amount, or the packet loss weight is given a constant (for example, 1) to the recent voice quality deterioration, The packet loss weight is given a small weight (for example, close to zero) to obtain a sum of speech quality degradation.

도 3은 이러한 음성 특성을 고려하여 패킷 손실 가중치를 부여하는 예를 나타낸 도면이다. 여기서 y축은 패킷 손실 가중치를 나타내고 x축은 상기 표 2의 음성의 특성을 나타내며, (a)는 잡음성-무성음인 경우, (b)는 무성음인 경우, (c)는 온셋인 경우, (d)는 시변 유성음인 경우, (e)는 시불변 유성음인 경우를 나타낸다. 특히, 상기 도 3을 참조하면 (c)온셋인 경우와 같이 갑자기 음성이 변하는 구간의 가중치 값이 큰 것을 알 수 있다. 3 is a diagram illustrating an example of assigning a packet loss weight in consideration of such voice characteristics. Where the y-axis represents the packet loss weight and the x-axis represents the characteristics of the voices in Table 2, (a) is noisy-unvoiced, (b) is unvoiced, (c) is onset, (d) Is a time-varying voiced sound, and (e) shows a time-varying voiced sound. In particular, referring to FIG. 3, it can be seen that the weight value of the section in which the voice suddenly changes, as in the case of (c) onset, is large.

여기서 보통 음성 구간에서 가장 많이 관찰되는 상태변화인 6-6-6(이 때 6은 상기 음성 특성 분류단계에서의 SMV의 6등급 유성음을 말하며 나머지 0, 1, 2, 3, 5도 마찬가지이다.)을 기준 상태 1로 가중치를 두고 다른 상태변화의 패킷 손실 가중치를 나타낸다.Here, 6-6-6, which is the state change most observed in the voice section, in which 6 refers to the sixth voiced voice of the SMV in the voice characteristic classification step, and the same is true for the remaining 0, 1, 2, 3, and 5. ) Is weighted as the reference state 1 to represent packet loss weights of other state changes.

이와 더불어, 상기 패킷 손실이 연속적으로 발생한 경우 도 3의 패킷 손실 가중치에 발생한 횟수를 곱하여 음성 품질 저하량을 결정할 수도 있고, 또는 발생한 횟수를 그대로 곱하는 것이 아니라 여러 차례의 실험을 거쳐서 최적의 가중치를 찾아내어 새로운 가중치를 부여하여 음성 품질 저하량을 결정할 수도 있다.In addition, when the packet loss occurs continuously, the packet loss weight of FIG. 3 may be multiplied to determine the voice quality degradation amount, or the optimal weight may be found through several experiments instead of multiplying the number of occurrences. A new weight may be given to determine the amount of speech quality degradation.

다음으로 상기 패킷 손실 가중치가 확률 모델에 따라서 결정되는 방법에 대하여 살펴본다. 상기 음성 특성을 따라서 패킷 손실 가중치를 결정하는 방법과 달리 확률 모델에 따라서 패킷 손실을 결정하는 방법은 패킷 손실 이전/이후 상태를 알고 있을 경우에 음성 품질을 평가하는 방법이다. Next, a method of determining the packet loss weight according to a probability model will be described. Unlike the method of determining the packet loss weight according to the voice characteristic, the method of determining the packet loss according to the probability model is a method of evaluating the voice quality when the state before / after the packet loss is known.

예를 들어 이전의 음성 특성을 1, 이후의 음성 특성을 6이라고 설정하고 현재의 음성 특성 값은 2 또는 5가 될 수 있다고 가정을 한 때, 현재의 음성 특성이 2가 될 확률이 30%, 5가 될 확률이 70%라고 하자. 이 때 1-2-6에 대한 패킷 손실 가중치와 1-5-6에 대한 패킷 손실 가중치 둘 중 하나를 선택하여야 할 경우 확률이 높은 1-5-6에 대한 패킷 손실 가중치를 구하거나 0.3, 0.7의 확률에 대한 가중치를 통하여 패킷 손실 가중치를 구할 수 있다. 결국 상기와 같이 확률 모델에 따라 패킷 손실 가중치를 부여하여 음성 품질의 평가을 할 수 있는 것이다.For example, assuming that the previous voice characteristic is 1 and the subsequent voice characteristic is 6 and the current speech characteristic value can be 2 or 5, there is a 30% chance that the current speech characteristic is 2, Let's say you have a 70% chance of being 5. In this case, if one of the packet loss weights for 1-2-6 and the packet loss weights for 1-5-6 should be selected, the packet loss weights for 1-5-6, which are more likely to be obtained, or 0.3, 0.7 The packet loss weight can be obtained by weighting the probability of. As a result, voice quality can be evaluated by assigning packet loss weights according to the probability model as described above.

상기 제 4-2 단계(S242)에서는 상기 제 4-1 단계(S241)에서 결정한 음성 품질의 저하도(패킷 손실 가중치)와 과거 일정 구간 N개 프레임에서의 시간 가중치를 곱한 후 합산하여 음성 품질의 저하량(d(n))을 결정한다. 상기 음성 품질의 저하량(d(n))을 구하는 식은 아래 수학식 1과 같다.In step 4242, the deterioration rate (packet loss weight) of voice quality determined in step 4-1 (S241) is multiplied by time weights in N frames in the past, and summed to add the voice quality. The fall amount d (n) is determined. The equation for obtaining the amount of deterioration d (n) of the voice quality is expressed by Equation 1 below.

여기서 W_m은 시간 가중치를 나타내고 d_pkloss(n)는 음성 품질의 저하도(패킷 손실 가중치)이다.Where W _m represents the time weight and d _pkloss (n) is the degradation of the speech quality (packet loss weight).

상기 시간 가중치를 이용하여 전체적인 음성 품질의 저하량을 구하는 방법에 대하여 살펴본다. 현재 체감하는 음성 품질의 저하량을 구하기 위하여 정해진 일정 시간 동안(예를 들어 8초)의 음성 품질의 저하도(패킷 손실 가중치)를 나타내는 상기 가중치를 누적한다. 이렇게 합산하는 과정에서 정해진 일정 시간을 넘기지 않으면 수학식 1의 시간 가중치(W_m)를 1로 설정하여 전체적인 음성 품질의 저하량을 계산한다. A method of obtaining a deterioration amount of overall voice quality using the time weight will be described. In order to obtain the amount of deterioration of the voice quality which is currently felt, the weights indicating the deterioration degree (packet loss weight) of the speech quality for a predetermined time period (for example, 8 seconds) are accumulated. In this process of summation, if the predetermined time is not exceeded, the time weight (W _m ) of Equation 1 is set to 1 to calculate the deterioration of the overall voice quality.

하지만 일정 시간을 넘긴 긴 음성에 대해서는 상기 시간가중치(W_m)를 시간에 따라 다른 값을 부여하여 가장 최근의 상기 가중치가 과거의 상기 가중치보다 더 반영된 시간가중치를 사용한다. 예를 들어 현재로부터 8초 이내의 시간가중치는 1, 8초 이전의 시간가중치는 1보다 작은 값을 사용할 수 있다. However, for a long voice over a predetermined time, the time weight value W _m is given a different value according to time to use a time weight value in which the most recent weight is reflected more than the weight in the past. For example, a time weight value within 8 seconds from the present can be 1, and a time weight value less than 8 can be used.

상기 제 4-3 단계(S243)에서는 상기 제 4-2 단계에서 결정된 상기 음성 품질 저하량의 회귀분석을 수행하여 음성 품질을 평가하는 단계로서, 주관적 음성 품질 평가의 기준인 MOS값을 구하여 음성 품질을 평가하는 단계이다. 아래의 수학식 2는 P차 회귀분석을 통하여 상기 MOS 값을 구하는 수식이다. In the step 4-3 (S243) to evaluate the speech quality by performing the regression analysis of the speech quality reduction amount determined in the step 4-2, to obtain the speech quality by obtaining the MOS value that is the criterion of the subjective speech quality evaluation Evaluate. Equation 2 below is an equation for obtaining the MOS value through P-order regression analysis.

여기서 d(n)은 음성 품질의 저하량이고 α_k는 차수에 따른 회귀분석 계수이다. 또한 상기 회기분석에 사용되는 회귀분석식은 음성 분류방법의 종류 및 음성 부호화기의 종류에 따라서 달라진다. Where d (n) is the decrease in speech quality and α _k is the regression coefficient according to the order. The regression equations used in the regression analysis vary depending on the type of speech classification method and the type of speech coder.

도 4는 본 발명에 따른 객관적인 음성 품질의 평가방법의 시뮬레이션에 의한 PESQ-LQ와의 분포도를 나타낸 그래프이다. 여기서 x축은 본 발명에 따른 음성 품질의 평가값 (PKLOSS-MOS-EST)이고 y축은 PESQ-LQ에 따른 음성 품질의 평가값이다.4 is a graph showing the distribution with PESQ-LQ by the simulation of the objective voice quality evaluation method according to the present invention. Here, the x-axis is an evaluation value of voice quality according to the present invention (PKLOSS-MOS-EST) and the y-axis is an evaluation value of voice quality according to PESQ-LQ.

실험을 위하여 ITU-T에서 사용한 Coded Speech Data(ITU-T supplement 23)의 음성 샘플을 사용하였으며 영어를 사용하는 여성화자 2명, 남성화자 2명으로 구성되어 있다. 여기서 패킷 손실은 0~10 %까지 총 64000번 랜덤하게 발생시켜 G. 729 음성 부호화기로 복호화시킨 음성 샘플을 사용하였다.For the experiment, a voice sample of Coded Speech Data (ITU-T supplement 23) used by ITU-T was used and consisted of two female speakers and two male speakers. In this case, the packet loss was randomly generated 64000 times from 0 to 10% and the speech sample decoded by G. 729 speech coder was used.

이 때 PESQ 방법은 상술한 바와 같이 원음과 왜곡음을 동시에 요구하는 더블 엔드(또는 침입적) 방식으로 현재 ITU-T에서 주관적 음성 품질과의 상관도가 0.9~0.95로 알려질 정도로 정확한 음성 품질 평가방법이며, 최근 사람의 체감음성 품질을 좀 더 반영한 스케일인 LQ(Listening Quality)로의 변환된 방법이 제안되었다. 따라서 상기 PESQ-LQ와 본 발명에 따른 음성 품질의 객관적인 평가방법과의 상관도를 구함으로써 본 발명과 주관적 음성 품질과의 상관도를 간접적으로 구한 것이다.At this time, the PESQ method is a double-end (or invasive) method that requires both original and distorted sound as described above, and thus accurate voice quality evaluation method is known so that the correlation with subjective voice quality in ITU-T is 0.9-0.95. Recently, a method of converting to LQ (Listening Quality), which is a scale that more reflects human bodily sensation quality, has been proposed. Therefore, the correlation between the present invention and subjective speech quality is indirectly obtained by obtaining a correlation between the PESQ-LQ and an objective evaluation method of speech quality according to the present invention.

상기 도 4는 최종 회귀분석을 통해 본 발명에 따른 객관적인 음성 품질의 평가방법에 의하여 평가된 MOS값과 PESQ-LQ에 의하여 평가된 MOS값의 분포도를 나타낸 것이며, 이 때 상관도는 0.9116)이었다.4 shows the distribution of MOS values evaluated by the objective speech quality evaluation method according to the present invention and MOS values evaluated by PESQ-LQ through a final regression analysis, wherein the correlation is 0.9116).

결국 상기 본 발명에 따른 객관적인 음성 품질의 평가방법에 의하여 평가된 MOS값과 PESQ-LQ에 의하여 평가된 MOS값의 상관도가 0.9116이라는 것은 본 발명에 따른 객관적인 음성 품질의 평가방법과 주관적 음성 품질과의 상관도 또한 1에 가까운 값을 얻을 수 있으므로 본 발명에 따른 객관적인 음성 품질의 평가방법에 의 하여 주관적 음성 품질을 평가할 수 있다고 할 것이다.As a result, the correlation between the MOS value evaluated by the method for evaluating the objective voice quality and the MOS value evaluated by the PESQ-LQ according to the present invention is 0.9116. Correlation can also be obtained close to 1, it will be said that the subjective speech quality can be evaluated by the objective speech quality evaluation method according to the present invention.

이상 본 발명을 바람직한 실시예를 사용하여 설명하였지만, 본 발명의 범위는 특정 실시예에 한정되는 것은 아니며, 첨부된 특허청구범위에 의해서 해석되어야 할 것이다.Although the present invention has been described using the preferred embodiments, the scope of the present invention is not limited to the specific embodiments and should be interpreted by the appended claims.

이상에서 본 바와 같이, 본 발명에 따른 객관적인 음성 품질의 평가방법에 의하면, 합성된 음성 신호에 대한 정보와 패킷 손실에 대한 정보만 음성 품질의 평가함으로써 연산량이 적고 실시간으로 MOS값을 결정할 수 있는 효과가 있다.As described above, according to the method for evaluating the objective speech quality according to the present invention, by evaluating the speech quality of only the information on the synthesized speech signal and the packet loss, the amount of calculation is small and the MOS value can be determined in real time. There is.

또한 발명에 따른 객관적인 음성 품질의 평가방법은 싱글 엔드 방식을 제공하면서 동시에 패킷 손실에 대한 가중치의 파라미터를 이용한 방식을 제공함으로써 싱글 엔드 방식의 단점인 주관적 음성 품질과의 상관도를 높이는 효과가 있다.In addition, the method for evaluating the objective voice quality according to the present invention provides a single-end method and at the same time provides a method using a parameter of weight for packet loss, thereby increasing the correlation with the subjective voice quality, which is a disadvantage of the single-end method.

Claims

In the method of evaluating the objective voice quality,

A first step of decoding a voice signal passing through the communication channel in the voice decoder and synthesizing the voice signal every frame;

A second step of classifying, in the speech characteristic classifier, the speech signal synthesized (distorted) in the first step according to the statistical characteristic;

A third step of detecting a packet loss occurring while passing through the speech decoder by receiving a synthesized distortion sound passing through the speech decoder by a packet loss detector; And

And a fourth step of evaluating voice quality by weighting the packet loss when the packet loss is detected in the third step.

The method of claim 1, wherein the first step

A method for evaluating objective speech quality, comprising synthesizing a speech signal by a packet loss concealment algorithm determined according to a specific encoder.

The method of claim 1 or 2, wherein the fourth step

A fourth step of determining a degree of degradation (packet loss weight) of voice quality based on the packet loss detected in the third step;

4-2 step of multiplying the deterioration degree (packet loss weight) of the speech quality determined in the step 4-1 by the time weight and adding the sum to determine the speech quality deterioration amount;

And a fourth to third step of evaluating the voice quality by performing a regression analysis of the voice quality degradation amount determined in step 4-2.

The packet loss weight of claim 4, wherein

And determining the speech quality according to the speech characteristics classified in at least one frame before / after the packet loss detected in the third step.

The method of claim 4 wherein the packet loss is

And if it occurs continuously, multiplying the packet loss weight by the number of occurrences.

The method of claim 4 wherein the packet loss is

The objective method of evaluating the quality of speech, characterized in that the optimal weight is used if it occurs continuously.

The packet loss weight of claim 4, wherein

And determining according to the probability model of packet loss detected in the third step.

The method according to claim 3, wherein the time weight of the step 4-2 is

A method for evaluating objective speech quality, wherein the perception of degradation of sound quality changes with time.