KR101233628B1

KR101233628B1 - Voice conversion method and terminal device having the same

Info

Publication number: KR101233628B1
Application number: KR1020100127528A
Authority: KR
Inventors: 최현규; 조용준
Original assignee: 유비벨록스(주)
Priority date: 2010-12-14
Filing date: 2010-12-14
Publication date: 2013-02-14
Also published as: KR20120066276A

Abstract

사람의 음성에 담겨진 감정 부분을 정상적인 형태로 변환시켜 전달할 수 있는 목소리 변환 방법 및 그를 적용한 단말 장치가 개시된다. 본 발명에 따르면, 음성 신호를 수집하여 음성 신호의 성분 중 규칙 정도가 기설정된 수준 이상인 제1 특정 구간을 검출하는 단계; 검출되어진 상기 제1 특정 구간을 주파수 대역으로 변환 후, 증폭하여 제2 특정 구간을 생성하는 단계; 상기 제2 특정 구간에 있는 음성 파형을 분석하여 감정 내포 정보를 추출하는 단계; 학습에 의해 축적된 감정 기준 정보와 상기 감정 내포 정보를 비교 후, 기준치 이상 여부를 판단하여 감정이 섞인 상태를 결정하는 단계; 및 상기 감정이 섞인 상태로 결정되면 상기 음성 신호 중 상기 감정 내포 정보를 담고 있는 특정 구간을 상기 기준치의 이하로 낮추어 전송하는 단계를 포함하는 목소리 변환 방법 및 그를 적용한 단말 장치가 제공된다.
이에, 본 발명은 상대방에게 정화된 음성 신호를 전달하여 상대방에게 자신의 이미지가 실추되는 것을 방지하며, 대인 관계를 개선할 수 있는 효과를 실현할 수 있다.Disclosed are a voice conversion method capable of converting an emotional part contained in a human voice into a normal form and transmitting the same, and a terminal apparatus using the same. According to the present invention, the method includes: collecting a voice signal and detecting a first specific section in which a degree of regulation is greater than or equal to a predetermined level among components of the voice signal; Converting the detected first specific section into a frequency band and amplifying the second specific section to generate a second specific section; Analyzing emotion waveforms in the second specific section to extract emotion inclusion information; Comparing the emotion reference information accumulated by learning with the emotion inclusion information, and determining whether the emotion is mixed by determining whether the reference value is higher than the reference value; And if it is determined that the emotion is mixed state is provided a voice conversion method comprising the step of lowering the specific section containing the emotional inclusion information of the voice signal to less than the reference value and the terminal device applying the same.
Accordingly, the present invention can deliver the purified voice signal to the other party to prevent the image of the person from being lost, and can realize the effect of improving the interpersonal relationship.

Description

VOICE CONVERSION METHOD AND TERMINAL DEVICE HAVING THE SAME}

본 발명은 목소리 변환 방법 및 그를 적용한 단말 장치에 관한 것으로서, 더욱 상세하게는, 사람의 음성에 담겨진 감정 부분을 정상적인 형태로 변환시켜 전달할 수 있는 목소리 변환 방법 및 그를 적용한 단말 장치에 관한 것이다.The present invention relates to a voice conversion method and a terminal device to which the present invention is applied. More particularly, the present invention relates to a voice conversion method capable of converting an emotional part contained in a human voice into a normal form, and a terminal device using the same.

최근에는 사람이 갖고 있는 감정을 기계적인 장치로 인식하고자 사람-기계 인터페이스 기술이 개발되고 있다. 이러한 사람-기계 인터페이스 기술은 사람의 신체 또는 음성에 내포되어 있는 감정 요소를 정량화하여 생활에 활용하고자 하는 목적을 갖는다.Recently, human-machine interface technology has been developed to recognize human emotions as mechanical devices. This human-machine interface technology has the purpose of quantifying the emotional elements contained in the human body or voice to utilize in life.

그러나, 사람의 신체 또는 음성에 내포된 감정을 정량화하여 기계적인 장치로 구현하기란 여간 쉬운 것이 아니다. 즉, 사람의 감정은 미묘하고, 변화가 심하기 때문에 정확한 측정값으로 도출하기란 쉽지 않기 때문이다.
이러한 어려움에도 불구하고, 전화 통화시 자신의 음성 속에 자신의 감정을 정화시켜 상대방에게 전달되면 매우 좋을 것이다.However, it is not easy to quantify the emotions contained in the human body or voice and implement them as a mechanical device. In other words, since human emotions are subtle and changeable, it is not easy to derive accurate measurements.
In spite of these difficulties, it would be great if you could cleanse your feelings in your own voice and deliver them to the other person.

삭제delete

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 자신의 음성에 내포된 감정 내포 정보를 추출하고, 감정이 섞이지 않은 정상적인 음성으로 변환시켜 상대방에게 전달할 수 있는 목소리 변환 방법 및 그를 적용한 단말 장치를 제공하는데 그 목적이 있다.The present invention has been made to solve the above problems, a voice conversion method that can extract the emotional inclusion information contained in its own voice, and converts it to a normal voice that is not mixed with emotions to be delivered to the other party and a terminal device using the same The purpose is to provide.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 기능을 수행하기 위한, 본 발명의 특징은 다음과 같다.In order to accomplish the objects of the present invention as described above and to carry out the characteristic functions of the present invention described below, features of the present invention are as follows.

본 발명의 일 실시예에 따르면, (a) 음성 신호를 수집하여 음성 신호의 성분 중 규칙 정도가 기설정된 수준 이상인 제1 특정 구간을 검출하는 단계; (b) 검출되어진 상기 제1 특정 구간을 주파수 대역으로 변환 후, 증폭하여 제2 특정 구간을 생성하는 단계; (c) 상기 제2 특정 구간에 있는 음성 파형을 분석하여 감정 내포 정보 -상기 감정 내포 정보는 진폭 크기, 길이, 각도, 최대 값, 에너지량 및 특징점 개수를 포함 함- 를 추출하는 단계; (d) 학습에 의해 축적된 감정 기준 정보와 상기 감정 내포 정보를 비교 후, 기준치 이상 여부를 판단하여 감정이 섞인 상태를 결정하는 단계; 및 (e) 상기 감정이 섞인 상태로 결정되면 상기 음성 신호 중 상기 감정 내포 정보를 담고 있는 특정 구간을 상기 기준치의 이하로 낮추어 전송하는 단계를 포함하는 목소리 변환 방법이 제공된다.According to an embodiment of the present invention, the method includes: (a) collecting a voice signal and detecting a first specific section in which a degree of regulation is greater than a predetermined level among components of the voice signal; (b) converting the detected first specific section into a frequency band and amplifying the second specific section to generate a second specific section; (c) extracting emotion inclusion information by analyzing the speech waveform in the second specific section, wherein the emotion inclusion information includes amplitude magnitude, length, angle, maximum value, amount of energy, and number of feature points; (d) comparing the emotion reference information accumulated by learning with the emotion inclusion information, and determining whether the emotion is mixed by determining whether the reference value is higher than the reference value; And (e) if it is determined that the emotion is mixed, lowering a specific section of the voice signal containing the emotion inclusion information below the reference value and transmitting the voice.

여기서, 상기 (e) 단계는, 상기 특정 구간에 있는 음성 신호에 대해 손실 압축 방법을 적용하여 상기 에너지량을 축소하고, 기준치의 이하로 낮추어 전송할 수 있다.Here, in the step (e), by applying a lossy compression method to the speech signal in the specific section, the amount of energy can be reduced, and lowered below the reference value to transmit.

또한, 상기 감정 기준 정보는, 다수의 샘플 음성 신호의 음소적 요소와 운율적 요소를 파악하여 축적된 감정이 섞인 데이터이며, 상기 음소적 요소는, 피치, 에너지, 발음 속도 파라미터 정보를 포함하고, 상기 운율적 요소는 구간 피치, 에너지의 평균, 표준 편차 및 최대 값 파라미터 정보를 포함할 수 있다.The emotion reference information is data in which emotions accumulated by grasping phoneme elements and rhyme elements of a plurality of sample voice signals are mixed, and the phoneme elements include pitch, energy, and pronunciation speed parameter information. The rhyme factor may include interval pitch, average of energy, standard deviation, and maximum value parameter information.

또한, 상기 손실 압축은 CELP, G.711, G.726, HILN, AMR 및 Speex 군 중 선택된 어느 하나의 방법을 적용할 수 있다.In addition, the lossy compression may be applied to any one selected from the group of CELP, G.711, G.726, HILN, AMR, and Speex.

또한, 본 발명의 다른 일 형태에 따르면, 음성 신호를 수집하여 음성 신호의 성분 중 규칙 정도가 기설정된 수준 이상인 제1 특정 구간을 검출하는 제1 특정 구간 검출부; 검출되어진 상기 제1 특정 구간을 주파수 대역으로 변환 후, 증폭하여 제2 특정 구간을 생성하는 제2 특정 구간 생성부; 상기 제2 특정 구간에 있는 음성 파형을 분석하여 감정 내포 정보 -상기 감정 내포 정보는 진폭 크기, 길이, 각도, 최대 값, 에너지량 및 특징점 개수를 포함 함- 를 추출하는 감정 내포 정보 추출부; 학습에 의해 축적된 감정 기준 정보와 상기 감정 내포 정보를 비교 후, 기준치 이상 여부를 판단하여 감정이 섞인 상태를 결정하는 감정 내포 상태 결정부; 및 상기 감정이 섞인 상태로 결정되면 상기 음성 신호 중 상기 감정 내포 정보를 담고 있는 특정 구간을 상기 기준치의 이하로 낮추어 전송하는 음성 신호 조절부;를 포함하는 단말 장치가 제공된다.According to another aspect of the present invention, there is provided an apparatus, comprising: a first specific section detection unit configured to collect a voice signal and detect a first specific section having a predetermined degree or more among components of a voice signal; A second specific section generator for converting the detected first specific section to a frequency band and amplifying the second specific section to generate a second specific section; An emotion inclusion information extraction unit configured to analyze the speech waveform in the second specific section to extract emotion inclusion information, wherein the emotion inclusion information includes amplitude magnitude, length, angle, maximum value, amount of energy, and number of feature points; An emotional inclusion state determination unit which determines whether the emotion is mixed by determining whether the emotion reference information is greater than the reference value after comparing the emotion reference information accumulated by the learning with the emotion inclusion information; And a voice signal controller for lowering a specific section including the emotion-containing information in the voice signal below the reference value when the emotion is mixed.

여기서, 상기 단말 장치는, 이동 단말기 및 고정 단말기 중 어느 하나이며, 상기 이동 단말기는, LTE 단말기, W-CDMA 단말기, HSDPA 단말기, Wibro 단말기, PDA 단말기 및 CDMA 단말기 군 중 선택된 어느 하나의 단말기일 수 있다.Here, the terminal device may be any one of a mobile terminal and a fixed terminal, and the mobile terminal may be any one terminal selected from a group of LTE terminals, W-CDMA terminals, HSDPA terminals, Wibro terminals, PDA terminals, and CDMA terminals. have.

본 발명에 의하면, 자신의 음성 내에 감정이 섞여 있는 부분인 감정 내포 정보를 추출하고, 상기 감정 내포 정보에 대응하는 음성 신호의 특정 구간을 손실 압축과 기준치 이하로 조정함으로써, 상대방에게 정화된 음성 신호를 전달하여 상대방으로 하여금 자신의 이미지를 실추시키는 것을 방지하며, 대인 관계를 크게 개선할 수 있는 효과가 달성된다.According to the present invention, the voice signal purified to the other party is extracted by extracting emotion-contained information which is a part where emotions are mixed in one's own voice and adjusting a specific section of the voice signal corresponding to the emotion-contained information to a loss compression and a reference value or less. By transmitting the to prevent the other party to lose their image, and the effect that can greatly improve the interpersonal relationship is achieved.

도 1은 본 발명의 일 실시예에 따른 목소리 변환 방법을 예시적으로 나타낸 도면이다.
도 2 및 도 3은 본 발명의 일 실시예에 따른 감정 내포 정보를 설명하기 위한 제2 특정 구간의 음성 파형을 예시적으로 나타낸 도면이다.
도 4는 본 발명의 일 실시예에 따른 단말 장치(200)를 예시적으로 나타낸 구성도이다.1 is a diagram illustrating a voice conversion method according to an embodiment of the present invention by way of example.
2 and 3 are diagrams exemplarily illustrating voice waveforms of a second specific section for explaining emotion inclusion information according to an embodiment of the present invention.
4 is a block diagram illustrating a terminal device 200 according to an embodiment of the present invention.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, so that those skilled in the art can easily carry out the present invention. In the drawings, like reference numerals refer to the same or similar functions throughout the several views.

도 1은 본 발명의 일 실시예에 따른 목소리 변환 방법을 예시적으로 나타낸 도면이다. 도 1을 설명하면서 도 2 및 도 3을 구체적인 예로서 참조 설명한다.1 is a diagram illustrating a voice conversion method according to an embodiment of the present invention by way of example. 1 and 3 will be described with reference to specific examples.

도 1을 참조하면, 본 발명의 일 실시예에 따른 목소리 변환 방법은 음성 신호의 특정 구간 검출 단계(S110), 제2 특정 구간을 생성하는 단계(S120), 감정 내포 정보를 추출하는 단계(S130), 감정이 섞인 상태를 결정하는 단계(S140) 및 음성 신호를 기준 이하로 조절하는 단계(S150)를 포함하여 이루어진다. Referring to FIG. 1, in the voice conversion method according to an embodiment of the present invention, a step of detecting a specific section of the voice signal (S110), a step of generating a second specific section (S120), and extracting emotion inclusion information (S130). ), Determining a state in which emotions are mixed (S140) and adjusting a voice signal below a reference (S150).

먼저, 본 발명의 S110 단계에서는 자신에 의해 발산된 음성 신호를 수집하여 음성 신호의 성분 중 규칙 정도가 기설정된 수준 이상인 제1 특정 구간을 검출하는 역할을 한다. 상기 제1 특정 구간 내에는 특정 단어나 문장 길이 수준에 준하는 음성을 담고 있으며, 그 음성이 규칙 변화가 심한 관계로 감정 변화(감성의 변화)가 포함된 상태일 가능성이 매우 높다. 따라서, 제1 특정 구간의 검출은 규칙 변화 정도가 기설정된 수준 이상인 경우를 확인함으로써 쉽게 예측되어 얻어진 결과이다.First, in step S110 of the present invention, the voice signal emitted by the self is collected to serve to detect a first specific section in which a degree of regulation is greater than a predetermined level among components of the voice signal. In the first specific section, a voice corresponding to a specific word or sentence length level is contained, and it is very likely that the voice includes a change in emotion (change of emotion) due to a severe change in rules. Therefore, the detection of the first specific section is a result that is easily predicted and obtained by checking the case where the degree of change of the rule is equal to or more than a predetermined level.

다음으로, 본 발명의 S120 단계에서는 검출되어진 상기 제1 특정 구간을 주파수 대역으로 변환 후, 증폭하여 제2 특정 구간을 생성하는 역할을 한다. 주파수 대역의 변환은 필터를 통하여 행해지며, 주파수 대역으로 변환된 음성 신호의 특정 구간을 증폭함으로써, 확대된 음성 신호의 제2 특정 구간을 생성하게 되는 것이다. 필터는 대역제한 필터인 것이 바람직하다. 이러한 음성 신호의 제2 특정 구간의 생성은 이하에서와 같이 감정 내포 정보를 파악하는데 유용하게 사용된다.Next, in step S120 of the present invention, after converting the detected first specific section to a frequency band, amplifies and generates a second specific section. The conversion of the frequency band is performed through a filter, and by amplifying a specific section of the speech signal converted into the frequency band, a second specific section of the enlarged speech signal is generated. The filter is preferably a band limiting filter. The generation of the second specific section of the voice signal is usefully used to grasp the emotion inclusion information as follows.

다음으로, 본 발명의 S130 단계에서는 제2 특정 구간에 있는 음성 파형을 분석하여 감정 내포 정보를 추출하는 역할을 수행한다. 음성 파형을 분석하는 과정은 제2 특정 구간에 있는 음성 파형의 진폭 크기, 길이, 각도, 최대 값, 에너지량 및 특징점 개수 등의 상태를 파악함으로써, 감정 내포 정보를 추출할 수 있게된다.Next, in step S130 of the present invention, the voice waveform in the second specific section is analyzed to extract the emotion inclusion information. In the process of analyzing the speech waveform, emotion inclusion information may be extracted by identifying a state of amplitude, length, angle, maximum value, amount of energy, and number of feature points of the speech waveform in the second specific section.

상기 감정 내포 정보의 추출에 대하여 도 2 및 도 3을 참조하여 이하에서 설명한다.Extraction of the emotional inclusion information will be described below with reference to FIGS. 2 and 3.

도 2 및 도 3은 본 발명의 일 실시예에 따른 감정 내포 정보를 설명하기 위한 제2 특정 구간의 음성 파형을 예시적으로 나타낸 도면으로서, 도 2에서는 감정 내포 정보를 추출 가능한 형태의 음성 파형을 보여주며, 도 3은 도 2의 음성 파형에 대해 감정 내포 정보의 추출 요소를 참조부호로서 나타내고 있음을 보여준다.2 and 3 exemplarily illustrate voice waveforms of a second specific section for explaining emotion inclusion information according to an embodiment of the present invention. In FIG. 3 shows that the extraction element of emotion inclusion information is indicated by reference numerals for the speech waveform of FIG. 2.

즉, 도 3에서와 같이, 본 발명의 감정 내포 정보의 추출은 음성 신호의 특정 구간에 있는 음성 파형(100)에서 진폭 크기(110), 길이(120), 각도(130), 최대 값(140), 에너지량(150) 및 최대 값(140)과 같은 특징점의 개수 등을 파악함으로써 가능하게 된다.That is, as shown in Figure 3, the extraction of the emotional inclusion information of the present invention is the amplitude amplitude 110, the length 120, the angle 130, the maximum value 140 in the speech waveform 100 in a particular section of the speech signal ), And the number of feature points such as the energy amount 150 and the maximum value 140 can be obtained.

다음으로, 본 발명의 S140 단계에서는 학습에 의해 축적된 감정 기준 정보와 상기 S130 단계에서 추출된 감정 내포 정보를 비교 후, 기준치 이상 여부를 판단하여 감정이 섞인 상태(감성이 섞인 상태)를 결정하는 역할을 수행한다.Next, in step S140 of the present invention, after comparing the emotion reference information accumulated by the learning and the emotion inclusion information extracted in the step S130, it is determined whether the reference value is greater than or equal to determine the mixed state (the state of emotion mixed) Play a role.

여기서, 감정 기준 정보라 함은 다수의 샘플 음성 신호의 음소적 요소와 음율적 요소를 파악하여 축적된 감정이 내포된 데이터를 지칭하며, 이때의 음소적 요소는 상대방이 말한 음성 신호에서 측정 가능한 피치, 에너지 및, 발음 속도 등의 파라미터 정보를 의미하며, 운율적 요소라 함은 상대방이 말한 음성 신호의 구간 피치, 에너지의 평균, 표준 편차 및 최대 값 등의 파라미터 정보를 의미한다.Here, the emotion reference information refers to data containing emotions accumulated by grasping the phoneme and phonological elements of the plurality of sample voice signals, and the phoneme elements in this case are pitches that can be measured in the voice signal spoken by the other party. It means parameter information such as, energy and pronunciation speed, and the rhythm element means parameter information such as interval pitch, mean of energy, standard deviation and maximum value of the voice signal spoken by the counterpart.

이로써, 본 발명의 감정 기준 정보는 음소적 요소의 파라미터로부터 피치 횟수와 크기, 에너지의 크기 및 발음 속도의 크기와 운율적 요소의 파라미터로부터 구간 피치의 횟수와 크기, 에너지의 평균 값, 표준 편차 값 및 최대 값 등을 분석하여 획득되어진 결과이다. 결국, 감정 기준 정보는 음성 파형의 진폭 크기, 길이, 각도 및 최대 값 및 진동 반복 횟수와 관련하여 보다 세밀하게 분석된 축적 데이터이다.Thus, the emotional criterion information of the present invention is the number and magnitude of pitches from the parameters of the phoneme element, the magnitude of the energy and the magnitude of the pronunciation speed and the number and magnitude of the interval pitch from the parameters of the rhythm element, the average value of the energy, the standard deviation value And a result obtained by analyzing the maximum value and the like. As a result, the emotional reference information is accumulated data analyzed in more detail with respect to the amplitude magnitude, the length, the angle and the maximum value of the speech waveform, and the number of vibration repetitions.

이에 따라, 본 발명의 S140 단계에서는 위와 같이 학습에 의해 축적된 감정 기준 정보의 정해진 각 요소에 가장 근접한 감정 내포 정보를 찾아내어 비교함으로써, 기준치 이상 여부를 판단하여 감정이 섞인 상태를 결정할 수 있게 되는 것이다.Accordingly, in step S140 of the present invention, by finding and comparing the emotion inclusion information closest to each element of the emotion reference information accumulated by learning as described above, it is possible to determine whether the reference value is greater or more and determine the mixed state of emotion. will be.

다른 방법으로는 음소적 요소 및 운율적 요소의 감정 기준 정보를 이용하여 음성 신호의 특정 구간에 내포된 감정 내포 정보에 대한 각각의 가우시안 혼합 분포를 찾아내어 비교함으로써, 그 중 가장 큰 확률 값이 기준치 이상인지를 판단하여 감정이 섞인 상태를 결정할 수도 있다.Alternatively, by using the emotional reference information of the phonetic and rhyme elements, each Gaussian mixture distribution of the emotional inclusion information contained in a specific section of the speech signal is found and compared, whereby the largest probability value is the reference value. You can also determine if your feelings are mixed by determining if something is wrong.

마지막으로, 본 발명의 S150 단계에서는 감정이 섞인 상태로 결정된 음성 신호 중 감정 내포 정보를 담고 있는 특정 구간을 감정이 섞인 상태의 기준치 이하로 낮추는 기능을 수행하여 감정이 섞인 상태를 줄인 음성 신호를 유,무선 통신망을 경유하여 전송할 수 있게 된다. 이때, 기준치의 값은 S140 단계에서 설정된 값이다. Finally, in step S150 of the present invention performs a function of lowering a specific section containing the emotional inclusion information of the voice signals determined to be mixed emotions below the reference value of the mixed emotions to reduce the mixed voice emotions. It is possible to transmit via a wireless communication network. At this time, the reference value is a value set in step S140.

아울러, 본 발명의 S150 단계에서는 특정 구간의 음성 신호에 대해 손실 압축 방법을 적용하여 감정 내포 정보에 담긴 에너지량을 축소하는 기능을 더 수행하게 된다. 상기 손실 압축 방법은 S140 단계에서 결정된 감정이 섞인 상태를 가지고 CELP, G.711, G.726, HILN, AMR 및 Speex 군 중 선택된 어느 하나의 방법을 적용하여 감정이 섞인 상태를 내포하고 있는 에너지량을 축소하여 축소된 주파수 영역을 제외한 나머지 영역을 손실 압축하게 된다.In addition, in step S150 of the present invention, by applying a lossy compression method to the speech signal in a specific section, the function of reducing the amount of energy contained in the emotional inclusion information is further performed. The lossy compression method has a mixed state of emotions determined in step S140 and the amount of energy containing the mixed state of emotions by applying any one selected from the group of CELP, G.711, G.726, HILN, AMR, and Speex. The lossy compression is performed on the remaining region except the reduced frequency region.

이 결과에 더하여, 본 발명의 S150 단계에서는 감정이 섞인 상태인 기준치의 이하로 낮추는 기능을 더 수행함으로써, 감정 내포 정보가 담긴 음성 신호를 최대한 억제하여 정상적인 수준 정도의 음성을 상대방에게 전송하게 되는 것이다.In addition to this result, in the step S150 of the present invention by further performing the function of lowering below the reference value in a mixed state of emotion, the voice signal containing emotion-containing information is suppressed as much as possible to transmit a normal level of voice to the other party. .

이와 같이, 본 실시예에서는 감정이 섞인 상태가 담긴 특정 구간의 음성 신호를 정상적인 형태로 변환시켜 감정이 담긴 음성을 최대한 억제할 수 있는 장점을 제공하게 된다.As described above, the present embodiment provides an advantage of suppressing the voice containing the emotion as much as possible by converting the voice signal of the specific section containing the mixed state into the normal form.

이하에서는, 이상에서 설명된 목소리 변환 방법을 적용할 수 있는 단말 장치를 예시한다.Hereinafter, a terminal device to which the voice conversion method described above can be applied will be exemplified.

도 4는 본 발명의 일 실시예에 따른 단말 장치(200)를 예시적으로 나타낸 구성도이다.4 is a block diagram illustrating a terminal device 200 according to an embodiment of the present invention.

도 4에 도시된 바와 같이, 본 발명의 일 실시예에 따른 단말 장치(200)는 제1 특정 구간 검출부(210), 제2 특정 구간 생성부(220), 감정 내포 정보 추출부(230), 감정 내포 상태 결정부(240), 음성 신호 조절부(250), 정보 제공부(260), 제어부(270) 및 정보 저장부(280)를 포함하여 구성된다.As illustrated in FIG. 4, the terminal device 200 according to an exemplary embodiment of the present invention may include a first specific section detection unit 210, a second specific section generation unit 220, an emotion inclusion information extraction unit 230, The emotional inclusion state determiner 240, the voice signal controller 250, the information provider 260, the controller 270 and the information storage unit 280 are configured to be included.

먼저, 본 발명의 제1 특정 구간 검출부(210)는 음성 신호를 수집하여 음성 신호의 성분 중 규칙 정도가 기설정된 수준 이상인 제1 특정 구간을 검출하는 역할을 수행하며, 제2 특정 구간 생성부(220)는 제1 특정 구간 검출부(210)에서 검출되어진 제1 특정 구간을 주파수 대역으로 변환 후, 증폭하여 제2 특정 구간을 생성한다. 상기 제2 특정 구간의 형태는 도 2 및 도 3을 참조하여 충분히 논의되었으므로 여기서는 생략하기로 한다.First, the first specific section detection unit 210 of the present invention collects a voice signal and performs a role of detecting a first specific section having a rule level equal to or greater than a predetermined level among components of the voice signal, and a second specific section generator ( 220 converts the first specific section detected by the first specific section detector 210 into a frequency band, and then amplifies and generates a second specific section. Since the shape of the second specific section has been fully discussed with reference to FIGS. 2 and 3, a description thereof will be omitted.

본 발명의 감정 내포 정보 추출부(230)는 제2 특정 구간에 있는 음성 파형을 분석하여 감정 내포 정보를 추출하되, 음성 파형의 진폭 크기, 길이, 각도, 최대 값, 에너지량 및 특징점 개수를 분석하여 감정 내포 정보를 추출할 수 있으며, 본 발명의 감정 내포 상태 결정부(240)는 학습에 의해 축적된 감정 기준 정보와 감정 내포 정보 추출부(230)에서 추출된 감정 내포 정보를 비교 후, 기준치 이상 여부를 판단하여 감정이 섞인 상태를 결정하는 기능을 수행하게 한다.The emotional inclusion information extraction unit 230 of the present invention analyzes the speech waveform in the second specific section to extract the emotional inclusion information, and analyzes the amplitude, length, angle, maximum value, energy amount, and number of feature points of the speech waveform. The emotion inclusion information determination unit 240 may extract the emotion inclusion information, and the emotion inclusion state determination unit 240 compares the emotion reference information accumulated by the learning and the emotion inclusion information extracted by the emotion inclusion information extraction unit 230, and then the reference value. Judging whether or not abnormality to perform the function to determine the mixed state.

본 발명의 음성 신호 조절부(250)는 감정 내포 상태 결정부(240)에 의해 감정이 섞인 상태로 결정되면 음성 신호 중 감정 내포 정보를 담고 있는 특정 구간을 기준치의 이하로 낮추고, 추가적으로 특정 구간에 있는 음성 신호에 대해 손실 압축 방법을 적용하여 에너지량을 축소시키는 기능을 더 수행하며, 이로써 정보 제공부(260)는 음성 신호 조절부(250)에 의해 수행된 결과를 유,무선 통신망을 경유하여 상대방의 해당하는 단말기로 전송하게 되는 것이다.When the voice signal adjusting unit 250 of the present invention determines that the emotion is mixed by the emotional inclusion state determination unit 240, the voice signal adjusting unit 250 lowers a specific section including the emotional inclusion information of the voice signal to below a reference value, and further To reduce the amount of energy by applying a lossy compression method to the voice signal, the information providing unit 260 is the result of the voice signal control unit 250 via the wired and wireless communication network It will be transmitted to the corresponding terminal of the other party.

상기, 유,무선 통신망은 인터넷망, 인트라넷망, 이동통신망 등 다양한 유무선 통신 기술을 이용하여 인터넷 프로토콜로 데이터를 송수신할 수 있는 망을 말한다. 무선 통신망일 경우에는 LTE 단말기, W-CDMA 단말기, HSDPA 단말기, Wibro 단말기, PDA 단말기 및 CDMA 단말기 군 중 선택된 어느 하나의 단말 장치를 적용할 수 있다. 또한, 유선 형태의 단말 장치일 수도 있다.The wired and wireless communication network refers to a network capable of transmitting and receiving data through an internet protocol using various wired and wireless communication technologies such as an internet network, an intranet network, and a mobile communication network. In the case of a wireless communication network, any one terminal device selected from an LTE terminal, a W-CDMA terminal, an HSDPA terminal, a Wibro terminal, a PDA terminal, and a CDMA terminal group may be applied. Also, the terminal device may be a wired type.

마지막으로, 본 발명의 제어부(270)은 각 모듈 간의 데이터 흐름을 제어하며, 정보 저장부(280)는 각 모듈에서 처리된 데이터를 저장한다. Finally, the controller 270 of the present invention controls the data flow between each module, and the information storage unit 280 stores the data processed in each module.

이와 같이, 본 실시예에서는 목소리 변환 방법을 실질적으로 적용함으로써, 기존의 거짓말 탐지기와 같이 오프 라인 상에서 적용되던 것을 응용하여 유,무선 네트워크와 같은 구조에서 적용할 수 있는데 큰 이점을 제공하게 되는 것이다.As described above, in the present embodiment, the voice conversion method is substantially applied, thereby providing a great advantage in that it can be applied in a structure such as a wired / wireless network by applying what was applied on-line like a conventional polygraph.

이상에서와 같이, 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고 다른 구체적인 형태로 실시할 수 있다는 것을 이해할 수 있을 것이다. 따라서 이상에서 기술한 실시예는 모든 면에서 예시적인 것이며 한정적이 아닌 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the exemplary embodiments or constructions. You can understand that you can do it. The embodiments described above are therefore to be considered in all respects as illustrative and not restrictive.

100 : 음성 파형 110 : 진폭 크기
120 : 음성 파형의 길이 130 : 음성 파형의 각도
140 : 음성 파형의 최대값 150 : 음성 파형의 에너지량
200 : 단말 장치 210 : 제1 특정 구간 검출부
220 : 제2 특정 구간 생성부 230 : 감정 내포 정보 추출부
240 : 감정 내포 상태 결정부 250 : 음성 신호 조절부
260 : 정보 제공부 270 : 제어부
280 : 정보 저장부 100: voice waveform 110: amplitude magnitude
120: length of the speech waveform 130: angle of the speech waveform
140: maximum value of the audio waveform 150: amount of energy of the audio waveform
200: terminal device 210: first specific section detection unit
220: second specific section generator 230: emotion inclusion information extraction unit
240: emotional inclusion state determination unit 250: voice signal control unit
260: information providing unit 270: control unit
280: information storage unit

Claims

(a) collecting a voice signal and detecting a first specific section in which components of the voice signal have a predetermined degree or more;
(b) converting the detected first specific section into a frequency band and amplifying the second specific section to generate an enlarged voice signal;
(c) extracting emotion inclusion information by analyzing the speech waveform in the second specific section, wherein the emotion inclusion information includes amplitude magnitude, length, angle, maximum value, amount of energy, and number of feature points;
(d) comparing the emotion reference information accumulated by learning with the emotion inclusion information, and determining whether the emotion is mixed by determining whether the reference value is higher than the reference value; And
(e) if it is determined that the emotions are mixed, lowering a specific section containing the emotion-containing information among the voice signals below the reference value and transmitting the same;
The emotion reference information is data in which emotions accumulated by identifying phoneme and rhyme elements of a plurality of sample voice signals are mixed, and the phoneme elements include parameters of pitch, energy, and pronunciation speed measured in the voice signal. And the rhyme element comprises parameters of interval pitch, mean of energy, standard deviation and maximum value of the speech signal.

The method of claim 1, wherein step (e)
The speech conversion method, characterized in that the amount of energy is reduced by applying a lossy compression method to the speech signal in a particular section containing the emotional inclusion information, and lowered below the reference value and transmitted.

delete

The method of claim 1,
The phonetic element includes pitch, energy, and pronunciation speed parameter information, and the rhyme element includes interval pitch, average of energy, standard deviation, and maximum value parameter information.

The method of claim 2, wherein the lossy compression,
The voice conversion method of any one of CELP, G.711, G.726, HILN, AMR and Speex group.

A first specific section detection unit configured to collect a voice signal and detect a first specific section in which a degree of regulation is greater than a predetermined level among components of the voice signal;
A second specific section generator which converts the detected first specific section into a frequency band and amplifies and generates a second specific section of the enlarged voice signal;
An emotion inclusion information extraction unit configured to analyze the speech waveform in the second specific section to extract emotion inclusion information, wherein the emotion inclusion information includes amplitude magnitude, length, angle, maximum value, amount of energy, and number of feature points;
An emotional inclusion state determination unit which determines whether the emotion is mixed by determining whether the emotion reference information is greater than the reference value after comparing the emotion reference information accumulated by the learning with the emotion inclusion information; And
And a voice signal controller for lowering a specific section including the emotion-embedded information in the voice signal below the reference value when the emotion is mixed.
The emotion reference information is data in which emotions accumulated by identifying phoneme and rhyme elements of a plurality of sample voice signals are mixed, and the phoneme elements include parameters of pitch, energy, and pronunciation speed measured in the voice signal. And the rhyme element comprises parameters of interval pitch, mean of energy, standard deviation and maximum value of the speech signal.

delete