KR100206799B1

KR100206799B1 - Camcorder capable of discriminating the voice of a main object

Info

Publication number: KR100206799B1
Application number: KR1019960034206A
Authority: KR
Inventors: 정한; 김기백
Original assignee: 구자홍; 엘지전자주식회사
Priority date: 1996-08-19
Filing date: 1996-08-19
Publication date: 1999-07-01
Also published as: KR19980014999A

Abstract

본 발명은 화자 인식형 캠코더에 관한 것으로, 종래에는 음성의 톤을 구분하는 인식으로 특정인을 인식하도록 하였으나 특정인이라도 상황에 따라 톤이 달라지는 것을 해결하지 못하고, 어휘구사에 의한 구별은 입력되는 음성 판단에 충분할 정도로 많아야 하고 판단에 있어 연산량은 방대하여 제품에 적용하지 못하는 문제점이 있었다. 따라서 본 발명은 신경회로망과 다수 마이크열을 이용하여 학습식으로 특정인의 음성을 인식하여 두고 그 음성이 혼합된 혼합 음성에서 특정인의 음성을 식별하고 추출하여 영상신호와 함께 기록하도록 하거나 특정화자의 음성에 의해 제어가 가능하도록 한다.The present invention relates to a speaker-recognized camcorder, which conventionally recognizes a specific person by recognizing the tone of a voice, but does not solve a change in the tone according to a situation even by a specific person. There should be enough to be large enough and the amount of calculation in the judgment was huge, there was a problem that can not be applied to the product. Therefore, the present invention recognizes the voice of a specific person by using a neural network and a plurality of microphone strings in a learning manner, and identifies and extracts the specific person's voice from the mixed voice mixed with the voice, and records it with a video signal or the voice of a specific speaker. Control is possible.

Description

Speaker Recognition Camcorder

제1도는 본 발명 화자 인식형 캠코더의 블럭구성도.1 is a block diagram of a speaker recognition camcorder of the present invention.

제2도는 본 발명 특정화자의 음성에 의해 제어가 가능한 화자 인식형 캠코더의 블럭구성도.2 is a block diagram of a speaker recognition type camcorder which can be controlled by the voice of a specific speaker of the present invention.

제3도는 제1도 및 제2도에서, 화자 인식부의 상세도.3 is a detailed view of a speaker recognition unit in FIGS. 1 and 2.

제4도는 제3도에서, 신경회로망의 구성도.4 is a configuration diagram of a neural network in FIG.

제5도는 화자에 적응화된 제어신호 발생방법에 대한 동작흐름도.5 is a flowchart illustrating a method of generating a control signal adapted to a speaker.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

11,21 : 영상 입력부 12,22 : 영상 처리부11,21: image input unit 12,22: image processing unit

13,23 : 음성 입력부 14,24 : 특정음성 입력부13,23: voice input unit 14,24: specific voice input unit

15,25 : 화자 인식부 16,26 : 기록부15,25: speaker recognition unit 16,26: recording unit

17 : 특징 디코딩부 27 : 제어신호 발생부17: feature decoding unit 27: control signal generator

28 : 제어부 29 : 키 입력부28: control unit 29: key input unit

본 발명은 잡음이나 다른 사람의 음성이 섞여 있는 신호에 대해 특정화자의 음성만을 추출 또는 부각시키기 위한 것으로, 특히 캠코더에 다수의 마이크를 장착하고 특정인의 음성을 인식하는 신경망회로를 설치하여 영상과 함께 입력되는 음성을 처리하여 특정인의 음성을 구분하고 이를 추출하여 특정인의 음성만을 상기 영상신호와 함께 기록하거나 특정인의 음성을 부각하여 상기 영상신호와 함께 기록하도록 하는 화자인식 캠코더에 관한 것이다.The present invention is for extracting or highlighting only the voice of a specific speaker for a signal mixed with noise or another person's voice. In particular, a camcorder is equipped with a plurality of microphones and a neural network circuit for recognizing the voice of a specific person is installed. The present invention relates to a speaker recognition camcorder for processing a voice input to distinguish a voice of a specific person, extracting the extracted voice, and recording only the voice of a specific person together with the video signal or recording the voice of a specific person together with the video signal.

일반적인 음성인식 영역에서 살펴보면, 음성인식은 특정인 인식이라는 분야와 명령 인식이라는 분야로 나뉘어 연구되어 왔다.Looking at the general speech recognition area, speech recognition has been studied in two categories: recognition of specific persons and commands.

그런데, 명령인식 분야는 특정인의 특정 명령인식이라는 결합된 기술의 개발도 병행되고 있다.However, in the field of command recognition, the development of a combined technology of a specific command recognition of a specific person is also parallel.

음성에 의한 특정인의 인식은 최초로는 음성의 톤(tone)을 구분하는 즉, 음성의 주파수 인식이 주류를 이루었으나, 특정인이라도 상황에 따라 톤이 달라지는 것을 해결하지 못했다.Recognition of a specific person by voice first distinguishes the tone of the voice, that is, the frequency recognition of the voice has become mainstream, but even a specific person has not solved the change of the tone according to the situation.

그 다음은 성문분석이라고 하여 특정인의 음색깔을 분석하는 것으로 지문과 같이 개개인마다 독특한 성문을 분석하여 특정인의 음성인지를 구별하도록 하였다.Next, the analysis of the tone of a specific person, called the vocal analysis, was used to distinguish the voice of a specific person by analyzing the unique traits of each person such as fingerprints.

그러나 성문분석은 상당량의 음성입력이 되어야하고 그 구별에 있어서 연산량이 방대하여 실제 제품에 적응하기 어려운 문제점이 있었다.However, the voice analysis requires a considerable amount of voice input, and there is a problem in that it is difficult to adapt to the actual product due to the large amount of calculation in the distinction.

다음으로는 어휘구사에 의한 구별이 있는데 특정인의 대화에는 특정인이 독특하게 사용하는 어휘와 어구들이 있어서 이를 조합하여 특정인의 대화를 구별하는 것으로 이는 필적 판별과 같다.Next, there is a distinction by lexical phrases. There is a vocabulary and phrases that are uniquely used by a specific person.

이 경우는 입력되는 음성이 판단에 충분할 정도로 많아야 하고 판단에 있어서 연산량은 여전히 방대하여 역시 제품에 적용이 어려운 문제점이 있었다.In this case, the input voice has to be large enough to be judged, and the amount of calculation is still huge in the judgment, which is also difficult to apply to the product.

따라서, 상기에서와 같은 제품 적용이 어려운 문제점을 해결하기 위한 본 발명의 목적은 신경회로망과 다수 마이크열을 이용하여 학습식으로 특정인의 음성을 인식하여 두고 그 음성이 혼합된 혼합 음성에서 특정인의 음성을 식별하고 추출하여 영상신호와 함께 기록하도록 한 화자 인식형 캠코더를 제공함에 있다.Accordingly, an object of the present invention to solve the problem of difficult application of the product is to recognize a specific person's voice by using a neural network and a plurality of microphone strings in a learning manner, and then mix the voice of the specific person in a mixed voice. The present invention provides a speaker recognition type camcorder for identifying, extracting, and recording the video with a video signal.

본 발명의 다른 목적은 특정화자의 음성에 의해 제어가 가능한 화자 인식형 캠코더를 제공함에 있다.Another object of the present invention is to provide a speaker recognition type camcorder which can be controlled by the voice of a specific speaker.

상기 목적을 달성하기 위한 본 발명의 화자 인식형 캠코더 구성은, 제1도에 도시한 바와 같이, 움직이고 있는 영상에 대하여 입력하는 영상 입력부(11)와; 상기 영상 입력부(11)로부터 입력된 영상에 대하여 신호처리 하여 출력하는 영상 처리부(12)와; 특정화자의 음성만을 입력시키는 특정음성 입력부(14)와; 상기 특정음성 입력부(14)를 통해 입력되는 음성에 대하여 신경회로망을 통해 학습하여 음성 입력부(13)로부터 노이즈가 포함된 음성에 대하여 특정 음성만을 추출하여 인식하도록 하는 화자 인식부(15)와; 상기 화자 인식부(15)를 통해 인식된 음성을 디코딩하여 특징을 추출하는 특징 디코딩부(17)와; 상기 특징 디코딩부(17)를 통해 추출된 음성과 영상 처리부(12)를 통해 처리된 영상을 기록하는 기록부(16)로 구성한다.A speaker recognition type camcorder configuration of the present invention for achieving the above object comprises: an image input unit (11) for inputting a moving image as shown in FIG. An image processor (12) for signal processing and outputting an image input from the image input unit (11); A specific voice input unit 14 for inputting only the voice of the specific speaker; A speaker recognition unit 15 which learns a voice input through the specific voice input unit 14 through a neural network and extracts and recognizes only a specific voice with respect to a voice including noise from the voice input unit 13; A feature decoding unit (17) for extracting a feature by decoding the speech recognized by the speaker recognition unit (15); The recording unit 16 records the voice extracted by the feature decoding unit 17 and the image processed by the image processing unit 12.

그리고, 본 발명 특정화자의 음성에 의해 제어가 가능한 화자 인식형 캠코더의 구성은, 제2도에 도시한 바와 같이, 움직이고 있는 영상에 대하여 입력하는 영상 입력부(21)와; 상기 영상 입력부(21)로 부터 입력된 영상에 대하여 신호처리 하여 출력하는 영상 처리부(22)와; 특정화자의 음성만을 입력시키는 특정음성 입력부(24)와; 상기 특정음성 입력부(24)를 통해 입력되는 음성에 대하여 신경회로망을 통해 학습하여 음성 입력부(23)로부터 노이즈가 포함된 음성에 대하여 특정 음성만을 추출하여 인식하도록 하는 화자 인식부(25)와; 음성 입력자 인식 수단과; 특정화자의 음성을 입력하는 특정음성 입력수단과; 상기 특정음성 입력수단을 통한 특정화자의 음성과 영상 처리부(22)의 영상신호를 함께 기록하는 기록부(26)와; 상기 화자 인식부(25)로 부터 출력되는 신호와 각종 제어신호의 특징과 비교하여 판정하고 이 판정된 특정화자의 음성을 제어신호로 바꾸어 출력하는 제어신호 발생부(27)와; 상기 제어신호 발생부(27)에서 발생된 제어신호와 키 입력부(29)를 통한 키 신호에 따라 상기 기록부(26)를 제어하는 제어부(28)로 구성한다.The configuration of the speaker recognition type camcorder which can be controlled by the voice of the specified speaker of the present invention includes: an image input unit 21 for inputting a moving image as shown in FIG. An image processor 22 for signal-processing and outputting an image input from the image input unit 21; A specific voice input unit 24 for inputting only the voice of the specific speaker; A speaker recognition unit 25 which learns a voice input through the specific voice input unit 24 through a neural network and extracts and recognizes only a specific voice with respect to a voice including noise from the voice input unit 23; Speech inputter recognition means; Specific voice input means for inputting a specific speaker's voice; A recording unit 26 for recording the voice of the specific speaker through the specific voice input means together with the video signal of the image processing unit 22; A control signal generator (27) for judging and comparing the signals output from the speaker recognition section (25) with the characteristics of the various control signals and converting the determined speaker's voice into a control signal and outputting them; The control unit 28 controls the recording unit 26 according to the control signal generated by the control signal generator 27 and the key signal through the key input unit 29.

이와 같이 구성된 본 발명의 동작 및 작용효과에 대하여 상세히 설명하면 다음과 같다.Referring to the operation and effect of the present invention configured as described in detail as follows.

특정음성 입력부(14)를 구성하는 마이크(MIC)를 통하여 특정화자의 음성을 입력할 때 2개 이상의 마이크를 갖는 마이크로폰 어레이로 이루어진 음성 입력부(13)를 통하여 잡음이 섞인 특정화자의 음성을 입력한다.When the voice of the specific speaker is input through the microphone MIC constituting the specific voice input unit 14, the voice of the specific speaker mixed with the noise is input through the voice input unit 13 formed of a microphone array having two or more microphones. .

여기서, 상기 마이크로폰 어레이는 캠코더 외측에 부착된 일정한 간격을 가지고 배열한다.Here, the microphone array is arranged at regular intervals attached to the outside of the camcorder.

그러면, 화자 인식부(15)의 제1특징 추출부(15a)와, 제2특징 추출부(15b)에서 음성신호의 특징 성분을 추출하는 과정이 이루어진다.Then, the first feature extractor 15a and the second feature extractor 15b of the speaker recognition unit 15 extract the feature components of the voice signal.

특징은 선형예측 계수(LP)등을 사용할 수 있는데, 이 과정은 입력된 음성신호를 아날로그/디지탈 변환 후 샘플링하여 윈도우(window)를 씌워 신호를 여러개의 샘플을 모아서 프레임별로 나누고 각 프레임별로 n차 LPC(Liner Predictive Coding)을 추출한다.Characteristic can use linear predictive coefficient (LP), etc. In this process, the inputted audio signal is sampled after analog / digital conversion and covered with a window to divide the signal into several samples, divided by frame, and n-th order for each frame. Extract LPC (Liner Predictive Coding).

이렇게 추출된 특징이 신경회로망(15c)의 입력단과 출력단에 연결된다.The extracted feature is connected to an input terminal and an output terminal of the neural network 15c.

그러면, 상기 신경회로망(15c)은 학습을 행하는데, 이 학습은 제1특징 추출부(15a)로부터 입력된 신호에 대해 제2특징 추출부(15b)를 통하여 나오는 신호가 나올 수 있도록 모델링하는 것이다.Then, the neural network 15c learns, and the learning is modeled so that a signal coming out through the second feature extracting unit 15b can be output with respect to a signal input from the first feature extracting unit 15a. .

신경회로망(15c)은, 제4도에 도시한 바와 같이, 입력층과 히든층 그리고 출력층으로 이루어진다.The neural network 15c is composed of an input layer, a hidden layer, and an output layer, as shown in FIG.

상기에서 입력층은 현재 프레임, 이전프레임과 다음 프레임의 특징벡터가 되며, 히든층은 비선형 함수로 모델링하고, 출력층은 선형 함수로 모델링한다. 즉, 입력단에는 특정화자의 음성과 잡음을 입력하고 목표값을 그때의 특정 발성자의 음성으로 두면 상기의 방법에 의해 학습이 이루어져 신경회로망(15c)은 해당 특정화자의 음성을 추출하는 기능을 가지게 된다.The input layer becomes a feature vector of the current frame, the previous frame, and the next frame, the hidden layer is modeled as a nonlinear function, and the output layer is modeled as a linear function. In other words, if the voice and noise of the specific speaker are input to the input terminal and the target value is the voice of the specific speaker at that time, the learning is performed by the above method, and the neural network 15c has a function of extracting the voice of the specific speaker. .

이렇게 신경회로망(15c)에 대한 학습이 끝나면 특정화자에 대한 신경회로망이 생기게 되고 이제 실제 잡음이나 다른 화자들의 음성이 섞여 마이프로폰 어레이에 입력되면 그 신호에 대한 특징이 추출된 다음에 특정화자의 신경회로망을 거쳐서 그 화자의 신호가 부각된 신호를 출력한다.After learning about the neural network 15c, a neural network for a specific speaker is created. Now, when the real noise or other speakers' voices are mixed and input to the miprophone array, the characteristics of the signal are extracted, and then Through the neural network, the speaker's signal is outputted.

이렇게 제1특징 추출부(15a), 제2특징 추출부(15b) 및 신경회로망(15c)을 통해 특정화자의 음성특징을 추출하여 특징 디코딩부(17)로 출력하면, 상기 특징 디코딩부(17)는 입력되는 음성신호에 대한 특징을 디코딩하고 이 디코딩하여 얻은 음성신호를 기록부(16)로 전송한다.When the voice feature of the specific speaker is extracted through the first feature extractor 15a, the second feature extractor 15b, and the neural network 15c and output to the feature decoder 17, the feature decoder 17 ) Decodes the feature of the input voice signal and transmits the decoded voice signal to the recording unit 16.

이때 영상 처리부(12)는 영상 입력부(11)로 부터 입력되는 영상신호에 대하여 신호처리하여 기록부(16)로 전송하면, 상기 기록부(16)는 이 영상과 함께 특징 디코딩부(17)에서 전송된 음성신호를 기록계로 출력하여 기록한다.In this case, the image processor 12 processes the image signal input from the image input unit 11 and transmits the signal to the recording unit 16. The recording unit 16 is transmitted from the feature decoding unit 17 together with the image. The audio signal is output to the recorder and recorded.

결국, 음성 입력부(13)를 통해 다른 화자들의 음성과, 잡음과, 특정화자의 음성이 입력되면 화자 인식부(15)는 신경회로망을 통해 특정화자의 음성특징을 추출하고 이를 특징 디코딩부(17)에서 디코딩한 음성을 기록부(16)로 출력하여 기록하도록 하는 것이다.As a result, when the voice of another speaker, the noise, and the voice of the specific speaker are input through the voice input unit 13, the speaker recognition unit 15 extracts the voice feature of the specific speaker through the neural network and decodes the feature feature. The audio decoded by < RTI ID = 0.0 >

이상에서와 같이 특정화자의 음성을 추출하여 기록하고 주변의 잡음이나 다른 화자의 음성을 제거하도록 한다.As described above, the voice of a specific speaker is extracted and recorded, and the surrounding noise or the voice of another speaker is removed.

그리고, 특정화자의 음성으로 캠코더를 제어하는 경우에 대하여 제2도에 의거하여 살펴보자.A case where the camcorder is controlled by the voice of a specific speaker will be described with reference to FIG.

신경회로망을 통해 특정화자를 인식하는 것은 동일하며, 이렇게 화자 인식부(25)의 신경회로망을 통해 추출된 특징을 제어신호 발생부(27)로 출력하면, 상기 제어신호 발생부(27)는 미리 저장되어 있는 각종 제어신호의 특징과 비교하여 판정하고 이 판정된 제어신호를 제어부(28)로 출력한다.Recognizing a specific speaker through the neural network is the same, and if the feature extracted through the neural network of the speaker recognition unit 25 is output to the control signal generator 27, the control signal generator 27 is previously Determination is made by comparing with the characteristics of the various control signals stored and the determined control signal is output to the control unit 28.

그러면, 상기 제어부(28)는 제어신호 발생부(27)로 부터 입력받은 제어신호에 따라 기록부(26)를 제어하여 캠코더를 조작하거나 키 입력부(29)를 통해 입력된 키 신호에 따라 캠코더를 조작한다.Then, the controller 28 controls the recording unit 26 according to the control signal received from the control signal generator 27 to operate the camcorder or manipulates the camcorder according to the key signal input through the key input unit 29. do.

상기에서 특정화자의 음성에 해당하는 제어신호를 발생하는 제어신호 발생부(27)는 캠코더를 조작하기 위하여 화자의 음성을 제5도에서와 같이 특정화자에 적응화된 음소모델을 학습용 데이타를 이용하여 생성하여 입력되는 음성에 대하여 확률에 의한 정규화를 하고 발성 내용과 화자 음성의 문턱치 값과 비교하여 판정을 한다.The control signal generator 27 for generating a control signal corresponding to the specific speaker's voice uses the training data using a phoneme model adapted for the speaker's voice to the specific speaker as shown in FIG. The speech generated and input is normalized by probability and judged by comparing the utterance content and the threshold value of the speaker voice.

그리고, 판정된 특정화자의 음성에 해당하는 제어신호로 바꾸어 제어부(28)로 출력하는 것이다.Then, the control unit 28 converts the control signal corresponding to the determined voice of the specified speaker to output to the control unit 28.

이와같이 특정화자의 음성을 학습시켜 캠코더를 조작할 수 있도록 한다.Thus, the camcorder can be operated by learning the voice of the specific speaker.

이상에서 상세히 설명한 바와 같이 본 발명은 특정화자의 음성을 기록하거나 주변의 잡음을 제거하고 화자들만의 음성을 기록 가능하고, 그 특정화자의 음성에 의해 캠코더를 조작가능하도록 하여 캠코더의 고기능화를 실현하도록 한 효과가 있다.As described in detail above, the present invention is capable of recording a specific speaker's voice or removing surrounding noise and recording only the speaker's voice, and enabling the camcorder to be operated by the specific speaker's voice to realize high functionality of the camcorder. There is one effect.

Claims

Speaker recognition recognizes voice by extracting and learning feature of specific voice when inputting voice of specific speaker through specific voice input means and recognizing only voice of specific speaker even if multiple voices including noise are input through voice input means. Means; Feature decoding means for extracting features of speech of a specific speaker recognized through said speaker recognition means; And recording means for recording the extracted voice of the specified speaker together with the image signal processed through the image processing means.

2. The apparatus of claim 1, wherein the speaker recognition means comprises: feature extraction means for extracting a voice feature of the specific speaker; And a neural network for learning and modeling features from the feature extracting means.

Speaker recognition means for recognizing only the voice of a specific speaker when inputting noise including voice through voice input means; Specific voice input means for inputting a specific speaker's voice; Recording means for recording together the audio signal input through the specific voice input means and the video signal processed through the image processing means; Control signal generation means for determining the signal output to the speaker recognition means in comparison with the characteristics of various control signals, and for converting the determined speaker's voice into a control signal and outputting the control signal; Speaker recognition type camcorder comprising a control means for controlling the recording means in accordance with the control signal generated by the control signal generating means and the key signal through the key input means to control by the voice of a specific speaker .