KR20170007107A

KR20170007107A - Speech Recognition System and Method

Info

Publication number: KR20170007107A
Application number: KR1020160064193A
Authority: KR
Inventors: 김동현; 이민규
Original assignee: 한국전자통신연구원
Priority date: 2015-07-10
Filing date: 2016-05-25
Publication date: 2017-01-18

Abstract

The present invention relates to a speech recognition system and a method thereof. The present invention can automatically distinguish voice language during voice recognition of a person without separates processes for setting a recognition language or being registered by a user such as button use for manually selecting a language, thereby effectively processing multi-lingual voice recognition. The present invention supports automatic voice recognition of each language by using one terminal even if a person speaks different languages, thereby increasing the convenience of the user.

Description

TECHNICAL FIELD [0001] The present invention relates to a speech recognition system,

본 발명은 음성인식 시스템 및 방법에 관한 것으로서, 특히, 다국어 음성인식을 효과적으로 처리하기 위하여 언어 식별과 음성 인식을 동시에 진행하는 음성인식 시스템과 그 방법에 관한 것이다.The present invention relates to a speech recognition system and method, and more particularly, to a speech recognition system and method for simultaneously performing language identification and speech recognition in order to effectively process multilingual speech recognition.

종래의 오프라인 또는 온라인 음성인식 시스템은, 사용자 단말에서 다국어 음성인식을 위하여 적용되어 있으며, 일반적으로 각 언어별 음성인식 버튼을 따로 두어 상황에 따라 서로 다른 언어의 음성인식기를 작동하게 하였다. 나아가 종래의 음성인식 시스템에서, 사용자가 보유한 텍스트 콘텐츠나 디바이스 정보를 통해 자동으로 사용언어를 알아내는 방법을 사용하기도 한다. 이렇게 사용 언어를 알아 내기 위하여, 사용자를 온라인 서버에 미리 등록해야 하고, 사용자 단말에 의존해서 사용자의 발성 언어를 미리 정해버리는 방법을 사용한다. 즉, 이러한 방법은 하나의 단말기로는 한 사람만 사용해야 하고, 여러 화자의 발성을 음성인식 할 수 있는 다국어 회의 등에서는 하나의 단말로 자동 음성인식을 수행하기 어렵다. Conventional offline or online speech recognition systems are applied to multi-lingual speech recognition in a user terminal. Generally, a speech recognition button for each language is set aside to operate speech recognizers of different languages according to the situation. Furthermore, in a conventional speech recognition system, a method of automatically finding a language to be used may be used through a text content or device information held by a user. To find out the language used in this way, a user must be registered in the online server in advance, and a method of predetermining the user's utterance language in dependence on the user terminal is used. That is, it is difficult to perform automatic speech recognition on a single terminal in a multi-lingual conference or the like in which such a method requires only one person to use one terminal and voice recognition of voices of various speakers.

다른 종래의 음성인식 시스템에서는, 다국어 공통 음소를 이용하여 음소인식 방법으로 언어식별을 수행하기도 한다. 이 방법은 음소의 발생 패턴을 통계적 언어모델로 만들어 언어식별에 활용하는 방법으로서, 음성데이터를 이용하여 실시간으로 언어 식별을 수행한다. 또 다른 종래의 음성인식 시스템에서는, 음향 모델에 많이 사용하는 DNN(Deep Neural Network)을 이용하여 언어를 식별하는 방법을 제안하기도 하는데, 이것은 DNN 구조의 생성 시에 최종 출력 노드를 각 언어로 지정하는 방법으로 언어 식별을 하고 있다. 그러나, 이와 같은 예들에서는 모두 음향 데이터를 주 정보로 이용하여 언어 식별만을 수행하는 전용 인식기를 가지지만, 이런 방법은 기본적인 음성인식기와는 다른 역할의 언어식별용 인식기를 따로 구비해야 하는 불편함이 있다. In another conventional speech recognition system, language identification is performed by a phoneme recognition method using a multilingual common phoneme. This method uses phonetic patterns as a statistical language model for language identification and real - time language identification using speech data. In another conventional speech recognition system, a method of identifying a language by using a DNN (Deep Neural Network), which is often used in an acoustic model, is proposed. In this method, a final output node is designated in each language Language identification. However, all of these examples have a dedicated recognizer that performs only language identification using the sound data as main information. However, this method inconveniently requires a recognizer for language identification having a different function from that of the basic speech recognizer .

따라서, 본 발명은 상술한 문제점을 해결하기 위하여 고안된 것으로, 본 발명의 목적은, 사용자에게 수동으로 발성할 언어를 선택하기 위한 버튼의 사용 등 사용자 등록이나 인식 언어 설정을 위한 별도의 과정 없이, 발성한 사람의 음성인식 동안에 자동으로 음성언어의 식별이 가능하게 하여 다국어 음성인식을 효과적으로 처리할 수 있으며, 하나의 단말기를 사용해 각기 다른 언어의 사람이 발성하여도 자동으로 각 언어의 음성인식을 수행하도록 지원하여 사용자의 편의성을 높일 수 있는 음성인식 시스템 및 방법을 제공하는 데 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems occurring in the prior art, and it is an object of the present invention to provide a method and apparatus for generating voices It is possible to automatically recognize the voice language during the voice recognition of one person, thereby effectively processing the multi-lingual voice recognition and to automatically perform the voice recognition of each language even if a person of different language utters the voice using one terminal And to provide a voice recognition system and method capable of enhancing user convenience.

또한, 본 발명에서, 다국어 음성인식을 수행하기에 편리하도록 사용자 정보의 등록이나 단말기 정보에 의존할 필요없이 발성만으로 음성 인식과 동시에 언어를 식별할 수 있도록 지원하기 위하여, 기존에 발성한 사용자의 언어에 맞춰진 음성인식기를 가동하기 위해 언어 식별기를 앞선 작업으로 수행하는 작업시간 편차를 일으키지 않고 음성 인식과 언어 식별을 동시에 진행하도록 한다. 이것은 여러 언어의 음성인식기를 동시에 가동하여 음성인식 처리 과정에서 생성된 언어식별 스코어를 이용하여 스코어가 낮은 인식기는 빠르게 정지시키고, 높은 스코어 인식기의 결과를 보여 줌으로써 해당 언어의 음성인식 결과를 보이는 방법이다. In the present invention, in order to facilitate recognition of a language at the same time as speech recognition without requiring registration of user information or relying on terminal information so as to facilitate multilingual speech recognition, In order to operate the speech recognizer adapted to the speech recognition, the speech recognition and the language identification are performed at the same time without causing the work time deviation to perform the language recognizer as the preceding task. This is a method of displaying the speech recognition result of the corresponding language by simultaneously operating the speech recognizers of various languages, quickly stopping the low-score recognizer using the language identification score generated in the speech recognition process, and displaying the result of the high score recognizer .

또한, 본 발명에서, 음성인식을 수행하면서 언어식별을 진행하는 과정으로 세가지 형태의 방법을 제안한다. 첫째는, 기존 언어별 음성인식기를 활용할 수 있는 병렬 음성인식 구성을 가지고, 언어식별 스코어를 매 프레임 또는 여러 프레임 마다 관측하여 언어식별하는 방법이다. 둘째는 계산 비용을 줄이기 위해 음향모델의 일부 또는 전체에 대해 공유하여 음향 모델 스코어를 계산하고, 병렬로 각 언어별 언어 네트워크 탐색을 진행하면서 매 프레임 또는 여러 프레임 마다 관측하여 언어식별하는 방법이다. 셋째는, 음향 스코어 계산 뿐만 아니라 언어 네트워크 탐색까지 하나의 통합 네트워크에서 수행하는 방법으로 음향 모델 전체를 공유하고 각 언어 네트워크를 하나의 네트워크로 결합하여 검색을 진행하는 방법이다. 위와 같은 음성인식과 언어식별을 동시에 진행하는 음성인식 시스템과 방법을 제안하고자 한다.Also, in the present invention, three types of methods are proposed as a process of performing language identification while performing speech recognition. First, there is a method of recognizing the language by observing the language identification score in every frame or every several frames, with a parallel speech recognition configuration that can utilize a conventional speech recognizer. The second method is to share acoustic models with some or all of the acoustic models in order to reduce the computational cost, and to identify the language models by observing every frame or every frame while searching the language network in parallel. Third, it is a method to perform the search by not only the acoustic score calculation but also the language network search in a single integrated network, sharing the entire acoustic model and combining each language network into one network. A speech recognition system and a method for simultaneously performing speech recognition and language identification as described above are proposed.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재들로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the above-mentioned technical problems, and other technical problems which are not mentioned can be understood by those skilled in the art from the following description.

먼저, 본 발명의 특징을 요약하면, 상기의 목적을 달성하기 위한 본 발명의 일면에 따른 언어 식별과 음성 인식을 동시에 진행하기 위한 음성인식 시스템에 있어서, 음성 신호를 수신하여 상기 음성 신호를 분석해 특징 데이터를 추출하는 음성처리부; 및 상기 특징 데이터를 수신하여 언어식별과 음성인식을 동시에 수행하고, 식별된 언어 정보를 상기 음성처리부로 피드백하는 언어식별 음성인식부를 포함하고, 상기 음성처리부는 피드백받는 상기 식별된 언어 정보에 따라 상기 언어식별 음성인식부에서 음성 인식한 결과를 출력하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a speech recognition system for simultaneously performing language identification and speech recognition according to an aspect of the present invention, A voice processing unit for extracting data; And a language identification voice recognition unit that receives the feature data and performs language identification and voice recognition at the same time and feeds back the identified language information to the voice processing unit, And the speech recognition unit outputs the speech recognition result.

상기 언어식별 음성인식부는, 음향모델과 언어모델을 참조하여 상기 특징 데이터에 대하여 그 유사도에 대한 분석을 통해, 상기 음성 신호에 대한 언어를 식별한다.The language identification voice recognition unit identifies a language for the voice signal by analyzing the similarity degree of the feature data by referring to the acoustic model and the language model.

상기 언어식별 음성인식부는, 각각이 병렬로 상기 특징 데이터에 대한 음성 인식을 수행하며, 해당 언어의 음향모델과 언어모델을 참조하여 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 언어식별 스코어를 연산하는 복수의 언어 디코더; 및 상기 복수의 언어 디코더로부터의 수신하여 누적된 언어식별 스코어를 참조하여 결정룰에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정하여 상기 식별된 언어 정보를 출력하는 언어 결정 모듈을 포함할 수 있다.The speech recognition unit recognizes the speech data of the speech data in parallel with each other and refers to the acoustic model and the language model of the speech data to determine the degree of similarity between one or more speech signal frames A plurality of language decoders for calculating a language identification score through an analysis of the language; And a language determining module for determining the language corresponding to the selected target language decoder according to the determination rule as an identified language by referring to the accumulated and accumulated language identification scores from the plurality of language decoders and outputting the identified language information can do.

상기 언어 결정 모듈은, 상기 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시켜, 상기 음성처리부는 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력할 수 있다.Wherein the language determination module sequentially transmits a decoding end command to a language decoder of a lower score on the basis of the accumulated language identification score to terminate the operation, and the voice processing unit outputs a speech recognition result of the remaining remaining target language decoder Can be output.

상기 언어식별 스코어는, 음향모델 스코어와 언어모델 스코어를 합산한 값, 또는 네트워크 탐색할 때 발생하는 유사한 언어 후보들에 대한 토큰의 개수의 역수, 또는 이들의 조합으로 구성될 수 있다.The language identification score may comprise a sum of acoustic model scores and language model scores, or a reciprocal of the number of tokens for similar language candidates that occur when searching for a network, or a combination thereof.

상기 결정룰은, 프레임 단위로 상기 누적된 언어식별 스코어가 가장 높은 값과 기준값 이상으로 차이가 나는 해당 언어식별 스코어를 출력하는 언어 디코더를 차례로 종료시키는 방식, 또는 시간에 따라 가변하는 상기 기준값을 적용하여 상기 누적된 언어식별 스코어가 가장 높은 값과 프레임 단위로 해당 기준값 이상으로 차이가 나는 해당 언어식별 스코어를 출력하는 언어 디코더를 차례로 종료시키는 방식을 포함할 수 있다.The determination rule may include a method of sequentially terminating a language decoder outputting a corresponding language ID score in which the cumulative language identification score differs from a highest value by at least a reference value or a method of applying the reference value that varies with time And a language decoder for outputting a corresponding language ID score in which the cumulative language ID score differs from a highest value and a corresponding reference value in units of frames.

상기 언어식별 음성인식부는, 다국어의 각 언어별 음향모델의 일부 또는 소정의 다국어의 음향 모델 전체를 공유하여, 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 음향모델 스코어를 연산하는 음향모델 공유부; 각각이 병렬로 상기 음향모델 스코어를 공유하여 상기 특징 데이터에 대한 음성 인식을 수행하며, 공유된 상기 음향모델 스코어와, 언어모델을 참조하여 상기 특징 데이터를 기초로 계산한 언어모델 스코어를 합산한 언어식별 스코어를 연산하는 복수의 언어 네트워크 디코더; 및 상기 복수의 언어 네트워크 디코더로부터의 수신하여 누적된 언어식별 스코어를 참조하여 결정룰에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정하여 상기 식별된 언어 정보를 출력하는 언어 결정 모듈을 포함할 수 있다.The language identification voice recognition unit may share a part of an acoustic model of each language of a multilingual language or an entire acoustic model of a predetermined multilingual language to generate an acoustic sound through analysis of the degree of similarity for each of one or more voice signal frames, An acoustic model sharing unit for calculating a model score; Each of the acoustic model scores is shared in parallel to perform speech recognition on the feature data, and the speech model is calculated by summing the shared acoustic model score and a language model score calculated based on the feature data with reference to the language model A plurality of language network decoders for computing an identification score; And a language determination module for determining the language corresponding to the selected target language decoder according to the determination rule as an identified language by referring to the accumulated language ID scores received from the plurality of language network decoders and outputting the identified language information .

상기 언어 결정 모듈은, 상기 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 네트워크 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시켜, 상기 음성처리부는 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력할 수 있다.Wherein the language determination module sequentially transmits a decoding end command to a language network decoder of a lower score on the basis of the accumulated language identification score to end the operation, Can be output.

상기 언어식별 음성인식부는, 소정의 다국어의 음향 모델 전체를 공유하되 다국어 공통음소와 개별 언어 구별 음소를 함께 이용하여, 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 음향모델 스코어를 연산하는 음향모델 공유부; 및 복수의 개별 언어의 언어 네트워크들을 하나로 통합하여, 언어 구별이 없는 통합 언어 네트워크를 이용하여 상기 특징 데이터에 대한 음성 인식을 수행하며, 공유된 상기 음향모델 스코어와, 언어모델을 참조하여 상기 특징 데이터를 기초로 계산한 언어모델 스코어를 합산한 언어식별 스코어를 연산하고, 상기 언어식별 스코어를 기초로 가장 높은 스코어로 결정된 문자열을 출력하는 결합 네트워크 디코더를 포함할 수 있다.The speech recognition unit recognizes the similarity of each of the one or more speech signal frames based on the feature data by using the multilingual common phonemes and the separate language phonemes together, An acoustic model sharing unit for calculating an acoustic model score through the speaker; And language networks of a plurality of individual languages into one to perform speech recognition on the feature data using an integrated language network having no language distinction, And a combination network decoder for calculating a language identification score by summing the language model scores calculated on the basis of the language identification score and outputting a string determined as the highest score based on the language identification score.

상기 음성처리부는 상기 결합 네트워크 디코더에서 음성 인식한 결과인 상기 결정된 문자열을 소정의 출력 인터페이스를 통해 출력할 수 있다.The voice processing unit may output the determined character string, which is a result of voice recognition by the combined network decoder, through a predetermined output interface.

그리고, 본 발명의 다른 일면에 따른 언어 식별과 음성 인식을 동시에 진행하기 위한 장치에서 음성인식 방법은, 음성 신호를 수신하여 상기 음성 신호를 분석해 특징 데이터를 추출하는 단계; 상기 특징 데이터를 수신하여 언어식별과 음성인식을 동시에 수행하고, 식별된 언어 정보를 출력하는 단계; 및 상기 식별된 언어 정보에 따라 소정의 출력 인터페이스를 통해 해당 음성 인식한 결과를 출력하는 단계를 포함한다.According to another aspect of the present invention, there is provided an apparatus for simultaneously performing speech recognition and speech recognition, comprising: extracting feature data by receiving a speech signal and analyzing the speech signal; Receiving the feature data to simultaneously perform language identification and speech recognition, and outputting the identified language information; And outputting the voice recognition result through a predetermined output interface according to the identified language information.

상기 식별된 언어 정보를 출력하는 단계에서, 음향모델과 언어모델을 참조하여 상기 특징 데이터에 대하여 그 유사도에 대한 분석을 통해, 상기 음성 신호에 대한 언어를 식별할 수 있다.In the step of outputting the identified language information, the language for the speech signal may be identified by analyzing the similarity degree of the feature data by referring to the acoustic model and the language model.

상기 식별된 언어 정보를 출력하는 단계는, 복수의 언어 디코더 각각이 병렬로 상기 특징 데이터에 대한 음성 인식을 수행하며, 해당 언어의 음향모델과 언어모델을 참조하여 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 언어식별 스코어를 연산하는 단계; 및 상기 복수의 언어 디코더로부터의 수신하여 누적된 언어식별 스코어를 참조하여 결정룰에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정하여 상기 식별된 언어 정보를 출력하는 단계를 포함할 수 있다.Wherein the step of outputting the identified language information comprises the steps of: each of the plurality of language decoders performing a speech recognition of the feature data in parallel; referring to the acoustic model and the language model of the language, Calculating a language identification score through analysis of the degree of similarity for each of the voice signal frames; And outputting the identified language information by determining a language corresponding to the selected target language decoder according to a determination rule as an identified language with reference to the received and accumulated language identification score from the plurality of language decoders have.

상기 식별된 언어 정보를 출력하는 단계에서, 상기 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시키고, 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력할 수 있다.And outputting the identified language information, the decoding end command is sequentially transmitted to a language decoder of a lower score based on the accumulated language identification score to terminate the operation, and the speech recognition result Can be output.

상기 식별된 언어 정보를 출력하는 단계는, 다국어의 각 언어별 음향모델의 일부 또는 소정의 다국어의 음향 모델 전체를 공유하여, 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 음향모델 스코어를 연산하는 단계; 복수의 언어 네트워크 디코더 각각이 병렬로 상기 음향모델 스코어를 공유하여 상기 특징 데이터에 대한 음성 인식을 수행하며, 공유된 상기 음향모델 스코어와, 언어모델을 참조하여 상기 특징 데이터를 기초로 계산한 언어모델 스코어를 합산한 언어식별 스코어를 연산하는 단계; 및 상기 복수의 언어 네트워크 디코더로부터의 수신하여 누적된 언어식별 스코어를 참조하여 결정룰에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정하여 상기 식별된 언어 정보를 출력하는 단계를 포함할 수 있다.Wherein the step of outputting the identified language information comprises the steps of sharing a part of an acoustic model for each language of a multilingual language or an entire acoustic model of a predetermined multilingual language, Calculating an acoustic model score through analysis; A plurality of language network decoders each sharing the acoustic model score in parallel to perform speech recognition on the feature data; a language model that is calculated based on the feature data by referring to the shared acoustic model score and the language model; Computing a language identification score summed with the scores; And outputting the identified language information by determining a language corresponding to the selected target language decoder according to a determination rule as an identified language with reference to the received and accumulated language identification score from the plurality of language network decoders .

상기 식별된 언어 정보를 출력하는 단계에서, 상기 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 네트워크 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시키고, 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력할 수 있다.And outputting the identified language information, a decoding end command is sequentially transmitted to a language network decoder of a lower score based on the accumulated language identification score to terminate the operation, and finally, The result can be output.

상기 식별된 언어 정보를 출력하는 단계는, 소정의 다국어의 음향 모델 전체를 공유하되 다국어 공통음소와 개별 언어 구별 음소를 함께 이용하여, 상기 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도에 대한 분석을 통해 음향모델 스코어를 연산하는 단계; 및 복수의 개별 언어의 언어 네트워크들을 하나로 통합한 결합 네트워크 디코더에서, 언어 구별이 없는 통합 언어 네트워크를 이용하여 상기 특징 데이터에 대한 음성 인식을 수행하며, 공유된 상기 음향모델 스코어와, 언어모델을 참조하여 상기 특징 데이터를 기초로 계산한 언어모델 스코어를 합산한 언어식별 스코어를 연산하고, 상기 언어식별 스코어를 기초로 가장 높은 스코어로 결정된 문자열을 출력하는 단계를 포함할 수 있다.Wherein the step of outputting the identified language information comprises the steps of: sharing the entire acoustic model of a predetermined multilingual language, using a multilingual common phoneme and a separate language phoneme together, Calculating an acoustic model score through analysis of the acoustic model score; And a combined network decoder that integrates language networks of a plurality of individual languages into one, performs speech recognition on the feature data using an integrated language network without language discrimination, and uses the shared acoustic model score and language model Calculating a language identification score obtained by summing language model scores calculated on the basis of the feature data, and outputting a character string determined as the highest score based on the language identification score.

상기 음성인식 방법은, 상기 결합 네트워크 디코더에서 음성 인식한 결과인 상기 결정된 문자열을 소정의 출력 인터페이스를 통해 출력하는 단계를 더 포함할 수 있다.The speech recognition method may further include outputting the determined character string that is a result of speech recognition in the combined network decoder through a predetermined output interface.

본 발명에 따른 음성인식 시스템 및 방법에 따르면, 사용자에게 수동으로 발성할 언어를 선택하기 위한 버튼의 사용 등 사용자 등록이나 인식 언어 설정을 위한 별도의 과정 없이, 발성한 사람의 음성인식 동안에 자동으로 음성언어의 식별이 가능하게 하여 다국어 음성인식을 효과적으로 처리할 수 있다. 기존 방법은 사용자의 단말에서 미리 사용자의 등록 내용에 사용언어를 기록하는 방식으로 언어를 결정했지만, 본 발명은 음성이 전달되는 순간에 언어식별을 시작하기 때문에 사전 작업이 필요없고, 사용자 단말에 의존적이지도 않다. According to the speech recognition system and method according to the present invention, speech recognition is automatically performed during voice recognition of a voiced person, such as the use of a button for selecting a language to be manually voiced to a user, It is possible to identify the language and effectively process the multilingual speech recognition. In the conventional method, the language is determined by recording the language used in the registration of the user in advance in the terminal of the user. However, since the present invention starts language identification at the moment when the voice is transmitted, It is not.

또한, 본 발명에 따른 음성인식 시스템 및 방법에 따르면, 하나의 단말기를 사용해 각기 다른 언어의 사람이 발성하여도 자동으로 각 언어의 음성인식을 수행하도록 자동 다국어 음성인식을 지원하여 사용자의 편의성을 높일 수 있다. 이는 다국어 회의와 같은 복수의 다른 언어를 가진 사람들의 회의 내용을 기록할 수 있도록 응용될 수 있다. In addition, according to the speech recognition system and method of the present invention, automatic multilingual speech recognition is performed so that speech recognition of each language is automatically performed even if a person of a different language speaks using one terminal, thereby enhancing user's convenience . This can be applied to record the contents of meetings of people having a plurality of different languages such as a multilingual conference.

그리고, 본 발명에 따른 음성인식 시스템 및 방법에 따르면, 실시간으로 발성된 음성에 대해 음성인식을 수행하면서 측정된 스코어에 기반하여 언어를 판별하기 때문에, 언어식별 전용 인식기를 앞서 수행할 필요없이 음성인식 결과를 빠르게 받을 수 있다. According to the speech recognition system and method according to the present invention, speech recognition is performed on speech uttered in real time, and the language is discriminated based on the measured score. Therefore, You can get results quickly.

도 1은 입력된 음성을 이용하여 언어식별과 음성인식을 동시에 수행하는 본 발명의 일 실시예에 따른 음성인식 시스템의 개념도이다.
도 2a는 입력된 음성을 각 언어 디코더로 보내어 병렬 음성인식과 언어 결정을 동시에 수행하기 위한, 도 1의 언어식별 음성인식부에 대한 제1구체예이다.
도 2b는 도 2a의 음성인식 시스템의 동작을 설명하기 위한 흐름도이다.
도 3a는 입력된 음성에 대한 언어식별 음성인식을 음향모델 공유부와 언어 네트워크 디코더들로 분리하여 연산하기 위한, 도 1의 언어식별 음성인식부에 대한 제2구체예이다.
도 3b는 도 3a의 음성인식 시스템의 동작을 설명하기 위한 흐름도이다.
도 4a는 입력된 음성에 대한 언어식별 음성인식을 음향모델 공유부와 결합 네트워크부로 분리하여 연산하기 위한, 도 1의 언어식별 음성인식부에 대한 제3구체예이다.
도 4b는 도 4a의 음성인식 시스템의 동작을 설명하기 위한 흐름도이다.
도 5는 도 4의 결합 네트워크 디코더에 대한 구체예이다.
도 6은 본 발명의 일 실시예에 따른 음성인식 시스템의 구현 방법의 일례를 설명하기 위한 도면이다.FIG. 1 is a conceptual diagram of a speech recognition system according to an embodiment of the present invention for simultaneously performing language identification and speech recognition using input speech.
FIG. 2A is a first specific example of the language identification voice recognition unit of FIG. 1 for simultaneously transmitting the input voice to each language decoder to perform parallel speech recognition and language determination.
FIG. 2B is a flowchart illustrating an operation of the speech recognition system of FIG. 2A.
FIG. 3A is a second specific example of the language identification speech recognition unit of FIG. 1 for separately performing language identification speech recognition for an input speech into an acoustic model sharing unit and language network decoders.
FIG. 3B is a flowchart for explaining the operation of the speech recognition system of FIG. 3A.
FIG. 4A is a third specific example of the language identification voice recognition unit of FIG. 1 for separately performing language identification voice recognition for an input voice into an acoustic model sharing unit and a coupling network unit.
4B is a flowchart illustrating an operation of the speech recognition system of FIG. 4A.
5 is an embodiment of the combining network decoder of FIG.
6 is a view for explaining an example of a method of implementing a speech recognition system according to an embodiment of the present invention.

이하, 본 발명의 일부 실시예들을 예시적인 도면을 통해 상세하게 설명한다. 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명의 실시예를 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 실시예에 대한 이해를 방해한다고 판단되는 경우에는 그 상세한 설명은 생략한다.Hereinafter, some embodiments of the present invention will be described in detail with reference to exemplary drawings. It should be noted that, in adding reference numerals to the constituent elements of the drawings, the same constituent elements are denoted by the same reference numerals whenever possible, even if they are shown in different drawings. In the following description of the embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the difference that the embodiments of the present invention are not conclusive.

본 발명의 실시예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), and (b) may be used. These terms are intended to distinguish the constituent elements from other constituent elements, and the terms do not limit the nature, order or order of the constituent elements. Also, unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

도 1은 입력된 음성을 이용하여 언어식별과 음성인식을 동시에 수행하는 본 발명의 일 실시예에 따른 음성인식 시스템(500)의 개념도이다.1 is a conceptual diagram of a speech recognition system 500 according to an embodiment of the present invention for simultaneously performing language identification and speech recognition using an input speech.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성인식 시스템(500)은, 음성처리부(100)와 언어식별 음성인식부(200)를 포함한다. Referring to FIG. 1, a speech recognition system 500 according to an embodiment of the present invention includes a speech processing unit 100 and a language identification speech recognition unit 200.

본 발명의 일 실시예에 따른 음성인식 시스템(500)은, 유선 인터넷 통신이나 WiFi, WiBro 등 무선 인터넷 통신, WCDMA, LTE 등 이동통신 또는 WAVE(Wireless Access in Vehicular Environment) 무선 통신 등을 지원하는 유무선 네트워크를 통해 통신이 가능한 사용자 단말에 탑재되어 동작할 수 있는 장치이다. 예를 들어, 사용자 단말은 데스크탑 PC 기타 통신 전용 단말기 등 유선 단말을 포함하며, 이외에도 통신 환경에 따라 스마트폰, 음성/영상 전화 통화가능한 웨어러블 디바이스, 테블릿 PC, 노트북 PC, 등 무선 단말을 포함할 수 있다. The voice recognition system 500 according to an exemplary embodiment of the present invention may be a voice recognition system that supports wired Internet communication, wireless Internet communication such as WiFi and WiBro, mobile communication such as WCDMA, LTE, or WAVE (Wireless Access in Vehicular Environment) And is mounted on a user terminal capable of communicating via a network. For example, the user terminal includes a wired terminal such as a desktop PC and other communication exclusive terminals, and may include a smart phone, a wearable device capable of voice / video phone calls, a tablet PC, a notebook PC, .

음성처리부(100)는 위와 같은 네트워크를 통해 온라인으로 전달되거나 사용자 단말의 마이크를 통해 음성 신호를 수신하여, 주파수 분석 등 음성 신호 분석을 통하여 특징 데이터를 추출한다. 음성처리부(100)는 언어식별 음성인식부(200)에서 식별된 언어 정보(예, 문자열이나 단어열, 또는, 다국어 중 어느 국가의 언어인지 여부의 정보 등)를 피드백 받으면, 음성 인식된 결과를 다양한 형태로 출력하는 후처리 과정을 수행할 수 있다. 언어가 식별되면, 해당 식별된 언어 정보에 따라 음성처리부(100)는 언어식별 음성인식부(200)에서의 소정의 출력 인터페이스를 통해 음성인식 결과를 다른 응용에 이용하게 지원할 수 있으며, 사용자 단말 등에 그 결과를 문자 등으로 표시하거나 다른 언어로 번역한 결과 등을 사용자 단말 등에 제공할 수 있다. 언어가 식별되면, 음성처리부(100)는 언어 구분 없는 특징 데이터의 추출을 중지하고 해당 언어 정보에 맞게 효과적으로 특징 데이터를 추출하기 위한 신호 분석을 수행할 수도 있다. The voice processing unit 100 receives the voice signal through a microphone of the user terminal or extracts the feature data through voice signal analysis such as frequency analysis. When the speech processing unit 100 receives the feedback of the language information (e.g., a string or a word string, or a language of any country among multiple languages) identified by the language identification speech recognition unit 200, It is possible to perform a post-processing process of outputting in various forms. When the language is identified, the voice processing unit 100 can support the voice recognition result to other applications through the predetermined output interface in the language identification voice recognition unit 200 according to the identified language information, And display the result in a character or the like, translate the result into another language, or the like, to a user terminal. When the language is identified, the voice processing unit 100 may stop extracting the feature data without language discrimination and perform signal analysis for extracting the feature data effectively according to the language information.

언어식별 음성인식부(200)는 음성처리부(100)로부터 음성 신호에 대한 특징 데이터를 수신하여, 언어식별과 음성인식을 동시에 수행하고 식별된 언어 정보를 음성처리부(100)로 피드백한다. 언어식별 음성인식부(200)는 음향모델(다국어 공통음소와 개별언어의 구별 음소 등)을 저장 관리하는 데이터베이스, 언어모델(개별언어의 음절, 어절 특성 등)을 저장 관리하는 데이터베이스를 참조하여, 특징 데이터에 대하여 그 유사도(likelihood)(음향모델/언어모델과의 유사도)에 대한 분석을 통해, 해당 음성 신호에 대한 언어를 식별할 수 있다. The language identification voice recognition unit 200 receives the feature data of the voice signal from the voice processing unit 100, performs the language identification and voice recognition at the same time, and feeds the identified language information to the voice processing unit 100. The language recognition speech recognition unit 200 refers to a database that stores and manages an acoustic model (a multilingual common phoneme and a phonemic of a distinct language, etc.), a database that stores and manages a language model (syllable, The language of the voice signal can be identified through analysis of the likelihood (similarity with the acoustic model / language model) of the feature data.

도 2a는 입력된 음성을 각 언어 디코더로 보내어 병렬 음성인식과 언어 결정을 동시에 수행하기 위한, 도 1의 언어식별 음성인식부(200)에 대한 제1구체예를 갖는 음성인식 시스템(510)이다.FIG. 2A is a speech recognition system 510 having a first specific example of the language identification speech recognition unit 200 of FIG. 1 for sending input speech to each language decoder to simultaneously perform parallel speech recognition and language determination .

도 2a를 참조하면, 제1구체예에 따른 음성인식 시스템(510)은, 도 1과 같은 음성처리부(100)를 포함하며, 이외에, 복수의 언어 (네트워크) 디코더(211~219)(예, 자연수 N개) 및 언어 결정 모듈(220)을 포함하는 언어식별 음성인식부(200)로 구성된다. Referring to FIG. 2A, a speech recognition system 510 according to the first embodiment includes a speech processing unit 100 as shown in FIG. 1, and further includes a plurality of language (network) decoders 211 to 219 And a language identification module 200 including a language determination module 220. The language recognition module 200 includes a language identification module 200,

도 2a에서는, 단말이나 네트워크에서 전달된 음성을 특징 추출하여 매 프레임 또는 다수의 프레임 묶음으로 개별 언어 디코더(211~219)에 동시에 전송하는 구성과, 개별 언어 디코더(211~219)에서 매 프레임 또는 다수 프레임 묶음으로 언어식별 스코어를 언어 결정 모듈(220)로 전송하는 구성과, 전송된 누적 언어식별 스코어에 대해 언어 결정 모듈(220)의 결정룰을 이용하여 비교하고, 낮은 스코어의 언어 디코더를 차례로 중지시키는 명령을 보내는 구성과, 최종으로 남은 높은 스코어의 언어 디코더의 음성인식 결과를 보여주는 방식으로 자동으로 언어식별과 동시에 음성인식을 수행하는 구성을 포함한다.2A shows a configuration in which speech extracted from a terminal or a network is extracted and transmitted simultaneously to individual language decoders 211 to 219 in a frame or a plurality of frame bundles, A configuration for transmitting the language identification score to the language determination module 220 with a plurality of frame bundles, a comparison using the determination rules of the language determination module 220 for the transmitted cumulative language identification scores, And a configuration for performing speech recognition simultaneously with language identification in a manner that shows a speech recognition result of the language decoder of the last remaining high score.

이하, 도 2b의 흐름도를 참조하여 도 2a의 음성인식 시스템(510)의 동작을 설명한다. Hereinafter, the operation of the speech recognition system 510 of FIG. 2A will be described with reference to the flowchart of FIG. 2B.

음성처리부(100)는 네트워크를 통해 온라인으로 전달되거나 사용자 단말의 마이크를 통해 음성 신호를 수신하여, 주파수 분석 등 신호 분석을 통하여 특징 데이터를 추출하고, 매 프레임 또는 다수의 프레임 단위로 특징 데이터를 각 언어 디코더(211~219)로 동시에 전달할 수 있다(S21). 음성처리부(100)는 이와 같은 특징 데이터를 소정의 메모리에 저장하고, 각 언어 디코더(211~219)가 상기 메모리를 공유하여 접근이 허용될 수 있도록 관리할 수도 있다. The voice processing unit 100 extracts feature data through signal analysis such as frequency analysis by receiving voice signals through a network or via a microphone of a user terminal, Language decoders 211 to 219 (S21). The voice processing unit 100 may store the feature data in a predetermined memory and manage the access to be allowed by the language decoders 211 to 219 sharing the memory.

각각의 언어 디코더(211~219)는 개별 언어(예, 한국어, 영어, 불어, 일어 등)에 대한 음성인식을 위한 디코더이다. 각각의 언어 디코더(211~219)는 병렬로 음성처리부(100)로부터의 특징 데이터에 대한 음성 인식을 수행하며, 해당 언어의 음향모델과 언어모델 등을 참조하여 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도(likelihood)에 대한 분석을 통해 언어식별 스코어를 연산할 수 있다(S22). 각각의 언어 (네트워크) 디코더(211~219)는 음향모델과 언어모델 등이 저장 관리되는 로컬 데이터베이스를 참조(로컬 언어 식별 탐색)할 수 있고, 경우에 따라서는 위와 같은 유무선 네트워크 상의 서버에서 음향모델과 언어모델 등이 저장 관리되는 복수의 데이터베이스를 참조(네트워크 언어 식별 탐색)할 수도 있다. Each of the language decoders 211 to 219 is a decoder for speech recognition for individual languages (e.g., Korean, English, French, Japanese, etc.). Each of the language decoders 211 to 219 performs speech recognition of the feature data from the speech processor 100 in parallel and refers to the acoustic model and the language model of the corresponding language to generate one or more The language identification score can be calculated by analyzing the similarity (likelihood) of each speech signal frame (S22). Each of the language (network) decoders 211 to 219 can refer to a local database in which acoustic models and language models are stored and managed (local language identification search), and in some cases, (Network language identification search) of a plurality of databases in which a language model and the like are stored and managed.

언어 결정 모듈(220)는 각각의 언어 디코더(211~219)로부터 수신하여 누적된 언어식별 스코어를 참조하여 결정룰(예, 스코어가 높은 쪽을 선택하는 방식 등)에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정한다(S23). 언어 결정 모듈(220)은 식별된 언어 정보(예, 식별된 언어의 디코더에서 인식한 문자열 등)를 음성처리부(100)로 전송하며, 언어 디코더(211~219) 중 상기 대상 언어 디코더를 제외한 언어 디코더(들)에게 디코딩 종료 명령을 전송하여 동작을 종료시킨다. 예를 들어, 언어 결정 모듈(220)은 언어 디코더(211~219)로부터 수신하여 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시킬 수 있다.The language determination module 220 receives the language IDs from the respective language decoders 211 to 219 and refers to the accumulated language IDs to correspond to the selected language decoders in accordance with a decision rule (for example, a method of selecting a higher score) And determines the language as the identified language (S23). The language determination module 220 transmits the identified language information (e.g., a character string recognized by a decoder of the identified language) to the voice processing unit 100 and transmits the language code to the language decoder 211-219, And transmits a decoding end command to the decoder (s) to end the operation. For example, the language determination module 220 may terminate the operation by sequentially transmitting a decoding end command to a language decoder of a lower score based on the accumulated language identification score received from the language decoders 211 to 219.

이에 따라 언어 디코더(211~219) 중 상기 대상 언어 디코더를 제외한 언어 디코더(들)은, 디코딩 종료 명령을 받게 되면, 즉시 음성인식과 연산을 종료하고, 언어 결정 모듈(220)로 응답한다(S24). 디코딩 종료 명령을 받지 않는 상기 대상 언어 디코더는, 음성 인식을 수행한 결과에 따라 인식한 문자열(또는 단어열)을 출력한다. 음성처리부(100)는 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력할 수 있다.Accordingly, the language decoder (s) except for the target language decoder of the language decoders 211 to 219 terminates the speech recognition and calculation immediately after receiving the decoding end command, and responds to the language deciding module 220 (S24 ). The target language decoder not receiving the decoding end command outputs the recognized character string (or word string) according to the result of performing the speech recognition. The voice processing unit 100 can output the result of voice recognition in the target language decoder that is finally left.

이와 같이 본 발명에서는 언어식별 음성인식부(200)에서 음성인식과 동시에 언어식별을 수행하기 위하여, 다국어 병렬 음성 인식기 방식을 사용한다. 위에서 언어 결정 모듈(220)은 결정룰(decision rule)에 따라 언어식별 스코어가 낮은 언어 디코더에 우선적으로 디코딩 종료 명령을 전달한다. 즉, 결정룰에 기초하여, 발성된 언어와 유사도 차이가 크다고 계산된 언어 디코더들을 차례로 모두 중지 시키는 방법으로 음성인식과 언어식별을 동시에 진행하게 된다.As described above, in the present invention, a multi-language parallel voice recognizer method is used to perform language identification in conjunction with speech recognition in the language identification voice recognition unit 200. [ The language determination module 220 delivers a decoding end command to the language decoder having a low language identification score according to a decision rule. That is, speech recognition and language identification are simultaneously performed by sequentially stopping the language decoders calculated to have a great difference in similarity with the spoken language based on the determination rule.

이와 같은 본 발명의 방법은 동시에 여러 언어 디코더들을 동작시키지만, 언어 결정 모듈(220)의 결정룰에 의해 발성언어와 다른 언어들의 디코더는 빠르게 중지 되기 때문에, 짧게 수행되는 다수의 언어 디코더들을 수용하는 온라인 서버방식으로 서비스가 가능하다. 또한, 언어별로 학습된 음향모델과 언어모델을 활용하기 때문에 언어 식별을 위한 새로운 모델을 생성할 필요가 없는 이점이 있다. Since the decoder of the speech language and other languages is rapidly stopped by the decision rule of the language determination module 220, the method of the present invention operates the plural language decoders at the same time, Server-based services are available. In addition, since it utilizes the learned acoustic model and language model for each language, there is an advantage that it is not necessary to create a new model for language identification.

한편, 위에서 각각의 언어 디코더(211~219)가 연산하는 언어식별 스코어는 몇 가지 방법으로 계산될 수 있다. 도 3a에서 언어 네트워크 디코더(241~249)에서도 하기와 같이 언어식별 스코어가 계산될 수 있음을 미리 밝혀 둔다.On the other hand, the language identification scores calculated by the language decoders 211 to 219 can be calculated by several methods. It is noted in advance in FIG. 3A that the language network decoder 241 to 249 can also calculate the language ID score as follows.

먼저, 언어식별 스코어는 음향모델에 대한 유사도 분석 결과인 음향모델 스코어와 언어모델에 대한 유사도 분석 결과인 언어모델 스코어를 합산한 값일 수 있다. 언어식별 스코어는 발성한 언어와 유사도가 가까운 단어열 일수록 높은 스코어를 나타내는 기본 특성을 이용하여 매 프레임 또는 다수의 프레임마다 언어 결정 모듈(220)로 전송되어 다른 언어 디코더로부터의 스코어와 비교될 수 있다. First, the language identification score may be a value obtained by adding the acoustic model score, which is the result of the similarity analysis on the acoustic model, and the language model score, which is the result of the similarity analysis on the language model. The language identification score may be transmitted to the language determination module 220 for each frame or a plurality of frames using a basic characteristic that indicates a higher score as the word string having a similarity to the spoken language is compared with a score from another language decoder .

또한, 언어 (네트워크) 디코더(211~219)는 네트워크 언어 식별 탐색 시에 유사한 언어 후보들에 대한 경로나 주소와 같은 데이터를 포함할 수 있는 언어후보정보인 토큰(token)을 생성하며, 이때 매 프레임 또는 다수의 프레임마다 토큰의 개수를 언어식별 스코어로 이용할 수 있다. 즉, 해당 음향모델이나 언어모델과의 매칭 유사도가 높으면 후보 단어들이 줄면서 토큰 수도 줄어들지만, 정확히 맞는 후보가 없으면 유사한 후보들을 찾아 가기 때문에 후보가 많아지면서 토큰 수도 많아지게 된다. 이러한 특성 때문에 이 방법에서는 낮은 스코어(토큰의 개수의 역수가 큰값)가 유리한 값이 된다. In addition, the language (network) decoders 211 to 219 generate a token, which is language candidate information that may include data such as a path or address for similar language candidates when searching for a network language identification, Alternatively, the number of tokens may be used as the language identification score for each of a plurality of frames. That is, if the degree of similarity with the corresponding acoustic model or language model is high, the number of tokens decreases while the number of candidate words decreases. However, if there is no matching candidate, the number of candidates increases and the number of tokens increases. Because of this characteristic, the lower the score (the reciprocal of the number of tokens is larger) is advantageous in this method.

이러한 다양한 방법에 따른 언어식별 스코어 또는 그 조합에 따른 언어식별 스코어가 언어 결정 모듈(220)에 전송되면, 언어 결정 모듈(220)은 결정룰에 따라 언어식별을 하게 된다. 주로 이용할 수 있는 결정룰은 음향모델 스코어와 언어모델 스코어의 합을 프레임 단위로 누적하여 서로 비교하는데, 누적 스코어가 가장 높은 값과 기준값(threshold) 이상으로 차이가 날 때 누적 스코어 값이 적은 디코더를 차례로 종료 시켜나가는 방법이다. 반대로 위와 같이 토큰 개수를 이용하는 결정룰의 경우는, 매 프레임 단위로 누적시킨 토큰 개수가 소정의 가장 적은 누적 토큰 개수에 비해 기준값 이상으로 차이가 날 때, 해당 많은 토큰 개수를 갖는 언어 디코더를 종료시켜 나가는 방법을 결정룰로 사용할 수 있다. When the language identification score according to the various methods or a combination thereof is transmitted to the language determination module 220, the language determination module 220 performs language identification according to the determination rule. The decision rules that can be used mainly are the accumulation of the sum of the acoustic model score and the language model score in frame units and are compared with each other. When the cumulative score is different from the highest value and the threshold value, a decoder having a small cumulative score value is used It is a method to end in turn. On the contrary, in the case of the decision rule using the number of tokens as described above, when the number of tokens accumulated in each frame differs from the number of cumulative tokens having the smallest predetermined number by more than the reference value, the language decoder having the corresponding number of tokens is terminated You can use the outgoing method as a decision rule.

이와 같은 결정룰은 앞서 설명된 두 가지 스코어 값을 서로 혼용하여 만들 수도 있고, 기준값을 고정 값으로 하지 않고, 시간에 따라 선형 함수로 변하게 할 수도 있다. 즉, 시간에 따라 가변하는 상기 기준값을 적용하여 프레임 단위로 상기 누적된 언어식별 스코어가 가장 높은 값과, 해당 기준값 이상으로 차이가 나는 해당 언어식별 스코어를 출력하는 언어 디코더를 차례로 종료시켜 나갈 수도 있다. 그리고, 서로 다른 언어에 대한 각각의 디코더(211~219/241~249)는 음향 모델뿐만 아니라, 언어 네트워크도 다르게 구성될 수 있으므로, 스코어의 동등비교가 어렵다. 그러므로, 사전에 비교 실험을 통해 디코더 간의 적절한 스코어 조정(scaling)을 통해 결정룰에 적용할 필요가 있다. Such a determination rule may be made by mixing the two score values described above or may change the reference value to a linear function over time instead of a fixed value. That is, the language decoder may sequentially output the highest value of the accumulated language ID score and the language ID score that differs by more than the corresponding reference value, on a frame-by-frame basis, by applying the reference value varying with time . Since not only the acoustic models but also the language networks of the respective decoders 211 to 219/241 to 249 for different languages can be configured differently, it is difficult to make equal comparison of the scores. Therefore, it is necessary to apply to the decision rule through appropriate scoring between decoders through comparison experiments in advance.

도 3a는 입력된 음성에 대한 언어식별 음성인식을 음향모델 공유부와 언어 네트워크 디코더들로 분리하여 연산하기 위한, 도 1의 언어식별 음성인식부에 대한 제2구체예를 갖는 음성인식 시스템(520)이다.FIG. 3A shows a speech recognition system 520 having a second specific example of the language identification speech recognition unit of FIG. 1 for separately performing language identification speech recognition for an input speech into an acoustic model sharing unit and language network decoders )to be.

도 3a를 참조하면, 제2구체예에 따른 음성인식 시스템(520)은, 도 1과 같은 음성처리부(100)를 포함하며, 이외에, 음향모델 공유부(230), 복수의 언어 네트워크 디코더(241~249)(예, 자연수 N개) 및 언어 결정 모듈(250)을 포함하는 언어식별 음성인식부(200)로 구성된다. Referring to FIG. 3A, the speech recognition system 520 according to the second embodiment includes the speech processing unit 100 as shown in FIG. 1 and includes an acoustic model sharing unit 230, a plurality of language network decoders 241 To 249 (e.g., N natural numbers), and a language determination module 250. The language recognition module 200 includes a speech recognition module 200,

도 3a에서는, 단말이나 네트워크에서 전달된 음성을 특징 추출하여 음향모델 음향모델 공유부(230)에 전송하는 구성과, 부분 공유 또는 전체 공유된 음향 모델을 가지고 스코어를 계산하고, 이 값을 매 프레임 또는 다수의 프레임 묶음으로 개별 언어의 언어 네크워크 디코더(241~249)에 동시에 전송하는 구성과, 개별 언어의 언어 네트워크 디코더(241~249)에서 매 프레임 또는 다수 프레임 묶음으로 언어식별 스코어를 언어 결정 모듈(250)로 전송하는 구성과, 전송된 누적 언어식별 스코어에 대해 언어 결정 모듈(250)의 결정룰을 이용하여 비교하고, 낮은 스코어의 언어 네크워크 디코더를 차례로 중지시키는 명령을 보내는 구성과, 최종으로 남은 높은 스코어의 언어 네크워크 디코더의 음성인식 결과를 보여주는 방식으로 자동으로 언어식별과 동시에 음성인식을 수행하는 구성을 포함한다.In FIG. 3A, a feature of extracting a voice transmitted from a terminal or a network and transmitting it to an acoustic model acoustic model sharing unit 230, a structure of calculating a score using a partially shared or fully shared acoustic model, Or a plurality of frame bundles to the language network decoders 241 to 249 of the individual language and the language network decoders 241 to 249 of the individual languages to transmit the language identification scores to the language determination module (250), a configuration for comparing the transmitted cumulative language identification score with a decision rule of the language determination module (250), and sending a command to sequentially stop the language network decoders of low scores, and finally The remaining high score language is a method that shows the speech recognition result of the network decoder. To include a structure for performing.

이하, 도 3b의 흐름도를 참조하여 도 3a의 음성인식 시스템(520)의 동작을 설명한다. Hereinafter, the operation of the speech recognition system 520 of FIG. 3A will be described with reference to the flowchart of FIG. 3B.

음성처리부(100)는 네트워크를 통해 온라인으로 전달되거나 사용자 단말의 마이크를 통해 음성 신호를 수신하여, 주파수 분석 등 신호 분석을 통하여 특징 데이터를 추출하고, 매 프레임 또는 다수의 프레임 단위로 특징 데이터를 음향모델 공유부(230)로 전달할 수 있다(S31). 음성처리부(100)는 이와 같은 특징 데이터를 소정의 메모리에 저장하고, 음향모델 공유부(230)가 상기 메모리에 접근이 허용될 수 있도록 관리할 수도 있다. The voice processing unit 100 extracts feature data through signal analysis such as frequency analysis by receiving voice signals through a network or via a microphone of a user terminal, and extracts feature data by each frame or a plurality of frames To the model sharing unit 230 (S31). The voice processing unit 100 may store the feature data in a predetermined memory and manage the voice model sharing unit 230 so that access to the memory is allowed.

음향모델 공유부(230)는 음성처리부(100)로부터 특징 데이터를 수신하여, 다국어에 대한 음향모델 등을 참조하여 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도(likelihood)에 대한 분석을 통해 음향모델 스코어를 연산하여, 언어 네트워크 디코더(241~249)로 출력하여 공유시킬 수 있다(S32). The acoustic model sharing unit 230 receives feature data from the voice processing unit 100 and refers to an acoustic model or the like for multiple languages to analyze the likelihood of one or more voice signal frames based on the feature data. And outputs the calculated acoustic model scores to the language network decoders 241 to 249 for sharing (S32).

일반적인 음성인식은 입력된 음성 신호에 대한 특징 데이터를 추출하여 단어 단위의 언어 네트워크 상에서 음향 모델 스코어와 언어 모델 스코어를 연산하면서 최적의 단어 경로를 찾는 과정이다. 여기서는 도 3a와 같이 음성인식을 수행하면서 언어식별을 하기 위한 또 다른 방법으로, 언어 네트워크 디코더(241~249)에서 음향모델 공유부(230)를 공유하여 각 언어의 음향모델 스코어를 계산하는 비용을 줄이고, 각 언어의 언어 네크워크를 병렬로 탐색하면서 언어식별 스코어를 언어 결정 모듈(250)로 전송하는 방법을 사용한다. 이 방법은 음성인식 과정에서 가장 많은 계산 비용이 발생하는 음향모델 스코어의 연산을 줄이는 방법으로서, 최근 많이 사용하는 DNN(Deep Neural Network, 심층 신경망) AM(Acoustic Model)을 이용하는 음성인식기의 경우에는 전체 연산에서 음향모델 스코어 연산이 차지하는 비중이 80%에 도달하기도 한다. 도 3a의 음향모델 공유부(230)는 계산된 음향모델 스코어를 매 프레임 혹은 다수의 프레임에 걸쳐 언어 네트워크 디코더(241~249)로 전달하여 병렬로 언어 네트워크 탐색을 진행하게 한다. General speech recognition is a process of extracting feature data of an input speech signal and calculating an acoustic model score and a language model score on a word-by-word language network to find an optimal word path. As another method for performing language recognition while performing speech recognition as shown in FIG. 3A, the language network decoders 241 to 249 share a cost of calculating an acoustic model score of each language by sharing the acoustic model sharing unit 230 And transmits the language identification score to the language determination module 250 while searching the language network of each language in parallel. This method is a method to reduce the computation of the acoustic model score that generates the most computational cost in the speech recognition process. In the case of a speech recognizer using the recently used DNN (Deep Neural Network) AM (Acoustic Model) The proportion of the acoustic model score calculation in the calculation reaches 80%. The acoustic model sharing unit 230 shown in FIG. 3A transmits the calculated acoustic model scores to the language network decoders 241 to 249 over each frame or a plurality of frames, thereby allowing the language network search to proceed in parallel.

여기서, 음향모델 공유 방식은 크게 두 가지로 나눌 수 있다. 첫째는 다국어의 각 언어별 음향모델의 일부 구조를 공유하는 방법을 이용할 수도 있고, 둘째로 다국어 공통 음소(phone)를 이용하여 음향 모델을 생성해 소정의 다국어의 음향 모델 전체를 공유하는 방법을 이용할 수도 있다. 먼저, 다국어의 각 언어별 음향모델의 일부 구조를 공유하여 음향모델 스코어를 계산하는 방법에서는, DNN 음향모델의 전체 구조는 입력계층(input layer), 히든 계층(hidden layer), 출력계층(output layer)으로 나눌 수 있는데, 이중에서 출력계층을 제외한 전체 계층을 공유하거나 히든 계층 만을 공유하여 음향모델을 학습하는 방법이 이용된다. 이에 따라, 개별 언어 고유의 음소 특성을 갖는 출력계층의 노드들을 유지하면서 음향모델 구조(입력계층이나 히든 계층)를 공유하는 이점을 얻을 수 있다. 그리고, 다국어 공통 음소를 이용하여 음향 모델을 생성하고 소정의 다국어의 음향 모델 전체를 공유해 음향모델 스코어를 계산하는 방법에서는, 다국어 공통 음소를 참조하여 공통으로 공유하는 음소와 그렇지 않은 개별 음소들을 모두 정의해, 다국어 전체 음향 모델을 함께 학습하는 방법이 이용된다. 이에 따라 위의 첫번째 방법에 비해 하나의 언어의 대해서 상대적으로 DNN 음향 모델 출력계층의 노드 개수가 증가하지만 전체 음향 모델을 공유할 수 있게 된다. Here, the acoustic model sharing method can be roughly classified into two types. First, a method of sharing a partial structure of an acoustic model of each language of a multilingual language may be used. Second, a method of sharing an entire acoustic model of a predetermined multilingual language by generating an acoustic model by using a multilingual common phoneme It is possible. First, in the method of calculating the acoustic model score by sharing a partial structure of the acoustic model of each language of a multilingual language, the overall structure of the DNN acoustic model includes an input layer, a hidden layer, an output layer ), In which the acoustic model is learned by sharing the entire hierarchy except the output layer or by sharing only the hidden layer. Thus, an advantage of sharing the acoustic model structure (input layer or hidden layer) can be obtained while maintaining the nodes of the output layer having the phoneme characteristics inherent to individual languages. In a method of generating an acoustic model using a multilingual common phoneme and calculating an acoustic model score by sharing the entire acoustic model of a predetermined multilingual language, a method of defining a phoneme common to both common phonemes and individual non- And a method of learning together the multi-lingual acoustic models is used. Thus, the number of nodes of the DNN acoustic model output layer is relatively increased for one language compared to the first method, but the entire acoustic model can be shared.

이렇게 음향모델 공유부(230)에서 계산된 음향 모델 스코어는 매 프레임 또는 다수 프레임 묶음으로 각 언어의 언어 네트워크 디코더(241~249)로 동시에 전송되고, 각각의 언어 네트워크 디코더(241~249)는 공유된 음향 모델 스코어를 언어모델 스코어와 함께 결합하여 언어 네트워크 탐색을 진행하고 음성 인식을 수행한다. 각각의 언어 네트워크 디코더(241~249)는 개별 언어(예, 한국어, 영어, 불어, 일어 등)에 대한 음성인식을 위한 디코더이다. 각각의 언어 네트워크 디코더(241~249)는 병렬로 음향모델 공유부(230)로부터의 공유된 음향 모델 스코어와, 언어모델을 참조하여 계산한 언어모델 스코어를 합산한 언어식별 스코어를 생성하여, 언어 결정 모듈(250)로 전달해 네트워크 탐색 진행의 승인을 받게 된다(S32). 각각의 언어 네트워크 디코더(241~249)는 언어모델 등을 참조하여 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도(likelihood)에 대한 분석을 통해 언어모델 스코어를 연산할 수 있다.The acoustic model scores calculated by the acoustic model sharing unit 230 are simultaneously transmitted to the language network decoders 241 to 249 of each language by each frame or a plurality of frame bundles, and the language network decoders 241 to 249 share The proposed method combines the acoustic model scores with language model scores to perform language network searching and voice recognition. Each of the language network decoders 241 to 249 is a decoder for speech recognition for individual languages (e.g., Korean, English, French, Japanese, etc.). Each of the language network decoders 241 to 249 generates a language ID score by summing the shared acoustic model score from the acoustic model sharing unit 230 and the language model score calculated by referring to the language model, To the determination module 250 and receives approval of the network search progress (S32). Each of the language network decoders 241 to 249 can calculate a language model score by analyzing the likelihood for each of one or more speech signal frames based on the feature data with reference to a language model or the like.

음향모델 공유부(230)와 언어 네트워크 디코더(241~249)는 음향모델이나 언어모델 등이 저장 관리되는 로컬 데이터베이스를 참조(로컬 언어 식별 탐색)할 수 있고, 경우에 따라서는 위와 같은 유무선 네트워크 상의 서버에서 음향모델이나 언어모델 등이 저장 관리되는 복수의 데이터베이스를 참조(네트워크 언어 식별 탐색)할 수도 있다. The acoustic model sharing unit 230 and the language network decoders 241 to 249 can refer to a local database in which an acoustic model or a language model is stored and managed (local language identification search), and in some cases, A plurality of databases in which a sound model or a language model is stored and managed can be referred to (network language identification search) in the server.

언어 결정 모듈(250)은 각각의 언어 디코더(211~219)로부터의 누적된 언어식별 스코어를 참조하여 결정룰(예, 스코어가 높은 쪽을 선택하는 방식 등)에 따라 선택된 대상 언어 디코더에 대응된 언어를 식별된 언어로 결정한다(S33). 언어 결정 모듈(250)은 식별된 언어 정보(예, 식별된 언어의 디코더에서 인식한 문자열 등)를 음성처리부(100)로 전송하며, 언어 네트워크 디코더(241~249) 중 상기 대상 언어 디코더를 제외한 언어 디코더(들)에게 디코딩 종료 명령을 차례로 전송한다. 즉, 언어 결정 모듈(250)은 상기 누적된 언어식별 스코어를 기초로 낮은 스코어의 언어 네트워크 디코더에게 디코딩 종료 명령을 차례로 전송하여 동작을 종료시킬 수 있다.The language determination module 250 refers to the cumulative language identification scores from the respective language decoders 211 to 219 and determines the language code corresponding to the selected target language decoder according to a decision rule (for example, a method of selecting a higher score) The language is determined as the identified language (S33). The language determination module 250 transmits the identified language information (e.g., the character string recognized by the decoder of the identified language) to the voice processing unit 100 and determines whether or not the language network decoder 241-249 To the language decoder (s). That is, the language determination module 250 may transmit the decoding end command to the language network decoder of the lower score on the basis of the accumulated language identification score, thereby completing the operation.

이에 따라 언어 네트워크 디코더(241~249) 중 상기 대상 언어 디코더를 제외한 언어 디코더(들)은, 디코딩 종료 명령을 받게 되면, 즉시 음성인식과 연산을 종료하고, 언어 결정 모듈(250)로 응답한다(S44). 디코딩 종료 명령을 받지 않는 상기 대상 언어 디코더는, 음성 인식을 수행한 결과에 따라 인식한 문자열(또는 단어열)을 출력한다. 상기 음성처리부(100)는 최종으로 남은 상기 대상 언어 디코더에서 음성 인식한 결과를 출력하는게 된다.Accordingly, the language decoder (s) except for the target language decoder among the language network decoders 241 to 249 immediately terminates the speech recognition and calculation when receiving the decoding end command, and responds to the language deciding module 250 S44). The target language decoder not receiving the decoding end command outputs the recognized character string (or word string) according to the result of performing the speech recognition. The voice processing unit 100 outputs the result of speech recognition by the target language decoder that is finally left.

도 4a는 입력된 음성에 대한 언어식별 음성인식을 음향모델 공유부와 결합 네트워크부로 분리하여 연산하기 위한, 도 1의 언어식별 음성인식부에 대한 제3구체예를 갖는 음성인식 시스템(530)이다.4A is a speech recognition system 530 having a third specific example of the language identification speech recognition unit of FIG. 1 for separately performing language identification speech recognition for an input speech into an acoustic model sharing unit and a combining network unit .

도 4a를 참조하면, 제3구체예에 따른 음성인식 시스템(530)은, 도 1과 같은 음성처리부(100)를 포함하며, 이외에, 음향모델 공유부(260), 결합 네트워크 디코더(270)를 포함하는 언어식별 음성인식부(200)로 구성된다. Referring to FIG. 4A, the speech recognition system 530 according to the third embodiment includes the speech processing unit 100 as shown in FIG. 1, and further includes an acoustic model sharing unit 260, a combined network decoder 270, And a language recognition voice recognition unit (200).

도 4a에서는, 단말이나 네트워크에서 전달된 음성을 특징 추출하여 음향모델 공유부(260)에 전송하는 구성과, 음향모델 공유부(260)에서 공통 음소를 이용하여 전체 공유된 음향 모델을 가지고 스코어를 계산하고, 이 값을 매 프레임 또는 다수의 프레임 묶음으로 결합 네크워크 디코더(270)에 전송하는 구성과, 결합 네크워크 디코더(270)에서 개별 언어의 언어 네트워크를 결합한 하나의 네트워크를 탐색하여 음성인식된 결과를 보여주는 방식으로 자동으로 언어식별과 동시에 음성인식을 수행하는 구성을 포함한다.4A shows a configuration in which the voice transmitted from a terminal or a network is subjected to feature extraction and transmitted to the acoustic model sharing unit 260 and a structure in which the acoustic model sharing unit 260 uses a common phoneme to share a score And transmits the combined values to the network decoder 270 in a frame or a plurality of frame bundles. The combined network decoder 270 searches for one network combining the language networks of the individual languages, And performs a speech recognition simultaneously with the language identification.

이하, 도 4b의 흐름도를 참조하여 도 4a의 음성인식 시스템(530)의 동작을 설명한다. Hereinafter, the operation of the speech recognition system 530 of FIG. 4A will be described with reference to the flowchart of FIG. 4B.

음성처리부(100)는 네트워크를 통해 온라인으로 전달되거나 사용자 단말의 마이크를 통해 음성 신호를 수신하여, 주파수 분석 등 신호 분석을 통하여 특징 데이터를 추출하고, 매 프레임 또는 다수의 프레임 단위로 특징 데이터를 음향모델 공유부(260)로 전달할 수 있다(S41). 음성처리부(100)는 이와 같은 특징 데이터를 소정의 메모리에 저장하고, 음향모델 공유부(230)가 상기 메모리에 접근이 허용될 수 있도록 관리할 수도 있다. The voice processing unit 100 extracts feature data through signal analysis such as frequency analysis by receiving voice signals through a network or via a microphone of a user terminal, and extracts feature data by each frame or a plurality of frames To the model sharing unit 260 (S41). The voice processing unit 100 may store the feature data in a predetermined memory and manage the voice model sharing unit 230 so that access to the memory is allowed.

음향모델 공유부(260)는 음성처리부(100)로부터 특징 데이터를 수신하여, 다국어에 대한 음향모델 등을 참조하여 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도(likelihood)에 대한 분석을 통해 음향모델 스코어를 연산하여, 결합 네트워크 디코더(270)로 출력하여 공유시킬 수 있다(S42). 도 3a의 음향모델 공유부(230)와 유사하게, 음향모델 공유부(260)는 음향모델 스코어를 연산하기 위하여, 스마트폰 등 사용자 단말 중 다국어 공통 단말을 이용하여 음향 모델을 생성해 소정의 다국어의 음향 모델 전체(다국어의 공통 음소를 이용하여 전체 공유된 음향 모델)를 공유하는 방법을 이용할 수 있다.The acoustic model sharing unit 260 receives feature data from the voice processing unit 100 and refers to an acoustic model or the like for multiple languages to analyze the similarity of likelihood for one or more voice signal frames based on the feature data. And outputs the calculated acoustic model scores to the combined network decoder 270 for sharing (S42). Similar to the acoustic model sharing unit 230 of FIG. 3A, the acoustic model sharing unit 260 generates an acoustic model using a multi-lingual common terminal among user terminals such as a smart phone to calculate an acoustic model score, (A common acoustic model using a common phoneme in multiple languages) can be used.

결합 네트워크 디코더(270)는 음향모델 공유부(260)에서 하나 또는 그 이상의 음성 신호 프레임 마다 전달된 음향모델 스코어를 가지고, 각 언어의 네트워크를 결합한 하나의 통합 네트워크를 기초로, 특징 데이터에 대한 네트워크 디코딩 연산을 진행하여 음성인식을 수행한다(S42). The combining network decoder 270 has an acoustic model score transmitted for each of one or more voice signal frames in the acoustic model sharing unit 260, and based on one integrated network combining networks of the respective languages, Decoding operation is performed to perform speech recognition (S42).

즉, 결합 네트워크 디코더(270)는 음향모델 공유부(230)로부터의 공유된 음향 모델 스코어와, 다국어(예, 한국어, 영어, 불어, 일어 등)에 대한 언어모델을 참조하여 계산한 언어모델 스코어를 합산한 언어식별 스코어를 기초로, 가장 높은 스코어로 결정된 문자열(또는 단어열)을 출력한다(S43). 결합 네트워크 디코더(270)는 다국어 언어모델 등을 참조하여 특징 데이터를 기초로 하나 또는 그 이상의 음성 신호 프레임 마다 그 유사도(likelihood)에 대한 분석을 통해 언어모델 스코어를 연산할 수 있다.That is, the combining network decoder 270 decodes the shared acoustic model scores from the acoustic model sharing unit 230 and the language model scores calculated by referring to the language models for multiple languages (e.g., Korean, English, French, Japanese, etc.) (Or word string) determined based on the language identification score summed with the highest score (S43). The combining network decoder 270 can calculate a language model score by analyzing a likelihood for each of one or more voice signal frames based on the feature data with reference to a multilingual language model or the like.

도 4a의 경우는 언어식별과 음성인식을 동시에 수행하기 위해서, 하나의 음향 모델과 하나의 통합된 언어네트워크를 이용하는 방법이다. 이 방법을 위해서는 음향모델 공유부(230)에서 다국어 공통음소와 개별 언어 구별 음소를 함께 이용하여 음향 모델 스코어를 연산한다. 그리고 나서, 결합 네트워크 디코더(270)는 개별 언어의 언어 네트워크가 하나로 결합된 언어 구별이 없는 통합 언어 네트워크 상에서(각 언어의 언어 모델 데이터베이스들 참조, 또는 다국어에 대한 하나의 통합 언어 모델 데이터베이스 참조), 음향모델 공유부(230)에 의해 DNN 음향모델 상에서 음향 모델 스코어를 연산하면서 생성한 음소들을 결합하여, 위와 같은 통합 언어 모델 데이터베이스 등을 참조하면서 언어모델 스코어를 연산하고 언어식별 스코어를 연산할 수 있다. 결합 네트워크 디코더(270)는 음향 모델 스코어와 언어모델 스코어를 합산한 언어식별 스코어가 가장 높은 스코어로 결정된 문자열(또는 단어열)을 생성해 나감으로써, 하나의 통합 네트워크에서 다국어 탐색이 가능하도록 구성할 수 있다. In the case of FIG. 4A, one acoustic model and one integrated language network are used to simultaneously perform language identification and speech recognition. For this method, the acoustic model sharing unit 230 calculates an acoustic model score using the multilingual common phonemes and individual language distinguishing phonemes together. Then, the combining network decoder 270 may be configured to perform the following steps on an integrated language network in which the language networks of the individual languages are combined into one language, without reference to the language (refer to language model databases in each language or one integrated language model database for multiple languages) The acoustic model sharing unit 230 may combine the phonemes generated while operating the acoustic model score on the DNN acoustic model to calculate the language model score and calculate the language identification score while referring to the integrated language model database and the like . The combining network decoder 270 is configured to enable a multi-language search in one integrated network by generating a string (or word string) determined by a score having the highest language identification score obtained by summing an acoustic model score and a language model score .

결합 네트워크 디코더(270)는 도 5와 같이 복수의 개별 언어의 언어 네트워크들(예, 자연수 N개)을 하나로 통합한 디코딩 네트워크 구조를 갖는다. 이때 결합 네트워크 디코더(270)는, 네트워크 구성의 효율성과 성능 등을 고려하여 복수의 개별 언어의 언어 네트워크들의 처음과 끝만 연결하는 단순 결합 방식, 즉, 개별 언어의 언어 네트워크들을 단순히 모아 놓아 결합한 형태(각 네트워크에서 개별적 연산이 이루어짐)일 수도 있지만, 또는, 복수의 개별 언어의 언어 네트워크들 재구성하는 강한 결합 방식, 즉, 재구성 단계를 거쳐서 고유명사와 자주 사용되는 외래어들이 서로 연결되면서 긴밀한 결합 관계를 갖는 통합 네트워크 형태(하나의 네트워크에서 한번의 연산이 이루어짐)가 바람직하다. The combining network decoder 270 has a decoding network structure in which language networks of a plurality of individual languages (e.g., N natural numbers) are integrated into one as shown in FIG. In this case, the combining network decoder 270 may be a simple combining method in which only the first and the end of the language networks of a plurality of individual languages are connected in consideration of the efficiency and performance of the network configuration, that is, The individual nouns and the frequently used foreign words are connected to each other through a strong coupling method of reconstructing language networks of a plurality of individual languages, that is, through a reconstruction step, An integrated network type (one operation in one network) is preferred.

이와 같이 하나의 공유 음향 모델과 하나의 통합 언어 네트워크를 이용하는 이점은, 도 3a과 같이 하나의 음향모델 공유부(230)을 이용하여 계산 비용을 절약할 수 있다는 것과, 도 2a/도 3a와 같이 별도의 언어 결정 모듈(220/250)을 구비할 필요없이 하나의 통합 언어 네트워크 탐색을 통해 자동으로 likelihood가 높은 단어열들로 언어가 결정된다는 것이다. 여러 언어의 네트워크를 결합하였기 때문에 많은 메모리가 소요되지만, 네트워크 탐색을 병렬 프로세스로 구성하면 효과적으로 탐색을 수행할 수 있다. As described above, the advantage of using one shared acoustic model and one integrated language network is that the calculation cost can be saved by using one acoustic model sharing unit 230 as shown in FIG. 3A, The languages are automatically determined to have high likelihood word sequences through one integrated language network search without the need of having separate language determination modules 220/250. Because it combines networks of different languages, it takes a lot of memory, but if you configure the network search as a parallel process, you can search effectively.

음성처리부(100)는 결합 네트워크 디코더(270)에서 식별된 언어 정보, 즉, 가장 높은 스코어로 결정된 문자열(또는 단어열)을 피드백 받으면, 해당 인식된 결과를 보여주는 후처리 부분을 수행할 수 있다. 또한, 음성처리부(100)는 언어 구분 없는 특징 데이터의 추출을 중지하고 해당 언어 정보에 맞게 효과적으로 특징 데이터를 추출하기 위한 신호 분석을 수행할 수 있다. When the speech processing unit 100 receives feedback on the language information identified by the combining network decoder 270, that is, the string (or word string) determined by the highest score, the voice processing unit 100 may perform a post-processing part showing the recognized result. In addition, the voice processing unit 100 may stop extracting feature data without language discrimination and perform signal analysis for extracting feature data effectively according to the language information.

도 6은 본 발명의 일 실시예에 따른 음성인식 시스템(500)의 구현 방법의 일례를 설명하기 위한 도면이다. 본 발명의 일 실시예에 따른 음성인식 시스템(500)은 하드웨어, 소프트웨어, 또는 이들의 결합으로 이루어질 수 있다. 예를 들어, 음성인식 시스템(500)는 도 6과 같은 컴퓨팅 시스템(1000)으로 구현될 수 있다. 6 is a diagram for explaining an example of a method of implementing the speech recognition system 500 according to an embodiment of the present invention. The speech recognition system 500 according to an embodiment of the present invention may be implemented by hardware, software, or a combination thereof. For example, the speech recognition system 500 may be implemented in the computing system 1000 as shown in FIG.

컴퓨팅 시스템(1000)은 버스(1200)를 통해 연결되는 적어도 하나의 프로세서(1100), 메모리(1300), 사용자 인터페이스 입력 장치(1400), 사용자 인터페이스 출력 장치(1500), 스토리지(1600), 및 네트워크 인터페이스(1700)를 포함할 수 있다. 프로세서(1100)는 중앙 처리 장치(CPU) 또는 메모리(1300) 및/또는 스토리지(1600)에 저장된 명령어들에 대한 처리를 실행하는 반도체 장치일 수 있다. 메모리(1300) 및 스토리지(1600)는 다양한 종류의 휘발성 또는 불휘발성 저장 매체를 포함할 수 있다. 예를 들어, 메모리(1300)는 ROM(Read Only Memory)(1310) 및 RAM(Random Access Memory)(1320)을 포함할 수 있다. The computing system 1000 includes at least one processor 1100, a memory 1300, a user interface input device 1400, a user interface output device 1500, a storage 1600, And an interface 1700. The processor 1100 may be a central processing unit (CPU) or a memory device 1300 and / or a semiconductor device that performs processing for instructions stored in the storage 1600. Memory 1300 and storage 1600 may include various types of volatile or non-volatile storage media. For example, the memory 1300 may include a ROM (Read Only Memory) 1310 and a RAM (Random Access Memory)

따라서, 본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서(1100)에 의해 실행되는 하드웨어, 소프트웨어 모듈, 또는 그 2 개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM과 같은 저장 매체(즉, 메모리(1300) 및/또는 스토리지(1600))에 상주할 수도 있다. 예시적인 저장 매체는 프로세서(1100)에 커플링되며, 그 프로세서(1100)는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서(1100)와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.Thus, the steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by processor 1100, or in a combination of the two. The software module may reside in a storage medium (i.e., memory 1300 and / or storage 1600) such as a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, You may. An exemplary storage medium is coupled to the processor 1100, which can read information from, and write information to, the storage medium. Alternatively, the storage medium may be integral to the processor 1100. [ The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

상술한 바와 같이, 본 발명에 따른 음성인식 시스템(500)에서는, 사용자에게 수동으로 발성할 언어를 선택하기 위한 버튼의 사용 등 사용자 등록이나 인식 언어 설정을 위한 별도의 과정 없이, 발성한 사람의 음성인식 동안에 자동으로 음성언어의 식별이 가능하게 하여 다국어 음성인식을 효과적으로 처리할 수 있다. 기존 방법은 사용자의 단말에서 미리 사용자의 등록 내용에 사용언어를 기록하는 방식으로 언어를 결정했지만, 본 발명은 음성이 전달되는 순간에 언어식별을 시작하기 때문에 사전 작업이 필요없고, 사용자 단말에 의존적이지도 않다. 또한, 본 발명에 따른 음성인식 시스템(500)에서는, 하나의 단말기를 사용해 각기 다른 언어의 사람이 발성하여도 자동으로 각 언어의 음성인식을 수행하도록 자동 다국어 음성인식을 지원하여 사용자의 편의성을 높일 수 있다. 이는 다국어 회의와 같은 복수의 다른 언어를 가진 사람들의 회의 내용을 기록할 수 있도록 응용될 수 있다. 그리고, 본 발명에 따른 음성인식 시스템(500)에서는, 실시간으로 발성된 음성에 대해 음성인식을 수행하면서 측정된 스코어에 기반하여 언어를 판별하기 때문에, 언어식별 전용 인식기를 앞서 수행할 필요없이 음성인식 결과를 빠르게 받을 수 있다. As described above, in the speech recognition system 500 according to the present invention, the user does not need to perform the user registration or the setting of the recognition language, such as the use of the button for selecting the language to be manually voiced, It is possible to automatically identify the speech language during recognition and effectively process the multilingual speech recognition. In the conventional method, the language is determined by recording the language used in the registration of the user in advance in the terminal of the user. However, since the present invention starts language identification at the moment when the voice is transmitted, It is not. In addition, in the speech recognition system 500 according to the present invention, automatic multilingual speech recognition is performed so that speech recognition of each language is automatically performed even if a person of a different language speaks using one terminal, thereby enhancing user's convenience . This can be applied to record the contents of meetings of people having a plurality of different languages such as a multilingual conference. In the speech recognition system 500 according to the present invention, speech recognition is performed on speech uttered in real time, and the language is discriminated based on the measured score. Therefore, You can get results quickly.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. The foregoing description is merely illustrative of the technical idea of the present invention, and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention.

따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

음성처리부(100)
언어식별 음성인식부(200)
복수의 언어 디코더(211~219)
언어 결정 모듈(220)
음향모델 공유부(230)
복수의 언어 네트워크 디코더(241~249)
언어 결정 모듈(250)
음향모델 공유부(260)
결합 네트워크 디코더(270)The voice processing unit (100)
The language identification voice recognition unit 200
A plurality of language decoders (211 to 219)
The language determination module (220)
The acoustic model sharing unit 230
The plurality of language network decoders (241 to 249)
The language determination module (250)
The acoustic model sharing unit 260
The combined network decoder 270,

Claims

A voice processor for analyzing voice signals and extracting characteristic data; And
And a language identification voice recognition unit for performing language identification and voice recognition using the feature data and feeding back the identified language information to the voice processing unit,
Wherein the voice processing unit outputs a voice recognition result by the language identification voice recognition unit according to the identified language information to be fed back.

The method according to claim 1,
Wherein the language identification voice recognition unit identifies a language for the voice signal by analyzing the similarity degree with reference to the acoustic model and the language model.

The method according to claim 1,
The language identification voice recognizing unit recognizes,
Each of the speech recognition apparatuses performs speech recognition on the feature data in parallel and refers to an acoustic model and a language model of the language and analyzes the similarity of each speech signal frame based on the feature data to obtain a language identification score A plurality of language decoders for operating; And
A language determination module for determining a language corresponding to a target language decoder selected according to a determination rule as an identified language by referring to the accumulated language ID scores received from the plurality of language decoders and outputting the identified language information
And a speech recognition system.

The method of claim 3,
Wherein the language determination module sequentially transmits a decoding end command to a language decoder of a lower score on the basis of the accumulated language identification score to terminate the operation, and the voice processing unit outputs a speech recognition result of the remaining remaining target language decoder And outputs the speech signal.

The method of claim 3,
Wherein the language identification score comprises:
A value obtained by summing an acoustic model score and a language model score, or a reciprocal of the number of tokens for similar language candidates generated when a network is searched, or a combination thereof.

The method of claim 3,
The above-
A language decoder for outputting a corresponding language ID score in which the cumulative language identification score differs from the highest value by a reference value or a method for terminating the language decoder in order, And a language decoder for outputting a corresponding language identification score in which the identification score is highest and the reference value is greater than or equal to the reference value in units of frames.

The method according to claim 1,
The language identification voice recognizing unit recognizes,
An acoustic model for calculating an acoustic model score through analysis of the degree of similarity for one or more speech signal frames based on the feature data, by sharing a part of an acoustic model for each language of a multilingual language or a predetermined multilingual acoustic model A shared portion;
Each of the acoustic model scores is shared in parallel to perform speech recognition on the feature data, and the speech model is calculated by summing the shared acoustic model score and a language model score calculated based on the feature data with reference to the language model A plurality of language network decoders for computing an identification score; And
A language determination module for determining the language corresponding to the selected target language decoder according to the determination rule as an identified language by referring to the accumulated and accumulated language identification scores received from the plurality of language network decoders and outputting the identified language information
And a speech recognition system.

8. The method of claim 7,
Wherein the language determination module sequentially transmits a decoding end command to a language network decoder of a lower score on the basis of the accumulated language identification score to end the operation, And outputs the speech recognition result.

The method according to claim 1,
The language identification voice recognizing unit recognizes,
The acoustic model score is calculated by analyzing the degree of similarity for one or more speech signal frames based on the feature data by using the multinational common phonemes and the individual language distinctive phonemes together, An acoustic model sharing unit; And
A speech recognition method for a speech recognition system, comprising: integrating language networks of a plurality of individual languages into one, performing speech recognition on the feature data using an integrated language network without language distinction, referring to the shared acoustic model score and language model, A combination network decoder for calculating a language identification score by summing the language model scores calculated on the basis and outputting a string determined as the highest score based on the language identification score,
The speech recognition system comprising:

10. The method of claim 9,
Wherein the voice processing unit outputs the determined character string as a result of speech recognition by the combined network decoder through a predetermined output interface.

Analyzing the speech signal and extracting the feature data;
Performing language identification and speech recognition using the feature data, and outputting the identified language information; And
Outputting a result of speech recognition through a predetermined output interface according to the identified language information
And a speech recognition unit for recognizing speech.

12. The method of claim 11,
Wherein the step of outputting the identified language information identifies a language for the speech signal by analyzing the similarity degree with reference to the acoustic model and the language model.

12. The method of claim 11,
Wherein the outputting of the identified language information comprises:
Each of the plurality of language decoders performs speech recognition of the feature data in parallel and refers to an acoustic model and a language model of the language to analyze the similarity for one or more speech signal frames based on the feature data Computing a language identification score; And
Determining a language corresponding to the selected target language decoder according to a determination rule as an identified language by referring to the accumulated language identification score received from the plurality of language decoders and outputting the identified language information
And a speech recognition unit for recognizing speech.

14. The method of claim 13,
And outputting the identified language information, the decoding end command is sequentially transmitted to a language decoder of a lower score based on the accumulated language identification score to terminate the operation, and the speech recognition result And outputting the speech signal.

14. The method of claim 13,
Wherein the language identification score comprises:
A value obtained by summing an acoustic model score and a language model score, or a reciprocal of the number of tokens for similar language candidates generated when a network is searched, or a combination thereof.

14. The method of claim 13,
The above-
A language decoder for outputting a corresponding language ID score in which the cumulative language identification score differs from the highest value by a reference value or a method for terminating the language decoder in order, And a language decoder for outputting a corresponding language identification score in which the identification score has a highest value and is different from the reference value in units of frames.

12. The method of claim 11,
Wherein the outputting of the identified language information comprises:
A step of calculating an acoustic model score by analyzing the degree of similarity of one or more speech signal frames based on the feature data by sharing a part of the acoustic model of each language or a predetermined multilingual acoustic model of the multilingual language;
A plurality of language network decoders each sharing the acoustic model score in parallel to perform speech recognition on the feature data; a language model that is calculated based on the feature data by referring to the shared acoustic model score and the language model; Computing a language identification score summed with the scores; And
Determining a language corresponding to the selected target language decoder according to a determination rule as an identified language by referring to the received language accumulated score IDs from the plurality of language network decoders and outputting the identified language information
And a speech recognition unit for recognizing speech.

18. The method of claim 17,
And outputting the identified language information, a decoding end command is sequentially transmitted to a language network decoder of a lower score based on the accumulated language identification score to terminate the operation, and finally, And outputs a result.

12. The method of claim 11,
Wherein the outputting of the identified language information comprises:
The acoustic model score is calculated by analyzing the degree of similarity for one or more speech signal frames based on the feature data by using the multinational common phonemes and the individual language distinctive phonemes together, step; And
In a coupled network decoder in which language networks of a plurality of individual languages are integrated into one, voice recognition is performed on the feature data using an integrated language network without language discrimination, and the shared acoustic model score and the language model are referred to Calculating a language identification score obtained by summing language model scores calculated on the basis of the feature data, and outputting a character string determined as the highest score based on the language identification score.

20. The method of claim 19,
Outputting the determined character string as a result of speech recognition in the combined network decoder through a predetermined output interface
Further comprising the steps of: