KR20160030168A

KR20160030168A - Voice recognition method, apparatus, and system

Info

Publication number: KR20160030168A
Application number: KR1020167000254A
Authority: KR
Inventors: 김사무엘; 오현오; 송명석
Original assignee: 주식회사 윌러스표준기술연구소
Priority date: 2013-07-09
Filing date: 2014-07-09
Publication date: 2016-03-16
Also published as: WO2015005679A1

Abstract

본 발명은 사용자의 개인 정보를 이용하여 음성 인식 성능을 높이기 위한 음성 인식 장치, 시스템 및 방법에 관한 것이다. 본 발명의 실시예에 따른 음성 인식 시스템은 음성 신호를 입력 받고 사용자의 개인 정보를 수집하는 단말기, 단말기로부터 음성 신호와 개인 정보를 수신하고, 개인 정보를 기 설정된 카테고리로 분류하여 저장하며, 음성 신호 및 저장된 적어도 일부의 개인 정보를 음성 인식 서버로 전송하는 프라이빗 서버, 프라이빗 서버로부터 전송된 음성 신호와 개인 정보에 기초하여 음성 인식을 수행하고, 음성 인식 결과물을 생성하는 음성 인식 서버를 포함할 수 있다.The present invention relates to a speech recognition apparatus, system, and method for enhancing speech recognition performance using personal information of a user. A voice recognition system according to an embodiment of the present invention includes a terminal for receiving a voice signal and collecting personal information of a user, a terminal for receiving voice signals and personal information from the terminal, sorting and storing personal information into predetermined categories, And a voice recognition server for performing voice recognition on the basis of voice signals and personal information transmitted from the private server and generating voice recognition results, .

Description

TECHNICAL FIELD [0001] The present invention relates to a voice recognition method, a voice recognition method,

본 발명은 음성 인식 장치, 시스템 및 방법에 관한 것으로, 더욱 상세하게는 사용자의 개인 정보를 이용하여 음성 인식 성능을 높이기 위한 음성 인식 장치, 시스템 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a speech recognition apparatus, a system and a method, and more particularly, to a speech recognition apparatus, system and method for enhancing speech recognition performance using personal information of a user.

음성 인식 기술은 사용자와 단말기 간의 상호작용을 쉽게 해줄 수 있는 핵심기술 중 하나이다. 음성 인식 기술을 통해 단말기는 사용자의 음성을 듣고, 이를 이해할 수 있으며, 이해한 내용을 바탕으로 사용자에게 적절한 서비스를 제공할 수도 있다.Speech recognition technology is one of the key technologies that can facilitate the interaction between the user and the terminal. Through the speech recognition technology, the terminal can listen to the user's voice, understand it, and provide the appropriate service to the user based on the understanding.

일반적으로 음성 인식 기술은 다량의 발화 데이터와 언어 데이터로부터 통계적 특징을 추출하여 제작한 음성 인식 모델을 사용한다. 음성 인식 장치는 사용자의 음성을 분석하여 이미 만들어져 있는 음성 인식 모델과의 유사도를 측정하여 사용자의 음성에 포함된 정보를 유추한다.In general, speech recognition technology uses a speech recognition model that is constructed by extracting statistical features from a large amount of speech data and language data. The speech recognition apparatus analyzes the user's voice and measures the similarity with the speech recognition model, which is already established, to infer the information included in the user's voice.

하지만, 최근의 괄목할만한 발달에도 불구하고, 음성 인식 기술은 그 넓은 가능성에 비하여 현재 매우 제한된 분야에서만 사용 되고 있다. 이러한 현상은 음성 인식 기술이 갖고 있는 몇 가지 한계점들 때문이다. 그 한계점은 음성 인식 과정에서 사용자 개인의 특성에 맞추어진 것이 아닌, 일반화된 음성 인식 모델을 사용하는 데서 기인한다. 또한, 현재의 음성 인식 단말기가 갖는 연산 능력의 한계성도 큰 장애요소가 되고 있다.However, in spite of recent remarkable developments, speech recognition technology is currently used only in a very limited field compared with its wide possibility. This is due to some limitations of speech recognition technology. The limitation is due to the use of a generalized speech recognition model, not tailored to the individual characteristics of the user in the speech recognition process. In addition, the limitation of the computing ability of the current speech recognition terminal is also a major obstacle.

본 발명은 상기와 같은 문제점을 해결하기 위해 안출된 것으로서, 사용자의 정보들을 수집하고, 이를 이용하여 음성 인식 과정에서 사용되는 음향 모델(Acoustic Model)과 언어 모델(Language Model)을 사용자에 맞게 개인화하여 성능을 향상시킬 수 있는 음성 인식 시스템을 제공하는데 있다.SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and it is an object of the present invention to provide a method and apparatus for collecting information of a user and personalizing an acoustic model and a language model used in a speech recognition process And to provide a speech recognition system capable of improving performance.

이에 더하여, 본 발명은 수집된 사용자의 개인 정보를 사용자의 동의 없이 유출하지 않는 보안에 강인한 음성 인식 시스템을 제공하고자 하는 목적도 가지고 있다.In addition, the present invention also has an object to provide a voice recognition system that is robust against security and does not leak personal information of the collected user without the consent of the user.

상기와 같은 과제를 해결하기 위한 본 발명의 실시예에 따르면, 사용자로부터 음성 신호를 입력 받고 상기 사용자의 개인 정보를 수집하는 단말기; 상기 단말기로부터 상기 음성 신호와 상기 개인 정보를 수신하고, 상기 개인 정보를 기 설정된 카테고리로 분류하여 저장하며, 상기 음성 신호 및 저장된 적어도 일부의 개인 정보를 음성 인식 서버로 전송하는 프라이빗 서버; 상기 프라이빗 서버로부터 전송된 상기 음성 신호와 상기 개인 정보에 기초하여 음성 인식을 수행하고, 음성 인식 결과물을 생성하는 음성 인식 서버; 를 포함하되, 상기 프라이빗 서버에서 상기 음성 인식 서버로 전송되는 개인 정보는 사용자가 공개로 설정한 개인 정보이며, 상기 음성 인식 서버는, 상기 음성 신호에 대응되는 음소, 음절 및 단어 중 적어도 하나를 선별하는 음향 모델부와, 언어의 문장 구조를 참조하여 문자열을 형성하는 언어 모델부와, 상기 음향 모델부와 상기 언어 모델부가 음성 인식 과정에서 사용할 음향 모델과 언어 모델을 선택하는 환경 콘트롤러를 포함하는 것을 특징으로 하는 음성 인식 시스템을 제공할 수 있다.According to an embodiment of the present invention, there is provided a terminal for receiving a voice signal from a user and collecting personal information of the user; A private server for receiving the voice signal and the personal information from the terminal, sorting the personal information into a predetermined category and storing the voice signal, and transmitting the voice signal and at least some of the stored personal information to the voice recognition server; A voice recognition server for performing voice recognition based on the voice signal transmitted from the private server and the personal information, and for generating a voice recognition result; Wherein the personal information transmitted from the private server to the voice recognition server is personal information set by the user to be public and the voice recognition server selects at least one of phonemes, And an environment controller for selecting an acoustic model and a language model to be used by the acoustic model unit and the language model unit in the speech recognition process, wherein the acoustic model unit comprises: A voice recognition system can be provided.

이때, 상기 환경 콘트롤러는 상기 음성 인식 서버에 전송된 개인 정보를 참조하여 적어도 하나의 음향 모델과 적어도 하나의 언어 모델을 선택한다.At this time, the environmental controller selects at least one acoustic model and at least one language model by referring to the personal information transmitted to the voice recognition server.

또한, 상기 개인 정보는, 사용자 행위의 기록 및 사용자 행위를 측정한 결과로부터 수집된 사용자 행동 정보와, 사용자 고유의 신상 정보 및 사용자의 상황을 나타내는 사용자 상태 정보를 포함하며, 상기 사용자 행동 정보는, 사용자의 온라인 상의 활동 및 인터넷 활용 기록을 수집한 사용자 온라인 기록, 사용자의 실제 위치를 나타내는 사용자 위치 정보, 사용자의 통신 식별 정보인 사용자 연결 정보 및 사용자와 단말기 사이의 상호 작용 과정에서 수집되는 사용자 기기 활용 정보를 포함하고, 상기 사용자 상태 정보는, 사용자 신상 정보 및 성격, 신체, 감정 상태를 나타내는 사용자 속성 정보 및 사용자가 위치하고 있는 주변 환경의 특징을 나타내는 환경 속성 정보를 포함한다.The personal information may include user behavior information collected from a result of measurement of user behavior and user behavior, user-specific personal information, and user status information indicating a user's situation, A user's online record of the user's online activity and internet usage record, user location information indicating the user's actual location, user connection information, which is communication identification information of the user, and user equipment collected during the interaction process between the user and the terminal The user status information includes user identification information, user attribute information indicating a personality, a body, an emotion state, and environment attribute information indicating a characteristic of a surrounding environment in which the user is located.

또한, 상기 단말기는, 상기 사용자 상태 정보를 상기 사용자로부터 직접 입력 받거나, 상기 음성 신호 및 상기 사용자 행동 정보 중 적어도 하나로부터 유추한다.Also, the terminal directly receives the user state information from the user, or inferences from at least one of the voice signal and the user behavior information.

또는, 상기 프라이빗 서버는, 상기 사용자 상태 정보를 상기 사용자로부터 직접 입력 받거나, 상기 음성 신호 및 상기 사용자 행동 정보 중 적어도 하나로부터 상기 사용자 상태 정보를 유추한다.Alternatively, the private server directly receives the user status information from the user, or inferences the user status information from at least one of the voice signal and the user behavior information.

그리고, 상기 음성 인식 서버는, 복수의 음성 인식 결과물들을 도출하여 상기 프라이빗 서버로 전송하되, 음성 인식 과정에서 사용된 음향 모델과 언어 모델의 종류 정보도 함께 전송한다.The speech recognition server derives a plurality of speech recognition results and transmits the same to the private server, and also transmits the acoustic model used in the speech recognition process and the type information of the language model.

이때, 상기 프라이빗 서버는, 상기 음성 인식 서버로부터 전송된 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 공개된 개인 정보 및 비공개 개인 정보를 이용하여 선택한다.At this time, the private server selects at least one of a plurality of voice recognition results transmitted from the voice recognition server, using the disclosed personal information and private personal information.

또한, 상기 프라이빗 서버는, 상기 음성 인식 서버로부터 전송된 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 복수의 음성 인식 결과물들의 각 확률값에 상기 공개된 개인 정보 및 비공개 개인 정보에 기초한 가중치를 부가하고, 그 결과로 가장 높은 확률값을 가지는 음성 인식 결과물을 선택한다.The private server selects at least one of a plurality of voice recognition results transmitted from the voice recognition server and stores a weight based on the public personal information and the private personal information in each probability value of the plurality of voice recognition results And selects the speech recognition result having the highest probability as a result.

본 발명의 다른 실시예에 따르면, 제 1 사용자로부터 음성 신호를 입력 받고 상기 제 1 사용자의 개인 정보를 수집하는 제 1 단말기; 상기 제 1 단말기로부터 상기 음성 신호와 상기 개인 정보를 수신하고, 상기 개인 정보를 기 설정된 카테고리로 분류하여 저장하며, 상기 음성 신호 및 저장된 적어도 일부의 개인 정보를 음성 인식 서버로 전송하는 프라이빗 서버; 상기 프라이빗 서버로부터 전송된 상기 음성 신호와 상기 개인 정보에 기초하여 음성 인식을 수행하여 음성 인식 결과물을 생성하고, 상기 음성 인식 결과물을 제 2 단말기에 전송하는 음성 인식 서버; 및 상기 음성 인식 결과물을 수신하여 제 2 사용자에게 출력하는 제 2 단말기; 를 포함하되, 상기 프라이빗 서버에서 상기 음성 인식 서버로 전송되는 개인 정보는 제 1 사용자가 공개로 설정한 개인 정보이며, 상기 음성 인식 서버는, 상기 음성 신호에 대응되는 음소, 음절 및 단어 중 적어도 하나를 선별하는 음향 모델부와, 언어의 문장 구조를 참조하여 문자열을 형성하는 언어 모델부와, 상기 음향 모델부와 상기 언어 모델부가 음성 인식 과정에서 사용할 음향 모델과 언어 모델을 선택하는 환경 콘트롤러를 포함하는 것을 특징으로 하는 음성 인식 시스템이 제공될 수 있다.According to another embodiment of the present invention, there is provided a mobile communication terminal comprising: a first terminal for receiving a voice signal from a first user and collecting personal information of the first user; A private server for receiving the voice signal and the personal information from the first terminal, sorting the personal information into a predetermined category and storing the voice signal, and transmitting the voice signal and at least some of the stored personal information to the voice recognition server; A voice recognition server for performing voice recognition based on the voice signal transmitted from the private server and the personal information to generate a voice recognition result and transmitting the voice recognition result to a second terminal; A second terminal for receiving the voice recognition result and outputting the result to a second user; Wherein the personal information transmitted from the private server to the voice recognition server is personal information set by the first user to be set to be public and the voice recognition server transmits at least one of phonemes, A language model unit for forming a character string by referring to a sentence structure of a language, and an environment controller for selecting an acoustic model and a language model to be used in the speech recognition process by the acoustic model unit and the language model unit The speech recognition system can be provided.

이때, 상기 환경 콘트롤러는 상기 음성 인식 서버에 전송된 제 1 사용자의 개인 정보를 참조하여 적어도 하나의 음향 모델과 적어도 하나의 언어 모델을 선택한다.At this time, the environmental controller selects at least one acoustic model and at least one language model by referring to the personal information of the first user transmitted to the voice recognition server.

그리고, 상기 제 1 단말기는, 상기 사용자 상태 정보를 상기 사용자로부터 직접 입력 받거나, 상기 음성 신호 및 상기 사용자 행동 정보 중 적어도 하나로부터 유추한다.The first terminal directly receives the user state information from the user, or inferences from at least one of the voice signal and the user behavior information.

또는, 상기 프라이빗 서버는, 상기 음성 인식 서버로부터 전송된 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 복수의 음성 인식 결과물들의 각 확률값에 상기 공개된 개인 정보 및 비공개 개인 정보에 기초한 가중치를 부가하고, 그 결과로 가장 높은 확률값을 가지는 음성 인식 결과물을 선택한다.Alternatively, the private server may select at least one of a plurality of voice recognition results transmitted from the voice recognition server, and add a weight based on the disclosed personal information and the private personal information to each probability value of the plurality of voice recognition results And selects the speech recognition result having the highest probability as a result.

그리고, 상기 제 2 단말기는, 상기 음성 인식 결과물을 음성으로 출력한다.The second terminal outputs the voice recognition result by voice.

이때, 상기 제 2 단말기는, 상기 프라이빗 서버로부터 제 1 사용자의 개인 정보를 전송 받고, 상기 제 2 단말기가 상기 음성 인식 결과물을 음성으로 변환할 때 상기 제 1 사용자의 개인 정보를 참조하여 음성을 형성한다.At this time, the second terminal receives the personal information of the first user from the private server, and when the second terminal converts the voice recognition result into voice, the second terminal refers to the personal information of the first user to form a voice do.

또는, 상기 제 2 단말기는, 상기 음성 인식 결과물을 음성으로 변환할 때 별도로 저장된 음성의 특징 및 환경 특징 중 적어도 하나를 참조하여 음성을 형성한다.Alternatively, the second terminal refers to at least one of the characteristics of the speech and environment characteristics separately stored when converting the speech recognition result into speech, thereby forming the speech.

또한, 상기 제 2 단말기는, 언어를 번역하는 번역부를 더 포함하고, 상기 번역부는 상기 음성 인식 결과물을 상기 제 2 사용자가 선택한 언어로 번역한다.The second terminal may further include a translator for translating a language, and the translator translates the speech recognition result into a language selected by the second user.

본 발명의 또 다른 실시예에 따르면, 사용자로부터 음성 신호를 입력 받고 상기 사용자의 개인 정보를 수집하며, 상기 음성 신호 및 적어도 일부의 개인 정보를 음성 인식 서버로 전송하는 단말기; 및 상기 단말기로부터 전송된 상기 음성 신호와 상기 개인 정보에 기초하여 음성 인식을 수행하여 음성 인식 결과물을 생성하는 음성 인식 서버; 를 포함하되, 상기 단말기에서 상기 음성 인식 서버로 전송되는 개인 정보는 사용자가 공개로 설정한 개인 정보이며, 상기 음성 인식 서버는, 상기 음성 신호에 대응되는 음소, 음절 및 단어 중 적어도 하나를 선별하는 음향 모델부와, 언어의 문장 구조를 참조하여 문자열을 형성하는 언어 모델부와, 상기 음향 모델부와 상기 언어 모델부가 음성 인식 과정에서 사용할 음향 모델과 언어 모델을 선택하는 환경 콘트롤러를 포함하는 것을 특징으로 하는 음성 인식 시스템이 제공될 수 있다.According to another embodiment of the present invention, there is provided a terminal for receiving a voice signal from a user, collecting personal information of the user, and transmitting the voice signal and at least some personal information to a voice recognition server. And a voice recognition server for performing voice recognition based on the voice signal transmitted from the terminal and the personal information to generate a voice recognition result; Wherein the personal information transmitted from the terminal to the voice recognition server is personal information set by a user to be open, and the voice recognition server selects at least one of phonemes, syllables and words corresponding to the voice signal An acoustic model unit, a language model unit for forming a character string by referring to a sentence structure of a language, and an environment controller for selecting an acoustic model and a language model to be used in the speech model unit and the language model unit, May be provided.

이때, 상기 환경 콘트롤러는, 상기 음성 인식 서버에 전송된 개인 정보를 참조하여 음향 모델과 언어 모델 중 적어도 하나를 선택한다.At this time, the environmental controller selects at least one of an acoustic model and a language model by referring to the personal information transmitted to the voice recognition server.

그리고, 상기 단말기는, 상기 사용자 상태 정보를 상기 사용자로부터 직접 입력 받거나, 상기 음성 신호 및 상기 사용자 행동 정보 중 적어도 하나로부터 유추한다.The terminal receives the user state information directly from the user or infer from at least one of the voice signal and the user behavior information.

또는, 상기 음성 인식 서버는, 상기 사용자 상태 정보를 상기 사용자로부터 직접 입력 받거나, 상기 음성 신호 및 상기 사용자 행동 정보 중 적어도 하나로부터 상기 사용자 상태 정보를 유추한다.Alternatively, the speech recognition server directly receives the user status information from the user, or inferences the user status information from at least one of the voice signal and the user behavior information.

또한, 상기 음성 인식 서버는, 복수의 음성 인식 결과물들을 도출하여 상기 단말기로 전송하되, 음성 인식 과정에서 사용된 음향 모델과 언어 모델의 종류 정보도 함께 전송한다.In addition, the speech recognition server derives a plurality of speech recognition results and transmits the same to the terminal, but also transmits the acoustic model used in the speech recognition process and the type information of the language model.

그리고, 상기 단말기는, 상기 음성 인식 서버로부터 전송된 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 공개된 개인 정보 및 비공개 개인 정보를 이용하여 선택한다.The terminal selects at least one of a plurality of voice recognition results transmitted from the voice recognition server using the disclosed personal information and private personal information.

또는, 상기 단말기는, 상기 음성 인식 서버로부터 전달된 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 복수의 음성 인식 결과물들의 각 확률값에 상기 공개된 개인 정보 및 비공개 개인 정보에 기초한 가중치를 부가하고, 그 결과로 가장 높은 확률값을 가지는 음성 인식 결과물을 선택한다.Alternatively, the terminal may select at least one of a plurality of voice recognition results transmitted from the voice recognition server, and add a weight based on the disclosed personal information and the private personal information to each probability value of the plurality of voice recognition results And selects the speech recognition result having the highest probability as a result.

그리고, 상기 음성 인식 서버는, 복수의 음성 인식 결과물들을 도출하고, 상기 복수의 음성 인식 결과물들 중 적어도 하나를 선택하되, 상기 공개된 개인 정보를 이용하여 선택한다.The speech recognition server derives a plurality of speech recognition results and selects at least one of the plurality of speech recognition results using the disclosed personal information.

또는, 상기 음성 인식 서버는, 복수의 음성 인식 결과물들을 도출하고, 상기 복수의 음성 인식 결과물들의 확률값에 상기 공개된 개인 정보에 기초한 가중치를 부가하여, 상기 가중치가 부가된 확률값 중 가장 높은 확률값을 가지는 음성 인식 결과물을 선택한다.Alternatively, the speech recognition server may derive a plurality of speech recognition results, add a weight based on the disclosed personal information to a probability value of the plurality of speech recognition results, Select the result of voice recognition.

본 발명의 또 다른 실시예에 따르면, 사용자로부터 음성 신호를 입력 받는 단계; 상기 사용자의 개인 정보를 수집하는 단계; 상기 음성 신호와 상기 개인 정보에 기초하여 상기 음성 신호로부터 음성 인식 결과물을 생성하는 단계; 상기 음성 인식 결과물로부터 최종 음성 인식 결과물을 선택하는 단계; 를 포함하고, 상기 음성 신호와 상기 개인 정보에 기초하여 상기 음성 신호로부터 음성 인식 결과물을 생성하는 단계는, 상기 사용자가 공개로 설정한 개인 정보를 참조하여 음향 모델과 언어 모델을 선택하는 단계; 를 추가적으로 포함하는 것을 특징으로 하는 음성 인식 방법이 제공될 수 있다.According to another embodiment of the present invention, Collecting personal information of the user; Generating a voice recognition result from the voice signal based on the voice signal and the personal information; Selecting a final speech recognition result from the speech recognition result; Wherein the step of generating a voice recognition result from the voice signal based on the voice signal and the personal information comprises the steps of: selecting an acoustic model and a language model with reference to the personal information set by the user; The speech recognition method according to the present invention may further comprise the steps of:

여기서, 상기 사용자의 개인 정보를 수집하는 단계는, 상기 사용자가 직접 입력한 개인 정보를 취득하는 단계; 와 음성 신호 및 사용자 행동 정보 중 적어도 하나로부터 사용자 상태 정보를 유추하는 단계; 를 더 포함한다.The collecting of the user's personal information may include acquiring the personal information directly input by the user, Deriving user state information from at least one of a voice signal and user behavior information; .

그리고, 상기 음성 신호와 상기 개인 정보에 기초하여 상기 음성 신호로부터 음성 인식 결과물을 생성하는 단계는, 복수의 음성 인식 결과물을 생성하고, 상기 복수의 음성 인식 결과물 각각에 대하여 음성 인식을 수행할 때 사용된 음향 모델 및 언어 모델의 종류 정보를 함께 생성한다.The step of generating a voice recognition result from the voice signal based on the voice signal and the personal information may include generating a plurality of voice recognition results and performing voice recognition on each of the plurality of voice recognition results And generates the acoustic model and the language model type information together.

또한, 상기 최종 음성 인식 결과물을 선택하는 단계는, 공개된 개인 정보 및 비공개 개인 정보를 이용하여 최종 음성 인식 결과물을 선택한다.In addition, in the step of selecting the final speech recognition result, the final speech recognition result is selected using the public personal information and the private personal information.

그리고, 상기 최종 음성 인식 결과물을 선택하는 단계는, 상기 복수의 음성 인식 결과물들의 각 확률에 상기 공개된 개인 정보 및 비공개 개인 정보에 기초한 가중치를 부가하고, 그 결과로 가장 높은 확률값을 가지는 음성 인식 결과물을 선택한다.The step of selecting the final speech recognition result may include adding a weight based on the disclosed personal information and the private personal information to each probability of the plurality of speech recognition results and outputting a result of the speech recognition result having the highest probability value .

본 발명에 따르면, 사용자의 개인 정보를 수집할 수 있으며, 상기 수집된 개인 정보를 이용하여 사용자에게 개인화된 음향 모델 및 언어 모델을 선택할 수 있다. 그리고, 개인화된 음향 모델 및 언어 모델을 선택하여 음성 인식을 수행함으로써, 음성 인식의 성공률을 높일 수 있다.According to the present invention, the user's personal information can be collected, and the personalized acoustic model and the language model can be selected by the user using the collected personal information. And, by performing the speech recognition by selecting the personalized acoustic model and the language model, the success rate of speech recognition can be increased.

또한, 본 발명의 실시예에 따르면, 사용자의 개인 정보는 사용자의 단말기 또는 프라이빗 서버 등의 사적 공간에만 저장되며, 음성 인식이 수행되는 음성 인식 서버에는 공개된 개인 정보만 전송되기 때문에 사용자의 개인 정보를 강력하게 보호할 수 있다.Also, according to the embodiment of the present invention, the personal information of the user is stored only in private space such as the user's terminal or the private server, and only the public personal information is transmitted to the voice recognition server, Can be strongly protected.

또한, 본 발명의 실시예에 따르면, 사용자는 타인에게 음성 인식 결과물을 전송할 수 있으며, 이를 통해 상기 타인과 실시간으로 음성 인식 결과물을 주고받을 수 있다.In addition, according to the embodiment of the present invention, the user can transmit the voice recognition result to the other person, and thereby can transmit and receive the voice recognition result in real time to the other person.

또한, 본 발명의 실시예에 따르면, 단말기, 프라이빗 서버 및 음성 인식 서버의 성능에 따라 각 구성 요소를 자유롭게 배치할 수 있는 음성 인식 시스템을 제공할 수 있다.In addition, according to the embodiment of the present invention, it is possible to provide a voice recognition system capable of freely arranging each component according to the performance of a terminal, a private server, and a voice recognition server.

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치를 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 음성 인식 시스템을 나타낸 도면이다.
도 3은 본 발명의 다른 실시예에 따른 음성 인식 시스템을 나타낸 도면이다.
도 4는 본 발명의 또 다른 실시예에 따른 음성 인식 시스템을 나타낸 도면이다.
도 5는 프라이빗 서버를 포함하는 음성 인식 시스템의 실시예를 나타낸 도면이다.
도 6은 프라이빗 서버를 포함하는 음성 인식 시스템의 또 다른 실시예를 나타낸 도면이다.
도 7은 단말기와 음성 인식 서버를 포함하는 음성 인식 시스템의 또 다른 실시예를 나타낸 도면이다.
도 8은 제 1 사용자의 음성 인식 결과를 제 2 사용자에게 전송하는 음성 인식 시스템의 실시예를 나타낸 도면이다.
도 9는 본 발명의 실시예에 따른 음성 인식 방법을 나타낸 도면이다.1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
2 is a block diagram of a speech recognition system according to an embodiment of the present invention.
3 is a diagram illustrating a speech recognition system according to another embodiment of the present invention.
4 is a diagram illustrating a speech recognition system according to another embodiment of the present invention.
5 is a diagram showing an embodiment of a speech recognition system including a private server.
6 is a diagram showing another embodiment of a speech recognition system including a private server.
7 is a diagram illustrating another embodiment of a speech recognition system including a terminal and a speech recognition server.
8 is a diagram illustrating an embodiment of a speech recognition system for transmitting a speech recognition result of a first user to a second user.
9 is a diagram illustrating a speech recognition method according to an embodiment of the present invention.

발명의 실시를 위한 최선의 형태Best Mode for Carrying Out the Invention

본 발명은 사용자의 개인 정보를 이용하여 음성 인식 성능을 높이고 사용자의 개인 정보를 보호하기 위한 음성 인식 장치 시스템 및 방법에 관한 것으로, 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다.The present invention relates to a voice recognition device system and method for enhancing voice recognition performance using personal information of a user and protecting personal information of a user, and a preferred embodiment of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 음성 인식 장치(100)를 나타낸 도면이다.1 is a block diagram of a speech recognition apparatus 100 according to an embodiment of the present invention.

도 1에 따르면 본 발명에 따른 음성 인식 장치(100)는 입력부(112), 특징 추출부(130), 음향 모델부(152), 언어 모델부 (140) 및 출력부(114)를 포함할 수 있다. 도 1에서 실선은 음성 신호 및 음성 인식 결과물의 흐름을 표시한 것이고 점선은 음성 인식에 필요한 부가 정보의 흐름을 표시한 것이다. 도 1에 따르면, 음향 모델부(152)는 복수의 음향 모델을 포함할 수 있으며, 언어 모델부(154)도 복수의 언어 모델을 포함할 수 있다. 이와 관련된 상세한 내용은 음향 모델부(152) 및 언어 모델부(154)를 설명할 때 다루도록 한다.1, the speech recognition apparatus 100 according to the present invention may include an input unit 112, a feature extraction unit 130, an acoustic model unit 152, a language model unit 140, and an output unit 114 have. In Fig. 1, the solid line indicates the flow of the voice signal and the voice recognition result, and the dotted line indicates the flow of additional information required for speech recognition. 1, the acoustic model unit 152 may include a plurality of acoustic models, and the language model unit 154 may include a plurality of language models. The details related to this will be described when the acoustic model unit 152 and the language model unit 154 are described.

음성 인식 장치(100)는 설명의 편의를 위해서 '장치'라고 표현되고 있지만, 소프트웨어(software)의 형태로 존재할 수 있으며, 하드웨어(hardware)의 형태 및 소프트웨어와 하드웨어가 융합된 형태 중 적어도 하나로 존재할 수도 있다. 음성 인식 장치(100)는 특정 장소에 설치된 PC 형태로 존재할 수 있고, 스마트폰, 노트북, 웨어러블 디바이스(wearable device)와 같이 용이하게 휴대할 수 있는 단말기 형태로 존재할 수도 있다.Although the speech recognition apparatus 100 is referred to as a " device " for convenience of description, it may exist in the form of software, and may exist in at least one of a form of hardware and a form in which software and hardware are fused have. The voice recognition apparatus 100 may exist in the form of a PC installed in a specific place, or may exist in the form of a portable terminal such as a smart phone, a notebook, and a wearable device.

입력부(112)는 사용자(800)의 음성을 수집하고 이를 전기적 신호로 변환하는 구성요소로 대표적으로 마이크 등의 장치가 사용될 수 있으나 이에 한정되지 않는다. 입력부(112)는 음성 신호뿐만 아니라 영상 신호도 함께 수집할 수 있으며, 카메라 등의 영상 신호 입력 장치를 이용하여 사용자(800)의 얼굴 형태 등을 촬영할 수 있다. 본 발명에 따른 음성 인식 장치(100)의 입력부(112)에 영상 신호 입력 장치가 사용됨으로써, 사용자(800)의 얼굴이나 입의 모양으로부터 현재 발음하고 있는 소리를 유추하는 구성으로 마련될 수 있다.The input unit 112 is a component for collecting the voice of the user 800 and converting it into an electrical signal. Typically, a microphone or the like may be used, but the present invention is not limited thereto. The input unit 112 can collect video signals as well as voice signals, and can capture the face shape of the user 800 using a video signal input device such as a camera. A video signal input device may be used in the input unit 112 of the speech recognition apparatus 100 according to the present invention to provide a configuration for inferring the sound currently being pronounced from the face or mouth shape of the user 800. [

특징 추출부(130)는 수집된 음성 신호로부터 음성 인식에 필요한 기본적인 정보들을 생성할 수 있다. 입력부(112)를 통해 수집된 음성 신호를 특정 간격(Frame)으로 분할하여 음성의 각 주파수 대역 별 에너지 분포 등의 정보를 추출한다. 상기 상기 주파수 대역 별 정보들은 벡터 수치화될 수 있으며, 상기 벡터 수치화된 정보는 음성 특징(Feature)으로 사용될 수 있다. 음성 신호의 특징을 추출하는 방법으로 LPC(Linear Predictive Coding) Cepstrum, PLP(Perceptual Linear Prediction) Cepstrum, Mel Frequency Cepstral Coefficient (MFCC), 필터뱅크 에너지 분석(Filter Bank Energy Analysis) 등이 사용될 수 있으나 이에 한정되지는 않는다.The feature extraction unit 130 may generate basic information necessary for speech recognition from the collected speech signal. The voice signal collected through the input unit 112 is divided into specific intervals to extract information such as energy distribution for each frequency band of voice. The frequency-band-based information may be vector-quantized, and the vector-quantized information may be used as a voice feature. (LPC) Cepstrum, PLP (Perceptual Linear Prediction) cepstrum, Mel Frequency Cepstral Coefficient (MFCC), and Filter Bank Energy Analysis can be used as a method of extracting characteristics of a speech signal. It does not.

음향 모델부(152)는 상기 특징 추출부(130)에서 추출한 음성 특징에 대응되는 언어의 기본 단위를 판별할 수 있다. 여기서 상기 언어의 기본 단위는 음소, 음절, 단어 등이 될 수 있다. 예를 들어, 음향 모델부(152)는 어떤 사용자가 영어로 'dog'라고 발음한 소리가 실제로 단어 'dog'의 음소인 /d/, /o/, /g/에 대응되는지 분석하고, 상기 사용자의 음성 신호를 각각의 음소로 인식한다.The acoustic modeling unit 152 can determine a basic unit of the language corresponding to the voice feature extracted by the feature extracting unit 130. Here, the basic unit of the language may be phonemes, syllables, words, and the like. For example, the acoustic modeling unit 152 analyzes whether a user pronounces 'dog' in English in correspondence to phonemes / d /, / o /, / g / of the word 'dog' The user's voice signal is recognized as each phoneme.

음성 신호의 경우, 동일한 단어라도 발음하는 사람에 따라, 그리고 그 단어가 문장 내에서 위치하는 순서 등에 따라서 다른 소리로 표현될 수 있다. 따라서 어떤 음성 특징이 어떠한 언어의 기본 단위에 대응되는지를 판별하기 위해서는 수많은 발화 데이터가 요구된다. 본 발명에 따른 음성 인식 장치(100)의 바람직한 실시예에 따르면, 음향 모델부(152)는 상기 대량의 발화 데이터를 저장하고 있는 음성 데이터베이스(372)와 통신을 수행할 수 있다. 음향 모델부(152)는 훈련 단계(Training Phase)에서, 음성 데이터베이스(372)에 저장된 대량의 발화 데이터를 참조하여 각 음성 특징에 대응하는 언어의 기본 단위를 결정하는 통계적 음향 모델을 생성할 수 있다. 음향 모델부(152)는 생성된 음향 모델 내의 각 음소에 대응하는 음성 특징과 특징 추출부(130)로부터 전송된 음성 특징의 유사도를 측정하여 가장 유사도가 높은 음소를 선택할 수 있다. 그리고 음향 모델부(152)는 선택된 음소들을 조합하여 단어를 생성할 수 있다. 음향 모델부(152)는 음향 모델에 대응되는 음소, 음절, 단어 등의 언어의 기본 단위를 판별할 때, 그 결과물로서 적어도 하나 이상을 선택할 수 있다. 한편, 음향 모델부(152)에서 음향 모델을 생성하는 과정에서 HMM(Hidden Markov Model) 또는 신경망분석(Neural Network)이 사용될 수 있으나 이에 한정되지 않는다.In the case of a voice signal, even the same word can be expressed in different sounds depending on the person who pronounces and the order in which the word is located in the sentence. Therefore, a large number of utterance data are required to determine which speech features correspond to a basic unit of a language. According to a preferred embodiment of the speech recognition apparatus 100 according to the present invention, the acoustic modeling unit 152 can communicate with the speech database 372 storing the large amount of speech data. In the training phase, the acoustic modeling unit 152 may generate a statistical acoustic model that determines a basic unit of the language corresponding to each speech feature by referring to a large amount of speech data stored in the speech database 372 . The acoustic model unit 152 can select the phoneme having the highest similarity by measuring the similarity between the speech characteristic corresponding to each phoneme in the generated acoustic model and the speech characteristic transmitted from the feature extraction unit 130. The acoustic modeling unit 152 can generate words by combining selected phonemes. The acoustic modeling unit 152 can select at least one result as a result of determining a basic unit of a language such as phonemes, syllables, and words corresponding to the acoustic model. Meanwhile, HMM (Hidden Markov Model) or Neural Network may be used in the process of generating the acoustic model in the acoustic model unit 152, but the present invention is not limited thereto.

언어 모델부 (154)는 언어의 문장 구조를 참조하여 문자열을 형성할 수 있다. 모든 언어에서 문장 내의 단어들은 일정한 규칙에 따라 나열된다. 언어 모델부(154)는 이러한 언어의 문장 구조를 참조하여 문자들의 선후 관계를 파악하고, 특정 문자가 인지된 경우, 그 문자 다음에 배치될 수 있는 문자를 예측한다. 사용자(800)가 언어의 문법 또는 규칙에 따라 발화 하였다는 가정하에서, 언어 모델부(154)는 이런 문자열의 구조에 부합되지 않는 문자들은 잘못 인지된 것으로 파악하고 후보 문자에서 탈락시킬 수 있으며, 이 과정을 통해 문자열 인식 성공률을 높일 수 있다.The language modeling unit 154 can form a character string by referring to the sentence structure of the language. In all languages, words in a sentence are listed according to certain rules. The language modeling unit 154 refers to the sentence structure of such a language to grasp the sequence of characters and predicts a character that can be placed after the specific character if recognized. Under the assumption that the user 800 has uttered in accordance with the grammar or rules of the language, the language model unit 154 can recognize that characters not conforming to the structure of such a string are misidentified and can be omitted from the candidate character, The success rate of string recognition can be increased.

하지만, 사람들은 일상 속에서 정확한 문법에 따른 발화를 하지 않는 경우가 많기 때문에 이에 대한 고려가 필요하다. 또한, 비슷한 의미의 문장이라도 발화하는 사람의 나이, 성별, 거주지에 따라서 판이하게 다른 문장 구조로 표현할 수 있다. 언어 모델부(154)는 이런 다양한 문장 구조를 올바르게 인지하기 위해서 별도의 훈련 단계를 거칠 수 있으며, 이 훈련 단계를 통해 통계적 언어 모델을 형성할 수 있다. 언어 모델부(154)가 언어 모델을 형성하기 위해서는, 앞서 설명한 음향 모델부(152)의 경우와 마찬가지로, 방대한 양의 문장 구조를 저장하고 있는 언어 데이터베이스(374)와 통신할 필요가 있다. 언어 모델부(154)는 문자열 인식의 결과물로서 적어도 하나의 문자열을 생성할 수 있다. 언어 모델부(154)는 문자열 인식의 결과물로서, 적어도 하나의 문자열에 포함된 단어들을 마디(node)로 표시하고, 문자들을 연결하며 각각의 연결에 대한 확률을 표시하는 줄기(branch)로 표시하는 격자(Lattice) 구조의 문자열 집합을 생성할 수 있다. 문장 내에서 한 문자 다음에 위치할 수 있는 문자의 종류는 복수 개로 선택될 수 있다. 첫 문자와 그 다음 문자간의 연결 조합이 가지는 확률은 언어 모델의 종류에 따라서 다르게 정해질 수 있다. 음성 인식 과정에서 하나의 언어 모델을 사용하더라도 각 문자들의 순서에 따라서 복수의 문자열이 형성될 수 있고, 이에 따라 각 문자열이 형성될 수 있는 확률이 각각 다르게 계산될 수 있다.However, since people often do not utter the correct grammar in everyday life, consideration should be given to this. In addition, sentences with similar meanings can be expressed in different sentences according to the age, sex, and place of residence of the person who uttered them. The language modeling unit 154 may undergo a separate training step in order to correctly recognize these various sentence structures, and through this training step, a statistical language model can be formed. In order to form the language model by the language model unit 154, it is necessary to communicate with the language database 374 storing a vast amount of sentence structure, as in the case of the acoustic model unit 152 described above. The language modeling unit 154 can generate at least one character string as a result of character string recognition. The language modeling unit 154 displays the words included in at least one character string as a result of the character string recognition as a node, connects the characters, and displays a branch indicating the probability of each connection You can create a set of strings with a lattice structure. A plurality of types of characters that can be positioned after one character in a sentence can be selected. The probability of having a combination of the first character and the next character can be determined differently depending on the type of the language model. Even if one language model is used in the speech recognition process, a plurality of character strings can be formed according to the order of each character, and the probability that each character string can be formed can be calculated differently.

음성 인식 장치(100)는 음성 인식 과정에서 복수의 음향 모델을 참조할 수 있고, 복수의 언어 모델을 참조할 수도 있으며, 이로 인해 복수의 음성 인식 결과물을 생성할 수 있다. 음성 인식 장치(100)는 각 문자열에 포함된 확률을 참조하여, 가장 높은 확률을 지니는 문자열을 최종 음성 인식 결과물로 선택하여 출력부(114)로 전송할 수 있다. 단일의 음향 모델 및 언어 모델이 사용된 경우에도 복수의 문자열이 생성될 수 있는데, 이 때에도 음성 인식 서버(300)는 가장 높은 확률을 지니는 문자열을 최종 음성 인식 결과물로 선택할 수 있다.The speech recognition apparatus 100 can refer to a plurality of acoustic models in a speech recognition process, and can refer to a plurality of language models, thereby generating a plurality of speech recognition results. The speech recognition apparatus 100 may select a character string having the highest probability as the final speech recognition result and transmit the selected character string to the output unit 114 by referring to the probability included in each character string. Even when a single acoustic model and a language model are used, a plurality of strings can be generated. Even at this time, the speech recognition server 300 can select a string having the highest probability as the final speech recognition result.

출력부(114)는 음성 인식 장치(100)의 음성 인식 결과물을 출력한다. 여기서, 상기 음성 인식 결과물은 음향 모델부(152)에서 인식한 언어의 기본 단위와 언어 모델부(154)에서 인식한 문자열 중 적어도 하나를 포함할 수 있다. 음성 인식 결과물은 복수의 문자열의 형태로 구성될 수 있고, 전술한 격자 형태의 문자열 집합으로 구성될 수도 있다.The output unit 114 outputs the speech recognition result of the speech recognition apparatus 100. Here, the speech recognition result may include at least one of a basic unit of the language recognized by the acoustic model unit 152 and a character string recognized by the language model unit 154. The result of the speech recognition may be formed in the form of a plurality of strings, or may be formed of a set of strings of the above-described lattice type.

도 1에서는 입력부(112)와 출력부(114)가 음성 인식 장치(100)에 포함되는 것으로 도시되었으나 이에 한정되지 않으며, 상기 입력부(112)와 출력부(114)는 음성 인식 장치(100)와는 별도의 구성요소로 구비될 수도 있다. 예를 들어, 음성 인식 장치(100)가 스마트폰으로 구비되는 경우, 사용자는 정확한 음성 인식을 위해서 고성능의 마이크를 상기 스마트폰에 부착하여 음성을 수집할 수 있다. 또한, 사용자는 상기 스마트폰에 대형 모니터나 빔 프로젝터, 다채널 스피커 등을 연결하여 다양한 방식으로 음성 인식 결과물을 출력할 수 있다.The input unit 112 and the output unit 114 are included in the voice recognition apparatus 100 and the input unit 112 and the output unit 114 are not included in the voice recognition apparatus 100, But may be provided as a separate component. For example, when the speech recognition apparatus 100 is provided as a smartphone, a user can attach a high-performance microphone to the smartphone and collect voice for correct speech recognition. In addition, a user can output a voice recognition result in various ways by connecting a large-sized monitor, a beam projector, or a multi-channel speaker to the smartphone.

앞서 설명된 음향 데이터베이스(372)와 언어 데이터베이스(374)는 도 1에 도시된 바와 같이, 음성 인식 장치(100)와 별도의 구성으로 마련될 수 있으나 이에 한정되지는 않는다. 특히, 만약 음성 인식 장치(100)의 연산 능력과 정보 저장소의 크기가 충분하다면 두 데이터베이스(372, 374)는 음성 인식 장치(100)에 포함될 수 있다.The sound database 372 and the language database 374 described above may be provided separately from the speech recognition apparatus 100 as shown in FIG. 1, but the present invention is not limited thereto. In particular, if the computing capability of the speech recognition apparatus 100 and the size of the information store are sufficient, the two databases 372 and 374 may be included in the speech recognition apparatus 100.

도 2는 본 발명의 실시예에 따른 음성 인식 시스템(1000A)을 나타낸 도면이다.2 is a diagram illustrating a speech recognition system 1000A according to an embodiment of the present invention.

도 2에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000A)은 단말기(200)와 음성 인식 서버(300)를 포함할 수 있다. 단말기(200)는 입력부(212), 특징 추출부(230) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372) 및 언어 데이터베이스(374)를 포함할 수 있다.Referring to FIG. 2, the speech recognition system 1000A according to the embodiment of the present invention may include a terminal 200 and a speech recognition server 300. FIG. The terminal 200 may include an input unit 212, a feature extraction unit 230, and an output unit 214. The speech recognition server 300 may include an acoustic modeling unit 352, a language modeling unit 354, an acoustic database 372, and a language database 374.

단말기(200)의 연산 능력에 따라서 특징 추출부(230)가 음성 인식 서버(300)에 포함될 수 있으며, 음성 인식 서버(300)의 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.The feature extraction unit 230 may be included in the speech recognition server 300 and the sound database 372 and the language database 374 of the speech recognition server 300 may be included in the speech recognition server 300 300) may be present.

도 2의 입력부(212), 특징 추출부(230), 출력부(214), 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374)는 도 1의 입력부(112), 특징 추출부(130), 출력부(114), 음향 모델부(152), 언어 모델부(154), 음향 데이터베이스(372), 언어 데이터베이스(374)와 공통된 구성요소이므로 이에 대한 상세한 설명은 생략하도록 한다.The input unit 212, the feature extraction unit 230, the output unit 214, the acoustic model unit 352, the language model unit 354, the acoustic database 372, and the language database 374 shown in FIG. Is a component common to the input unit 112, the feature extraction unit 130, the output unit 114, the acoustic model unit 152, the language model unit 154, the acoustic database 372, and the language database 374, A detailed description thereof will be omitted.

도 2의 단말기(200), 음성 인식 서버(300)로 구성되는 음성 인식 시스템(1000A)이 가지는 장점은 다음과 같다. 우선, 단말기(200)는 음성 인식을 위한 최소한의 입출력 및 기본적인 음성 신호 처리만 수행하기 때문에 비교적 연산 능력이 떨어지는 단말기(200)측의 부담이 적다. 대신, 상대적으로 처리 능력과 저장 능력이 우수한 음성 인식 서버(300)에서 대부분의 연산이 고속으로 수행되며, 단말기(200)는 단지 그 결과를 수신하기만 하면 된다. 현대에 이르러 인터넷 등 유무선 통신 환경이 과거와 비할 바 없이 많은 발전을 이룩하였기 때문에 단말기(200)와 음성 인식 서버(300) 간의 통신은 자유롭게 이루어질 수 있다. 또한, 각기 다른 연산 능력을 지니는 단말기(200)가 매우 다양하게 개발되고 시장에서 유통되고 있다. 각각의 단말기(200) 마다 서로 다른 음성 인식 처리 과정을 마련하는 것은 비효율적일 수 있다. 도 2에 도시된 것처럼, 음성 인식 서버(300)가 대부분의 연산을 수행하도록 함으로써 단말기(200)의 종류와는 상관없는, 단말기(200)에 독립적인 시스템 구현이 가능하다. 물론, 특정 단말기(200)의 처리 능력을 참조하여 단말기(200)와 음성 인식 서버(300) 의 음성 인식 처리 단계를 자유롭게 분배하는 방식으로도 전체 음성 인식 시스템(1000A)이 구현될 수도 있다.The advantages of the voice recognition system 1000A including the terminal 200 and the voice recognition server 300 of FIG. 2 are as follows. First, since the terminal 200 performs only minimum input / output and basic voice signal processing for voice recognition, the terminal 200 having a relatively low computing power is less burdened. Instead, most operations are performed at a high speed in the speech recognition server 300, which has relatively high processing capability and storage capability, and the terminal 200 only needs to receive the result. Since the wired / wireless communication environment such as the Internet has achieved many improvements in comparison with the past, communication between the terminal 200 and the voice recognition server 300 can be freely performed. In addition, terminals 200 having different computing capabilities have been developed and distributed on the market. It may be inefficient to provide a different voice recognition process for each terminal 200. As shown in FIG. 2, since the voice recognition server 300 performs most operations, it is possible to implement a system independent of the terminal 200, which is independent of the type of the terminal 200. Of course, the entire voice recognition system 1000A may also be implemented by a method of freely distributing voice recognition processing steps of the terminal 200 and the voice recognition server 300 with reference to the processing capability of the specific terminal 200. [

한편, 음성 인식 서버(300)는 복수로 마련될 수 있고, 복수로 마련된 음성 인식 서버(300)는 클라우드(Cloud) 기반의 분산 음성 인식(Distributed Speech Recognition, DSR)을 수행할 수 있다. 분산 음성 인식은 무선 통신 환경에서 음성 인식 성능 향상을 위해 음성 신호의 특징을 디지털 데이터로 변환 및 전송하고 음성 인식 서버가 이를 분산 처리하는 기술을 의미한다. 분산 음성 인식에 의하면 음성 인식 연산의 처리 속도와 메모리의 사용 효율을 극대화할 수 있다.Meanwhile, a plurality of speech recognition servers 300 may be provided, and a plurality of speech recognition servers 300 may perform Cloud based Distributed Speech Recognition (DSR). Distributed speech recognition refers to a technology in which the characteristics of a speech signal are converted into digital data and transmitted and a speech recognition server distributes the speech signal in order to improve speech recognition performance in a wireless communication environment. According to the distributed speech recognition, the processing speed of the speech recognition operation and the use efficiency of the memory can be maximized.

도 2에서 단말기(200)는 음성 인식 서버(300)로부터 음성 인식 결과물을 전송 받고 이를 출력부를 통해 출력하는 것으로 도시되어있으나 이에 한정되지 않으며, 도 2의 단말기(200)가 아닌 다른 기기 또는 출력 장치로 상기 음성 인식 결과물이 전송될 수 있다.2, the terminal 200 receives the voice recognition result from the voice recognition server 300 and outputs the voice recognition result through the output unit. However, the terminal 200 is not limited to the terminal 200 of FIG. 2, The voice recognition result may be transmitted to the mobile terminal.

도 3은 본 발명의 다른 실시예에 따른 음성 인식 시스템(1000B)을 나타낸 도면이다.FIG. 3 is a diagram illustrating a speech recognition system 1000B according to another embodiment of the present invention.

도 3에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000B)은 단말기(200)와 음성 인식 서버(300)를 포함할 수 있다. 단말기(200)는 입력부(212), 특징 추출부(230) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374) 및 환경설정 콘트롤러(Configure Controller, 380)를 포함할 수 있다.Referring to FIG. 3, the speech recognition system 1000B according to the embodiment of the present invention may include a terminal 200 and a speech recognition server 300. FIG. The terminal 200 may include an input unit 212, a feature extraction unit 230, and an output unit 214. The speech recognition server 300 may include an acoustic model unit 352, a language model unit 354, an acoustic database 372, a language database 374, and a configuration controller 380.

단말기(200)의 연산 능력에 따라서 특징 추출부(130)가 음성 인식 서버(300)에 포함될 수 있으며, 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.The feature extraction unit 130 may be included in the speech recognition server 300 and the sound database 372 and the language database 374 may be present outside the speech recognition server 300 depending on the computing capability of the terminal 200. [ It is also possible.

도 3과 도 2에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of common elements in FIG. 3 and FIG. 2 is redundant and will be omitted.

환경설정 콘트롤러(380)는 음향 모델부(352)와 언어 모델부(354)가 음성 인식 과정에서 사용할 음향 모델과 언어 모델을 선택한다. 음향 모델부(352)와 언어 모델부(354)는 음성 인식 과정에서 복수의 음향 모델 및 언어 모델을 참조할 수 있다. 화자의 연령대, 성별, 방언의 사용 유무에 따라서 음향 모델의 음성 특징은 서로 다르게 나타날 수 있으며, 음성 발화가 이루어진 장소의 주변 잡음(Background Noise), 잔향(Reverberation) 등에 따라서 음성 특징이 변경되기도 한다. 화자의 연령대, 성별, 방언의 사용 유무에 따라서 사용 단어 및 문장 내의 단어들의 순서 관계가 달라질 수 있기 때문에, 훈련 단계에서 사용된 화자의 특성에 따라서 언어 모델이 다양하게 형성될 수 있다. 음성 인식 서버(300)는 음성 인식 과정에서 전술한 다양한 형태의 음향 모델 및 언어 모델을 복수 개 사용함으로써 음성 인식의 성공률을 높일 수 있다.The environment setting controller 380 selects an acoustic model and a language model to be used by the acoustic model unit 352 and the language model unit 354 in the speech recognition process. The acoustic model unit 352 and the language model unit 354 can refer to a plurality of acoustic models and language models in the speech recognition process. Depending on the speaker's age, sex, and the use of dialects, the voice characteristics of the acoustic model may be different from each other, and the voice characteristics may be changed depending on background noise and reverberation at the place where the voice is uttered. Since the order relation of the words in the sentence and the words used can vary depending on the age, sex, and use of the dialect of the speaker, the language model can be formed in various ways according to the characteristics of the speaker used in the training step. The speech recognition server 300 can increase the success rate of speech recognition by using a plurality of acoustic models and language models of various types in the speech recognition process.

도 3의 형태로 분산 음성 인식을 수행하는 경우, 각 음성 인식 서버(300)마다 서로 다른 음향 모델 및 음성 모델을 이용한 음성 인식 과정을 수행하고, 다양한 모델을 통해 생성된 음성 인식 결과물을 하나로 취합하여 다시 단말기(200)로 전송할 수 있다. 또는, 각 음성 인식 서버(300)가 동일한 음향 모델 및 언어 모델을 사용하되 각 음성 인식 처리 과정을 병렬연산 함으로써 음성 인식의 처리 속도를 높일 수 있다.In the case of performing distributed speech recognition in the form of FIG. 3, a speech recognition process using different acoustic models and speech models is performed for each speech recognition server 300, and speech recognition results generated through various models are combined To the terminal 200 again. Alternatively, each speech recognition server 300 uses the same acoustic model and language model, but the processing speed of speech recognition can be increased by parallelly calculating each speech recognition process.

도 4는 본 발명의 또 다른 실시예에 따른 음성 인식 시스템(1000C)을 나타낸 도면이다.4 is a diagram illustrating a speech recognition system 1000C according to another embodiment of the present invention.

도 4에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000C)은 단말기(200)와 음성 인식 서버(300)를 포함할 수 있다. 단말기(200)는 입력부(212), 개인 정보 수집부(220), 특징 추출부(230), 개인 정보 분석부(240) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374) 및 환경설정 콘트롤러(380)를 포함할 수 있다.Referring to FIG. 4, the speech recognition system 1000C according to the embodiment of the present invention may include a terminal 200 and a speech recognition server 300. FIG. The terminal 200 may include an input unit 212, a personal information collection unit 220, a feature extraction unit 230, a personal information analysis unit 240, and an output unit 214. The speech recognition server 300 may include an acoustic model unit 352, a language model unit 354, an acoustic database 372, a language database 374, and a configuration controller 380.

단말기(200)의 연산 능력에 따라서 특징 추출부(230), 개인 정보 분석부(240) 중 적어도 하나가 음성 인식 서버(300)에 포함될 수 있으며, 음성 인식 서버(300)의 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.At least one of the feature extraction unit 230 and the personal information analysis unit 240 may be included in the speech recognition server 300 and the sound database 372 of the speech recognition server 300 may be included in the speech recognition server 300, And the language database 374 may exist outside the voice recognition server 300. [

도 4와 도 3에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of the components common to FIG. 4 and FIG. 3 will be omitted because they are redundant.

한편, 본 발명에서 개인 정보는 사용자 행위의 기록 및 사용자 행위를 측정한 결과로부터 수집된 사용자 행동 정보를 포함할 수 있다. 또한, 개인 정보는 사용자 고유의 신상 정보 및 사용자의 상황을 나타내는 사용자 상태 정보를 포함할 수 있다.Meanwhile, in the present invention, the personal information may include user activity information collected from a result of measuring user activity and recording of user activity. In addition, the personal information may include user-specific personal information and user status information indicating the user's situation.

사용자 행동 정보는 사용자 온라인 기록, 사용자 위치 정보, 사용자 연결 정보 및 사용자 기기 활용 정보를 포함할 수 있다. The user behavior information may include user online history, user location information, user connection information, and user device utilization information.

사용자 온라인 기록은 사용자(800)의 온라인 상의 활동 및 인터넷 활용 기록을 수집한 정보이다. 사용자 온라인 기록은 사용자(800)가 SNS(Social Network Service) 상에서 작성한 글(text), 사진, 음악, 영상 등의 게시물, 사용자가 SNS 상에서 표시한 감정 아이콘이나 좋다 - 나쁘다, 동의 - 비동의 등의 간이 의사 표시 행위, 온라인 상의 이웃 목록과 인터넷 브라우저 검색 기록 및 방문 기록, 즐겨 찾는 사이트 목록 등을 포함할 수 있다.The user online record is information that collects online activity of the user 800 and internet use record. The user's online record may include a post such as text, photograph, music, video created by the user 800 on the Social Network Service (SNS), an emotion icon displayed by the user on the SNS, or a good-bad, A simple doctrine of behavior, a list of online neighbors and Internet browser search and visit history, a list of favorite sites, and the like.

사용자 위치 정보는 사용자(800)의 실제 위치를 나타내는 정보이다. 사용자 위치 정보는 사용자(800)가 GPS 등의 측위 시스템을 이용하여 파악한 자신의 위치 정보, 위치 기반 서비스를 제공하는 스마트폰 어플리케이션 등을 통해 표시되는 위치 정보, 유무선 통신망을 통해 온라인에 접속했을 때 참조되는 접속 위치 정보 등을 포함할 수 있다.The user location information is information indicating the actual location of the user 800. The user location information includes location information displayed by the user 800 using his or her location information obtained by using a positioning system such as GPS, a smartphone application providing location-based services, etc., And the like.

사용자 연결 정보는 사용자(800)의 통신 식별 정보로써, 사용자(800)의 전화 번호, e-mail 주소, 실제 주소 정보 등을 포함할 수 있다.The user connection information is communication identification information of the user 800 and may include a telephone number of the user 800, an e-mail address, real address information, and the like.

사용자 기기 활용 정보는 사용자(800) 및 단말기(200) 사이의 상호 작용 과정에서 수집되는 정보를 의미한다. 사용자 기기 활용 정보는 사용자(800)가 사용하는 기기의 종류, 각 기기 별 사용 시간 및 빈도수, 사용자(800)가 PC나 스마트폰 등을 통해서 실행시키는 어플리케이션의 종류, 각 어플리케이션의 사용 시간 및 빈도 수, 설치된 어플리케이션 목록, 온라인에서 내려 받은 어플리케이션 목록 등을 포함할 수 있다.User device utilization information refers to information collected during an interaction process between the user 800 and the terminal 200. The user device utilization information includes information such as the type of the device used by the user 800, the usage time and frequency of each device, the type of application that the user 800 executes through the PC or smart phone, A list of installed applications, and a list of applications downloaded from online.

한편, 사용자 상태 정보는 사용자 속성 정보 및 환경 속성 정보를 포함할 수 있다.On the other hand, the user status information may include user attribute information and environment attribute information.

사용자 속성 정보는 사용자 신상 정보 및 성격, 신체, 감정 상태를 나타내는 정보로써, 사용자(800)의 연령, 성별, 출신 민족, 사용하는 방언, 직업, 수입, 교육 정도, 건강 상태, 감정 상태, 성격 등을 포함할 수 있다.The user attribute information is information indicating the user's personal information, personality, body, and emotional state, and is information indicating the user's age, gender, ethnic group, dialect used, occupation, income, education degree, health status, . &Lt; / RTI >

환경 속성 정보는 사용자가 위치하고 있는 주변 환경의 특징을 나타내는 정보로써, 사용자가 위치하고 있는 공간의 음향학적 특징인 배경 잡음, 잔향의 정도, 그리고 계절, 시간, 날씨, 기후 정보 등을 포함할 수 있다.The environmental attribute information is information indicating the characteristics of the surrounding environment in which the user is located, and may include background noise, degree of reverberation, and season, time, weather, and climate information, which are acoustic characteristics of the space in which the user is located.

상기 열거된 사용자(800)의 개인 정보는 개인 정보 수집부(220)에 의해 수집될 수 있다. 개인 정보 수집부(220)는 사용자(800)가 단말기(200)를 조작할 때 사용자(800)의 개인 정보를 자동적으로 수집할 수 있으며, 상기 사용자 행동 정보 및 상기 사용자 상태 정보 중 적어도 하나를 사용자(800)로부터 직접 입력 받을 수도 있다. 개인 정보 수집부(220)는 음성인식을 수행하기 위한 도 4의 단말기(200) 또는 해당 사용자(800)의 인증 또는 개인 정보 수집에 대한 동의가 수행된 외부 단말기 및 서버 등에 포함되어 사용자(800)의 개인 정보를 수집할 수 있다.The personal information of the user 800 listed above may be collected by the personal information collection unit 220. The personal information collection unit 220 may automatically collect personal information of the user 800 when the user 800 operates the terminal 200 and may store at least one of the user behavior information and the user status information, (800). The personal information collecting unit 220 includes the user 800 in the terminal 200 or the external terminal and the server in which the user 800 has agreed to collect authentication or private information for performing voice recognition, Of personal information.

개인 정보 분석부(240)는 수집된 개인 정보를 분석한다. 특히, 개인 정보 분석부(240)는 상기 개인 행동 정보와 입력부(212)를 통해 수집된 음성 신호 중 적어도 하나로부터 사용자 상태 정보를 유추할 수 있다. 예를 들어, 개인 정보 분석부(240)는 음성 신호에서 에너지가 주로 분포하고 있는 주파수 대역을 파악함으로써 사용자(800)가 남성인지 여성인지 구분할 수 있다. 또한, 개인 정보 분석부(240)는 음성 신호의 모음 부분의 파형을 분석하여 사용자(800)의 성대 상태를 파악할 수도 있으며 이를 통해 사용자(800)의 나이와 건강 상태 등을 유추할 수도 있다. 한편, 사용자(800)가 단말기(200)를 통해 화장품 할인 정보, 의류 사이트, 명품 잡화 사진, 연예계 이슈, 인터넷 육아 카페 등의 정보를 빈번하게 검색한 경우, 개인 정보 분석부(240)는 상기 사용자(800)가 여성일 확률이 높은 것으로 파악할 수 있다. 한편, 개인 정보 분석부(240)는 GPS 등을 통해 파악된 사용자(800)의 현재 위치가 콘서트 홀 내부인 경우, 콘서트 홀이 가지는 배경 잡음 수준 및 잔향의 정도를 유추할 수 있다.The personal information analyzing unit 240 analyzes the collected personal information. In particular, the personal information analyzer 240 may infer user state information from at least one of the personal behavior information and the voice signal collected through the input unit 212. [ For example, the personal information analyzing unit 240 may determine whether the user 800 is male or female by grasping a frequency band in which energy is mainly distributed in the voice signal. In addition, the personal information analyzer 240 may analyze the waveform of the vowel portion of the voice signal to grasp the vocal state of the user 800, and may infer the age and health state of the user 800 through the vocal pattern analysis. On the other hand, if the user 800 frequently searches for information such as cosmetics discount information, a clothing site, a luxury goods photograph, a entertainment issue, an Internet childcare cafe, and the like via the terminal 200, (800) is highly likely to be female. Meanwhile, the personal information analyzer 240 can infer the background noise level and the degree of reverberation of the concert hall when the current position of the user 800, which is determined through GPS or the like, is inside the concert hall.

즉, 개인 상태 정보는 개인 정보 분석부(240)에서 유추될 수 있다. 하지만, 전술한 바와 같이 상기 개인 정보 수집부(220)를 통해서 상기 개인 상태 정보를 직접 입력 받을 수도 있다.That is, the personal status information may be inferred from the personal information analysis unit 240. However, as described above, the personal status information may be input directly through the personal information collection unit 220. [

개인 정보 분석부(240)는 사용자 속성 정보 및 환경 속성 정보의 각 항목별 확률값을 계산할 수 있다. 예를 들어, 개인 정보 분석부(240)는 사용자(800)의 음성 신호의 주파수별 에너지 분포를 분석하여 사용자(800)가 남성일 확률을 80%로 파악할 수 있다. 또 다른 예로서, 사용자(800)가 개인 정보 수집부(220)를 통해 자신의 나이를 75세인 것으로 직접 입력한 경우, 개인 정보 분석부(240)는 상기 사용자(800)가 노인일 확률을 100%인 것으로 설정할 수 있다. 또 다른 예로서, 개인 정보 분석부(240)는 사용자(800)의 인터넷 검색 기록을 참조하여 상기 사용자(800)가 여성일 확률이 70%이고 학생일 확률이 90%이며 서울에 거주할 확률이 60%인 것으로 파악할 수 있다.The personal information analyzer 240 may calculate the probability value of each item of the user attribute information and the environment attribute information. For example, the personal information analyzing unit 240 may analyze the energy distribution of the voice signal of the user 800 according to frequencies to determine that the user 800 has a male probability of 80%. As another example, if the user 800 directly inputs his or her age as 75 years old through the personal information collection unit 220, the personal information analysis unit 240 may determine that the user 800 is 100% . As another example, the personal information analyzing unit 240 may refer to the Internet search record of the user 800 so that the probability that the user 800 is 70% female, 90% student, 60%.

한편, 개인 정보 분석부(240)는 개인 정보 분석 작업을 지속적으로 수행할 수 있다. 사용자의 개인 정보는 상기 개인 정보 수집부(220)에 의해서 지속적으로 수집되기 때문에 시간에 비례하여 개인 정보의 양이 증가될 수 있다. 개인 정보 분석부(240)는 개인 정보의 양이 변동될 때마다 개인 정보 분석 작업을 다시 수행할 수 있다. 또는, 개인 정보 분석부(240)는 기 설정된 방식에 따라 일정 주기마다 개인 정보를 다시 분석할 수 있다. 개인 정보 분석부(240)는 개인 정보의 양이 많아지고 그 종류가 다양해질수록 개인 상태 정보를 보다 정확하게 유추할 수 있다. 이를 통해, 개인 정보 분석부(240)는 상기 개인 정보의 각 항목별 확률값의 정확도를 높일 수 있다.Meanwhile, the personal information analyzing unit 240 can continuously perform personal information analysis work. Since the personal information of the user is continuously collected by the personal information collection unit 220, the amount of personal information can be increased in proportion to the time. The personal information analyzing unit 240 can perform the personal information analyzing operation again every time the amount of personal information is changed. Alternatively, the personal information analyzing unit 240 may analyze the personal information again at predetermined intervals according to a preset method. The personal information analyzer 240 can more accurately deduce the individual status information as the amount of personal information increases and the variety thereof increases. Accordingly, the personal information analyzer 240 can increase the accuracy of the probability value of each item of the personal information.

또한, 개인 정보 분석부(240)는 수집된 개인 정보로부터 사용자(800)의 행동 패턴을 유추할 수 있다. 예를 들어, 사용자(800)가 가정과 학교를 정해진 시간에 왕복하는 학생인 경우를 가정할 수 있다. 개인 정보 분석부(240)는 시간 정보와 GPS 등의 개인 정보 수집부(220)에서 전송된 장소 정보를 참조하여 시간대별 사용자(800)가 위치할 수 있는 공간을 유추할 수 있다. 위의 경우, 개인 정보 분석부(240)는 특정 시간 동안 상기 사용자(800)가 '학교'에서 시간을 보내며 그 시간 동안 '학교'라는 환경 속성 정보를 수집하거나 유추할 수 있다.In addition, the personal information analyzer 240 can infer the behavior pattern of the user 800 from the collected personal information. For example, it can be assumed that the user 800 is a student who returns home and school at a predetermined time. The personal information analyzing unit 240 can refer to the time information and the space information that the user 800 can be located by time zone by referring to the place information transmitted from the personal information collecting unit 220 such as GPS. In this case, the personal information analyzing unit 240 may collect or infer the environmental attribute information of 'school' during the time that the user 800 spends time in 'school' for a specific time.

개인 정보 분석부(240)는 바람직하게는, 빅데이터(Big Data) 기법을 통해 이미 수집되어있거나 수집 중인 사용자(800)의 개인 정보를 분석할 수 있지만 이에 한정되지 않는다.The personal information analyzing unit 240 may analyze the personal information of the user 800 already collected or collected through the Big Data technique, but is not limited thereto.

도 4를 참조하면 개인 정보 수집부(220)와 개인 정보 분석부(240)가 단말기(200)에 포함되는 것으로 도시되어있다. 도 4에 따르면, 단말기(200)는 상기 개인 정보 수집부(220)와 개인 정보 분석부(240)를 통해 사용자(800)의 개인 정보를 직접 입력 받을 수도 있고, 수집된 개인 행동 정보로부터 개인 상태 정보를 유추할 수 있다. 하지만, 본 발명에 따른 음성 인식 시스템(1000C)은 도 4의 구성에 한정되지 않으며, 음성 인식 서버(300)에 개인 정부 분석부(240)가 포함될 수도 있다. 음성 인식 서버(300)는 연산 처리 능력 및 저장 능력이 단말기(200)에 비해 월등하게 우수하기 때문에, 음성 인식 서버(300)에 개인 정보 분석부(240)가 포함된 경우 단말기(200)보다 원활하게 개인 정보를 유추할 수 있다.Referring to FIG. 4, the personal information collection unit 220 and the personal information analysis unit 240 are included in the terminal 200. 4, the terminal 200 may directly receive the personal information of the user 800 through the personal information collecting unit 220 and the personal information analyzing unit 240, Information can be deduced. However, the speech recognition system 1000C according to the present invention is not limited to the configuration shown in FIG. 4, and the personal-agent analysis unit 240 may be included in the speech recognition server 300. FIG. The voice recognition server 300 is superior to the terminal 200 in terms of processing capacity and storage capacity so that the voice recognition server 300 can smoothly communicate with the voice recognition server 300 when the personal information analysis unit 240 is included in the voice recognition server 300. [ So that personal information can be deduced.

한편, 사용자(800)는 수집된 개인 정보를 기 설정된 카테고리(category)에 따라 분류할 수 있다. 상기 개인 정보의 분류는 사용자(800)의 개인 정보가 저장된 모든 기기에 수행될 수 있다. 바람직하게는, 사용자(800)는 개인 정보를 공개 여부에 따라 공개용 개인 정보와 비공개용 개인 정보로 분류할 수 있다.Meanwhile, the user 800 can sort the collected personal information according to a predetermined category. The classification of the personal information may be performed on all devices in which the personal information of the user 800 is stored. Preferably, the user 800 may classify the personal information into public or private personal information according to whether the personal information is disclosed or not.

도 4에 따르면, 개인 정보 분석부(240)를 통해 유추되거나 개인 정부 수집부(220)를 통해 입력된 개인 정보가 음성 인식 서버(300)의 환경설정 콘트롤러(380)으로 전송될 수 있다. 이 때 전송되는 개인 정보는 사용자가 공개를 허락한 것만으로 구성될 수 있다. 그리고, 개인 정보 분석부(240)를 통해 파악된 개인 정보의 각 항목별 확률도 음성 인식 서버(300)로 전송될 수 있다.4, personal information inferred through the personal information analysis unit 240 or input through the personal government collection unit 220 may be transmitted to the environment setting controller 380 of the voice recognition server 300. [ The personal information transmitted at this time can be constituted by only allowing the user to disclose it. The probability of each item of the personal information identified through the personal information analyzing unit 240 may also be transmitted to the voice recognition server 300. [

환경설정 콘트롤러(380)은 전송된 개인 정보를 참조하여 음향 모델 및 언어 모델 중 적어도 하나를 선택할 수 있다. 또한, 환경설정 콘트롤러(380)는 전송된 개인 정보를 참조하여 적어도 하나의 음향 모델 과 적어도 하나의 언어 모델을 선택할 수 있다. 예를 들어, 환경설정 콘트롤러(380)가 '어린이' 이라는 공개된 개인 정보를 수신한 경우, 상기 '어린이'와 연관된 음향 모델을 선택할 수 있다. 또한, 환경설정 콘트롤러(380)는 '어린이'와 연관된 언어 모델을 선택할 수도 있는데, 음향 모델부(354)와 언어 모델부(354)에 '어린이'와 연관된 음향 모델 및 언어 모델이 모두 존재하는 경우 상기 두 모델을 모두 선택할 수도 있다.The configuration controller 380 can select at least one of an acoustic model and a language model by referring to the transmitted personal information. Also, the configuration controller 380 can select at least one acoustic model and at least one language model by referring to the transmitted personal information. For example, when the configuration controller 380 receives public personal information called 'child', it may select an acoustic model associated with the 'child'. Also, the environment setting controller 380 can select a language model associated with 'child'. If the acoustic model unit 354 and the language model unit 354 both have an acoustic model and a language model associated with 'child' Both of the above models may be selected.

환경설정 콘트롤러(380)는 개인 정보 분석부(240)에서 유추된 사용자의 패턴 정보를 이용할 수도 있다. 전술한 예처럼, 사용자(800)가 학생인 경우, 개인 정보 분석부(240)는 상기 사용자(800)가 특정 시간대에 '학교'에 등교하여 그곳에서 일과를 보낸다고 유추할 수 있다. 환경설정 콘트롤러(380)는 이 패턴 정보와 시간 정보를 참조하여 상기 특정 시간 영역 동안 사용자(800)의 음성을 인식할 때 '학교'에 해당하는 음향 모델 및 언어 모델을 선택할 수 있다.The environment setting controller 380 may use the pattern information of the user inferred from the personal information analyzing unit 240. [ As described above, if the user 800 is a student, the personal information analyzer 240 can infer that the user 800 goes to the 'school' at a specific time and sends a routine there. The environment setting controller 380 can select an acoustic model and a language model corresponding to 'school' when recognizing the voice of the user 800 during the specific time period with reference to the pattern information and the time information.

한편, 사용자(800)가 개인 정보를 전혀 입력하지 않았거나, 분석되거나 유추된 개인 정보가 적거나 없을 경우, 환경설정 콘트롤러(380)는 활용할 수 있는 모든 음향 모델 및 언어 모델을 선택할 수 있다. 수신된 사용자의 개인 정보에 연관되는 음향 모델 및 언어 모델이 없을 때도, 환경설정 콘트롤러(380)는 활용할 수 있는 모든 음향 모델 및 언어 모델을 선택할 수 있다. 수신된 사용자의 개인 정보에 직접 연관되는 음향 모델 및 언어 모델이 없을 때, 환경설정 콘트롤러(380)는 상기 수신된 개인 정보에 근사한 음향 모델 및 언어 모델을 선택할 수 있다. 예를 들어, 음성 인식 서버(300)에 수신된 개인 정보가 '학생' 만 포함하지만 언어 모델부(354)에 '학생'에 해당하는 언어 모델이 없는 경우, 환경설정 콘트롤러(380)는 음성 인식 서버(300)가 보유중인 '청소년' 언어 모델을 선택할 수도 있다.On the other hand, if the user 800 does not input any personal information at all, or if there is little or no analyzed or inferred personal information, the configuration controller 380 may select all available acoustic and language models. Even when there is no acoustic model and language model associated with the received user's personal information, the configuration controller 380 can select all available acoustic and language models. When there is no acoustic model and language model directly associated with the received personal information of the user, the configuration controller 380 can select an acoustic model and a language model that are similar to the received personal information. For example, if the personal information received by the speech recognition server 300 includes only the 'student' but the language model unit 354 does not have a language model corresponding to 'student', the environment setting controller 380 performs the speech recognition The server 300 may select the 'youth' language model held by the server 300.

환경설정 콘트롤러(380)가 이처럼 개인 정보에 부합하는 음향 모델 및 언어 모델을 선택함으로써, 음향 모델과 언어 모델을 사용자(800)의 음성에 적합하게 개인화 할 수 있다. 그리고, 음성 인식 과정에서 개인화된 음향 모델과 언어 모델을 사용함으로써 음성 인식 시스템(1000C)이 음성 인식을 수행할 때 보다 정확도를 높일 수 있다.The environment setting controller 380 can select an acoustic model and a language model that match the personal information in this way, so that the acoustic model and the language model can be personalized to the voice of the user 800. [ In addition, by using the personalized acoustic model and the language model in the speech recognition process, the accuracy of speech recognition system 1000C can be improved more than when performing speech recognition.

음성 인식 서버(300)는 음성 인식을 수행하는 과정에서 복수의 음향 모델 및 언어 모델을 참조할 수 있다. 음성 인식 서버(300)는 음성 인식 결과물로 복수의 문자열을 생성할 수 있는데, 이 경우 음성 인식 서버(300)는 개인 정보 분석부(240)로부터 전송된 각 항목별 확률값을 참조하여 이에 기초한 가중치를 각 문자열에 적용할 수 있다. 음성 인식 서버(300)는 상기 가중치를 적용한 확률들 중 가장 높은 확률값을 가지는 문자열을 최종 음성 인식 결과물로 선택할 수 있다.The speech recognition server 300 may refer to a plurality of acoustic models and language models in performing speech recognition. In this case, the speech recognition server 300 refers to the probability value of each item transmitted from the personal information analysis unit 240, and calculates a weight based on the probability value Can be applied to each string. The speech recognition server 300 may select a character string having the highest probability value among the probabilities to which the weight is applied as the final speech recognition result.

한편, 도 4에 도시된 구조로 음성 인식 시스템(1000C)이 구성되는 경우, 음향 모델과 언어 모델을 형성하기 위한 훈련 단계에서 유용하게 활용될 수 있다. 무작위로 녹음된 대량의 음성 신호들을 입력부(212)를 통해 단말기(200)에 입력하면, 각 음성 신호들의 특징이 추출되어 음성 인식 서버(300)로 전송 되고, 분석된 개인 정보들도 함께 전송 된다. 음성 인식 서버(300)는 전송된 음성 특징들과 개인 정보들을 참조하여 훈련 단계를 수행함으로써 다양한 음향 모델과 언어 모델을 형성할 수 있다. 특히, 특정의 개인 정보 항목에 해당하는 음성 신호만 선별하여 훈련 단계에 사용함으로써 특정 항목에 특화된 음향 모델 및 언어 모델을 형성할 수 있다. 예를 들어, 음성 인식 시스템(1000C)이 개인 정보 분석부(240)를 통해 노인 남성의 음성을 별도로 선별할 수 있는 경우, 상기 노인 남성의 음성들만 선별하여 이용함으로써 노인 남성에 특화된 음향 모델과 언어 모델을 형성할 수 있으며, 차후 음성 인식 단계에서 '노인' 또는 '남성'으로 분류된 음성 신호를 분석할 때 사용될 수 있다.Meanwhile, when the speech recognition system 1000C is configured with the structure shown in FIG. 4, the speech recognition system 1000C can be advantageously used in a training step for forming an acoustic model and a language model. When a large number of randomly recorded voice signals are input to the terminal 200 through the input unit 212, the characteristics of each voice signal are extracted and transmitted to the voice recognition server 300, and the analyzed personal information is also transmitted . The speech recognition server 300 can form various acoustic models and language models by performing the training step with reference to the transmitted voice features and personal information. In particular, only voice signals corresponding to a specific personal information item are selected and used in the training step, so that an acoustic model and a language model specific to a specific item can be formed. For example, when the voice recognition system 1000C can select the voice of the elderly male separately through the personal information analysis unit 240, only the voice of the elderly male is selected and used, And can be used for analyzing voice signals classified as 'old man' or 'male' at a later speech recognition step.

도 4에 도시된 음성 인식 시스템(1000C)에서 음성 인식 서버(300)가 복수 개로 마련될 수 있고, 분산 음성 인식 처리 과정을 수행할 수 있다.In the speech recognition system 1000C shown in FIG. 4, a plurality of speech recognition servers 300 may be provided, and a distributed speech recognition process may be performed.

도 5는 프라이빗 서버(400)를 포함하는 음성 인식 시스템(1000D)의 실시예를 나타낸 도면이다.5 is a diagram showing an embodiment of a speech recognition system 1000D including a private server 400. In FIG.

도 5에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000D)은 단말기(200), 음성 인식 서버(300) 및 프라이빗 서버(400)를 포함할 수 있다. 단말기(200)는 입력부(212), 개인 정보 수집부(220), 개인 정보 분석부(240) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374) 및 환경설정 콘트롤러(380)를 포함할 수 있다. 프라이빗 서버(400)는 특징 추출부(430)와 개인 정보 저장부(460)을 포함할 수 있다.5, a speech recognition system 1000D according to an embodiment of the present invention may include a terminal 200, a speech recognition server 300, and a private server 400. [ The terminal 200 may include an input unit 212, a personal information collection unit 220, a personal information analysis unit 240, and an output unit 214. The speech recognition server 300 may include an acoustic model unit 352, a language model unit 354, an acoustic database 372, a language database 374, and a configuration controller 380. The private server 400 may include a feature extraction unit 430 and a personal information storage unit 460.

프라이빗 서버(400)에 포함되는 특징 추출부(430)는 도 4의 단말기(200)에 포함되는 특징 추출부(230)와 동일한 것으로 구비될 수 있다.The feature extraction unit 430 included in the private server 400 may be the same as the feature extraction unit 230 included in the terminal 200 of FIG.

단말기(200)의 연산 능력에 따라서 단말기에 특징 추출부(430)가 포함할 수 있고, 개인 정보 수집부(220) 및 개인 정보 분석부(240) 중 적어도 하나가 프라이빗 서버(400)에 포함될 수 있다. 상기 특징 추출부(430)는 음성 인식 서버(300)에 포함될 수도 있다. 음성 인식 서버(300)의 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.The feature extraction unit 430 may include the terminal according to the computing capability of the terminal 200 and at least one of the personal information collection unit 220 and the personal information analysis unit 240 may be included in the private server 400 have. The feature extraction unit 430 may be included in the speech recognition server 300. [ The acoustic database 372 and the language database 374 of the speech recognition server 300 may exist outside the speech recognition server 300. [

도 5와 도 4에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of the components common to FIG. 5 and FIG. 4 will be omitted because they are redundant.

프라이빗 서버(400)는 단말기(200)로부터 음성 신호와 개인 정보를 수신하고, 상기 개인 정보를 기 설정된 카테고리로 분류하여 저장할 수 있다. 또한, 프라이빗 서버(400)는 음성 신호 및 저장된 적어도 일부의 개인 정보를 음성 인식 서버(300)로 전송할 수 있다.The private server 400 receives the voice signal and the personal information from the terminal 200, classifies the personal information into a predetermined category, and stores the classified personal information. Also, the private server 400 may transmit the voice signal and at least some of the stored personal information to the voice recognition server 300.

프라이빗 서버(400)의 특징 추출부(430)는 단말기(200)로부터 전송된 음성 신호로부터 특징을 추출하여 음성 인식 서버(300)로 전송할 수 있다. 프라이빗 서버(400)는 상기 음성 특징을 음성 인식 서버(300)로 전송할 때, 음성 특징을 암호화 하여 전송할 수 있다. 특징 추출부(430)가 음성 인식 서버(300)에 포함되는 경우, 프라이빗 서버(400)는 암호화된 음성 신호를 음성 인식 서버로(200)로 전송할 수 있다. 이처럼 프라이빗 서버(400)는 음성 특징 또는 음성 신호를 암호화할 수 있고, 이를 통해 암호화 되지 않은 음성으로부터 유추될 수 있는 개인 정보의 유출을 방지할 수 있다.The feature extraction unit 430 of the private server 400 may extract the feature from the voice signal transmitted from the terminal 200 and transmit the feature to the voice recognition server 300. [ When the private server 400 transmits the voice characteristic to the voice recognition server 300, the voice characteristic can be encrypted and transmitted. When the feature extraction unit 430 is included in the voice recognition server 300, the private server 400 may transmit the encrypted voice signal to the voice recognition server 200. [ As such, the private server 400 can encrypt a voice feature or a voice signal, thereby preventing leakage of personal information that can be inferred from unencrypted voice.

프라이빗 서버(400)의 개인 정보 저장부(460)는 단말기(200)로부터 전송된 개인 정보를 저장한다. 개인 정보 저장부(460)는 사용자(800)가 직접 입력한 개인 정보, 사용자(800)의 음성 신호로부터 유추된 개인 정보, 사용자(800)의 개인 정보로부터 유추된 타 개인 정보를 저장할 수 있다. 바람직하게는, 상기 개인 정보는 개인 정보 분석부(240)로부터 전송된 것일 수 있다. 도 4에서 설명한 바와 같이, 개인 정보 분석부(240)는 개인 정보의 각 항목별 확률값을 계산할 수 있으며, 이 확률값도 개인 정보 저장부(460)에 저장될 수 있다.The personal information storage unit 460 of the private server 400 stores the personal information transmitted from the terminal 200. [ The personal information storage unit 460 may store personal information directly input by the user 800, personal information deduced from the voice signal of the user 800, and other personal information derived from the personal information of the user 800. Preferably, the personal information may be transmitted from the personal information analyzer 240. 4, the personal information analyzer 240 may calculate a probability value for each item of personal information, and the probability value may also be stored in the personal information storage unit 460. FIG.

도 4에서 설명한 바와 같이, 사용자(800)는 개인 정보를 기 설정된 카테고리로 분류할 수 있는데, 상기 분류 과정이 프라이빗 서버(400)에서 수행될 수 있다. 사용자(800)는 단말기(200) 조작을 통해 단말기(200) 및 프라이빗 서버(400)에 저장된 개인 정보를 사용자(800) 임의 카테고리 또는 기 설정된 카테고리에 따라 분류할 수 있으며, 상기 분류된 개인 정보를 프라이빗 서버(400)의 개인 정보 저장부(460)에 저장할 수 있다. 프라이빗 서버(400)는 사용자(800)의 개인 정보를 공개 가능한 개인 정보와 비공개 개인 정보로 분류하여 저장할 수 있으나 이에 한정되지 않는다.As described in FIG. 4, the user 800 can classify the personal information into predetermined categories, and the classification process can be performed in the private server 400. The user 800 can classify the personal information stored in the terminal 200 and the private server 400 according to an arbitrary category or predetermined category of the user 800 through the operation of the terminal 200, And may be stored in the personal information storage unit 460 of the private server 400. The private server 400 may classify and store the personal information of the user 800 into publicly available private information and non-public personal information, but the present invention is not limited thereto.

전술한 바와 같이 프라이빗 서버(400)는 사용자(800)의 개인 정보를 저장하고 개인 정보 보안을 위한 각종 암호화 기법들을 수행할 수 있다. 프라이빗 서버(400)는 사용자(800)와 계약을 하거나, 사용자 인증과 개인 정보 수집 동의하에 사용자의 개인 정보를 저장한다. 프라이빗 서버(400)는 음성 인식 서버(300)와 별도로 구비되어, 공개된 영역에서 대량의 음성 인식을 처리하는 음성 인식 서버(300)에서의 사용자 정보 유출을 방지할 수 있다. 사용자(800)의 개인 정보는 단말기(200)와 프라이빗 서버(400) 사이에서 자유로이 송수신된다. 하지만, 프라이빗 서버(400)의 보안에 의해서 상기 개인 정보가 프라이빗 서버(400)를 벗어나 그 이후의 네트워크 연결로 유출되지 않는다. 특히, 프라이빗 서버(400)는 사용자(800)가 공개로 설정한 개인 정보만 음성 인식 서버(300)로 전송함으로써 사용자가 공개하길 원치 않는 개인 정보가 유출되는 것을 방지할 수 있다.As described above, the private server 400 may store the personal information of the user 800 and may perform various encryption techniques for personal information security. The private server 400 contracts with the user 800 or stores the user's personal information under user authentication and personal information collection agreement. The private server 400 is provided separately from the voice recognition server 300 and can prevent user information from being leaked in the voice recognition server 300 that processes a large amount of voice recognition in the open area. Personal information of the user 800 is freely transmitted and received between the terminal 200 and the private server 400. However, due to the security of the private server 400, the private information does not flow out of the private server 400 to the subsequent network connection. In particular, the private server 400 can prevent private information, which the user does not want to disclose, from flowing out by transmitting only the personal information set by the user 800 to the voice recognition server 300.

도 5와 같이 단말기(200)와 음성 인식 서버(300) 사이에 프라이빗 서버(400)가 존재하는 경우 얻을 수 있는 이점은 아래와 같다. 예를 들어, 사용자(800)가 음성 인식 서비스를 받길 원하지만, 음성 신호 유출 등 음성 인식에 따른 개인 정보의 유출을 걱정하는 경우를 가정해볼 수 있다. 사용자(800)는 신뢰도가 높은 프라이빗 서버(400) 사업자에게 개인 정보와 음성 신호를 전송할 수 있고, 프라이빗 서버(400) 사업자는 암호화된 음성 신호 및 암호화된 음성 특징 중 적어도 하나와 사용자가 공개를 허락한 개인 정보만 음성 인식 서비스를 제공하는 서버(200)로 전송할 수 있다. 음성 인식 서비스 제공자(200)는 음성 신호와 공개가 허용된 개인 정보를 이용하여 문자열만 추출할 수 있을 뿐, 음성 신호의 발화자가 실제로 누구인지, 어떤 특징을 지닌 사용자인지 확인할 수 없으므로 사용자의 개인 정보가 보호될 수 있다. 또한, 단말기(200)와 음성 인식 서버(300) 사이에 프라이빗 서버(400)라는 중간 단계가 더 생겨남으로써, 음성 인식의 각 과정을 각 구성요소에 분배하여 배치함으로써 단말기(200)와 음성 인식 서버(300)측에 걸리는 부하를 경감할 수 있다.As shown in FIG. 5, the following advantages are obtained when the private server 400 exists between the terminal 200 and the voice recognition server 300. For example, it may be assumed that the user 800 desires to receive the voice recognition service, but worries about leakage of personal information due to voice recognition, such as voice signal leakage. The user 800 can transmit private information and voice signals to the highly reliable private server 400 and the private server 400 can transmit the encrypted voice signals and the encrypted voice characteristics to at least one of the private server 400 and the private server 400, Only one piece of personal information can be transmitted to the server 200 providing the voice recognition service. The voice recognition service provider 200 can only extract the character string by using the voice signal and the personal information permitted to be disclosed and can not determine who is actually speaking or who has the characteristics of the voice signal, Can be protected. In addition, since the intermediate stage of the private server 400 is further developed between the terminal 200 and the voice recognition server 300, the voice recognition process is divided and arranged among the respective components, The load applied to the side of the main body 300 can be reduced.

도 5에 도시된 음성 인식 시스템(1000D)에서 음성 인식 서버(300)는 복수 개로 마련되어 분산 음성 인식 처리 과정을 수행할 수 있다.In the speech recognition system 1000D shown in FIG. 5, a plurality of speech recognition servers 300 may be provided to perform distributed speech recognition processing.

한편, 음성 인식 서버(300)는 프라이빗 서버(400)로부터 전송된 개인 정보를 참조하여 상기 개인 정보에 부합하는 음향 모델 및 언어 모델을 선택할 수 있으며, 이를 통해 음성 인식 성공률을 높일 수 있다. 또한, 도 4의 경우와 마찬가지로, 음성 인식 과정에서 복수의 음향 모델 및 언어 모델이 사용되고, 음성 인식 결과물로 복수의 문자열이 생성된 경우, 음성 인식 서버(300)는 개인 정보 분석부(240)로부터 전송된 각 속성별 확률을 참조하여 이에 기초한 가중치를 각 문자열에 적용할 수 있다. 음성 인식 서버(300)는 상기 가중치를 적용한 확률들 중 가장 높은 확률값을 가지는 문자열을 최종 음성 인식 결과물로 선택할 수 있다.Meanwhile, the voice recognition server 300 may select an acoustic model and a language model corresponding to the personal information by referring to the personal information transmitted from the private server 400, thereby increasing the voice recognition success rate. 4, when a plurality of sound models and language models are used in the speech recognition process and a plurality of strings are generated as a result of speech recognition, the speech recognition server 300 acquires a plurality of character strings from the personal information analysis unit 240 We can refer to the probability of each transmitted property and apply a weight based on it to each string. The speech recognition server 300 may select a character string having the highest probability value among the probabilities to which the weight is applied as the final speech recognition result.

그리고, 도 5에 따르면 음성 인식 서버(300)에서 생성된 음성 인식 결과물은 프라이빗 서버(400)를 거친 후 단말기(200)로 전송되는 것으로 표시되었으나 이에 한정되지 않으며, 상기 음성 인식 결과물이 음성 인식 서버(300)에서 단말기(200)로 직접 전송될 수도 있다. 또한, 음성 인식 서버(300)는 상기 음성 인식 결과물을 전술한 단말기(200) 및 프라이빗 서버(400) 외 기기로도 전송할 수 있다.5, the voice recognition result generated by the voice recognition server 300 is transmitted to the terminal 200 through the private server 400. However, the voice recognition result is not limited to the voice recognition server 300, Or may be transmitted directly from the terminal 300 to the terminal 200. Also, the voice recognition server 300 may transmit the voice recognition result to the devices other than the terminal 200 and the private server 400 described above.

도 6은 프라이빗 서버를 포함하는 음성 인식 시스템의 또 다른 실시예(1000E)를 나타낸 도면이다.6 is a diagram illustrating another embodiment 1000E of a speech recognition system including a private server.

도 6에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000E)은 단말기(200), 음성 인식 서버(300) 및 프라이빗 서버(400)를 포함할 수 있다. 단말기(200)는 입력부(212), 개인 정보 수집부(220), 개인 정보 분석부(240) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374) 및 환경설정 콘트롤러(380)를 포함할 수 있다. 프라이빗 서버(400)는 특징 추출부(430), 개인 정보 저장부(460) 및 결과물 재연산부(490)를 포함할 수 있다.Referring to FIG. 6, a speech recognition system 1000E according to an embodiment of the present invention may include a terminal 200, a speech recognition server 300, and a private server 400. FIG. The terminal 200 may include an input unit 212, a personal information collection unit 220, a personal information analysis unit 240, and an output unit 214. The speech recognition server 300 may include an acoustic model unit 352, a language model unit 354, an acoustic database 372, a language database 374, and a configuration controller 380. The private server 400 may include a feature extraction unit 430, a personal information storage unit 460, and an output re-

단말기(200)의 연산 능력에 따라서 단말기에 특징 추출부(430)가 포함할 수 있고, 개인 정보 수집부(220) 및 개인 정보 분석부(240) 중 적어도 하나가 프라이빗 서버(400)에 포함될 수 있다. 상기 특징 추출부(430)는 음성 인식 서버(300)에 포함될 수도 있다. 음성 인식 시스템(1000E)을 구성하는 단말기(200), 음성 인식 서버(300), 및 프라이빗 서버(400)에 부가되는 연산량 부담을 고르게 분포시키기 위해 결과물 재연산부(490)는 단말기(200) 및 음성 인식 서버(300) 중 적어도 하나에 포함될 수도 있다. 음성 인식 서버(300)의 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.The feature extraction unit 430 may include the terminal according to the computing capability of the terminal 200 and at least one of the personal information collection unit 220 and the personal information analysis unit 240 may be included in the private server 400 have. The feature extraction unit 430 may be included in the speech recognition server 300. [ The result re-encoding unit 490 may distribute the computation burden to the terminal 200, the voice recognition server 300, and the private server 400 of the voice recognition system 1000E, Or may be included in at least one of the recognition servers 300. [ The acoustic database 372 and the language database 374 of the speech recognition server 300 may exist outside the speech recognition server 300. [

도 6와 도 5에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of the components common to FIG. 6 and FIG. 5 will be omitted because they are redundant.

음성 인식 서버(300)는 복수의 음성 인식 결과물을 생성할 수 있다. 여기서 복수의 음성 인식 결과물은 언어 모델부(354)에서 생성된 격자 구조의 문자열 집합을 포함할 수 있다. 음성 인식 서버(300)는 상기 복수의 음성 인식 결과물을 프라이빗 서버(400)로 전송할 수 있다.The speech recognition server 300 may generate a plurality of speech recognition results. Here, the plurality of speech recognition results may include a set of strings of the lattice structure generated by the language modeling unit 354. [ The voice recognition server 300 may transmit the plurality of voice recognition results to the private server 400. [

이 때, 음성 인식 서버(300)는 음성 인식 과정에서 사용된 음향 모델 및 언어 모델의 종류 정보도 함께 전송할 수 있으며, 각각의 음성 인식 결과물에 음향 모델 및 언어 모델의 종류 정보가 포함될 수 있다. 상기 종류 정보들은 각 음성 인식 결과물이 어떤 음향 모델 및 언어 모델로부터 비롯되었는지 구분하기 위해서 사용될 수 있다. 바람직하게는, 상기 종류 정보들은 환경설정 콘트롤러(380)에서 프라이빗 서버(400)로 전송될 수 있으나 이에 한정되지 않으며, 음향 모델부(352), 언어 모델부(354) 및 기타 음성 인식 서버(300)의 구성요소에서 전송될 수도 있다.At this time, the speech recognition server 300 may transmit the acoustic model and the type information of the language model used in the speech recognition process, and the acoustic model and the type information of the language model may be included in each speech recognition result. The type information may be used to identify which acoustic model and language model each speech recognition result is derived from. The type information may be transmitted from the configuration controller 380 to the private server 400. The acoustic model unit 352, the language model unit 354, and the voice recognition server 300 &Lt; / RTI >

결과물 재연산부(490)는 음성 인식 서버(300)로부터 전송된 음성 인식 결과물로부터 최적의 음성 인식 결과물을 선택할 수 있다. 음성 인식 서버(300)가 복수의 음성 인식 결과물과 각 음성 인식 결과물에 사용된 음향 모델 및 언어 모델의 종류 정보를 함께 전송한 경우, 결과물 재연산부(490)는 상기 음향 모델 및 언어 모델의 종류 정보를 이용하여 최적의 음성 인식 결과물을 선별할 수 있다. 이 때, 결과물 재연산부(490)는 개인 정보 저장부(460)에 저장된 사용자(800)의 개인 정보를 참조할 수 있다. 참조되는 개인 정보는 사용자(800)가 공개로 설정한 개인 정보와 공개로 설정하지 않은 개인 정보 모두를 포함할 수 있다. 결과물 재연산부(490)가 최적의 음성 인식 결과물을 선별하는 구체적인 예시는 다음과 같다.The result reproduction unit 490 can select an optimum speech recognition result from the speech recognition result transmitted from the speech recognition server 300. [ When the speech recognition server 300 transmits a plurality of speech recognition results and acoustic model and language model type information used for each speech recognition result together, the result reproduction unit 490 outputs the acoustic model and the language model type information It is possible to select an optimal speech recognition result. At this time, the result reproduction unit 490 can refer to the personal information of the user 800 stored in the personal information storage unit 460. The referenced personal information may include both the personal information that the user 800 has set as public and the personal information that has not been set as public. A specific example in which the result reproduction unit 490 selects the optimum speech recognition result is as follows.

우선, 사용자(800)가 영어를 사용하며, 프라이빗 서버(400)가 공개된 개인 정보인 '남성'을 보유하고 있고, 비공개 개인 정보인 '노인', '미국 텍사스(Texas) 방언'도 함께 저장하고 있는 경우를 가정할 수 있다. 프라이빗 서버(400)는 상기 공개된 개인 정보인 '남성'을 음성 인식 서버(300)로 전송할 수 있다. 음성 인식 서버(300)의 환경설정 콘트롤러(380)는 일반화된 음향 모델 및 언어 모델을 사용하여 음성 인식을 수행할 수 있다. 하지만, 환경설정 콘트롤러(380)는 보다 정확한 음성 인식을 위해 개인 정보 '남성'에 해당하는 음향 모델 및 언어 모델을 선택할 수 있다. 환경설정 콘트롤러(380)는 이 외에도, 지역별 방언 발화데이터로부터 형성된 음향 모델 및 언어 모델인 '뉴저지 방언', '보스턴 방언' 등을 선택할 수 있으며, 다양한 연령층 별 음향 모델 및 언어 모델도 함께 선택할 수 있다. 음성 인식 서버(300)에 '텍사스 방언'에 해당하는 언어 모델을 보유하고 있지만 이와 연관된 음향 모델을 가지고 있지 않은 경우, 환경설정 콘트롤러(380)는 '텍사스 방언' 언어 모델만 선택할 수 있다. 음성 인식 서버(300)가 '텍사스 방언'에 해당하는 음향 모델은 보유하고 있지 않지만, '텍사스'와 지리적으로 가까운 '뉴 멕시코(New Mexico)', '오클라호마(Oklahoma)', '알칸사스(Arkansas)', '루이지아나(Louisiana)' 지역의 음향 모델은 보유하고 있는 경우, 환경설정 콘트롤러(380)는 상기 지역 방언의 음향 모델을 선택할 수 있다. 환경설정 콘트롤러(380)는 음성 인식 서버(300)가 보유하고 있는 모든 종류의 음향 모델 및 언어 모델을 선택할 수 있다. 음성 인식 서버(300)가 공개된 개인 정보에 부합하는 음향 모델 및 언어 모델을 보유하지 않는 경우에도, 환경설정 콘트롤러(380)는 음성 인식 서버(300)가 보유하고 있는 모든 종류의 음향 모델 및 언어 모델을 선택할 수 있다. 음성 인식 과정에서 '남성', '노인', '어린이', '청년', '텍사스 방언', '뉴저지 방언', '보스턴 방언'의 음향 모델 및 언어 모델이 사용되었고, 각각에 해당하는 음성 인식 결과물이 생성된 경우, 음성 인식 서버(300)는 상기 음성 인식 결과물들과 각각의 종류 정보를 프라이빗 서버(400)로 전송한다. 프라이빗 서버(400)의 결과물 재연산부(490)는 공개된 개인 정보인 '남성'에 해당하는 음성 인식 결과물과 비공개 개인 정보인 '노인', '텍사스 방언'에 해당하는 음성 인식 결과물 중 적어도 하나를 최종 음성 인식 결과물로 선택할 수 있고, 상기 3가지 음성 인식 결과물을 모두 선택할 수 있다. 결과물 재연산부(490)는 모든 음성 인식 결과물들 중에서 가장 높은 확률을 지니는 문자열을 최종 음성 인식 결과물로 선택할 수도 있다.First, the user 800 uses English, the private server 400 has 'male', which is the personal information that is disclosed, and the private information 'old man' and 'Texas Texas dialect' And the like. The private server 400 may transmit the 'personal information' to the voice recognition server 300. The configuration controller 380 of the speech recognition server 300 can perform speech recognition using a generalized acoustic model and a language model. However, the environment setting controller 380 can select an acoustic model and a language model corresponding to personal information 'male' for more accurate voice recognition. The environment setting controller 380 can also select acoustic models and language models, such as 'New Jersey dialect' and 'Boston dialect', formed from local dialect data, and can also select acoustic models and language models for various age groups . If the speech recognition server 300 has a language model corresponding to the 'Texas dialect' but does not have an acoustic model associated therewith, the configuration controller 380 can select only the 'Texas dialect' language model. Although the speech recognition server 300 does not have an acoustic model corresponding to the 'Texan dialect', it is possible to use 'New Mexico', 'Oklahoma', 'Arkansas' 'And' Louisiana 'areas, the configuration controller 380 may select the acoustic model of the local dialect. The environment setting controller 380 can select all kinds of acoustic models and language models possessed by the speech recognition server 300. [ Even if the speech recognition server 300 does not have an acoustic model and a language model that match the disclosed personal information, the environment setting controller 380 can also store all types of acoustic models and languages You can choose a model. In the speech recognition process, acoustic and language models of 'male', 'elderly', 'children', 'youth', 'Texan dialect', 'New Jersey dialect', 'Boston dialect' were used, When the result is generated, the voice recognition server 300 transmits the voice recognition results and the respective type information to the private server 400. The result reproduction unit 490 of the private server 400 may store at least one of the voice recognition result corresponding to the 'personal information', which is the public personal information, and the voice recognition result corresponding to the private information 'senior citizen' and ' Can be selected as the final speech recognition result, and all of the three speech recognition results can be selected. The result reproduction unit 490 may select a character string having the highest probability among all speech recognition results as the final speech recognition result.

도 5에 따르면 개인 정보 분석부(240)를 통해 파악된 개인 정보의 각 항목별 확률값이 개인 정보 저장부(460)에 저장될 수 있다. 도 6에서도 이와 마찬가지로, 개인 정보의 각 항목별 확률값이 개인 정보 저장부(460)에 저장될 수 있다. 물론, 개인 정보 분석부(240)에서 생성되는 각종 정보들은 개인 정보 저장부(460)를 거치지 않고 곧바로 결과물 재연산부(490)에 전송될 수도 있다.According to FIG. 5, the probability value of each item of the personal information identified through the personal information analyzing unit 240 can be stored in the personal information storage unit 460. FIG. 6, likelihood values of individual items of personal information may be stored in the personal information storage unit 460. [ Of course, various kinds of information generated by the personal information analyzing unit 240 may be transmitted to the result re-generating unit 490 without going through the personal information storing unit 460. [

음성 인식 과정에서 복수의 음향 모델 및 언어 모델이 사용되고, 음성 인식 결과물로 복수의 문자열이 생성된 경우, 결과물 재연산부(490)는 각 항목별 확률값을 참조하여 이에 기초한 가중치를 각 문자열에 적용할 수 있다. 결과물 재연산부(490)는 상기 가중치를 적용한 확률값들 중 가장 높은 확률값을 가지는 문자열을 최종 음성 인식 결과물로 선택할 수 있다.In a case where a plurality of acoustic models and language models are used in the speech recognition process and a plurality of strings are generated as a result of speech recognition, the result reproduction unit 490 may apply a weight based on the probability values to each string, have. The result reproduction unit 490 may select a character string having the highest probability value among the probability values to which the weight is applied as the final speech recognition result.

직전의 예시에서, 결과물 재연산부(490)는 '남성', '노인', '텍사스 방언' 각각의 확률들에 기초하여 가중치 1, 가중치 2, 가중치 3 값을 형성할 수 있고, 상기 가중치를 각각의 결과물에 적용할 수 있다. 결과물 재연산부(490)는 '남성'에 해당하는 음향 모델 및 언어 모델을 통해서 형성된 단어열들의 각 확률값에 가중치 1 값을 곱하는 방식으로 최종 확률을 구할 수 있는데, 이는 다른 결과물들에 대해서도 동등하게 처리된다. 하지만 최종 확률을 구하는 방식은 다양하게 마련될 수 있으며, 상기의 곱하는 방식에 한정되지는 않는다. 결과물 재연산부(490)는 모든 계산 결과값들 중 가장 높은 최종 확률을 가지는 단어열을 선택할 수 있다.In the immediately preceding example, the result reproduction unit 490 may form the weight 1, weight 2, and weight 3 values based on the probabilities of each of the male, the elderly, and the tex1 dialect, To the resultant product. The result reproduction unit 490 can obtain the final probability by multiplying each probability value of word strings formed through the acoustic model and the language model corresponding to 'male' by the weight value 1, which is equivalent to other results do. However, a method of obtaining the final probability can be variously arranged, and is not limited to the above multiplication method. The result reproduction unit 490 can select the word sequence having the highest final probability among all calculation result values.

사용자(800)가 복수의 최종 음성 인식 결과물을 전송 받은 경우, 각 음성 인식 결과물의 내용을 확인하고 그 중 가장 사용자(800)의 의도에 부합하는 결과물을 선택하고 이를 음성 인식 시스템(1000E)에 전송할 수 있다. 또는, 사용자(800)는 모든 최종 음성 인식 결과물에 대해서 정확도를 평가하고 각 결과물의 정확도 평가 정보를 음성 인식 시스템(1000E)에 전송할 수 있다. 여기서, 음성 인식 시스템(1000E)은 상기 사용자(800)의 최종 음성 인식 결과물 선택 정보와 정확도 평가 정보를 포함하는 사용자 피드백 정보를 생성할 수 있다. 그리고 음성 인식 시스템(1000E)은 사용자 피드백 정보에 기초하여 음향 모델 및 언어 모델에 정확도 가중치를 부여할 수 있으며, 상기 정확도 가중치 정보는 차후의 음성 인식 과정에 사용되어 음성 인식의 정확도를 높일 수 있다. 일 예로, 음성 인식 시스템(1000E)은 상기 정확도 가중치를 음성 인식 결과물로 형성된 각 단어열의 확률에 부가하고 가장 높은 확률을 가지는 단어열을 최종 음성 인식 결과물로 선택할 수 있다.When the user 800 receives a plurality of final speech recognition results, he / she checks the contents of the speech recognition result, selects an outcome corresponding to the intention of the user 800, and transmits the selected result to the speech recognition system 1000E . Alternatively, the user 800 may evaluate the accuracy of all final speech recognition results and send the accuracy evaluation information of each result to the speech recognition system 1000E. Here, the speech recognition system 1000E may generate the user feedback information including the final speech recognition result selection information and the accuracy evaluation information of the user 800. [ The speech recognition system 1000E may assign an accuracy weight to the acoustic model and the language model based on the user feedback information, and the accuracy weight information may be used in a subsequent speech recognition process to improve the accuracy of speech recognition. For example, the speech recognition system 1000E may add the accuracy weights to the probability of each word string formed by the speech recognition result, and may select the word string having the highest probability as the final speech recognition result.

도 7은 단말기와 음성 인식 서버를 포함하는 음성 인식 시스템의 또 다른 실시예(1000F)를 나타낸 도면이다.7 is a diagram illustrating another embodiment 1000F of a speech recognition system including a terminal and a speech recognition server.

도 7에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000F)은 단말기(200)와 음성 인식 서버(300)를 포함할 수 있다. 단말기(200)는 입력부(212), 개인 정보 수집부(220), 특징 추출부(230), 개인 정보 분석부(240), 개인 정보 저장부(260), 결과물 재연산부(290) 및 출력부(214)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352), 언어 모델부(354), 음향 데이터베이스(372), 언어 데이터베이스(374) 및 환경설정 콘트롤러(380)를 포함할 수 있다. 음성 인식 서버(300)는 음향 모델부(352)와 언어 모델부(354)를 포함하는 적어도 하나의 음성 신호 분석부(250)를 포함할 수 있다.Referring to FIG. 7, the speech recognition system 1000F according to the embodiment of the present invention may include a terminal 200 and a speech recognition server 300. FIG. The terminal 200 includes an input unit 212, a personal information collection unit 220, a feature extraction unit 230, a personal information analysis unit 240, a personal information storage unit 260, an output re- (214). The speech recognition server 300 may include an acoustic model unit 352, a language model unit 354, an acoustic database 372, a language database 374, and a configuration controller 380. The speech recognition server 300 may include at least one speech signal analyzer 250 including an acoustic model unit 352 and a language model unit 354. [

도 7의 단말기(200)에 포함되는 특징 추출부(230), 개인 정보 저장부(260) 및 결과물 재연산부(290)는 도 6의 프라이빗 서버(400)에 포함되는 특징 추출부(430), 개인 정보 저장부(460) 및 결과물 재연산부(490)와 동일한 것으로 구비될 수 있다.The feature extraction unit 230, the personal information storage unit 260 and the result reproduction unit 290 included in the terminal 200 of FIG. 7 include the feature extraction unit 430 included in the private server 400 of FIG. 6, The personal information storage unit 460 and the output re-output unit 490 may be provided.

단말기(200)의 연산 능력에 따라서 개인 정보 수집부(220), 개인 정보 분석부(240), 결과물 재연산부(290), 개인 정보 저장부(260) 및 특징 추출부(230) 중 적어도 하나가 음성 인식 서버(300)에 포함될 수 있다. 음성 인식 서버(300)의 음향 데이터베이스(372)와 언어 데이터베이스(374)가 음성 인식 서버(300) 외부에 존재하는 구성도 가능하다.At least one of the personal information collection unit 220, the personal information analysis unit 240, the result reproduction unit 290, the personal information storage unit 260, and the feature extraction unit 230 may be May be included in the voice recognition server 300. The acoustic database 372 and the language database 374 of the speech recognition server 300 may exist outside the speech recognition server 300. [

도 7과 도 6에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of the components common to FIG. 7 and FIG. 6 will be omitted because they are redundant.

도 7의 단말기(200)는 도 6의 프라이빗 서버(400)에 포함되어 있던 특징 추출부(430), 개인 정보 저장부(460), 결과물 재연산부(490)를 포함하고 있으며, 단말기(200)가 통해 개인 정보 분석 및 최종 음성 인식 결과물 선택에 대한 처리도 수행하는 구성이다. 특히, 도 7의 단말기(200)는 최근에 시장에서 유통되고 있는 고성능 스마트폰에 적합한 구성이며, 가정에서 사용하는 개인용 컴퓨터에도 적용될 수 있는 시스템 구조이다.The terminal 200 of FIG. 7 includes a feature extraction unit 430, a personal information storage unit 460 and an output re-emission unit 490 included in the private server 400 of FIG. And performs processing for personal information analysis and final speech recognition result selection. In particular, the terminal 200 shown in FIG. 7 is a configuration suitable for a high-performance smartphone which has recently been distributed in the market, and is a system structure that can be applied to a personal computer used in the home.

도 7의 음성 인식 시스템(1000F)에서, 단말기(200)는 음성 인식 서버(300)와는 음성 신호 및 공개된 개인 정보, 그리고 음성 인식 결과물만 주고 받는다. 또한, 도 7의 음성 인식 시스템(1000F)은 단말기(200)와 음성 인식 서버(300)를 제외한 별도의 음성 인식 단계를 거치지 않는 단순한 구조로 구비되는데, 이 단순함이 해당 시스템 구조의 구성의 장점이다. 보안 측면에 있어서도, 사용자는 각자의 단말기(200)에 저장되는 개인 정보만 유의하여 관리하면 될 뿐 별도의 보안을 강구할 필요가 없다. 도 7의 음성 인식 시스템(1000)은 도 6의 단말기(200)와 프라이빗 서버(400)가 하나로 합쳐진 형태로, 특히 개인 정보를 처리하는 과정에서 큰 강점을 지니고 있다. 도 6의 경우는 개인 정보가 프라이빗 서버(400)에 별도로 저장되어있기 때문에 사용자(800)의 요청에 의해서 개인 정보를 삭제하거나 수정할 필요가 있을 때 프라이빗 서버(400)에 접속하는 추가적인 단계가 필요할 수 있다. 하지만 도 7의 경우, 사용자(800)는 직접 단말기(200)를 통해 손쉽게 개인 정보를 관리할 수 있다. 또한, 사용자(800)는 결과물 재연산 과정 중, 자신의 기호에 따른 결과물 선택을 용이하게 할 수 있다.In the speech recognition system 1000F of FIG. 7, the terminal 200 exchanges voice signals, public personal information, and voice recognition results with the voice recognition server 300 only. In addition, the speech recognition system 1000F of FIG. 7 has a simple structure that does not require a separate speech recognition step except for the terminal 200 and the speech recognition server 300. This simplicity is an advantage of the structure of the system structure . In terms of security, the user only needs to manage personal information stored in each terminal 200, and does not have to take extra security. The voice recognition system 1000 of FIG. 7 is a combination of the terminal 200 and the private server 400 of FIG. 6, and has a great advantage in the process of processing personal information. 6, since the private information is separately stored in the private server 400, an additional step may be required to access the private server 400 when it is necessary to delete or modify personal information at the request of the user 800 have. However, in the case of FIG. 7, the user 800 can easily manage personal information through the terminal 200 directly. In addition, the user 800 can easily select an output according to his or her preference during the process of re-arising the result.

도 8은 제 1 사용자(800a)의 음성 인식 결과를 제 2 사용자(800b)에게 전송하는 음성 인식 시스템(1000G)의 실시예를 나타낸 도면이다.FIG. 8 is a diagram illustrating an embodiment of a speech recognition system 1000G for transmitting a speech recognition result of a first user 800a to a second user 800b.

도 8에 따르면, 본 발명의 실시예에 따른 음성 인식 시스템(1000G)은 제 1 단말기(500), 음성 인식 서버(300), 프라이빗 서버(400) 및 제 2 단말기(600)를 포함할 수 있다. 제 1 단말기(500)는 입력부(512), 개인 정보 수집부(520) 및 개인 정보 분석부(540)를 포함할 수 있고, 제 2 단말기(600)는 신호 수신부(610), 번역부(620), 결과물 선택부(630), 출력 신호 선택부(640), 음성 신호 변환부(650), 속성 저장부(652) 및 출력부(614)를 포함할 수 있다.Referring to FIG. 8, a speech recognition system 1000G according to an embodiment of the present invention may include a first terminal 500, a speech recognition server 300, a private server 400, and a second terminal 600 . The first terminal 500 may include an input unit 512, a personal information collecting unit 520 and a personal information analyzing unit 540. The second terminal 600 may include a signal receiving unit 610, a translator 620 An output signal selection unit 640, a voice signal conversion unit 650, an attribute storage unit 652, and an output unit 614, as shown in FIG.

도 8의 음성 인식 서버(300) 및 프라이빗 서버(400)는 도 6의 음성 인식 서버(300) 및 프라이빗 서버(400)와 동일하고, 제 2 단말기(600)에 포함되는 출력부(614)는 도 6의 단말기(200)에 포함되는 출력부(214)와 동일하게 구비될 수 있다. 제 1 단말기(500)도 도 6의 단말기(200)와 마찬가지로 별도의 출력부를 포함함으로써 제 1 사용자(800a)는 자신의 음성 인식 결과를 확인할 수도 있다.The voice recognition server 300 and the private server 400 of FIG. 8 are the same as the voice recognition server 300 and the private server 400 of FIG. 6, and the output unit 614 of the second terminal 600 includes And may be provided in the same manner as the output unit 214 included in the terminal 200 of FIG. The first terminal 500 includes a separate output unit as in the terminal 200 of FIG. 6, so that the first user 800a can check the voice recognition result of the first user 800a.

도 8과 도 6에 공통된 구성요소들에 대한 상세한 설명은 중복되므로 생략하도록 한다.The detailed description of common elements in FIGS. 8 and 6 will be omitted since they are redundant.

신호 수신부(610)는 제 1 사용자(800a)의 음성 인식 결과물을 수신한다. 신호 수신부(610)는 복수의 음성인식 결과물을 수신할 수 있다. 신호 수신부(610)는 이에 더하여 상기 제 1 사용자(800a)의 개인 정보 및 제 1 사용자(800a)의 음성 특징을 수신할 수 있다. 이 때, 신호 수신부(610)는 프라이빗 서버(400)로부터 공개된 제 1 사용자(800a)의 공개된 개인 정보만 수신할 수 있다. 여기서, 제 1 사용자(800a)의 음성 특징은 특징 추출부(430)에서 제 1 사용자(800a)의 음성 신호로부터 추출되어 프라이빗 서버(400)에 저장된 것일 수 있다. 제 1 사용자(800a)의 개인 정보는 제 2 단말기(600)에서 음성 출력시 사용될 수 있다.The signal receiving unit 610 receives the voice recognition result of the first user 800a. The signal receiving unit 610 can receive a plurality of speech recognition results. In addition, the signal receiving unit 610 can receive the personal information of the first user 800a and the voice characteristics of the first user 800a. At this time, the signal receiving unit 610 can receive only the disclosed personal information of the first user 800a, which is disclosed from the private server 400. [ Here, the voice feature of the first user 800a may be extracted from the voice signal of the first user 800a in the feature extraction unit 430 and stored in the private server 400. [ The personal information of the first user 800a may be used to output the voice in the second terminal 600. [

번역부(620)는 제 1 사용자(800a)의 언어와 제 2 사용자(800b)의 사용하는 언어가 다를 경우, 제 1 사용자(800a)의 음성 인식 결과물을 제 2 사용자(800b)의 언어에 맞게 번역한다. 이 때, 번역부(620)는 복수의 음성 인식 결과물에 대해서 복수의 번역 결과물을 생성할 수 있다. 이에 더하여, 번역부(620)는 음성 인식 결과물에 대한 정규화 과정을 수행할 수 있다. 여기서 정규화 과정은 음성 인식 결과물을 표준어법에 맞는 문자열로 변형하는 처리 과정을 말한다. 하지만 이에 한정되지 않으며, 상기 정규화 과정은 음성 인식 시스템(1000G)의 다른 구성요소에 의해서 처리될 수 있다.The translating unit 620 translates the voice recognition result of the first user 800a to the language of the second user 800b when the language of the first user 800a is different from the language of the second user 800b Translate. At this time, the translation unit 620 can generate a plurality of translation results for a plurality of speech recognition results. In addition, the translating unit 620 may perform a normalization process on the speech recognition result. Here, the normalization process is a process of transforming the result of speech recognition into a character string conforming to a standard phrase. However, the present invention is not limited to this, and the normalization process may be performed by other components of the speech recognition system 1000G.

결과물 선택부(630)는 복수의 음성 인식 결과물 및 복수의 번역 결과물 중 적어도 하나를 선별한다. 이때 결과물 선택부(630)는 제 2 사용자(800b)의 선택 입력에 따라 결과물을 선별할 수 있다. 또한, 결과물 선택부(630)는 제 2 사용자(800b)의 개인 정보를 참조하여 이에 부합하는 결과물을 선별할 수 있다. 결과물 선택부(630)는 상기 번역부(620)와 합쳐져서 하나의 구성요소로 존재할 수도 있다.The result selection unit 630 selects at least one of a plurality of speech recognition results and a plurality of translation results. At this time, the result selecting unit 630 may select an output according to the selection input of the second user 800b. In addition, the result selecting unit 630 can refer to the personal information of the second user 800b and select an output corresponding to the personal information. The result selecting unit 630 may be combined with the translating unit 620 and exist as one component.

출력 신호 선택부(640)는 상기 선별된 결과물을 출력할 때의 출력 형식을 결정한다. 출력 신호 선택부(640)는 제 2 사용자(800b)로부터 '영상 출력', '음성 출력' 등의 출력 형식 지정 입력을 받을 수 있다. 출력 신호 선택부(640)는 제 2 단말기(400)의 구성에 따라 출력 형식을 결정할 수 있다. 예를 들어, 제 2 단말기(400)에 별도의 영상 출력 수단이 없고, 스피커와 같은 음성 출력 수단만 구비된 경우, 출력 신호 선택부(640)는 음성 출력 형식을 선택한다.The output signal selector 640 determines an output format for outputting the selected output. The output signal selection unit 640 may receive an output format specification input such as 'video output' and 'audio output' from the second user 800b. The output signal selector 640 may determine the output format according to the configuration of the second terminal 400. [ For example, if the second terminal 400 does not have a separate video output unit and only audio output means such as a speaker is provided, the output signal selector 640 selects the audio output format.

출력 신호 선택부(640)는 상기 선별된 결과물을 음성으로 출력하는 것으로 결정할 수 있고, 이 경우, 출력 신호 선택부(640)는 상기 선별된 결과물을 음성 신호 변환부(650)로 전송할 수 있다. 음성 신호 변환부(650)는 상기 선별된 결과물을 음성 신호로 변환한다. 이 때, 음성 신호 변환부(650)는 신호 수신부(610)가 수신한 제 1 사용자(800a)의 개인 정보 및 음성 특징 정보를 참조하여 출력 음성을 생성할 수 있다. 즉, 만약 제 1 사용자(800a)가 한국 여성이고 제 2 사용자(800b)가 미국 사람인 경우, 제 2 사용자(800b)는 제 1 사용자(800a)가 한국어로 말한 내용을 영어 음성으로 들을 수 있으며, 이 때, 그 음성이 제 1 사용자(800a) 특유의 한국 여성 음성 특징을 그대로 가질 수 있다는 것을 의미한다.The output signal selector 640 may determine that the selected output is to be output by voice. In this case, the output signal selector 640 may transmit the selected output to the voice signal converter 650. The voice signal converting unit 650 converts the selected result into a voice signal. At this time, the voice signal converting unit 650 can generate an output voice by referring to the personal information and the voice feature information of the first user 800a received by the signal receiving unit 610. That is, if the first user 800a is a Korean woman and the second user 800b is a United States person, the second user 800b can listen to what the first user 800a has spoken in Korean in English, At this time, this means that the voice can have the Korean female voice characteristic unique to the first user 800a.

한편, 속성 저장부(652)는 음성의 특징 및 환경 특징을 저장할 수 있다. 여기서 음성 특징은 유명인의 음성 특징을 포함할 수 있고, 환경 특징은 다양한 공간의 잔향 특성 및 공간 정보를 포함할 수 있다.On the other hand, the attribute storage unit 652 may store voice characteristics and environment characteristics. Here, the voice feature may include the voice feature of the celebrity, and the environmental feature may include the reverberation feature and spatial information of the various spaces.

음성 신호 변환부(650)는 속성 저장부(652)에 저장된 음성 특징과 환경 특징을 참조하여 음성을 생성할 수 있다. 예를 들어, 음성 신호 변환부(650)는 속성 저장부(652)의 정보를 참조하여 상기 선별된 결과물을 유명 연예인의 목소리로 출력할 수 있고, 울림이 심한 콘서트 홀의 특성을 가미한 목소리를 생성할 수도 있다.The voice signal converting unit 650 may generate voice by referring to voice characteristics and environment characteristics stored in the attribute storage unit 652. [ For example, the voice signal converter 650 may output the selected result to the voice of a famous entertainer by referring to the information stored in the attribute storage 652, and may generate a voice having a characteristic of a concert hall having a strong resonance It is possible.

도 8과 같은 구성의 음성 인식 시스템(1000G)은 실시간 통역 시스템으로 활용할 수 있다. 즉, 제 1 사용자(800a)와 제 2 사용자(800b)는 서로 각자의 언어로 자유롭게 발화할 수 있고, 서로 상대방이 말한 내용을 자국의 언어로 청취할 수 있다. 도 8에서는 제 1 사용자(800a)에서 제 2 사용자(800b)로 음성 인식 결과물이 전달되는 구조만 도시하였지만 이에 한정되지 않으며, 두 사용자가 동시에 발화 및 청취가 가능한 양방향 시스템으로의 구성도 가능하다. 즉, 제 1 사용자(800a)와 제 2 사용자(800b) 두 사람이 각자가 사용할 프라이빗 서버(400)를 보유하고, 음성 인식 시스템의 입력 수단에 해당하는 제 1 단말기(500)와 출력 수단에 해당하는 제 2 단말기(600)가 하나로 합쳐진 형태의 단말기를 각자 가지고 있으면 양방향 통역 시스템이 구현될 수 있다.The speech recognition system 1000G having the configuration as shown in FIG. 8 can be utilized as a real-time interpretation system. That is, the first user 800a and the second user 800b can freely speak each other in their own languages, and can listen to each other's contents in their own languages. Although FIG. 8 shows only the structure in which the voice recognition result is transmitted from the first user 800a to the second user 800b, the present invention is not limited thereto, and a bi-directional system in which two users can simultaneously speak and listen is also possible. That is, the first user 800a and the second user 800b each have a private server 400 to be used by them, and correspond to the first terminal 500 corresponding to the input means of the voice recognition system and the output means Directional interpretation system can be implemented if each of the second terminals 600 includes a single terminal.

도 8과 유사하게, 도 7의 구성에 제 2 단말기를 포함시켜 프라이빗 서버(400)가 없는 실시간 통역 시스템을 구성할 수도 있다.Similar to FIG. 8, a real-time interpretation system without a private server 400 may be configured by including a second terminal in the configuration of FIG.

도 9는 본 발명의 실시예에 따른 음성 인식 방법을 나타낸 도면이다.9 is a diagram illustrating a speech recognition method according to an embodiment of the present invention.

도 9에 따르면, 본 발명에 따른 음성 인식 방법은 사용자로부터 음성 신호를 입력 받는 단계(S100), 사용자의 개인 정보를 수집하는 단계(S200), 음성 신호와 개인 정보에 기초하여 음성 신호로부터 음성 인식 결과물을 생성하는 단계(S300), 음성 인식 결과물로부터 최종 음성 인식 결과물을 선택하는 단계(S400) 및 최종 음성 인식 결과물을 출력하는 단계(S500)를 포함할 수 있다.9, the speech recognition method according to the present invention includes a step S100 of receiving a voice signal from a user, a step S200 of collecting personal information of a user, a step of acquiring voice information (S400) of selecting a final speech recognition result from the speech recognition result, and outputting the final speech recognition result (S500).

사용자로부터 음성 신호를 입력 받는 단계(S100)는 음성 인식을 위한 기본적인 정보인 음성 신호를 마이크 등의 수단을 통해 입력 받는 단계이다.The step of receiving a voice signal from a user (S100) is a step of receiving a voice signal, which is basic information for voice recognition, through a microphone or the like.

사용자의 개인 정보를 수집하는 단계(S200)는 음성 인식 성공률을 높이기 위한 개인 정보를 얻는 단계이다. 개인 정보는 사용자 행위의 기록 및 사용자 행위를 측정한 결과로부터 수집된 사용자 행동 정보와, 사용자 고유의 신상 정보 및 사용자의 상황을 나타내는 사용자 상태 정보를 포함할 수 있다. 이 때, 사용자의 개인 정보를 수집하는 단계(S200)는 사용자가 직접 입력한 개인 정보를 취득하는 단계(S220)와 음성 신호 및 상기 수집된 개인 행동 정보 중 적어도 하나를 분석하여 개인 상태 정보를 유추하는 단계(S240)를 더 포함할 수 있다. 한편, 개인 정보를 수집하는 단계(S200)는 사용자로부터 음성 신호를 입력 받는 단계(S100) 이전부터 수행될 수 있고, 사용자로부터 음성 신호를 입력 받는 단계(S100)가 완료된 이후에 수행될 수도 있다.The step of collecting the user's personal information (S200) is a step of acquiring personal information for increasing the voice recognition success rate. The personal information may include user behavior information collected from a record of user actions and results of measuring user actions, user-specific personal information, and user status information indicating a user's situation. At this time, collecting the user's personal information (S200) includes analyzing at least one of the step S220 of acquiring the personal information directly inputted by the user and the voice signal and the collected personal behavior information, (Step S240). Meanwhile, the step of collecting personal information (S200) may be performed before the step of receiving the voice signal from the user (S100), and may be performed after the step of receiving the voice signal from the user (S100) is completed.

음성 신호와 개인 정보에 기초하여 음성 신호로부터 음성 인식 결과물을 생성하는 단계(S300)는 사용자가 공개로 설정한 개인 정보를 참조하여 음향 모델과 언어 모델을 선택하는 단계(S320)를 추가적으로 포함할 수 있다. 음성 신호와 개인 정보에 기초하여 음성 신호로부터 음성 인식 결과물을 생성하는 단계(S300)는 개인화된 음향 모델과 언어 모델을 참조함으로써 정확도가 높은 음성 인식 결과물을 생성할 수 있다.The step S300 of generating the voice recognition result from the voice signal based on the voice signal and the personal information may further include a step S320 of selecting the acoustic model and the language model with reference to the personal information set by the user have. The step of generating the voice recognition result from the voice signal based on the voice signal and the personal information (S300) can generate the voice recognition result with high accuracy by referring to the personalized acoustic model and the language model.

한편, 음성 신호와 개인 정보에 기초하여 음성 신호로부터 음성 인식 결과물을 생성하는 단계(S300)는 복수의 음성 인식 결과물을 생성할 수 있고, 각 음성 인식 결과물 마다 확률값을 가질 수 있다.On the other hand, the step of generating a voice recognition result from the voice signal based on the voice signal and the personal information (S300) may generate a plurality of voice recognition results, and may have a probability value for each voice recognition result.

음성 신호와 개인 정보에 기초하여 음성 신호로부터 음성 인식 결과물을 생성하는 단계(S300)는 음성 인식 결과물을 생성할 때 사용된 음향 모델 및 언어 모델의 종류를 함께 표시할 수 있다.The step of generating the voice recognition result from the voice signal based on the voice signal and the personal information (S300) may display the types of the acoustic model and the language model used when the voice recognition result is generated.

음성 인식 결과물로부터 최종 음성 인식 결과물을 선택하는 단계(S400)는 복수의 음성 인식 결과물로부터 최적의 문자열을 선별하는 단계이다. 최종 음성 인식 결과물을 선택하는 단계(S400)는 사용자의 공개된 개인 정보 및 비공개 개인 정보를 이용하여 음성 인식 결과물을 선택할 수 있다. 또한, 최종 음성 인식 결과물을 선택하는 단계(S400)는 복수의 음성 인식 결과물들의 각 확률에 상기 개인 정보에 기초한 가중치를 부가하고, 그 결과로 가장 높은 확률값을 가지는 음성 인식 결과물을 선택할 수 있다.The step of selecting the final speech recognition result from the speech recognition result (S400) is a step of selecting an optimum character string from a plurality of speech recognition results. In the step S400 of selecting the final speech recognition result, the speech recognition result may be selected using the user's public personal information and the private personal information. In addition, in the step S400 of selecting the final speech recognition result, a weight based on the personal information may be added to each probability of the plurality of speech recognition results, and as a result, a speech recognition result having the highest probability value may be selected.

최종 음성 인식 결과물을 출력하는 단계(S500)는 문자, 영상, 소리 등을 이용하여 음성 인식 결과물을 출력한다. 이 때, 최종 음성 인식 결과물을 출력하는 단계(S500)는 상기 음성 인식 결과물을 타 사용자에게 표시할 수 있다.The step of outputting the final speech recognition result (S500) outputs the speech recognition result using characters, images, sounds, and the like. At this time, outputting the final speech recognition result (S500) may display the speech recognition result to another user.

도 9에 도시되지는 않았지만, 최종 음성 인식 결과물을 타 사용자에게 전송하고, 타 사용자가 사용하는 언어로 상기 최종 음성 인식 결과물을 번역한 뒤, 영상 또는 소리를 통해 타 사용자에게 출력하는 방법을 추가적으로 포함할 수도 있다.Although not shown in FIG. 9, the method further includes a method of transmitting the final speech recognition result to another user, translating the final speech recognition result into a language used by another user, and outputting the final speech recognition result to another user through video or sound You may.

도 9에서 도시한 것과 같은 음성 인식 방법을 이용함으로써 개인화된 음향 모델 및 언어 모델을 참조할 수 있고, 이를 통해 음성 인식 성공률을 높일 수 있다.By using the speech recognition method as shown in FIG. 9, it is possible to refer to the personalized acoustic model and the language model, thereby increasing the speech recognition success rate.

이상에서 본 발명을 구체적인 실시예를 통하여 설명하였으나, 당업자라면 본 발명의 취지를 벗어나지 않는 범위 내에서 수정, 변경을 할 수 있을 것이다. 따라서 본 발명이 속하는 기술분야에 속한 사람이 본 발명의 상세한 설명 및 실시예로부터 용이하게 유추할 수 있는 것은 본 발명의 권리범위에 속하는 것으로 해석되어야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention. Accordingly, it is to be understood that within the scope of the appended claims, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

발명의 실시를 위한 형태DETAILED DESCRIPTION OF THE INVENTION

발명의 실시를 위한 최선의 형태에서 관련 내용을 서술하였다.The contents have been described in the best mode for carrying out the invention.

본 발명은 스마트폰 및 PC의 음성 인식 어플리케이션과 텔레마케팅, 음성 인식 기능이 포함된 가정용 가전제품, 음성 인식 운송수단, 실시간으로 작동하는 음성 기반 통역기등의 단말기 및 음성 인식 시스템에 활용될 수 있다.The present invention can be applied to voice recognition applications of smart phones and PCs, terminals for household appliances including telemarketing and voice recognition functions, speech recognition vehicles, voice-based interpreters operating in real time, and voice recognition systems.

Claims

A terminal for receiving a voice signal from a user and collecting personal information of the user;
A private server for receiving the voice signal and the personal information from the terminal, sorting the personal information into a predetermined category and storing the voice signal, and transmitting the voice signal and at least some of the stored personal information to the voice recognition server;
A voice recognition server for performing voice recognition based on the voice signal transmitted from the private server and the personal information, and for generating a voice recognition result; , &Lt; / RTI &
Wherein the personal information transmitted from the private server to the voice recognition server is personal information set by the user to be public,
The voice recognition server comprises:
An acoustic model unit for selecting at least one of phonemes, syllables and words corresponding to the speech signal;
A language model unit for referring to a sentence structure of a language to form a character string,
Wherein the acoustic model unit and the language model unit include an environment controller for selecting an acoustic model and a language model to be used in the speech recognition process.

The method according to claim 1,
Wherein the environmental controller selects at least one acoustic model and at least one language model by referring to the personal information transmitted to the voice recognition server.

The method according to claim 1,
The personal information includes:
User activity information collected from a result of measuring user activity and user activity, user status information indicating user's own personal information and user's status,
Wherein the user behavior information comprises:
User's online activity that collected user's online activity and Internet usage record,
User location information indicating the user's actual location,
User connection information, which is communication identification information of the user, and
User device information collected in the process of interaction between the user and the terminal,
Wherein the user state information comprises:
User attribute information indicating the user's personal information and personality, body, and emotional state, and
And environmental attribute information indicating a characteristic of a surrounding environment in which the user is located.

The method of claim 3,
The terminal comprises:
Wherein the user state information is directly input from the user or inferred from at least one of the voice signal and the user behavior information.

The method of claim 3,
The private server includes:
Wherein the user state information is input directly from the user, or the user state information is inferred from at least one of the voice signal and the user behavior information.

The method according to claim 1,
The voice recognition server comprises:
Wherein the plurality of speech recognition results are derived and transmitted to the private server, and the acoustic model and the language model type information used in the speech recognition process are transmitted together.

The method according to claim 6,
The private server includes:
Selecting at least one of a plurality of speech recognition results transmitted from the speech recognition server,
And selects the personal information using the public personal information and the private personal information.

The method according to claim 6,
The private server includes:
Selecting at least one of a plurality of speech recognition results transmitted from the speech recognition server,
Adding a weight based on the public personal information and the private personal information to each probability value of the plurality of voice recognition results, and selecting a voice recognition result having the highest probability value as the result.

Receiving a voice signal from a user;
Collecting personal information of the user;
Generating a voice recognition result from the voice signal based on the voice signal and the personal information;
Selecting a final speech recognition result from the speech recognition result; Lt; / RTI >
Wherein the step of generating a voice recognition result from the voice signal based on the voice signal and the personal information includes the steps of: selecting an acoustic model and a language model with reference to the personal information set by the user; The speech recognition method further comprising:

10. The method of claim 9,
The collecting of personal information of the user may include:
Acquiring personal information directly input by the user; Wow
Inferring user state information from at least one of a voice signal and user behavior information; Further comprising the steps of:

10. The method of claim 9,
Wherein the step of generating a voice recognition result from the voice signal based on the voice signal and the personal information comprises:
And generating a plurality of speech recognition results and generating acoustic models and language model type information used when speech recognition is performed for each of the plurality of speech recognition results.

12. The method of claim 11,
Wherein the step of selecting the final speech recognition result comprises:
And the final speech recognition result is selected using the public personal information and the private personal information.

12. The method of claim 11,
Wherein the step of selecting the final speech recognition result comprises:
Adds a weight based on the public personal information and the private personal information to each probability of the plurality of voice recognition results, and selects a voice recognition result having the highest probability value as a result.