KR20190133325A

KR20190133325A - Speech recognition method and apparatus

Info

Publication number: KR20190133325A
Application number: KR1020180058093A
Authority: KR
Inventors: 이재석
Original assignee: 카페24 주식회사
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2019-12-03
Also published as: KR102114365B1

Abstract

One aspect of the present invention discloses a voice recognition device for recognizing voice according to a mode. The voice recognition device comprises: a signal obtaining unit for obtaining a voice signal; a signal analyzing unit for analyzing the obtained voice signal, determining whether the obtained voice signal is a first signal associated with a normal voice of a person and a second signal associated with a whisper, and calling at least one of first and second signal recognition units; a first signal recognition unit for recognizing the first signal according to a first voice recognition algorithm in response to a call from the signal analyzing unit; and a second signal recognition unit for recognizing the second signal according to a second voice recognition algorithm in response to a call from the signal analyzing unit.

Description

Speech recognition method and apparatus {SPEECH RECOGNITION METHOD AND APPARATUS}

본 발명은 음성인식 방법에 관한 것으로, 보다 상세하게는, 음성인식을 기반으로 하는 정보 입력 방법에 관한 것이다.The present invention relates to a voice recognition method, and more particularly, to an information input method based on voice recognition.

최근들어, 음성인식 기술이 발전하면서, 스마트폰에도 음성인식을 기반으로 하는 기능들이 다수 탑재되고 있다. 전형적으로, 아이폰의 시리(siri) 등 음성인식을 통한 명령어 입력 기능이 대부분의 스마트폰에 탑재되어 있다. 이러한 음성 인터페이스는 터치 인터페이스 보다 자연스럽고 직관적인 인터페이스이며, 이에 따라, 터치 인터페이스의 단점을 보완할 수 있는 차세대 인터페이스로 각광받고 있다.Recently, with the development of voice recognition technology, a number of functions based on voice recognition are also mounted on smart phones. Typically, most smartphones have a command input function through voice recognition, such as the Siri of the iPhone. Such a voice interface is a more natural and intuitive interface than the touch interface, and thus has been spotlighted as a next-generation interface that can compensate for the shortcomings of the touch interface.

공공 장소에서 기계를 상대로 큰 목소리로 말을 하는 것은 일반적인 사람들에게는 매우 부끄럽고 부자연스러운 행위이다. 이에 따라, 음성 인터페이스는 사람들이 많거나 또는 조용히 해야 하는 공공 장소에서 사용하기 어렵다는 단점이 존재한다. 이러한 단점은 음성 인터페이스의 가장 큰 단점으로 지적되고 있으며, 음성 인터페이스의 사용 확산을 가로막는 큰 장애물로 지적되고 있다. 이로 인해 음성 인터페이스는 자동차와 같이 혼자 있는 극히 제한적인 상황에서만 주로 사용되고 있다. 이에 따라, 공공 장소에서도 다른 사람들에게 피해를 주지 않고, 음성 인터페이스를 자유롭게 이용할 수 있는 방법이 요구된다.Speaking loudly against machines in public is a very embarrassing and unnatural act for the average person. Accordingly, there is a disadvantage that the voice interface is difficult to use in a public place where there are many people or must be quiet. These shortcomings are pointed out as the biggest shortcomings of the voice interface, and are pointed out as a big obstacle to the spread of the use of the voice interface. For this reason, the voice interface is mainly used only in extremely limited situations such as automobiles. Accordingly, there is a need for a method of freely using a voice interface without damaging others in a public place.

상술한 문제점을 해결하기 위한 본 발명의 일 양태에 따른 목적은 사용자로부터 입력되는 음성을 취득하여 음성의 세기 및 음역대를 분석하고, 이를 기반으로 사용자가 입력한 음성이 성대를 사용한 음성인지 속삭임인지를 판단하여 모드에 따라 음성을 인식하는 음성인식장치를 제공하는 것이다.An object according to an aspect of the present invention for solving the above problems is to acquire the voice input from the user to analyze the strength and the range of the voice, and based on this, whether the voice input by the user is a voice using a vocal cord or whisper The present invention provides a voice recognition device that determines and recognizes a voice according to a mode.

상기한 목적을 달성하기 위한 본 발명의 일 양태에 따른 음성인식장치는 음성신호를 취득하는 신호 취득부, 상기 취득된 음성신호를 분석하여 사람의 정상음성과 연관된 제 1 신호인지 속삭임과 연관된 제 2 신호인지 판단하여 제 1 신호인식부 및 제 2 신호인식부 중 적어도 하나를 호출하는 신호분석부, 상기 신호분석부에서의 호출에 응답하여, 제 1 음성인식 알고리즘에 따라 상기 제 1 신호를 인식하는 제 1 신호인식부 및 상기 신호분석부에서의 호출에 응답하여, 제 2 음성인식 알고리즘에 따라 상기 제 2 신호를 인식하는 제 2 신호 인식부를 포함할 수 있다.A voice recognition device according to an aspect of the present invention for achieving the above object is a signal acquisition unit for acquiring a voice signal, a second signal associated with a whisper whether it is the first signal associated with a normal voice of a person by analyzing the acquired voice signal A signal analyzer for calling at least one of a first signal recognizer and a second signal recognizer by determining whether the signal is a signal, and recognizing the first signal according to a first voice recognition algorithm in response to a call from the signal analyzer. In response to a call from the first signal recognizer and the signal analyzer, a second signal recognizer may recognize the second signal according to a second voice recognition algorithm.

상기 음성인식장치는 상기 제 1 신호인식부 및 상기 제 2 신호인식부 중 적어도 하나에서 인식한 텍스트 기반의 정보를 기반으로 명령어 정보를 생성하여 입력하는 정보입력부를 더 포함할 수 있다.The voice recognition device may further include an information input unit which generates and inputs command information based on text-based information recognized by at least one of the first signal recognition unit and the second signal recognition unit.

상기 정보입력부는 상기 제 1 신호인식부 및 상기 제 2 신호인식부 중 적어도 하나에서 인식한 텍스트 기반의 정보를 명령어로 변환하는 명령어 변환부 및 상기 변환된 명령어를 입력하는 명령어 입력부를 포함할 수 있다.The information input unit may include a command conversion unit for converting text-based information recognized by at least one of the first signal recognition unit and the second signal recognition unit into a command, and a command input unit for inputting the converted command. .

상기 명령어 변환부는 기저장된 명령어 모델을 이용하여 명령어로 변환할 수 있다.The command converting unit may convert the command into a command using a pre-stored command model.

상기 인식한 텍스트 정보는 디스플레이부를 통해 표시될 수 있다.The recognized text information may be displayed through the display unit.

상기 디스플레이부를 통해 표시되는 텍스트 정보 내에 포함된 소정 문자를 사용자 인터페이스를 통한 문자 입력을 통해 수정할 수 있다.The predetermined characters included in the text information displayed through the display unit may be modified through text input through a user interface.

상기 디스플레이부를 통해 표시되는 텍스트 정보 내에 포함된 소정 문자를 특정하여 사용자 인터페이스를 통한 음성 재입력을 통해 수정할 수 있다.The predetermined character included in the text information displayed through the display unit may be specified and modified through voice re-input through a user interface.

상기 디스플레이부를 통해 표시되는 텍스트 주변에 텍스트 수정을 위한 아이콘을 함께 표시할 수 있다.An icon for text correction may be displayed together with the text displayed through the display unit.

상기 명령어 모델은 중요단어와 비중요단어를 구분하여 이루어지되, 상기 변환된 명령어를 표시할 때, 상기 중요단어와 비중요단어를 구분하여 표시할 수 있다.The command model is made by dividing important words and non-important words, and when displaying the converted command, the important words and non-important words may be displayed separately.

상기 정보입력부는 통화 중이 아닌 상태에 자동으로 활성화될 수 있다.The information input unit may be automatically activated when not in a call.

상기 신호분석부는 상기 취득된 음성신호의 세기 및 음역대 중 적어도 하나를 분석하여 상기 제 1 신호인지 상기 제 2 신호인지 판단할 수 있다.The signal analyzer may determine whether the signal is the first signal or the second signal by analyzing at least one of an intensity and a sound range of the acquired voice signal.

상기 신호분석부는 상기 취득된 음성신호의 세기가 제 1 임계값보다 큰지 여부 및 상기 취득된 음성신호의 음역대가 제 2 임계값보다 큰지 여부 중 적어도 하나를 기반으로 상기 제 1 신호인지 상기 제 2 신호인지 판단할 수 있다.The signal analyzer is the second signal based on at least one of whether the intensity of the acquired speech signal is greater than a first threshold value and whether the band of the acquired speech signal is greater than a second threshold value. Can be determined.

상기 제 1 임계값 및 상기 제 2 임계값 중 적어도 하나는 상기 음성인식장치 주변의 소음의 크기에 따라 가변될 수 있다.At least one of the first threshold value and the second threshold value may vary according to the amount of noise around the voice recognition apparatus.

상기 제 1 임계값 및 상기 제 2 임계값 중 적어도 하나는 기저장된 사용자 음성 프로파일(profile)을 기반으로 설정될 수 있다.At least one of the first threshold value and the second threshold value may be set based on a pre-stored user voice profile.

상기 제 2 신호인식부는 속삭임 음성신호 인식과 연관된 음성모델을 이용하여 상기 제 2 신호를 인식할 수 있다.The second signal recognition unit may recognize the second signal by using a voice model associated with a whisper voice signal recognition.

상기 음성인식장치는 사용자로부터 토출되는 공기의 압력을 감지하는 기압센서를 더 포함하고, 상기 신호분석부는 상기 기압센서로부터의 압력값이 제 3 임계값보다 큰지 여부를 기반으로 상기 제 1 신호인지 상기 제 2 신호인지 판단할 수 있다.The voice recognition device further includes an air pressure sensor for sensing the pressure of the air discharged from the user, wherein the signal analyzer is the first signal based on whether the pressure value from the air pressure sensor is greater than a third threshold value; It may be determined whether it is the second signal.

상기 기압센서는 상기 음성신호 취득부로부터 2cm 반경 내에 배치될 수 있다.The barometric pressure sensor may be disposed within a radius of 2 cm from the voice signal acquisition unit.

상기 음성인식장치가 통화 중일 때, 상기 제 1 신호인식부를 비활성화시키고 상기 제 2 신호인식부만 활성화시키되, 상기 음성인식장치는 상기 제 2 신호인식부에서 인식한 정보를 정상음성신호로 변환하는 신호 변환부를 더 포함할 수 있다.When the voice recognition device is in a call, the first signal recognition unit is deactivated and only the second signal recognition unit is activated, wherein the voice recognition device converts the information recognized by the second signal recognition unit into a normal voice signal. The conversion unit may further include.

상기 음성인식장치는 제 1 사용자의 정상음성신호 특성을 포함하는 제 1 사용자 음성신호 및 제 2 사용자의 정상음성신호 특성을 포함하는 제 2 사용자 음성신호를 보유하는 저장부를 더 포함하되, 상기 신호 변환부는 상기 취득된 음성신호가 상기 제 1 사용자 음성신호인지 상기 제 2 사용자 음성신호인지 판단하여 그에 대응되는 음성신호로 변환할 수 있다.The voice recognition device further includes a storage unit for holding a first user voice signal including the normal voice signal characteristic of the first user and a second user voice signal including the normal voice signal characteristic of the second user, wherein the signal conversion is performed. The unit may determine whether the acquired voice signal is the first user voice signal or the second user voice signal and convert the voice signal into a corresponding voice signal.

상기 신호 변환부는 상기 취득된 음성신호의 특성에 대응되는 정상음성신호로 변환할 수 있다.The signal converter may convert the normal voice signal corresponding to the acquired characteristic of the voice signal.

상기 신호분석부는 하나의 연속된 언어에서 제 1 구간은 상기 제 1 신호인식부의 처리구간으로, 제 2 구간은 제 2 구간은 상기 제 2 신호인식부의 처리구간으로 식별하여 상기 제 1 신호인식부와 상기 제 2 신호인식부를 호출할 수 있다.In one continuous language, the signal analyzer identifies a first section as a processing section of the first signal recognition section, and a second section as a processing section of the second signal recognition section. The second signal recognition unit may be called.

상기 제 1 신호인식부는 제 1 프로세서로, 제 2 신호인식부는 제 2 프로세서로 구현되어, 실질적으로 동시에 병렬적으로 상기 하나의 연속된 언어의 제 1 구간과 제 2 구간에 대한 신호인식작업이 수행될 수 있다.The first signal recognition unit is implemented as a first processor, and the second signal recognition unit is implemented as a second processor, so that signal recognition operations for the first and second sections of the one continuous language are performed substantially in parallel. Can be.

상기한 목적을 달성하기 위한 본 발명의 일 양태에 따른 음성인식방법은 음성신호를 취득하는 단계, 상기 취득된 음성신호를 분석하여 사람의 정상음성과 연관된 제 1 신호인지 속삭임과 연관된 제 2 신호인지 판단하여 제 1 신호인식부 및 제 2 신호인식부 중 적어도 하나를 호출하는 단계, 상기 신호분석부에서의 호출에 응답하여, 상기 제 1 신호인식부에서, 제 1 음성인식 알고리즘에 따라 상기 제 1 신호를 인식하는 단계 및 상기 신호분석부에서의 호출에 응답하여, 상기 제 2 신호인식부에서, 제 2 음성인식 알고리즘에 따라 상기 제 2 신호를 인식하는 단계를 포함할 수 있다.Voice recognition method according to an aspect of the present invention for achieving the above object is a step of acquiring a voice signal, analyzing the acquired voice signal whether the first signal associated with the normal voice of the person or the second signal associated with the whisper Determining and calling at least one of a first signal recognition unit and a second signal recognition unit, in response to a call from the signal analyzer, in the first signal recognition unit, the first signal recognition unit according to a first speech recognition algorithm; Recognizing a signal and in response to a call from the signal analyzer, the second signal recognition unit may include recognizing the second signal according to a second speech recognition algorithm.

상기한 목적을 달성하기 위한 본 발명의 다른 양태에 따른 음성인식장치는 음성신호를 취득하는 신호 취득부, 상기 취득된 음성신호를 분석하여 사람의 정상음성과 연관된 제 1 신호인지 속삭임과 연관된 제 2 신호인지 판단하여 제 1 신호인식부 및 제 2 신호인식부 중 적어도 하나를 호출하는 신호분석부, 상기 신호분석부에서의 호출에 응답하여, 제 1 음성인식 알고리즘에 따라 상기 제 1 신호를 인식하는 제 1 신호인식부 및 상기 신호분석부에서의 호출에 응답하여, 제 2 음성인식 알고리즘에 따라 상기 제 2 신호를 인식하는 제 2 신호 인식부를 포함하되, 통화 중인지 여부를 기반으로 상기 제 1 신호인식부 및 상기 제 2 신호인식부 중 적어도 하나에서 인식한 정보를 텍스트 기반의 입력신호 또는 정상음성신호로 변환하는 신호 변환부를 더 포함할 수 있다.Voice recognition apparatus according to another aspect of the present invention for achieving the above object is a signal acquisition unit for acquiring a voice signal, a second signal associated with a whisper whether the first signal associated with the normal voice of a person by analyzing the acquired voice signal A signal analyzer for calling at least one of a first signal recognizer and a second signal recognizer by determining whether the signal is a signal, and recognizing the first signal according to a first voice recognition algorithm in response to a call from the signal analyzer. And a second signal recognizer configured to recognize the second signal according to a second voice recognition algorithm in response to a call from the first signal recognizer and the signal analyzer, wherein the first signal recognizer is based on whether a call is in progress. And a signal converter configured to convert information recognized by at least one of the second signal recognizer and the text signal into a text-based input signal or a normal voice signal. have.

상기한 목적을 달성하기 위한 본 발명의 다른 양태에 따른 음성인식방법은 음성신호를 취득하는 단계, 상기 취득된 음성신호를 분석하여 사람의 정상음성과 연관된 제 1 신호인지 속삭임과 연관된 제 2 신호인지 판단하여 제 1 신호인식부 및 제 2 신호인식부 중 적어도 하나를 호출하는 단계, 상기 신호분석부에서의 호출에 응답하여, 상기 제 1 신호인식부에서, 제 1 음성인식 알고리즘에 따라 상기 제 1 신호를 인식하는 단계 및 상기 신호분석부에서의 호출에 응답하여, 상기 제 2 신호인식부에서, 제 2 음성인식 알고리즘에 따라 상기 제 2 신호를 인식하는 단계를 포함하되, 통화 중인지 여부를 기반으로 상기 제 1 신호인식부 및 상기 제 2 신호인식부 중 적어도 하나로부터 획득된 신호를 텍스트 기반의 입력신호 또는 정상음성신호로 변환하는 단계를 더 포함할 수 있다.Speech recognition method according to another aspect of the present invention for achieving the above object is a step of acquiring a speech signal, analyzing the acquired speech signal whether the first signal associated with the normal voice of a person or the second signal associated with a whisper Determining and calling at least one of a first signal recognition unit and a second signal recognition unit, in response to a call from the signal analyzer, in the first signal recognition unit, the first signal recognition unit according to a first speech recognition algorithm; Recognizing a signal and in response to a call from the signal analyzer, recognizing the second signal by the second signal recognition unit according to a second speech recognition algorithm, based on whether the call is in progress. Converting a signal obtained from at least one of the first signal recognition unit and the second signal recognition unit into a text-based input signal or a normal sound signal; It can be included.

본 발명의 일 양태에 따른 음성인식장치에 따르면, 사용자가 큰 소리로 얘기하는 것이 허용된 장소에서는 큰소리로 입력하고, 도서관이나 지하철 등 큰소리 입력이 어려울 경우 속삭임으로 입력하면, 휴대폰에서 모드 선택 없이도 이를 자동 인식하여 사용자 음성을 인식 가능케 하는 효과가 있다.According to the voice recognition device according to an aspect of the present invention, if a user is allowed to speak loudly at a place where the user is allowed to speak loudly and inputs a loud voice such as a library or a subway, it is possible to input it as a whisper, without selecting a mode in the mobile phone. There is an effect that can automatically recognize the user voice by recognizing.

도 1은 본 발명의 일 실시예에 따른 음성인식장치를 개략적으로 나타낸 블록도,
도 2는 본 발명의 일 실시예에 따른 음성인식장치가 제 1 신호인식부 및 제 2 신호인식부를 호출하여 정상음성 및 속삭임음성을 인식하는 과정을 구체적으로 나타낸 흐름도,
도 3은 본 발명의 일 실시예에 따른 음성인식장치의 기압센서를 기반으로 하는 속삭임 음성인식을 설명하기 위한 개념도,
도 4a 및 도 4b는 본 발명의 일 실시예에 따른 음성인식장치의 속삭임 인식을 위한 임계값의 가변가능성을 나타낸 개념도,
도 5는 본 발명의 일 실시예에 따른 음성인식장치가 인식한 속삭임 관련 텍스트를 수정하는 모드를 나타낸 개념도,
도 6은 본 발명의 일 실시예에 따른 음성인식장치의 명령어 모델 데이터베이스를 나타낸 블록도,
도 7은 도 6의 명령어 모델 데이터베이스를 통해 생성된 명령어를 디스플레이부에 표시한 화면을 나타낸 도면,
도 8은 본 발명의 일 실시예에 따른 음성인식장치가 통화 중에 속삭임 모드로 음성을 인식할 때의 처리방법을 설명하기 위한 개념도,
도 9는 본 발명의 다른 실시예에 따른 음성인식장치를 개략적으로 나타낸 블록도이다. 1 is a block diagram schematically showing a voice recognition device according to an embodiment of the present invention;
2 is a flowchart illustrating a process of recognizing a normal voice and a whispering voice by calling the first signal recognition unit and the second signal recognition unit according to an embodiment of the present invention;
3 is a conceptual diagram for explaining a whisper voice recognition based on the barometric pressure sensor of the voice recognition device according to an embodiment of the present invention,
4A and 4B are conceptual views illustrating the variability of a threshold for whisper recognition of a voice recognition device according to an embodiment of the present invention;
5 is a conceptual diagram illustrating a mode for modifying a whisper related text recognized by a voice recognition device according to an embodiment of the present invention;
6 is a block diagram showing a command model database of a voice recognition device according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a screen displaying a command generated through the command model database of FIG. 6 on a display unit; FIG.
8 is a conceptual diagram illustrating a processing method when a voice recognition device recognizes a voice in a whisper mode during a call according to an embodiment of the present invention;
9 is a block diagram schematically showing a voice recognition device according to another embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.As the present invention allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the written description.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.However, this is not intended to limit the present invention to specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

제 1, 제 2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as the second component, and similarly, the second component may also be referred to as the first component. The term and / or includes a combination of a plurality of related items or any item of a plurality of related items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. When a component is referred to as being "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that other components may be present in between. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the present invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In this application, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in the commonly used dictionaries should be construed as having meanings consistent with the meanings in the context of the related art, and shall not be construed in ideal or excessively formal meanings unless expressly defined in this application. Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. Hereinafter, with reference to the accompanying drawings, it will be described in detail a preferred embodiment of the present invention. In the following description of the present invention, the same reference numerals are used for the same elements in the drawings and redundant descriptions of the same elements will be omitted.

본 명세서에 걸쳐서, 속삭임은 속삭이는 음성신호를 나타내고, 이는 예컨대, 조용하고 개인적인 대화의 형태로써 사용되는 현상을 나타낼 수 있다. 준언어적 현상으로서, 속삭임은 정상음성과 다르게 고려되어 양자는 서로 구분될 수 있다. 음성 생성 과정은 성도 및 비강을 통해 공진하고 입을 통해 나오는 가변 피치 신호를 생성하기 위해 성문을 통해 나오는 폐의 내쉼으로 시작한다. 성도강, 구강 및 비강 내에서, 벨럼, 혀 및 입술 위치는 음성 사운드를 형성하는 데에 중요한 역할을 한다. 이들을 집합적으로 성도 모듈레이터(vocal tract modulator)라고 할 수 있다.Throughout this specification, whispering refers to a whispering voice signal, which may indicate, for example, a phenomenon used in the form of a quiet and private conversation. As a semilinguistic phenomenon, whispering is considered different from normal voice so that both can be distinguished from each other. The speech generation process begins with the exhalation of the lungs coming through the gates to resonate through the saints and nasal passages and to produce a variable pitch signal through the mouth. Within the vocal cavities, oral cavity and nasal cavity, the berm, tongue and lip positions play an important role in forming voice sound. These may be collectively called vocal tract modulators.

이에 반해, 속삭임은 또렷한 대화를 원하지만 정상음성의 큰 소리가 금지된 상황에 사용될 수 있고, 특히 속삭임은 후두 장애가 있는 사람을 위한 필수적인 대화 수단이다. 속삭임은 인식성과 이해 정도가 떨어지는 것이 일반적이다. 정상적으로 발음된 음성과 속삭임 간의 주요한 차이는 속삭임에는 성대 떨림이 없다는 것이다. 이것은 속삭일 때 성대 떨림이 생리적으로 막힌 경우 또는 병이 있을 때에 질병이나 질병 치료에 의해 성대를 제거하였거나 발성 계의 질병에 의해 성대가 막힌 경우에 생길 수 있다. Whispering, on the other hand, can be used in situations where clear conversations are desired but loud voices are normal, whispering is an essential means of communication for people with laryngeal disorders. Whispering is usually less perceptible and understandable. The main difference between a normally pronounced voice and a whisper is that there is no vocal trembling in the whisper. This can occur when the vocal cords are physiologically blocked during whispering, or when the vocal cords are removed by disease or disease treatment during illness or when the vocal cords are blocked by the disease of the vocal system.

본 명세서에서, 용어 "속삭임 음성"(whispered speech)에 대한 고유의 정의는 없다. "속삭임 음성"은 부드러운 속삭임(soft whisper)과 고성의 속삭임(stage whisper)으로 크게 분류할 수 있다. 이들은 약간 다를 수 있다. 부드러운 속삭임(조용한 속삭임)은 다른 사람의 귀에 속삭이는 등에 의해 인지를 의도적으로 감소시키기 위해 정상적으로 말하는 사람에 의해 이루어지며, 일반적으로는 편안하고 용이하게 사용된다. 이들은 성대 주름의 떨림이 없이도 만들어지며, 일상 생활에서 많이 사용되며, 후두 절제 환자에 의해 만들어진 속삭임의 형태와 유사하다.In this specification, there is no inherent definition for the term "whispered speech". "Whissing voice" can be broadly classified into soft whisper and stage whisper. These may be slightly different. Soft whispers (quiet whispers) are made by a normally speaking person to intentionally reduce cognition by, for example, whispering in another person's ear, and are generally used comfortably and easily. They are made without the trembling of vocal cord wrinkles, are used a lot in everyday life, and resemble the forms of whispering made by laryngectomy patients.

한편, 고성의 속삭임은 듣는 사람이 말하는 사람으로부터 어느 정도 떨어져 있을 때에 사용된다. 고성의 속삭임을 만들기 위해, 음성은 의도적으로 속삭이는 듯이 말해야 한다. 성대 주름의 떨림을 필요로 하는 일부 부분적인 발성이 고성의 속삭임에 속한다. High whispers, on the other hand, are used when the listener is some distance from the speaker. To make a whisper of high altitude, the voice must be intentionally whispered. Some partial vocalizations that require tremors of the vocal cords are high whispering.

본 발명의 일 실시예에 따른 음성인식장치는 기본적으로 부드러운 속삭임을 위해 구성되었지만, 입력 신호에서의 속삭임은 고성의 속삭임의 형태로도 사용가능하다. 속삭임 음성의 특징은, 속삭임 음성이 만들어지는 방법으로부터 생기는 음향적 특징, 및 정상 음성과 비교되는 스펙트럼 특징과 관련해서 고려될 수 있다. 따라서, 속삭임 음성인식은 정상음성인식과 구분되어 이루어질 수 있다.Although the voice recognition device according to an embodiment of the present invention is basically configured for soft whispering, the whispering in the input signal may be used in the form of high whispering. The characteristics of the whisper voice can be considered in relation to the acoustic features resulting from the way the whisper voice is made, and the spectral features compared to the normal voice. Therefore, the whisper voice recognition can be made separately from the normal voice recognition.

도 1은 본 발명의 일 실시예에 따른 음성인식장치를 개략적으로 나타낸 블록도이다. 도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 음성인식장치는 음성신호 취득부(110), 신호분석부(120), 제 1 신호인식부(130), 제 2 신호인식부(140) 및 정보입력부(150)를 포함할 수 있다.1 is a block diagram schematically showing a voice recognition device according to an embodiment of the present invention. As shown in FIG. 1, the speech recognition apparatus according to an exemplary embodiment of the present invention includes a speech signal acquisition unit 110, a signal analyzer 120, a first signal recognition unit 130, and a second signal recognition unit ( 140 and the information input unit 150.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성인식장치는 기설정된 프로그램에 따라 동작하는 컴퓨팅 장치이다. 음성인식 장치는 내장된 센서를 통해 음성 신호를 감지하고, 프로세서를 통해 감지한 음성신호에 대한 연산 처리가 가능한 단말을 의미한다. 이는 사용자단말로 지칭될 수 있고, 이러한 사용자 단말은 이동국(MS), 사용자 장비(UE; User Equipment), 사용자 터미널(UT; User Terminal), 무선 터미널, 액세스 터미널(AT), 터미널, 고정 또는 이동 가입자 유닛(Subscriber Unit), 가입자 스테이션(SS; Subscriber Station), 셀룰러 전화, 무선 기기(wireless device), 무선 통신 디바이스, 무선송수신유닛(WTRU; Wireless Transmit/Receive Unit), 이동 노드, 모바일, 모바일국, 개인 휴대 정보 단말(personal digital assistant; PDA), 스마트폰, 랩톱, 넷북, 개인용 컴퓨터, 무선 센서, 웨어러블 디바이스, 소비자 전자기기(CE) 또는 다른 용어들로서 지칭될 수 있다. Referring to FIG. 1, a voice recognition device according to an embodiment of the present invention is a computing device operating according to a preset program. The voice recognition device refers to a terminal capable of detecting a voice signal through a built-in sensor and performing arithmetic processing on the voice signal detected by a processor. This may be referred to as a user terminal, which may be a mobile station (MS), user equipment (UE), user terminal (UT), wireless terminal, access terminal (AT), terminal, fixed or mobile. Subscriber Unit, Subscriber Station (SS), Cellular Phone, Wireless Device, Wireless Communication Device, Wireless Transmit / Receive Unit (WTRU), Mobile Node, Mobile, Mobile Station , Personal digital assistant (PDA), smartphone, laptop, netbook, personal computer, wireless sensor, wearable device, consumer electronics (CE) or other terms.

사용자 단말의 다양한 실시예들은 셀룰러 전화기, 무선 통신 기능을 가지는 스마트 폰, 무선 통신 기능을 가지는 개인 휴대용 단말기(PDA), 무선 모뎀, 무선 통신 기능을 가지는 휴대용 컴퓨터, 무선 통신 기능을 가지는 디지털 카메라와 같은 촬영장치, 무선 통신 기능을 가지는 게이밍 장치, 무선 통신 기능을 가지는 음악저장 및 재생 가전제품, 무선 인터넷 접속 및 브라우징이 가능한 인터넷 가전제품뿐만 아니라 그러한 기능들의 조합들을 통합하고 있는 휴대형 유닛 또는 단말기들을 포함할 수 있으나, 이에 한정되는 것은 아니다. Various embodiments of a user terminal include a cellular telephone, a smart phone having a wireless communication function, a personal digital assistant (PDA) having a wireless communication function, a wireless modem, a portable computer having a wireless communication function, and a digital camera having a wireless communication function. Imaging devices, gaming devices with wireless communication capabilities, music storage and playback appliances with wireless communication capabilities, Internet appliances with wireless internet access and browsing, as well as portable units or terminals incorporating combinations of such functions. It may be, but is not limited thereto.

음성신호 취득부(110)는 사용자로부터 토출되는 음성을 취득하는 구성으로, 마이크(mic)를 포함할 수 있다. The voice signal acquisition unit 110 is a configuration for acquiring the voice discharged from the user, and may include a microphone.

신호분석부(120)는 음성신호 취득부(110)에서 취득된 음성을 분석하는 기능을 수행할 수 있다. 이는 마이크로프로세서(micro-processor)와 같은 하드웨어로 구현될 수 있다. 신호분석부(120)는 취득된 음성신호의 세기, 음역대 및/또는 주파수를 기반으로 취득된 음성이 사람의 정상음성인지, 속삭임과 관련된 음성인지 판단한다. 음성 인식 센서로 입력되는 소리의 세기(음량)를 감지할 수 있다. The signal analyzer 120 may perform a function of analyzing the voice acquired by the voice signal acquirer 110. This may be implemented in hardware such as a micro-processor. The signal analyzer 120 determines whether the acquired voice is a normal voice of a person or a voice associated with a whisper, based on the strength, band, and / or frequency of the acquired voice signal. The strength (volume) of the sound input to the voice recognition sensor may be sensed.

신호분석부(120)는 취득된 음성신호의 음성세기를 감지하고 음성세기가 제 1 임계값을 초과하는지 여부를 기반으로 속삭임 음성인지 정상음성인지 판단할 수 있다. 경우에 따라, 음량의 변화를 음량변화 임계값과 비교하여 속삭임 음성인지 판단할 수 있다. 또는, 신호분석부(120)는 상기 취득된 음성신호의 음역대를 기반으로 속삭임음성인지 정상음성인지 구분할 수 있다. 속삭임 음성의 경우, 정상음성의 음역대보다 현저히 낮거나 또는 현저히 높을 수 있다. 따라서, 제 2 임계값을 하나 또는 복수 개 이상 설정하여 임계값보다 현저히 낮거나 높은 음역대의 음성신호를 속삭임 관련 신호로 판단할 수 있다. 또는, 특정 구간의 음역대를 특정하여 해당 구간 음역대를 속삭임과 관련된 음역대로 고려할 수 있다. 또한, 음성의 주파수 분석을 통해 양자를 구분할 수 있다. 다만, 이러한 음성신호의 세기, 음역대 및/또는 주파수 기반의 신호구분만으로는 그 구분이 아주 정확하진 않을 수 있기에, 이를 보조하기 위한 다양한 방안들이 본 발명에서 논의될 수 있다. 여기에는, 공기압을 통한 인식, 주변소음과의 관계, 신호 스펙트럼 분석, 사용자 프로파일(profile)을 활용하는 방안 등이 포함될 수 있다. The signal analyzer 120 may detect the voice intensity of the acquired voice signal and determine whether it is a whisper voice or a normal voice based on whether the voice intensity exceeds a first threshold value. In some cases, it may be determined whether the voice is whispered by comparing the change in volume with the volume change threshold. Alternatively, the signal analyzer 120 may distinguish whether the whisper voice or the normal voice is based on the sound band of the acquired voice signal. In the case of a whisper voice, it may be significantly lower or significantly higher than the range of normal voice. Accordingly, one or more second thresholds may be set to determine a voice signal of a sound range that is significantly lower or higher than the threshold as a whisper-related signal. Alternatively, by specifying the range of the range of the particular section can be considered as the range associated with the whisper. In addition, the frequency analysis of the voice can be distinguished between the two. However, since the distinction may not be very accurate only based on the strength, sound range, and / or frequency-based signal division of the voice signal, various methods for assisting this may be discussed in the present invention. This may include recognition through air pressure, a relationship with ambient noise, signal spectrum analysis, and a method of utilizing a user profile.

신호분석부(120)는 정상음성 신호와 속삭임 관련 음성 신호를 구분하고 나면, 정상음성 신호에 대해서는 제 1 신호인식부(130)를 호출하여 신호를 인식시키고, 속삭임 음성신호에 대해서는 제 2 신호인식부(140)를 호출하여 신호를 인식시킬 수 있다. 정상음성과 속삭임 관련 음성신호는 신호 특성과 인식을 위한 알고리즘이 다르기 때문에 서로 다른 모듈(130, 140)을 호출하여 각각 동작하도록 제어한다. 하나의 문장이나 하나의 연결된 언어에서, 정상음성으로 이야기하다가 속삭임 음성으로 이야기할 때, 신호분석부(120)는 상기와 같은 방식으로 신호특성 분석을 통해 각 구간별로 정상음성과 속삭임음성 구간을 구분 인식하여 제 1 신호인식부(130)로부터 제 2 신호인식부(140)가 활성화되도록 스위칭(switching)하거나 제 2 신호인식부(140)로부터 제 1 신호인식부(130)가 활성화되도록 스위칭할 수 있다. 즉, 하나의 연속된 언어에서 제 1 구간은 제 1 신호인식부(130)의 처리구간으로, 제 2 구간은 제 2 신호인식부(140)의 처리구간으로 식별한 후, 식별정보에 기반하여 제 1 신호인식부(130)와 제 2 신호인식부(140)를 적절히 호출할 수 있다. 또한, 복수 개의 프로세서를 구비하여 제 1 신호인식부(130)와 제 2 신호인식부(140)를 구현하는 경우, 양자는 실질적으로 거의 동시에 병렬적으로 신호인식작업을 수행할 수 있다. After the signal analysis unit 120 distinguishes between the normal voice signal and the whisper-related voice signal, the signal is recognized by calling the first signal recognition unit 130 on the normal voice signal, and the second signal recognition on the whisper voice signal. The unit 140 may be called to recognize a signal. Normal voice and whisper-related voice signals are controlled by calling different modules 130 and 140 because the algorithms for signal characteristics and recognition are different. In one sentence or one connected language, when talking in a normal voice and then speaking in a whisper voice, the signal analyzer 120 distinguishes between normal and whispering voice sections for each section through signal characteristic analysis in the same manner as described above. Recognize and switch to activate the second signal recognition unit 140 from the first signal recognition unit 130 or switch to activate the first signal recognition unit 130 from the second signal recognition unit 140. have. That is, in one continuous language, the first section is identified as the processing section of the first signal recognition unit 130, and the second section is identified as the processing section of the second signal recognition unit 140, and then based on the identification information. The first signal recognition unit 130 and the second signal recognition unit 140 may be appropriately called. In addition, when the first signal recognition unit 130 and the second signal recognition unit 140 are implemented with a plurality of processors, both may perform signal recognition operations substantially in parallel at substantially the same time.

제 1 신호인식부(130)는 일반적인 정상음성인식 알고리즘을 이용하여 음성신호를 텍스트 기반의 정보로 변환할 수 있다. 상기 정상음성인식 알고리즘은 MFCC(Mel Frequency Cepstral Coefficient) 알고리즘을 포함할 수 있다. 이는 입력된 신호에서 노이즈 및 배경소리를 제거하고 실제 유효소리를 추출한 후, 일정구간(short time)으로 나누어 해당 구간에 대한 스펙트럼 분석을 통해 유효음성을 추출하는 방식이다. 다만, 반드시 상기 알고리즘을 활용해야만 하는 것은 아니고, LPC(Linear Prediction Coefficients)와 LPCC(Linear Prediction Cepstral Coefficient) 방식도 활용가능하다. The first signal recognition unit 130 may convert the voice signal into text-based information using a general normal speech recognition algorithm. The normal speech recognition algorithm may include a Mel Frequency Cepstral Coefficient (MFCC) algorithm. This method removes the noise and background sound from the input signal, extracts the actual effective sound, and divides it into short periods to extract the effective voice through a spectrum analysis of the corresponding section. However, it is not necessary to utilize the above algorithm, and LPC (Linear Prediction Coefficients) and LPCC (Linear Prediction Cepstral Coefficient) schemes can also be utilized.

제 2 신호인식부(140)는 속삭임 음성 데이터베이스(142)를 이용하여 해당 데이터베이스(142) 내에 포함된 다양한 속삭임 관련 음성모델을 활용할 수 있다. 속삭임 관련 음성모델을 활용한 속삭임 음성인식 알고리즘은 속삭임 음성신호로부터 음성을 재구성하는 알고리즘을 포함할 수 있다. 이는 입력신호의 표현을 형성하기 위해 입력신호를 분석하는 분석모듈, 입력신호의 스펙트럼을 조절하기 위해 입력신호의 표현을 변경하는 보강모듈(enhancement module) 및 입력신호의 변경된 표현으로부터 음성을 재구성하는 합성모듈을 포함할 수 있다. 이때, 입력신호의 스펙트럼의 조절에서는, 스펙트럼 내의 하나 이상의 포먼트(formant)에 대해 미리 정해진 스펙트럼 에너지 분포 및 진폭을 달성하기 위해, 하나 이상의 포먼트의 대역폭을 변경하는 알고리즘을 사용할 수 있다.The second signal recognition unit 140 may utilize various whisper-related voice models included in the database 142 using the whisper voice database 142. The whisper speech recognition algorithm using the whisper-related speech model may include an algorithm for reconstructing speech from the whisper speech signal. It consists of an analysis module that analyzes the input signal to form a representation of the input signal, an enhancement module that changes the representation of the input signal to adjust the spectrum of the input signal, and a synthesis that reconstructs the speech from the changed representation of the input signal. It may include a module. At this time, in the adjustment of the spectrum of the input signal, an algorithm for changing the bandwidth of one or more formants may be used to achieve a predetermined spectral energy distribution and amplitude for one or more formants in the spectrum.

제 1 신호인식부(130) 및 제 2 신호인식부(140)는 각각 정상 음성인식 알고리즘과 속삭임 음성인식 알고리즘을 활용하여 취득된 음성신호를 텍스트 기반의 정보로 생성할 수 있다. 여기서 텍스트는, 문자, 숫자, 수식, 각종 기호 등을 포함할 수 있다. The first signal recognition unit 130 and the second signal recognition unit 140 may generate the acquired speech signal as text-based information by using a normal speech recognition algorithm and a whisper speech recognition algorithm, respectively. The text may include letters, numbers, mathematical expressions, various symbols, and the like.

본 발명의 실시예에 따르면, 제 1 신호인식부(130)와 제 2 신호인식부(140) 역시 하드웨어 프로세서로써 구현될 수 있고, 프로세서는 관련 기능을 수행하기 위한 명령어를 기반으로 프로그래밍되어있을 수 있다. 이때, 신호분석부(120), 제 1 신호인식부(130) 및 제 2 신호인식부(140)는 하나의 프로세서 내의 각각의 기능블록으로 구현될 수 있고, 복수 개의 프로세서로 구현될 수도 있다. According to an embodiment of the present invention, the first signal recognition unit 130 and the second signal recognition unit 140 may also be implemented as a hardware processor, and the processor may be programmed based on instructions for performing a related function. have. In this case, the signal analyzer 120, the first signal recognition unit 130, and the second signal recognition unit 140 may be implemented as respective functional blocks in one processor, or may be implemented as a plurality of processors.

정보 입력부(150)는 제 1 신호인식부(130) 및 제 2 신호인식부(140)에서 인식된 텍스트 기반의 정보를 디스플레이부(160)를 통해 표시할 수 있다. 그리고는, 변환된 문자를 명령어 데이터베이스(170)와 매칭하여 명령어 정보를 생성하여 장치에 입력한다. 명령어 변환을 위해 정보입력부(150)는 상기 텍스트 기반의 정보를 명령어로 변환하는 명령어 변환부(미도시) 및 변환된 명령어를 입력하는 명령어 입력부(미도시)를 포함할 수 있다. 장치는 정보 입력부(150)에서 생성한 명령어대로 동작을 수행할 수 있다. 예컨대, "xx에게 전화걸어"라는 명령어에 대해서, 문장 구조를 파악하여 "전화걸어"라는 용어에 대해 통화 송신을 준비하고, "xx"라는 명령어를 통해 전화걸 대상을 찾아 "xx"라는 기저장된 연락처에게 전화를 거는 동작을 수행하도록 한다. 명령어와 제어동작의 매칭은 미리 프로그래밍되어 있을 수 있다.The information input unit 150 may display the text-based information recognized by the first signal recognition unit 130 and the second signal recognition unit 140 through the display unit 160. Then, the converted character is matched with the command database 170 to generate command information and input the same into the device. For command conversion, the information input unit 150 may include a command conversion unit (not shown) for converting the text-based information into a command and a command input unit (not shown) for inputting the converted command. The device may perform an operation according to a command generated by the information input unit 150. For example, for the command "Call xx", the sentence structure is grasped to prepare for a call transmission for the term "Call a call", and the command "xx" finds a call destination and the "xx" is stored in advance. Have the contact perform a call. Matching of command and control actions may be pre-programmed.

이때, 속삭임 음성인식의 경우, 인식률이 매우 높지 않을 수 있으므로 잘못된 명령어 입력을 방지하기 위해, 이를 수정하는 알고리즘을 추가할 수 있다. 이는 음성인식 단계에서 텍스트 정보를 수정하는 형태로 구현될 수 있고, 또는 텍스트에서 변환된 명령어를 수정하는 형태로 구현될 수도 있다. 예컨대, 사용자는 디스플레이부(160)를 통해 표시되는 음성인식 및/또는 속삭임 인식된 문자의 내용을 보고 수정하고자 하는 부분이 있을 때, 사용자 인터페이스(180)(예컨대, 키보드, 마우스, 터치패드 등)를 이용하여 텍스트 입력을 수행할 수 있고, 이를 기반으로 인식된 문자의 내용을 변경할 수 있다. 이 경우, 정보입력부(150)는 변경된 문자 내용을 기반으로 명령어 데이터베이스(170)와 관련 명령어를 검색 및 조회 작업을 거쳐 대응되는 명령어로 변환할 수 있다.At this time, in the case of whisper speech recognition, since the recognition rate may not be very high, an algorithm for modifying this may be added to prevent an incorrect command input. This may be implemented in the form of modifying the text information in the speech recognition step, or may be implemented in the form of modifying the command converted from the text. For example, when a user wants to view and correct the contents of the voice recognition and / or whisper recognized text displayed through the display 160, the user interface 180 (eg, a keyboard, a mouse, a touch pad, etc.) Text input can be performed using, and the content of the recognized character can be changed based on this. In this case, the information input unit 150 may convert the command database 170 and related commands into corresponding commands through a search and inquiry operation based on the changed text content.

도 2는 본 발명의 일 실시예에 따른 음성인식장치가 제 1 신호인식부 및 제 2 신호인식부를 호출하여 정상음성 및 속삭임음성을 인식하는 과정을 구체적으로 나타낸 흐름도이다. 2 is a flowchart specifically illustrating a process of recognizing a normal voice and a whispering voice by calling the first signal recognition unit and the second signal recognition unit according to an embodiment of the present invention.

도 2를 참조하면, 음성신호 취득부는 음성신호를 취득한다(S210). 취득되는 음성신호는 사용자로부터 토출되는 음성일 수 있다. 취득된 음성신호는 신호분석부에 제공되고, 신호분석부는 취득된 음성신호의 크기를 제 1 임계값과 비교한다(S220). 이때, 제 1 임계값보다 큰 신호는 정상음성신호로 간주되어 제 1 신호인식부로 제공될 수 있다. 세기 비교 이후에는, 취득된 음성신호의 음역대를 제 2 임계값과 비교한다(S230). 이때, 음역대가 제 2 임계값보다 높으면 정상음성신호로 간주되어 제 1 신호인식부로 제공될 수 있고, 제 2 임계값보다 낮은 음성신호는 속삭임 관련 음성신호로 간주될 수 있다. 다만, 위와 같이 일괄적으로 판단되어야 하는 것은 아니고, 음성세기가 작음에도 제 3 임계값보다 높은 음역대의 신호는 고성의 속삭임으로 판단하여 이 역시 속삭임 음성신호로 인식할 수 있다. 또는 복수 임계값을 통해 구간을 특정하여 속삭임 음성신호로 인식할 수도 있다.Referring to FIG. 2, the voice signal acquisition unit acquires a voice signal (S210). The obtained voice signal may be voice emitted from the user. The acquired voice signal is provided to the signal analyzer, and the signal analyzer compares the magnitude of the acquired voice signal with a first threshold value (S220). In this case, the signal larger than the first threshold may be regarded as a normal voice signal and provided to the first signal recognizer. After the intensity comparison, the sound range of the acquired voice signal is compared with the second threshold value (S230). In this case, when the sound range is higher than the second threshold value, it may be regarded as a normal voice signal and provided to the first signal recognizer, and the voice signal lower than the second threshold value may be regarded as a whisper-related voice signal. However, it is not necessary to collectively determine as described above, and even if the voice intensity is small, a signal of a high band that is higher than the third threshold may be determined to be a whisper of high voice and may be recognized as a whisper voice signal. Alternatively, the section may be identified as a whisper voice signal through a plurality of threshold values.

세기 및 음역대 비교 과정(S220, S230)은 선후관계를 고려할 필요가 없을 수 있다.Intensity and transliteration comparison process (S220, S230) may not need to consider the posterior relationship.

세기 및 음역대 비교 과정(S220, S230)을 통해 속삭임 음성으로 구분된 신호(예컨대, 세기도 작고 음역대로 낮은 신호)에 대해서는, 신호분석부가 제 2 신호인식부를 호출하여(S240), 제 2 신호인식부에서 신호를 인식하도록 제어한다(S250). 반대로, 세기 및 음역대 비교 과정(S220, S230)을 통해 정상음성으로 구분된 신호(예컨대, 세기가 크거나 음역대가 높은 신호)에 대해서는, 신호분석부가 제 1 신호인식부를 호출하여(S245), 제 1 신호인식부에서 신호를 인식하도록 제어한다(S255). For signals divided into whisper voices (for example, signals having low intensity and low tones) through intensity and range comparison processes (S220 and S230), the signal analyzer calls the second signal recognition unit (S240) and recognizes the second signal. The control unit recognizes the signal (S250). On the contrary, the signal analysis unit calls the first signal recognition unit (S245) for the signal divided into normal voices (for example, a signal having a large intensity or a high range) through the strength and sound range comparison processes (S220 and S230). The signal recognition unit controls to recognize the signal (S255).

본 발명의 다른 실시예에 따르면, 음성의 세기 및 음역대를 임계값과 직접 비교하는 것뿐만 아니라, 음성 세기의 변화 및 음역대(또는 주파수)의 변화량을 세기변화 임계값 및 음역대변화 임계값(또는 주파수변화 임계값)과 비교하는 방식을 통해 속삭임 관련 신호와 정상음성신호로 구분할 수 있다.According to another embodiment of the present invention, in addition to directly comparing the intensity and the range of the voice with a threshold, the magnitude of the change in the intensity of the voice and the change in the range (or frequency) of the voice may be the intensity change threshold and the range change threshold (or frequency). Change threshold value) to distinguish between a whisper-related signal and a normal voice signal.

상기와 같은 방법을 통해 제 1 신호인식부 및 제 2 신호인식부로 각 음성신호가 할당되면, 제 1 신호인식부 및 제 2 신호인식부는 각 모듈로 제공된 신호를 텍스트 기반의 정보로 변환할 수 있다(S260).When each voice signal is allocated to the first signal recognition unit and the second signal recognition unit through the above method, the first signal recognition unit and the second signal recognition unit may convert the signal provided to each module into text-based information. (S260).

도 3은 본 발명의 일 실시예에 따른 음성인식장치의 기압센서를 기반으로 하는 속삭임 음성인식을 설명하기 위한 개념도이다. 3 is a conceptual diagram for explaining a whisper voice recognition based on the barometric pressure sensor of the voice recognition device according to an embodiment of the present invention.

도 3을 참조하면, 사용자 단말은 사용자가 속삭임 음성을 표현하기 위해 단말에 가까이 근접하여 이야기함을 감지할 수 있다. 이때, 사용자의 근접에 따라 이야기와 함께 나오는 입김의 세기도 커지게 된다. 즉, 정상음성신호를 표출할 때와 속삭임음성신호를 표출할 때 사용자가 사용자 단말에 제공하는 입김의 세기는 현저히 변할 수 있다. Referring to FIG. 3, the user terminal may detect that the user speaks close to the terminal to express the whisper voice. At this time, the strength of breathing that comes with the story increases as the user approaches. That is, the intensity of the breath that the user provides to the user terminal when the normal voice signal is displayed and when the whisper voice signal is displayed may be significantly changed.

본 발명의 실시예에 따른 음성인식장치(사용자 단말)는 공기의 흐름 및/또는 압력을 감지하는 기압센서를 더 포함할 수 있다. 이는 마이크에 근접하여 배치되는 것이 바람직하다. 예컨대, 기압센서는 마이크로부터 2cm 반경 내에 배치되는 것이 바림직할 수 있다. 이를 통해, 속삭임 행위를 위해 더 강해진 입김의 세기변화를 감지하여 속삭임행위로 규정할 수 있다. 즉, 기압의 세기 및/또는 기압세기의 변화량에 대한 임계값을 설정하여 해당 임계값보다 큰 세기 및/또는 세기변화량이 감지될 때, 속삭임 음성신호라고 판단하는 것을 고려할 수 있다. Voice recognition device (user terminal) according to an embodiment of the present invention may further include an air pressure sensor for detecting the flow and / or pressure of the air. It is preferably arranged in close proximity to the microphone. For example, the barometric pressure sensor may preferably be disposed within a 2 cm radius from the micro. Through this, it is possible to detect a change in strength of the stronger breathing for the whispering action and to define it as a whispering action. That is, it may be considered to determine a whisper voice signal when an intensity and / or an intensity change amount greater than the corresponding threshold value is set by setting a threshold value for the air pressure intensity and / or the air pressure intensity change amount.

본 발명의 일 실시예에 따르면, 음성의 세기, 음역대, 주파수 및/또는 기압의 세기뿐만 아니라, 다른 팩터들도 속삭임 음성으로의 구분을 위한 기준으로 사용될 수 있다. 상기 다른 팩터들에는, 사용자의 손날 터치를 감지하기 위해 단말의 특정영역에 대한 (예컨대, 압력센서를 활용한) 터치감지, 광량센서를 활용한 광량변화 감지 등이 포함될 수 있다. 이외에도, 열, 광, 온도와 같은 팩터도 광 센서, 온도센서, 압력센서 등을 통해 활용할 수 있고, 다수의 팩터들 중 임의의 둘 이상의 조합이 속삭임 구분을 위해 활용될 수 있다.According to one embodiment of the invention, not only the intensity of the voice, the band, the frequency and / or barometric pressure, but also other factors may be used as a reference for discrimination into the whisper voice. The other factors may include touch detection (eg, using a pressure sensor) of a specific area of the terminal, light quantity change detection using a light quantity sensor, and the like to detect a user's finger touch. In addition, factors such as heat, light, and temperature may also be utilized through optical sensors, temperature sensors, pressure sensors, and the like, and a combination of any two or more of the plurality of factors may be utilized for whisper distinction.

도 4a 및 도 4b는 본 발명의 일 실시예에 따른 음성인식장치의 속삭임 인식을 위한 임계값의 가변가능성을 나타낸 개념도이다. 4A and 4B are conceptual views illustrating the variability of a threshold for whisper recognition of a voice recognition device according to an embodiment of the present invention.

도 4a를 참조하면, 음성세기 및/또는 음역대(또는 주파수) 관련 임계값은 주변의 소음상태에 따라 변화될 수 있다. 이는 사용자 단말의 설정 기능을 통해 변경가능한 부분인데, 주변소음상태에 따른 임계값 변화를 활용할 수도 있고 그렇지 않을 수도 있다. 다만, 임계값 변화를 활용하는 경우, 주변 소음이 거의 없을 때, 예컨대, 도서관과 같은 환경에서는 사용자가 일반 상황보다 더 작은 소리의 속삭임 음성을 낼 가능성이 많으므로, 평소보다 더 적은 음성신호도 속삭임으로 간주하는 것이 바람직할 수 있다. 이에 주변의 소리 크기(dB)에 비례하여 임계값을 변경하도록 설정할 수 있다. 이를 위해 주변소리 측정을 위한 소음 감지 센서가 장치 내에 추가적으로 구성될 수 있다. 도 4a의 실시예에서는, 주변소음이 50dB일 때는, 속삭임 구분을 위한 음성세기/음역대 임계값이 40(예컨대, dB)이였는데, 주변소음이 0dB 일 때는, 상기 임계값이 20까지 작아지도록 설정될 수 있다. Referring to FIG. 4A, thresholds related to loudness and / or range (or frequency) may be changed according to ambient noise conditions. This is a part that can be changed through a setting function of the user terminal, and may or may not use a threshold value change according to the ambient noise state. However, when the threshold value is utilized, when there is little ambient noise, for example, in an environment such as a library, the user is more likely to make a whisper voice than usual, and thus whisper fewer voice signals than usual. May be desirable. This can be set to change the threshold in proportion to the ambient sound (dB). To this end, a noise sensor for measuring ambient sound may be additionally configured in the device. In the embodiment of FIG. 4A, when the ambient noise is 50 dB, the voice intensity / band threshold for whisper distinction was 40 (eg, dB), and when the ambient noise is 0 dB, the threshold value is set to be lowered to 20. Can be.

도 4b를 참조하면, 장치는 음성세기 및/또는 음역대(또는 주파수)와 관련된 임계값을 사용자의 음성 특성에 따라 가변할 수 있다. 이때, 장치는 적어도 하나의 사용자에 대한 음성특성과 관련된 프로파일을 보유하고 있을 수 있다. 즉, 제 1 사용자에 대한 제 1 음성프로파일과 제 2 사용자에 대한 제 2 음성 프로파일을 가지고 있을 수 있다. 도 4b의 실시예에서는, 특정 사용자에 대한 음성프로파일을 사용하지 않지 않고 디폴트(default) 상태에서 속삭임 구분을 위한 판단이 수행되는 경우, 15(dB)의 임계값이 사용될 수 있는데, 제 1 사용자의 음성프로파일을 사용하는 경우 30, 제 2 사용자의 음성프로파일을 사용하는 경우 45의 임계값이 사용되어 서로 다른 임계값이 사용될 수 있다. 즉, 특정 사용자에 대응되는 임계값이 사용될 수 있다.Referring to FIG. 4B, the device may vary the threshold associated with speech intensity and / or range (or frequency) according to the user's speech characteristics. In this case, the device may have a profile related to voice characteristics for at least one user. That is, it may have a first voice profile for the first user and a second voice profile for the second user. In the embodiment of FIG. 4B, when a determination for whispering is performed in a default state without using a voice profile for a specific user, a threshold of 15 (dB) may be used. In the case of using the voice profile, a threshold value of 30 is used in the case of using the voice profile of the second user and a different threshold value may be used. That is, a threshold corresponding to a specific user may be used.

상기 음성 프로파일들은 사용자 프로파일 입력을 위한 모드에서 사용자가 정상음성으로 내는 음성신호와 속삭임으로 내는 음성신호를 각각 추출한 후, 해당 음성에 대한 프로파일 분석을 통해 이루어질 수 있다. 또는, 정상음성신호 또는 속삭임음성 신호 하나만을 추출하여 분석이 수행된 후, 프로파일이 생성될 수도 있다. 기본적으로 고음을 가진 사용자에게는 보다 높은 음역대의 임계값이 설정되는 것이 바람직할 수 있기 때문에, 사전에 사용자별 음성신호 분석을 통해 복수 개의 프로파일을 저장하고 있다가, 속삭임과 관련된 음성이 감지되는 경우, 해당 음성신호 분석을 통해 기저장된 사용자와 매칭을 수행하고 매칭에 따른 임계값을 불러와서 속삭임여부를 판단할 수 있는 것이다. 이러한 프로파일 기반의 임계값 변경은 사용할 수도 있고, 그렇지 않을 수도 있다. 또는, 음성신호 분석을 통한 매칭을 사용하지 않고, 사용자가 미리 수동으로 제 1 사용자의 프로파일을 활용하는 모드로 설정하고 속삭임 여부 판단이 이루어지도록 할 수도 있다. The voice profiles may be made by extracting the voice signal generated by the user in the normal voice and the voice signal generated by the whisper in the mode for the user profile input, and then analyzing the profile of the corresponding voice. Alternatively, a profile may be generated after the analysis is performed by extracting only one normal voice signal or a whisper voice signal. Basically, it may be desirable to set a higher threshold for a user who has a high pitch. Therefore, when a plurality of profiles are stored by analyzing a user's voice signal in advance, and a voice related to a whisper is detected, Through the analysis of the corresponding voice signal, the user may perform matching with the stored user, and determine whether to whisper by calling a threshold value according to the matching. This profile-based threshold change may or may not be used. Alternatively, the user may manually set the mode to utilize the first user's profile in advance without using matching through voice signal analysis and determine whether to whisper.

본 발명의 실시예에 따르면, 속삭임과 정상음성 구분을 통한 음성인식은 음성인식 모드를 수동으로 활성화하여 실행될 수도 있고, 통화 중이 아닌 상황에서 자동으로 활성화되도록 설정해 놓고 사용되도록 할 수도 있다. According to an embodiment of the present invention, voice recognition through whispering and normal voice distinction may be executed by manually activating the voice recognition mode or may be set to be automatically activated in a situation not in a call.

도 5는 본 발명의 일 실시예에 따른 음성인식장치가 인식한 속삭임 관련 텍스트를 수정하는 모드를 나타낸 개념도이다. 5 is a conceptual diagram illustrating a mode for modifying a whisper related text recognized by a voice recognition device according to an embodiment of the present invention.

도 5의 좌측 도면을 참조하면, 장치는 제 1 신호인식부 및/또는 제 2 신호인식부를 통해 인식된 텍스트 기반의 정보를 디스플레이부를 통해 표시할 수 있다. 이때, 사용자는 장치에서 음성인식된 텍스트 기반 정보와 자신이 실제 입력한 정보의 동일성 여부를 따질 수 있고, 잘못된 입력을 바로잡을 수 있다. 즉, 텍스트 수정을 위한 아이콘(510)을 텍스트 기반의 정보와 함께 표시하여 사용자가 해당 아이콘(510)을 클릭 또는 터치하는 경우, 인식된 정보를 수정할 수 있도록 제어한다. 특히, 속삭임 기반의 인식은 인식률이 정상음성 인식률보다 떨어질 수 있기 때문에, 텍스트 수정 기능의 활용이 보다 높을 수 있다. 이를 위해, 본 발명의 실시예에 따르면, 정상음성 인식의 경우는 텍스트 수정 기능이 활성화되지 않도록 하고, 속삭임 음성인식의 경우에만 텍스트 수정기능이 활성화되도록 할 수 있다. 다만, 무조건 그렇게 해야 하는 것은 아니다.Referring to the left figure of FIG. 5, the device may display text-based information recognized by the first signal recognizer and / or the second signal recognizer through the display. In this case, the user may determine whether the text-based information recognized by the device is identical to the information actually input by the user, and correct the wrong input. That is, by displaying the icon 510 for text correction together with the text-based information, when the user clicks or touches the icon 510, the controller 510 can control the recognized information. In particular, the whisper-based recognition may have a higher utilization of the text correction function because the recognition rate may be lower than the normal speech recognition rate. To this end, according to an embodiment of the present invention, in case of normal speech recognition, the text correction function may not be activated, and in the case of whisper speech recognition, the text correction function may be activated. It does not have to be so.

도 5의 우측하단 도면에서, 사용자는 디스플레이된 음성인식 정보로 "김이사님께 문자 걸어줘"라고 인식된 것을 확인할 수 있다. 이때, "문자"라는 표현(520)이 잘못됐다고 인식할 수 있고, 이에 대응하여 해당 부분을 수정할 수 있다. 이때, 아이콘(510)을 누르면, 음성명령어 변환을 위해 키보드와 같은 사용자 인터페이스가 표시될 수 있고, 사용자 인터페이스를 통한 문자 입력을 통해 해당 문구(520)를 "전화"와 같은 올바른 표현으로 수정할 수 있다. In the lower right drawing of Figure 5, the user can see that the recognized voice recognition information "to give a text to Mr. Kim". At this time, the expression "520" may be recognized as wrong, and corresponding portions may be corrected. In this case, when the icon 510 is pressed, a user interface such as a keyboard may be displayed for voice command conversion, and the corresponding phrase 520 may be corrected to a correct expression such as "telephone" through text input through the user interface. .

도 5의 우측 하단 도면을 참조하면, 장치는 수정되어야 할 문구 부분을 제 1 표시(530) 및 제 2 표시(532)를 통해 특정할 수 있다. 즉, 수정되어야 할 문구의 시작부분은 제 1 표시(530)로 수정되어야 할 문구의 종료부분은 제 2 표시(532)를 이용하여 특정할 수 있다. 그리고는, 장치는 해당 부분에 대해서만 음성인식을 재활용하여 수정하도록 제어할 수 있다. 즉, 사용자는 인식된 텍스트 정보 중 일부를 특정한 후, 특정된 부분에 해당되는 용어를 재표출함으로써 해당 부분의 문구가 수정되도록 할 수 있다. 도 5의 우하단 실시예에서는, "김이사님" 부분을 특정한 후, "이사장님"이라고 속삭이거나 정상음성으로 표출하면, 장치는 이를 다시 입력으로 받아들여, 속삭임 및 정상음성으로 구분한 후, 해당 부분을 "이사장님"으로 수정할 수 있다. 그리고는, 장치는 "인식 완료" 아이콘을 표시하여 해당 부분 클릭시 음성인식이 정확하게 완료되었다고 판단하게 된다. Referring to the lower right drawing of FIG. 5, the device may specify the phrase portion to be modified through the first display 530 and the second display 532. That is, the beginning of the phrase to be corrected may be specified by the first display 530, and the ending of the phrase to be corrected may be specified using the second display 532. Then, the device can be controlled to recycle and correct the voice recognition only for the corresponding part. That is, the user may specify a part of the recognized text information, and then re-express a term corresponding to the specified part so that the phrase of the corresponding part may be modified. In the bottom-right embodiment of Figure 5, after specifying the "Mr. Kim" part, whisper "Missor" or expressed as a normal voice, the device accepts it as an input again, divided into a whisper and a normal voice, the corresponding You can change the part to "President." Then, the device displays the "recognition complete" icon to determine that the voice recognition is correctly completed when the corresponding part is clicked.

상기한 바와 같은 음성인식된 텍스트 정보 변경은 정보입력부에서 변환된 명령어에도 유사하게 적용될 수 있다. 즉, 음성인식된 텍스트 정보를 명령어로 변환한 후에도 명령어 수정이 이루어질 수 있다. The voice recognition text information change as described above may be similarly applied to the command converted by the information input unit. That is, the command modification may be performed even after converting the voice recognition text information into the command.

본 발명의 다른 실시예에 따르면, 제 1 신호인식부에서의 정상음성인식 알고리즘과 제 2 신호인식부에서의 속삭임 음성인식 알고리즘의 정확도를 높이기 위해, 취득된 음성신호와 수정된 텍스트 정보 또는 수정없이 정확히 인식된 텍스트 정보를 훈련데이터 셋으로 하여 CNN(Convolutional Neural Network) 또는 RNN(Recurrent Neural Network)과 같은 딥 러닝(Deep Learning) 알고리즘을 실행시킬 수 있다. 즉, (취득음성신호, 텍스트 정보)를 훈련데이터 셋으로 생성하여 매 취득된 음성신호에 대해 학습데이터화하여 정상음성인식 알고리즘 및/또는 속삭임음성인식 알고리즘을 교육시킬 수 있다. 또한, 상기 (취득음성신호, 텍스트 정보) 셋 중 일부는 검증을 위한 데이터 셋으로, 또 다른 일부는 테스트용 데이터셋으로 생성하여 교육을 시킬 수 있다. 더욱이, 취득된 음성신호를 기저장된 사용자 프로파일과 대조하여, 사용자를 식별함으로써, 특정사용자에 특화된 학습이 이루어지도록 할 수 있다. 예컨대, (사용자 1 취득음성신호, 텍스트 정보(수정완료 또는 정확히 인식된 것))을 훈련데이터로 사용자 1에 대한 정상 및/또는 속삭임 음성인식 알고리즘의 기계학습이 가능하다.According to another embodiment of the present invention, in order to increase the accuracy of the normal speech recognition algorithm in the first signal recognition unit and the whisper speech recognition algorithm in the second signal recognition unit, the acquired speech signal and the modified text information or without modification Deep learning algorithms such as CNN (Convolutional Neural Network) or RNN (Recurrent Neural Network) can be executed with the training data set using the correctly recognized text information. That is, it is possible to train the normal speech recognition algorithm and / or the whisper speech recognition algorithm by generating (acquired speech signal and text information) into a training data set and making learning data for each acquired speech signal. In addition, some of the (acquisition voice signal, text information) set may be a training data set, and another part may be generated as a test data set for training. Further, by identifying the user by comparing the acquired voice signal with a pre-stored user profile, learning specific to the specific user can be made. For example, machine learning of the normal and / or whispering speech recognition algorithm for user 1 is possible using (user 1 acquired speech signal, text information (correction completed or correctly recognized)) as training data.

도 6은 본 발명의 일 실시예에 따른 음성인식장치의 명령어 모델 데이터베이스를 나타낸 블록도이다. 도 6에 도시된 바와 같이, 본 발명의 일 실시예에 따른 명령어 모델 데이터베이스(600)는 중요 단어 모델(610)과 비중요단어 모델(620)을 포함할 수 있다. 6 is a block diagram illustrating a command model database of a voice recognition device according to an embodiment of the present invention. As shown in FIG. 6, the instruction model database 600 according to an embodiment of the present invention may include a key word model 610 and a non-important word model 620.

도 6을 참조하면, 장치는 음성인식된 텍스트 정보를 명령어 모델을 이용하여 명령어로 변환할 수 있다. 이때, 명령어 모델은 중요단어 모델(610)과 비중요단어 모델(620)을 포함할 수 있는데, 여기서 중요단어는 장치가 단말에서의 제어동작을 수행함에 있어서 중요한 요소로 작용하는 단어를 의미할 수 있다. 중요단어에는 전화나 문자 등의 대상이 될 수 있는 사람이름, 행위 관련된 용어들(예컨대, 전화, 문자, (카카오 톡과 같은) 메시저 서비스 등), 정보검색과 관련된 용어들이 포함될 수 있다. 이러한 단어들을 특정하여 음성인식된 텍스트 정보로부터 변환된 명령어를 표시할 때, 중요단어가 사용자의 눈에 띌 수 있도록 시각화하는 것이 바람직할 수 있다. Referring to FIG. 6, the device may convert speech-recognized text information into a command using a command model. In this case, the command model may include an important word model 610 and a non-essential word model 620, where the important word may mean a word that acts as an important element in the device performing a control operation in the terminal. have. Important words may include people's names that may be subject to phone calls or texts, terms related to behavior (eg, phone calls, text messages, messenger services (such as KakaoTalk)), and terms related to information retrieval. When specifying these words to display a command converted from speech-recognized text information, it may be desirable to visualize the important words so that they are noticeable to the user.

비중요단어에는, 조사, 부사, 의태어/의성어와 같이 잘못된 명령어로써 입력되어도 단말에서의 동작에 크게 영향을 미치지 않는 단어를 포함한다. Non-critical words include words that do not significantly affect the operation in the terminal even if they are entered as wrong commands, such as search, adverb, and pseudonym / onomatopoeia.

도 7은 도 6의 명령어 모델 데이터베이스를 통해 생성된 명령어를 디스플레이부에 표시한 화면을 나타낸 도면이다. FIG. 7 is a diagram illustrating a screen displaying a command generated through the command model database of FIG. 6 on a display unit.

도 7을 참조하면, 전술한 바와 같이, 음성인식된 텍스트 정보에서 변환된 명령어도 디스플레이부를 통해 표시될 수 있다. 이때, 중요단어와 비중요단어를 구분한 명령어 모델 데이터베이스를 통해 생성된 명령어에서도 수정작업이 이루어질 수 있다. 표시되는 명령어는 중요단어와 비중요단어가 구분되어 표시될 수 있다. 예컨대, "김이사님께 전화걸어줘"에서, 사람이름을 나타내는 "김이사님"(710)과 제어동작과 관련된 "전화"(712)는 중요단어로써 굵은 글씨체, 다른 색상 및/또는 밑줄을 통해 비중요단어와 구분되어 표시될 수 있다. 그리고는, 명령어 수정 아이콘(714)을 클릭하여 중요단어가 구분된 명령어를 수정할 수 있다. 이때, 명령어 수정은 문자입력 및/또는 음성재입력을 통해 이루어질 수 있다. 본 발명의 실시예에 따르면, 중요단어를 중심으로 수정이 이루어지도록 명령어 수정 아이콘(714) 클릭에 응답하여 중요단어(710, 712)의 위치로 명령어 수정을 위한 커서가 이동할 수 있다. 즉, 수정 아이콘(714) 클릭시, 1차로 "김이사님"(710)이 수정 대상 문자로 특정될 수 있고, 커서 이동 입력에 응답하여 2차로 바로 "전화"(712)로 수정대상 문자가 특정될 수 있다. 이는 중요단어 단위로 수정을 위한 문자특정이 바로 이루어지도록 할 수 있다.Referring to FIG. 7, as described above, a command converted from voice recognition text information may also be displayed through the display unit. In this case, modifications may also be made to a command generated through a command model database that distinguishes important words and non-important words. The displayed command may be classified into important words and non-important words. For example, in "Call Kim", "Kim Lee" (710) representing a person's name and "telephone" (712) associated with a control action are important words that are important through bold text, different colors and / or underscores. It can be displayed separately from the word. Then, the command edit icon 714 may be clicked to modify a command in which important words are divided. In this case, the command may be modified through text input and / or voice re-input. According to the exemplary embodiment of the present invention, the cursor for the command modification may move to the position of the key words 710 and 712 in response to the click of the command modification icon 714 so that the correction is made around the key word. That is, when the correction icon 714 is clicked, the first "Kim Lee" 710 may be specified as the character to be modified, and the character to be modified is directly identified as "telephone" 712 in response to the cursor movement input. Can be. This may allow the character specification for correction to be made in important word units.

본 발명의 실시예에 따르면, 신호분석부에서 취득된 음성신호를 속삭임 음성신호로 인식했을 때, 주변소음이 적고 타인의 눈에 최대한 띄지 않아야 하는 환경일 가능성이 높다. 따라서, 이때, 사용자의 모드 설정에 따라 자동으로 단말로부터 출력되는 음성부분을 특정 임계값 이하로 줄이는 제어동작, 단말의 출력 광량을 줄이는 제어 동작이 단독 또는 조합되어 연계되도록 할 수 있다. According to an embodiment of the present invention, when the voice signal acquired by the signal analyzer is recognized as a whisper voice signal, it is likely that the environment is low in ambient noise and should not be as noticeable to others. Therefore, in this case, the control operation for automatically reducing the voice portion output from the terminal to a predetermined threshold value or less, and the control operation for reducing the output light quantity of the terminal may be linked alone or in combination according to the user's mode setting.

도 8은 본 발명의 일 실시예에 따른 음성인식장치가 통화 중에 속삭임 모드로 음성을 인식할 때의 처리방법을 설명하기 위한 개념도이다. 8 is a conceptual view illustrating a processing method when a voice recognition device recognizes a voice in a whisper mode during a call according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 음성인식장치는 속삭임 음성과 정상음성으로 구분하는 기능이 통화 중이 아닌 상황에서 명령어 입력으로 활성화되는 것뿐만 아니라 통화 중인 상황에서 활성화될 수 있다. 이 경우, 입력된 음성신호는 텍스트 기반의 신호로 변환되는 것뿐만 아니라 특수한 형태의 음성신호로 변환될 수 있다. 정상음성신호는 음성인식률이 좋기 때문에, 통화중인 경우에는 활성화되지 않도록 하고, 통화 중인 경우, 속삭임 음성신호만 인식되도록 할 수 있다. 즉, 제 2 신호인식부만 활성화되어 인식된 속삭임 음성신호를 기반으로 증폭된 음성신호를 생성할 수 있다. 이때, 소리의 증폭만이 아니라 사용자의 프로파일에 맞는 정상음성신호로 변경할 수 있다. 앞서 설명한 바와 같이, 장치는 복수 개의 사용자에 대한 정상 및/또는 속삭임 음성프로파일을 보유하고 있을 수 있다. 이때, 취득된 음성신호를 분석하여 특정 사용자로 식별하고 나면, 해당 사용자의 정상음성 프로파일을 가져와서, 제 2 신호인식부에서 인식한 정보를 기반으로 상기 사용자의 정상음성 프로파일을 덧씌위서 음성신호를 가공할 수 있다. 즉, 통화상대방 입장에서는 속삭임 음성통화신호는 잘 안들릴 수 있기 때문에, 속삭임 음성신호로 사용자가 이야기하더라도, 일정 크기의 정상음성신호로 변환하되, 사용자의 정상음성신호와 동일 또는 유사하게 변환하여 통화상대방 단말로 전달되기 때문에, 대화에 불편함을 최소화할 수 있는 것이다. Referring to FIG. 8, the voice recognition device of the present invention may be activated not only by a command input in a situation where a whisper voice and a normal voice are in a call but also in a call. In this case, the input voice signal may be converted not only into a text-based signal but also into a special type of voice signal. Since the normal voice signal has a good voice recognition rate, the normal voice signal may not be activated when in a call and only the whisper voice signal may be recognized in a call. That is, only the second signal recognition unit is activated to generate an amplified voice signal based on the recognized whisper voice signal. At this time, not only the amplification of the sound but also can be changed to a normal voice signal that fits the user's profile. As described above, the device may have a normal and / or whisper voice profile for a plurality of users. In this case, after analyzing the acquired voice signal and identifying the user as a specific user, the user acquires the normal voice profile of the user and adds the voice signal based on the normal voice profile of the user based on the information recognized by the second signal recognition unit. I can process it. That is, the whisper voice call signal may not be heard well from the other party's point of view, so even if the user talks to the whisper voice signal, the voice signal is converted into a normal voice signal having a predetermined size, but converted to the same or similar to the normal voice signal of the user. Since it is delivered to the terminal, it is possible to minimize the inconvenience in the conversation.

또한, 본 발명의 실시예에 따르면, 통화 중인 경우, 속삭임 음성인식에 대해 정상음성신호로 변환함에 있어서, 속삭임 음성신호의 특성에 따라 그에 대응되는 정상음성신호로 변환할 수 있다. 예컨대, 입력된 속삭임 음성신호의 음역대가 평균 속삭임 음성신호의 음역대보다 다소 높은 경우, 음역대가 다소 높게 형성된 기저장된 제 1 정상음성 프로파일에 기반한 제 1 정상음성신호로 변환될 수 있다. 반대로, 입력된 속삭임 음성신호의 음역대가 평균보다 낮은 경우, 음역대가 다소 낮게 형성된 정상음성 프로파일에 기반한 제 2 정상음성신호로 변환될 수 있다. 속삭임 음성신호의 특성은 음성의 크기, 음역대, 발음의 길이(말이 빠른지 느린지)를 기반으로 기저장된 복수 개의 임계값들 중 적어도 하나와 비교함으로써 구분될 수 있고, 각 특성에 대응되는 정상음성 프로파일을 보유하고 있다가 해당되는 정상음성 프로파일을 추출하여 적절한 정상음성신호로 변환할 수 있다. 이러한 통화 중 음성변환모드는 사용자 설정에 따라 사용할 수도 있고, 사용하지 않을 수 있으며, 다양한 음성프로파일의 고려 또한 사용자 설정에 따라 사용여부가 결정된다. 예컨대, 디폴트로 설정된 하나의 정상음성 프로파일만을 사용하여 속삭임 음성신호의 특성 고려없이 무조건 특정 정상음성신호로 변환되도록 할 수 있다. 다만, 이 경우, 사용자의 통화 음성과 괴리가 있어 통화상대방에게 부자연스럽게 들릴 수 있는 문제가 있다. Further, according to an embodiment of the present invention, when the call is in progress, the voice signal may be converted into a normal voice signal according to the characteristics of the whisper voice signal. For example, when the sound range of the input whisper voice signal is slightly higher than the sound range of the average whisper voice signal, the sound range may be converted into a first normal voice signal based on a pre-stored first normal voice profile in which the sound range is rather high. On the contrary, when the sound range of the input whisper voice signal is lower than the average, it may be converted into a second normal sound signal based on the normal sound profile in which the sound range is somewhat lower. The characteristics of the whispering voice signal can be distinguished by comparing with at least one of a plurality of pre-stored thresholds based on the loudness, the range of the voice, and the length of the pronunciation (whether the words are fast or slow), and the normal voice profile corresponding to each characteristic. It can be extracted and converted into the appropriate normal voice signal by extracting the corresponding normal voice profile. The voice conversion mode during the call may or may not be used depending on the user setting, and consideration of various voice profiles may also be determined depending on the user setting. For example, only one normal voice profile set as a default may be used to unconditionally convert to a specific normal voice signal without considering the characteristics of the whisper voice signal. However, in this case, there is a problem that may be unnatural to the call partner because there is a deviation from the call voice of the user.

도 9는 본 발명의 다른 실시예에 따른 음성인식장치를 개략적으로 나타낸 블록도이다. 도 9에 도시된 바와 같이, 본 발명의 다른 실시예에 따른 음성인식장치는 음성신호 취득부(910), 신호분석부(920), 제 1 신호인식부(930), 제 2 신호인식부(940) 및 신호변환부(950)를 포함할 수 있다.9 is a block diagram schematically showing a voice recognition device according to another embodiment of the present invention. As shown in FIG. 9, the voice recognition apparatus according to another embodiment of the present invention includes a voice signal acquisition unit 910, a signal analyzer 920, a first signal recognition unit 930, and a second signal recognition unit ( 940 and a signal converter 950.

도 9를 참조하면, 음성신호 취득부(910), 신호분석부(920), 제 1 신호인식부(930), 제 2 신호인식부(940)는 도 1의 음성신호 취득부(110), 신호분석부(120), 제 1 신호인식부(130), 제 2 신호인식부(140)와 동일 또는 유사한 기능을 수행할 수 있다(도 1 관련 설명 참조). Referring to FIG. 9, the voice signal acquisition unit 910, the signal analysis unit 920, the first signal recognition unit 930, and the second signal recognition unit 940 are the audio signal acquisition unit 110 of FIG. 1, The signal analysis unit 120, the first signal recognition unit 130, and the second signal recognition unit 140 may perform the same or similar functions (see FIG. 1 related description).

신호변환부(950)는 현재 장치가 통화중인지 여부를 판단하여 제 1 신호인식부(930) 및 제 2 신호인식부(940)에서 인식한 정보를 텍스트 기반의 명령어 신호 또는 정상음성신호로 변환할 수 있다. 예컨대, 현재 통화 중이 아닌 경우, 음성신호 취득부(910)에서 취득한 신호는 통화음성신호가 아닌 음성명령을 위한 신호로 판단하여, 제 1 신호인식부(930) 및 제 2 신호인식부(940)를 통해 인식된 정보를 텍스트 기반의 명령어 정보로 변환할 수 있다. 반대로, 현재 통화중에 속삭임과 관련된 음성이 인식된 경우, 이 경우는 제 1 신호인식부(930)는 이미 정상음성신호를 출력하고 있기에 적절히 비활성화시킬 수 있다. 제 2 신호인식부(940)만이 활성화되어 속삭임 음성신호로 감지된 경우, 제 2 신호인식부(940)에서 인식된 정보는 정상음성신호로 변환될 수 있다. 이때, 디폴트로 설정된 정상음성신호를 활용하는 방식, 사용자 프로파일을 활용하는 방식 및/또는 속삭임 음성특성에 대응되는 정상음성 프로파일에 따른 정상음성신호로 변환하는 방식을 활용할 수 있다. The signal converter 950 determines whether the device is currently in a call and converts the information recognized by the first signal recognizer 930 and the second signal recognizer 940 into a text-based command signal or a normal voice signal. Can be. For example, when not in a current call, the signal acquired by the voice signal acquisition unit 910 is determined as a signal for a voice command, not a voice signal, and thus, the first signal recognition unit 930 and the second signal recognition unit 940. The recognized information can be converted into text-based command information. On the contrary, when the voice related to the whisper is recognized during the current call, in this case, the first signal recognition unit 930 already outputs the normal voice signal so that it can be appropriately deactivated. When only the second signal recognition unit 940 is activated and detected as a whisper voice signal, the information recognized by the second signal recognition unit 940 may be converted into a normal voice signal. In this case, a method of utilizing a normal voice signal set as a default, a method of using a user profile, and / or a method of converting a normal voice signal according to a normal voice profile corresponding to a whisper voice characteristic may be used.

본 발명의 또 다른 실시예에 따르면, 음성인식에 따른 명령어 실행 또는 정상통화음성 변환을 위한 방식에, 입모양 판단모드도 추가하여 구성될 수 있다. 입모양 판단모드는 "입모양 인식모드"로 사용자가 설정을 변경한 경우, 구현되는 모드로써, 단말에 포함된 카메라를 이용하여 사용자의 입모양을 기반으로 음성을 인식하는 모드이다. 즉, 동그란 입모양은 "ㅗ"의 모음이 발음되는 것으로 인식하고, 앞글자와 뒷글자와의 관계에 따라 자음을 유추하여 인식할 수 있다. 이를 위한 알고리즘도 별도로 코딩되어 제 3 신호인식부(미도시)에서 실행될 수 있다. 또한, 이러한 입모양 인식 모드와 관련하여, 장치는 기저장된 기본 입모양 인식알고리즘에 더해, 사용자가 말하는 동영상(예컨대, 영상통화 또는 동영상 파일)에서 사용자에게 특화된 입모양 동작 특성을 추출하여 보다 사용자에게 최적화된 입모양 인식이 가능하도록 할 수 있다. According to another embodiment of the present invention, a mouth determination mode may be added to a method for executing a command or converting a normal call voice according to voice recognition. The mouth determination mode is a mode implemented when the user changes the setting to the "mouth recognition mode" and is a mode for recognizing a voice based on the user's mouth using a camera included in the terminal. That is, the round mouth recognizes that the vowel of "ㅗ" is pronounced and can be recognized by inferring consonants according to the relationship between the front letter and the back letter. An algorithm for this may also be separately coded and executed in a third signal recognition unit (not shown). In addition, in relation to the mouth recognition mode, the device extracts a user-specific mouth motion characteristic from a video (eg, a video call or a video file) that the user speaks in addition to the pre-stored basic mouth recognition algorithm. It is possible to enable optimized mouth shape recognition.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. Although described above with reference to the drawings and embodiments, it does not mean that the scope of protection of the present invention is limited by the above drawings or embodiments, and those skilled in the art to the spirit of the present invention described in the claims It will be understood that various modifications and variations can be made in the present invention without departing from the scope of the invention.

Claims

A signal acquisition unit for acquiring an audio signal;
A signal analyzer configured to call the at least one of a first signal recognizer and a second signal recognizer by analyzing the acquired voice signal to determine whether the first signal is a first signal associated with a normal voice of a person or a second signal associated with a whisper;
A first signal recognition unit recognizing the first signal according to a first voice recognition algorithm in response to a call from the signal analyzer; And
And a second signal recognition unit recognizing the second signal according to a second voice recognition algorithm in response to the call from the signal analyzer.

The method of claim 1,
The voice recognition device further comprises an information input unit for generating and inputting command information based on text-based information recognized by at least one of the first signal recognition unit and the second signal recognition unit.

The method of claim 2, wherein the information input unit
A command conversion unit converting text-based information recognized by at least one of the first signal recognition unit and the second signal recognition unit into a command; And
Speech recognition device comprising a command input unit for inputting the converted command.

The method of claim 3, wherein
The voice recognition device converts the command to a command using a pre-stored command model.

The method of claim 2,
The recognized text information is displayed on the display unit.

The method of claim 5,
Speech recognition device for modifying a predetermined character contained in the text information displayed through the display through a text input through the user interface.

The method of claim 5,
Speech recognition device for specifying a predetermined character contained in the text information displayed through the display to modify through the voice re-input through the user interface.

The method of claim 5,
Speech recognition device for displaying an icon for text correction around the text displayed through the display unit.

The method of claim 4, wherein
The instruction model is made by distinguishing important words and non-important words,
When the converted command is displayed, the voice recognition device for distinguishing and displaying the important words and non-important words.

The method of claim 2,
The voice input device is automatically activated when the information input unit is not in a call.

The method of claim 1,
The signal analyzing unit analyzes at least one of the strength and the range of the acquired voice signal to determine whether the first signal or the second signal.

The method of claim 11, wherein the signal analysis unit
A voice determining whether the first signal or the second signal is based on at least one of whether the intensity of the acquired voice signal is greater than a first threshold value and whether the band of the acquired voice signal is greater than a second threshold value Recognition device.

The method of claim 12,
At least one of the first threshold value and the second threshold value is variable according to the amount of noise around the voice recognition device.

The method of claim 12,
And at least one of the first threshold value and the second threshold value is set based on a pre-stored user voice profile.

The method of claim 1,
And the second signal recognition unit recognizes the second signal using a voice model associated with a whisper voice signal recognition.

The method of claim 1,
Further comprising a barometric pressure sensor for sensing the pressure of the air discharged from the user,
The signal analyzing unit determines whether the first signal or the second signal based on whether the pressure value from the barometric pressure sensor is greater than a third threshold value.

The method of claim 16,
The air pressure sensor is a voice recognition device disposed within a radius of 2cm from the voice signal acquisition unit.

The method of claim 1,
When the voice recognition device is busy, the first signal recognition unit is deactivated and only the second signal recognition unit is activated.
And a signal conversion unit converting the information recognized by the second signal recognition unit into a normal voice signal.

The method of claim 18,
A storage unit may further include a storage unit configured to hold a first user voice signal including the normal voice signal characteristic of the first user and a second user voice signal including the normal voice signal characteristic of the second user.
And the signal converter determines whether the acquired voice signal is the first user voice signal or the second user voice signal and converts the voice signal into a corresponding voice signal.

The method of claim 18,
And the signal converter converts the normal voice signal corresponding to the acquired characteristic of the voice signal.

The method of claim 1,
In one continuous language, the signal analyzer identifies a first section as a processing section of the first signal recognition section, and a second section as a processing section of the second signal recognition section. Voice recognition device for calling the second signal recognition unit.

The method of claim 21,
The first signal recognition unit is implemented as a first processor, and the second signal recognition unit is implemented as a second processor, so that signal recognition operations for the first and second sections of the one continuous language are performed substantially in parallel. Voice recognition device.

Acquiring a voice signal;
Analyzing the acquired voice signal to determine whether it is a first signal associated with a normal voice of a person or a second signal associated with a whisper to call at least one of a first signal recognizer and a second signal recognizer;
In response to a call from the signal analyzer, recognizing, by the first signal recognizer, the first signal according to a first speech recognition algorithm; And
And in response to the call from the signal analyzer, recognizing the second signal by the second signal recognition unit according to a second voice recognition algorithm.

A signal acquisition unit for acquiring an audio signal;
A signal analyzer configured to call the at least one of a first signal recognizer and a second signal recognizer by analyzing the acquired voice signal to determine whether the first signal is a first signal associated with a normal voice of a person or a second signal associated with a whisper;
A first signal recognition unit recognizing the first signal according to a first voice recognition algorithm in response to a call from the signal analyzer; And
In response to the call from the signal analyzer, including a second signal recognition unit for recognizing the second signal according to a second speech recognition algorithm,
And a signal converter configured to convert information recognized by at least one of the first signal recognizer and the second signal recognizer into a text-based input signal or a normal voice signal based on whether a call is in progress.

Acquiring a voice signal;
Analyzing the acquired voice signal to determine whether it is a first signal associated with a normal voice of a person or a second signal associated with a whisper to call at least one of a first signal recognizer and a second signal recognizer;
In response to a call from the signal analyzer, recognizing, by the first signal recognizer, the first signal according to a first speech recognition algorithm; And
In response to the call from the signal analyzer, recognizing, by the second signal recognition unit, the second signal according to a second speech recognition algorithm;
And converting a signal obtained from at least one of the first signal recognition unit and the second signal recognition unit into a text-based input signal or a normal voice signal based on whether a call is in progress.