KR20170052082A

KR20170052082A - Method and apparatus for voice recognition based on infrared detection

Info

Publication number: KR20170052082A
Application number: KR1020150154045A
Authority: KR
Inventors: 한명수
Original assignee: 주식회사 셀바스에이아이
Priority date: 2015-11-03
Filing date: 2015-11-03
Publication date: 2017-05-12

Abstract

The present invention relates to a method and apparatus for detecting infrared-based speech, and more particularly, to a method and apparatus for recognizing an infrared ray based on infrared rays, comprising the steps of receiving vocal data including vocal image data and vocal sound data from a vocalizing source, Processing a vocal sound image based on vocal image data with a vocal airflow image based on infrared detection and a vocal mouth image based on visible light detection; generating a voice recognition voice feature on the basis of the vocal sound data; And outputting a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature. The speech recognition method according to claim 1, further comprising the steps of: A method and apparatus for recognizing infrared based speech recognition A.

Description

[0001] METHOD AND APPARATUS FOR VOICE RECOGNITION BASED ON INFRARED DETECTION [0002]

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an infrared detection-based speech recognition method and apparatus, and more particularly, to an infrared detection-based speech recognition method and apparatus capable of detecting a sound of an object by infrared rays.

As speech recognition has been widely used in real life, speech recognition technology has greatly improved. In general, speech recognition recognizes and analyzes a human voice, so that a device such as a computer can understand the language of a person. That is, speech recognition has been developed as a new input method so that a user can use the device more conveniently.

Speech recognition has been developed in a way that recognizes the human voice and more accurately recognizes the human voice. Accordingly, a method of analyzing human utterance as acoustic data has been widely used. For example, there is a technique of separating the speech and noise of a speaking person in order to clearly recognize the speech of the speaking person in an environment where noises exist.

However, there are many cases in which the accuracy and the use conditions are limited in the technique of using only human speech for speech recognition. Accordingly, not only acoustic data according to human voice but also visual data are used as data for voice recognition. For example, visual data of the lips shape of the speaker together with acoustic data of a speaker can be used for speech recognition.

Thus, the speech recognition method using the lip shape of the speaker as the visual data has a problem that the speaker must be photographed from the front. In addition, the speech recognition method using only acoustic data has a problem that the speech recognition success rate is very low when the signal-to-noise ratio (SNR) is very small, such as when the environment is very noisy or the voice of the speaker is very small Lt; / RTI >

Accordingly, there is an increasing need to provide a method that can be used for voice recognition, as well as sound data of a speaking person and airflow generated from a mouth of a speaking person.

[Related Technical Literature]

A speech recognition apparatus and a speech recognition method thereof (Korean Patent Publication No. 10-2014-0024536)

SUMMARY OF THE INVENTION It is an object of the present invention to provide an infrared detection-based speech recognition method and apparatus capable of recognizing a speech even in a noisy environment or a low speech output of a speaking person.

Another object of the present invention is to provide an infrared detection-based speech recognition method and apparatus capable of detecting speech by detecting an air current generated from a mouth of a speaking person using an infrared camera.

The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided an infrared detection-based speech recognition method including receiving speech data including voiced image data and voiced speech data from a speech source, Processing a speech sound image based on infrared detection and a vocal mouth image based on visible light detection; generating a speech recognition speech feature based on the speech speech data; and generating a speech recognition image feature based on the speech image data And outputting a speech recognition result through a pattern recognition method based on the speech recognition speech feature and the speech recognition image feature.

According to still another aspect of the present invention, the step of generating a speech recognition image feature includes calculating at least one of a velocity, an amount, and a pressure of a voicing air current propagated from a voicing source through infrared detection of a voiced air current image And a control unit.

According to another aspect of the present invention, the infrared ray detection is characterized by detecting a far-infrared ray.

According to another aspect of the present invention, the step of generating a speech recognition image feature includes a step of recognizing a speech interval and distinguishing phonemes based on the speech recognition image feature.

According to another aspect of the present invention, the pattern recognition is based only on a speech recognition image feature when voiced speech data is not detected or when the percentage of noise in the speech speech data is larger than the speech of the speaker.

According to another aspect of the present invention, there is provided an infrared-ray-based speech recognition apparatus including a receiving unit for receiving vocal data including vocal image data and vocal sound data from a vocalizing source, A processing unit for processing a voiced air current image based on infrared detection and a voiced image based on visible light detection to generate a voice recognition voice feature on the basis of the voiced voice data and generating a voice recognition image feature on the basis of the voiced image data And an output unit for outputting a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature.

According to another aspect of the present invention, the processing unit calculates at least one of the velocity, the amount, and the pressure of the voicing air current propagated from the voicing source through the infrared detection of the voiced air current image.

According to still another aspect of the present invention, the processing unit identifies a voice section and distinguishes phonemes based on a voice recognition image feature.

According to an aspect of the present invention, there is provided a computer-readable medium storing instructions for providing an infrared detection-based speech recognition method, the computer-readable medium storing speech recognition data including voiced image data and voiced speech data, A voiced air image based on infrared light detection and a voiced image based on visible light detection based on the voiced image data to generate a voice recognition voice feature on the basis of the voiced speech data, And generating a speech recognition image feature based on the speech recognition speech feature and the speech recognition image feature.

The details of other embodiments are included in the detailed description and drawings.

The present invention provides an infrared detection-based speech recognition method and apparatus capable of recognizing speech even in a noisy environment or in a case where the speech output of a speaking person is low.

The present invention provides an infrared detection-based speech recognition method and apparatus capable of detecting speech by detecting an air current generated from a mouth of a speaking person using an infrared camera.

The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

FIG. 1 shows a schematic configuration of an infrared detection-based speech recognition module according to an embodiment of the present invention.
FIG. 2 illustrates a procedure for recognizing a speech of a speaking person according to an infrared detection-based speech recognition method according to an embodiment of the present invention.
FIG. 3 illustrates an exemplary photographing for infrared detection based speech recognition according to an embodiment of the present invention.
FIG. 4 illustrates an exemplary configuration and a flow of a speech recognition process according to an exemplary embodiment of the present invention.
FIG. 5 illustrates exemplary infrared detection results of phonemes according to phonemes in a speech recognition process according to an embodiment of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

Like reference numerals refer to like elements throughout the specification unless otherwise specified.

It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

In the present specification, speech recognition basically means an operation in which an electronic device interprets a voice uttered by a speaking person and recognizes the contents as text. Specifically, when a waveform of a voice uttered by a speaker is input to the electronic device, a voice feature including pattern information of the voice can be obtained by analyzing the voice waveform. Accordingly, the voice feature is compared with the previously learned acoustical and linguistic statistical data, so that the text with the highest likelihood of matching with the input voice can be recognized.

In the present specification, a speaker is a user who generates vocal data to be subjected to speech recognition.

In the present specification, speech data is data that is uttered by a utterance originator and propagated to the speech recognition module, and includes utterance image data and utterance voice data. More specifically, the vocalization data can be propagated to the infrared detection-based speech recognition module as well as speech in an image format.

In the present specification, the vocal image data is image data that can be obtained through the photographing equipment such as a camera among the vocal data, and means the result of photographing the vocalization state of the speaking person. The vocal image data includes a vocal mouth image based on visible light detection and a vocal air image based on infrared detection.

In the present specification, a voiced air flow image refers to an image of an airflow generated from a mouth of a speaking person during voicing, and an air stream around the mouth of a speaking person through a camera. The voiced air current image may be the result of photographing the air current around the mouth of the igniter with an infrared detecting device.

In the present specification, a vocal mouth image is an image of a mouth shape taken when a speaking person is vocalized, and is an image of a mouth shape change which can be photographed in a visible ray region.

In the present specification, vocal sound data is data in the form of voice to be transmitted to the voice recognition module, and includes all sounds excluding vocal image data. That is, the voiced speech data may include not only the voice of the speaker but also the noise between the voicing source and the voice.

Herein, the speech recognition speech feature is speech data processed by the speech recognition module for speech recognition, and is data processed or extracted for conversion from speech data to text. For example, a speech recognition voice feature is a result of analysis of the voice level, waveform, etc., and includes the time at which the voice is pronounced, the distinction between the phonemes and the predictive phonemes.

In the present specification, the speech recognition image feature is speech image data processed by the speech recognition module for speech recognition, and is data processed or extracted for converting from speech image data to text. For example, the speech recognition image feature includes a mouth shape taken by an image, a predicted phoneme corresponding to a mouth shape, an infrared ray flow of a voicing air flow propagating from a voicing source, a phoneme interval according to an infrared ray flow, and a predicted phoneme .

In the present specification, pattern recognition is a technique that uses a probabilistic method or artificial neural network using a computable mechanical device to recognize a phoneme's pronunciation of a phoneme by using a speech recognition speech feature and a speech recognition image feature The method of intelligence is collectively called.

Various embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

FIG. 1 shows a schematic configuration of an infrared detection-based speech recognition module according to an embodiment of the present invention.

Referring to FIG. 1, the infrared detection-based speech recognition module 100 includes a receiving unit 110, a processing unit 120, and an output unit 130. The infrared detection-based speech recognition module 100 receives various types of data from the utterance originator and recognizes the speech through pattern recognition.

The receiving unit 110 receives the utterance data from the utterance source. The utterance data means all kinds of data that can be collected when the utterance originator of utterance utterances. However, the receiving unit 110 mainly receives the vocal image data and the vocal sound data. The receiving unit 110 transmits the received voice data to the processing unit 120. [

The processing unit 120 processes and analyzes the vocal image data and the vocal sound data received from the receiving unit 110. [ The processing unit 120 generates a speech recognition image feature from the speech image data, and generates a speech recognition speech feature from the speech speech data. The processing unit 120 transmits the generated speech recognition speech features and speech recognition image features to the output unit 130.

The output unit 130 converts the speech recognition speech features and the speech recognition image features generated by the processing unit 120 into texts through pattern recognition. The output unit 130 may output the final result of speech recognition as a text through a result post-processing. Accordingly, the speech data received as the voice and the image can be converted into text through the infrared detection-based speech recognition module 100. However, the text output from the output unit 130 may be displayed as text through other modules or devices.

Each of the components of the infrared detection-based speech recognition module 100 is illustrated as an individual component for convenience of explanation, and may be implemented in one module or separated into two or more components according to an implementation method.

FIG. 2 illustrates a procedure for recognizing a speech of a speaking person according to an infrared detection-based speech recognition method according to an embodiment of the present invention. Will be described with reference to Fig. 1 for convenience of explanation.

The infrared detection-based speech recognition method according to the present invention is initiated by the receiving unit 110 receiving speech data including voiced image data and voiced speech data from a speech source (S110).

The receiving unit 110 receives the voice data. The vocalizing data includes vocal image data and vocal voice data. That is, the infrared detection-based speech recognition module 100 can receive both the voiced image data and the voiced speech data through the receiving unit 110 and process the voiced image data and the voiced speech data separately through the processing unit 120 . The receiving unit 110 may be connected to a voice receiving module and a video receiving module of the voice recognition device including the infrared detection based voice recognition module 100.

The processing unit 120 processes the voiced air image based on the infrared light detection and the voiced image based on the visible light detection based on the voiced image data (S120).

The processing unit 120 may process the voiced image data to separate the voiced image and the voiced image. The vocal image data can be propagated to the receiving unit 110 through a photographing apparatus capable of detecting both infrared rays and visible rays. Accordingly, the processing unit 120 can classify the photographed data based on the infrared rays into a voiced air current image, and classify the photographed data based on the visible light into the vocal mouth image.

In the case where the receiving unit 110 receives the vocal image data through infrared imaging alone, the processing unit 120 can separate and process the airflow generated from the face of the speaking person and the igniter. That is, the processing unit 120 can extract the vocal mouth image based on the face of the speaker from the infrared-ray captured image, and extract the vocal air current image based on the image of the air current generated from the mouth of the speaking person .

The processing unit 120 can process voiced speech data in addition to the voiced image data. That is, the processing unit 120 can analyze the sound waveform from the utterance voice data and obtain the voice pattern information using a known utterance voice data processing algorithm.

The processing unit 120 generates a speech recognition speech feature based on the speech speech data, and generates a speech recognition image feature based on the speech speech data (S130).

The processing unit 120 generates a speech recognition speech feature based on the speech speech data. More specifically, the processing unit 120 can generate a voice feature by distinguishing a phonemic region from a phonemic phoneme through the analyzed voice waveform. That is, the processing unit 120 may analyze the voiced speech data to generate distinctions of the time and phonemes of the actual speech, and predictive phonemes.

The processing unit 120 generates a speech recognition image feature based on the speech image data. More specifically, the processing unit 120 may analyze a vocal stream image and a vocal mouth image to generate an image feature. That is, the processing unit 120 may analyze the vocal mouth image to analyze the predicted phonemes according to the mouth shape and the infrared image of the vocal stream to generate vocal strength, phoneme duration, and anticipated phoneme.

The output unit 130 outputs a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature (S140).

More specifically, the output unit 130 outputs the time, the phoneme, the expected pronunciation phoneme and the time at which the actual voice included in the speech recognition image feature is pronounced, the intensity of the utterance, the estimated pronunciation Analyze phonemes. For example, in the case where the speaker uttered 'high', the speech recognition voice features include 'high', 'oh', and 'no' as anticipated phonemes, And the probability of matching with < RTI ID = 0.0 > In addition, the speech recognition image feature may include 'high' and 'degree' as expected phonemes, and may include the probability of matching the actual phonemes with respect to each expected phoneme. Accordingly, the pattern recognition method can analyze the expected pronunciation phonemes included in the speech recognition speech feature and the speech recognition image feature, respectively, and the output unit 130 can determine the phoneme with the highest probability as the speech recognition result.

FIG. 3 illustrates an exemplary photographing for infrared detection based speech recognition according to an embodiment of the present invention.

Referring to FIG. 3, the infrared ray-based photographing apparatus 210 photographs a voicing stream 220 generated from a speaking person. The infrared detection based photographing apparatus 210 photographs both the igniter and the voicing airflow 220. The infrared detection-based photographing apparatus 210 may be capable of only infrared photographing, or may photograph visible light with infrared photographing.

The voicing air current 220 is detected by infrared rays, and the brightness may be displayed differently depending on the temperature. The voicing air currents 220 may be displayed in color or in different colors depending on the temperature.

The photographing range 230 is determined in the infrared detection-based photographing apparatus 210 and may be automatically set to include the speaking person and the voicing airflow 220 by detecting the voicing airflow 220. That is, the infrared detection-based photographing apparatus 210 can set the photographing range 230 so that photographing including the talker and the voicing airflow 220 can be taken.

FIG. 4 illustrates an exemplary configuration and a flow of a speech recognition process according to an exemplary embodiment of the present invention.

Referring to FIG. 4, the voiced speech data 311 includes a speech 312 and a noise 313. The speech 311 includes a speech 312 and a noise 313. The speech voice data 311 may be separated into a speaker voice 312 and a noise 313 through the processing unit 120 and the speaker voice 312 may be separated into voice features 315 through a known voice data processing algorithm Can be converted.

The voicing image data 321 is separated into a voiced image 322 and a voiced image 323 in the processing unit 120. [ The voiced air current image 322 is detected by detecting infrared rays. The vocal mouth image 323 is analyzed by detecting infrared or visible light. The voiced air current image 322 and the vocal mouth shape image 323 may be analyzed in the processing unit 120 and converted into an image feature 325 or extracted. The vocal image data 321 can be analyzed through infrared detection, and more specifically, can be analyzed through long wavelength infrared detection.

The pattern recognition module 400 receives the voice feature 315 and the image feature 325 and outputs a voice recognition result 410. The pattern recognition module 400 takes into account all possible matches between expected phonemes and actual phonemes contained in the speech features 315 and the image features 325. [ In particular, if the speech recognition module 400 does not detect vocal voice data because the vocal voice is too small, or if the ratio of the noise among the vocal voice data is larger than the vocalizations of the speaker, The recognition result 410 can be determined.

The speech recognition result 410 includes expected phonemes that are determined through the pattern recognition module 400. For example, if the actual utterance data is 'thank you', the pattern recognition module 400 may determine the speech recognition result 410 as 'high', 'ma', and 'wo' for each phoneme. That is, the speech recognition result 410 can be assumed to be a set of expected pronunciation phonemes determined through the pattern recognition module 400. The speech recognition result 410 may be converted according to the device that outputs the result. For example, when outputting to a display device such as a monitor, the speech recognition result 410 may be converted to text format and output to an output device such as a speaker, .

FIG. 5 illustrates exemplary infrared detection results of phonemes according to phonemes in a speech recognition process according to an embodiment of the present invention.

Referring to Figs. 5 (a) to 5 (c), an image is captured at regular time intervals while the speaker speaks "gomawaru". Accordingly, the infrared detecting-based photographing apparatus 210 detects the first vocal stream 511 and the first mouth shape 521, the second vocal stream 513, and the second mouth shape 523, the third voicing air stream 515, and the third mouth shape 525 in succession. 5 (a) to 5 (c) illustrate a part of an image successively photographed by the infrared ray detection-based photographing apparatus 210. FIG.

Accordingly, the first to third voiced air streams 511, 513, and 515 and the first to third mouth shapes 521, 523, and 525 are continuously displayed on the infrared detection based photographing apparatus 210 Can be displayed through a display device.

The processing unit 120 analyzes the continuous vocal stream images to generate a speech recognition image feature, and can compare the voiced speech image with the air flow model to identify a speech region and distinguish phonemes. The processing unit 120 may analyze the vocal stream image as shown in FIG. 5 to calculate the amount of change in velocity, amount, pressure, etc. of the vocal stream propagating from the vocal source. Accordingly, the processing unit 120 can generate a speech recognition speech feature and a speech recognition image feature on the basis of the calculated variation amount.

The processing unit 120 includes first to third vortexes 511, 513, and 515 and first to third mouth shapes 521, 523, and 525 shown in Figs. 5A to 5C, And the expected pronunciation phoneme can be determined through the pattern recognition module 400. FIG. For example, the processing unit 120 may include a first vortical flow to a third vortex flow 511, 513, 515 and a continuous vortex flow including the first to third mouth shapes 521, 523, (E.g., a vocal image model, an acoustic model, and a language model) of the pattern recognition module 400 together with the voice recognition voice feature by analyzing the shape image to generate a voice recognition image feature, . However, each mouth shape 521, 523, and 525 is used to determine a predicted pronunciation phoneme together with a voiced air flow, and may not be used in some cases.

The output unit 130 may output the final speech recognition result by combining the first to third voiced speech streams 511, 513, 515 and the first to third mouth shapes 521, 523, 525 have.

The infrared detection-based speech recognition module 100 receives speech data in a voice and video format. In particular, the processing unit 120 generates a voice recognition image feature based on the voiced air current image obtained by detecting the infrared radiation of the voicing airflow, and the output unit 130 outputs the voice recognition result through pattern recognition together with the voice recognition voice feature. Accordingly, even in a case where the noisy environment or the voice of the speaker is very small, the infrared detection-based speech recognition module 100 can detect the voiced air flow along with the mouth shape to perform more accurate speech recognition.

In this specification, each block or each step may represent a part of a module, segment or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it is to be understood that the present invention is not limited to those embodiments and various changes and modifications may be made without departing from the scope of the present invention. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100 Infrared Detection Based Voice Recognition Module
110 receiver
120 processor
130 output section
210 Infrared Detection Base
220 voices
230 Shooting Range
311 phonetic voice data
312 Speaker voice
313 Noise
315 voice features
321 speech image data
322 vocal flow image
323 vocal mouth shape image
325 image features
400 pattern recognition module
410 Speech recognition result
511 First vocal stream
513 Second vocal stream
515 Third voicing stream
521 1st mouth
523 second mouth shape
525 third mouth shape

Claims

Receiving vocal data including vocal image data and vocal voice data from a vocal circle;
Processing the voiced image based on the voiced image data into a voiced air image based on infrared detection and a voiced image based on visible light detection;
Generating a speech recognition speech feature based on the speech speech data and generating a speech recognition image feature based on the speech speech data; And
And outputting a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature.

The method according to claim 1,
Wherein the step of generating the speech recognition image feature includes calculating at least one of a velocity, an amount, and a pressure of a voicing air current propagated from the voicing source through an infrared ray detection of the voicing air current image, Based infrared detection based speech recognition method.

The method according to claim 1,
Wherein the infrared detection detects far-infrared rays.

The method according to claim 1,
Wherein the step of generating the speech recognition image features comprises the steps of identifying a speech segment and distinguishing phonemes based on the speech recognition image feature.

The method according to claim 1,
Wherein the pattern recognition module is based only on the speech recognition image feature when voiced speech data is not detected or when the ratio of noise in the speech speech data is larger than the speech of the speaker.

A receiving unit for receiving vocal data including vocal image data and vocal voice data from a vocal circle;
Processing the voiced image data based on the voiced image data to generate a voiced speech voice feature based on the voiced speech data, and generating the voiced image data based on the voiced image data based on the voiced image data, A processor for generating a speech recognition image feature; And
And an output unit for outputting a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature.

The method according to claim 6,
Wherein the processor calculates at least one of a velocity, an amount, and a pressure of a voicing air current propagated from the voicing source through an infrared detection of the voiced air image.

The method according to claim 6,
Wherein the infrared detection detects far-infrared rays.

The method according to claim 6,
Wherein the processor identifies a speech section and distinguishes phonemes based on the speech recognition image feature.

The method according to claim 6,
Wherein the pattern recognition module is based only on the speech recognition image feature when the speech speech data is not detected or the ratio of the noise among the speech speech data is larger than the speech of the speaker.

Receiving vocal data including vocal image data and vocal voice data from a vocal source,
A voiced air flow image based on infrared light detection and a voiced image based on visible light detection based on the voiced image data,
Generating a speech recognition speech feature based on the speech speech data, generating a speech recognition image feature based on the speech image data,
And outputting a speech recognition result through pattern recognition based on the speech recognition speech feature and the speech recognition image feature.