KR20150054445A

KR20150054445A - Sound recognition device

Info

Publication number: KR20150054445A
Application number: KR1020130136890A
Authority: KR
Inventors: 강병옥; 정호영; 전형배; 이성주; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2015-05-20

Abstract

According to an embodiment of the present invention, a voice recognition device includes: a voice database; an ambiguity application module; and a decoding module. The voice database stores user reciting voice data corresponding to the voice of a user reciting a script written with respect to the distribution of phonemes and the accumulated natural language voice data of the user corresponding to the user′s voice input in advance according to the respective situations. The ambiguity application module compares the user′s reciting voice data and natural language voice data to extract the ambiguity weight values for the respective situations regarding the ambiguity of each phoneme and applies the ambiguity weight values to a pre-set voice model for the respective situations based on the user′s reciting voice data and natural language voice data. The decoding module runs a voice recognition operation based on the voice model applied with the ambiguity weight values in advance by the ambiguity application module upon receiving the voice input from the user. The decoding module also performs another voice recognition operation based on the classification parameters and context information set for the phonemes or phoneme segments of the user′s voice having the ambiguity weight values higher than a pre-set weight value for the respective situations.

Description

[0001] The present invention relates to a sound recognition device,

More particularly, the present invention relates to a speech recognition apparatus that can improve recognition performance for recognizing user's voice such as presentation / conference recording, call center recording, medical / legal service, and the like.

Generally, a speech recognition apparatus uses HMM (Hidden Markov Model) as a speech recognition method. Here, the Viterbi search is performed in the speech recognition process, which is a process of determining the most likely candidate word by comparing the difference between the HMM constructed in advance by training the candidate words to be recognized and the features of the currently inputted speech .

The HMM is a method of modeling as a basic phoneme unit for speech recognition. In other words, most of the Korean speech recognition engine companies are using words and sentences by combining the phonemes coming into the speech recognition engine and the phonemes in the databases in the speech recognition engine.

HMM is a dual probability processing method that estimates an unobservable process through other observable processes. Therefore, the use of HMM in speech recognition refers to modeling a minimum phoneme unit of speech recognition and constructing a speech recognition apparatus using the model.

Here, in the speech recognition apparatus, the linguistic speech and the other natural speech have a large mutual characteristic in the same speaker and the individual speaker-to-speaker acoustic space, and stuttering patterns such as mutilation, . These characteristics appear in most natural speech interfaces such as general announcement / meeting minutes, call center recording, medical / legal services except for the purpose of providing information of a trained speaker such as broadcast news among voice dictation areas to which speech recognition technology is applied. Since the variation of the acoustic space in the natural language voice varies greatly depending on the area and the speaker in which the speech recognition is used and the same speaker / area varies depending on the situation, the speaker adaptation technique or the acoustic volume using a general- There is a limit to training a model.

It is an object of the present invention to provide a speech recognition apparatus which can improve recognition performance for recognizing user speech such as presentation / meeting minutes, call center recording, medical / legal services, and the like.

The speech recognition apparatus according to the embodiment is a speech recognition apparatus for recognizing a user's speech data corresponding to a user's speech uttered with a script considering a phoneme distribution and a user's natural language voice data accumulated corresponding to a previously inputted user's voice The user's speech data and the user's natural language speech data are compared with each other to extract context-specific ambiguity weights for ambiguities of the respective phonemes, And a speech recognition module for performing speech recognition based on the acoustic model to which the context-specific ambiguity weight is applied by the ambiguity application module when the speech user's speech is input From the uttered user voice, And a decoding module that performs speech recognition based on context information and classifier parameters set for a phoneme or phoneme interval whose ambiguity weight is equal to or greater than the set weight.

The voice database according to the embodiment may include a first database in which the user-friendly voice data is stored in advance, and a second database in which the user-friendly voice data is classified according to a situation.

The ambiguity extracting unit may determine the degree of variation in the acoustic space for each phoneme included in the user-read-aloud speech data and the user's natural language speech data, and determine ambiguity of each phoneme according to the context Estimate ambiguity weights.

The decoding module according to the exemplary embodiment may further include a speech recognition unit that extracts the speech uttered from the uttered user speech based on the acoustic model, excluding the phonemes having the context-specific ambiguity weight equal to or higher than the set weight, And a second decoding module for performing phonetic recognition by extracting phonemes corresponding to the set weight or more in the context information and the classifier parameter from the first decoding module, .

The speech recognition apparatus according to the embodiment can improve the acoustic model previously set for voice recognition of the user voice based on the accumulated user voice or voice data so that reliability and accuracy in voice recognition of the user voice can be secured There is an advantage.

1 is a control block diagram showing a control configuration of a speech recognition apparatus according to an embodiment.

The following merely illustrates the principles of the invention. Therefore, those skilled in the art will be able to devise various apparatuses which, although not explicitly described or shown herein, embody the principles of the invention and are included in the concept and scope of the invention. It is also to be understood that all conditional terms and examples recited in this specification are, in principle, expressly intended for the purpose of enabling the inventive concept to be understood and are not to be construed as limited to such specifically recited embodiments and conditions do.

It is also to be understood that the detailed description, as well as the principles, aspects and embodiments of the invention, as well as specific embodiments thereof, are intended to cover structural and functional equivalents thereof. It is also to be understood that such equivalents include all elements contemplated to perform the same function irrespective of the currently known equivalents as well as the equivalents to be developed in the future, i.e., the structure.

Thus, for example, it should be understood that the block diagrams herein illustrate exemplary conceptual aspects embodying the principles of the invention. Similarly, all flowcharts, state transition diagrams, pseudo code, and the like are representative of various processes that may be substantially represented on a computer-readable medium and executed by a computer or processor, whether or not the computer or processor is explicitly shown .

The functions of the various elements shown in the figures, including the functional blocks depicted in the processor or similar concept, may be provided by use of dedicated hardware as well as hardware capable of executing software in connection with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared.

Also, the explicit use of terms such as processor, control, or similar concepts should not be interpreted exclusively as hardware capable of running software, and may be used without limitation as a digital signal processor (DSP) (ROM), random access memory (RAM), and non-volatile memory. Other hardware may also be included.

In the claims of the present specification, the elements represented as means for performing the functions described in the detailed description may include, for example, a combination of circuit elements performing the function or any type of software including firmware / And is coupled with appropriate circuitry for executing the software to perform the function. It is to be understood that the invention as defined by the appended claims is not to be interpreted as limiting the scope of the invention as defined by the appended claims, as the functions provided by the various enumerated means are combined and combined with the manner in which the claims require. .

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: . In the following description, a detailed description of known technologies related to the present invention will be omitted when it is determined that the gist of the present invention may be unnecessarily blurred.

Referring to FIG. 1, the voice recognition apparatus 300 may include a voice database (DB) 310, a ambiguity application module 320, and first and second decoding modules 330 and 340.

The voice database 310 stores the user-readable voice data corresponding to the user voice uttered by the script in consideration of the phoneme distribution and the user's natural voice data accumulated corresponding to the user voice inputted in the past.

Here, the voice database 310 may include a first database 312 in which the user-lenticular voice data is stored in advance, and a second database 314 in which the user natural voice voice data is classified and stored according to a situation.

That is, the second database 314 stores the user natural language voice data for at least one of the previously inputted user voice and the voice recognition result obtained by voice recognition of the user voice as speech and announcement / meeting minutes, call center recording, medical / Or the like, and accumulate and store them.

The ambiguity applying module 320 determines the degree of mutation in the acoustic space for each of the phonemes included in the user-readable speech data and the user's natural language speech data, and calculates the ambiguity weight for each ambiguity And an ambiguity applying unit 324 for applying the ambiguity weight estimated by the ambiguity estimating unit 322 and the ambiguity estimating unit 322 to the acoustic model.

The first decoding module 330 may output the voice recognition result s20 by applying the user voice v20 to the acoustic model set for voice recognition of the user voice v20.

That is, when the user speech v20 is inputted, the first decoding module 330 applies the minimum phoneme unit model constituting the acoustic model to exclude phonemes whose context ambiguity weights are equal to or higher than the set weights in the user speech v20 And the user voice v20 can perform voice recognition for each phoneme.

The second decoding module 340 can perform speech recognition by extracting the phonemes corresponding to the set weight or more that the speech recognition is not performed in the first decoding module 330 with the context information and the corresponding phonemes in the classifier parameter.

The context information and the classifier parameters may be extracted from a separate repository and classifier and stored in the second decoding module 340, but are not limited thereto.

It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

Claims

A voice database in which user's voice data corresponding to a user voice uttered by a script considering a phoneme distribution and voice data accumulated in response to a previously input user voice are stored for each situation;
The speech recognition apparatus according to claim 1, further comprising: a speech recognition unit configured to compare the user's speech data with the user's natural language speech data to extract context-specific ambiguity weights for each ambiguity, An ambiguity applying module for applying the ambiguity weights to the acoustic models; And
Wherein the speech recognition unit performs speech recognition based on the acoustic model to which the context ambiguity weight is applied by the ambiguity application module when the speech user's speech is input, And a decoding module that performs speech recognition based on the context information and the classifier parameter set for the section.