KR20150054445A - Sound recognition device - Google Patents

Sound recognition device Download PDF

Info

Publication number
KR20150054445A
KR20150054445A KR1020130136890A KR20130136890A KR20150054445A KR 20150054445 A KR20150054445 A KR 20150054445A KR 1020130136890 A KR1020130136890 A KR 1020130136890A KR 20130136890 A KR20130136890 A KR 20130136890A KR 20150054445 A KR20150054445 A KR 20150054445A
Authority
KR
South Korea
Prior art keywords
voice
user
ambiguity
speech
speech recognition
Prior art date
Application number
KR1020130136890A
Other languages
Korean (ko)
Inventor
강병옥
정호영
전형배
이성주
이윤근
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020130136890A priority Critical patent/KR20150054445A/en
Publication of KR20150054445A publication Critical patent/KR20150054445A/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Abstract

According to an embodiment of the present invention, a voice recognition device includes: a voice database; an ambiguity application module; and a decoding module. The voice database stores user reciting voice data corresponding to the voice of a user reciting a script written with respect to the distribution of phonemes and the accumulated natural language voice data of the user corresponding to the user′s voice input in advance according to the respective situations. The ambiguity application module compares the user′s reciting voice data and natural language voice data to extract the ambiguity weight values for the respective situations regarding the ambiguity of each phoneme and applies the ambiguity weight values to a pre-set voice model for the respective situations based on the user′s reciting voice data and natural language voice data. The decoding module runs a voice recognition operation based on the voice model applied with the ambiguity weight values in advance by the ambiguity application module upon receiving the voice input from the user. The decoding module also performs another voice recognition operation based on the classification parameters and context information set for the phonemes or phoneme segments of the user′s voice having the ambiguity weight values higher than a pre-set weight value for the respective situations.

Description

[0001] The present invention relates to a sound recognition device,

More particularly, the present invention relates to a speech recognition apparatus that can improve recognition performance for recognizing user's voice such as presentation / conference recording, call center recording, medical / legal service, and the like.

Generally, a speech recognition apparatus uses HMM (Hidden Markov Model) as a speech recognition method. Here, the Viterbi search is performed in the speech recognition process, which is a process of determining the most likely candidate word by comparing the difference between the HMM constructed in advance by training the candidate words to be recognized and the features of the currently inputted speech .

The HMM is a method of modeling as a basic phoneme unit for speech recognition. In other words, most of the Korean speech recognition engine companies are using words and sentences by combining the phonemes coming into the speech recognition engine and the phonemes in the databases in the speech recognition engine.

HMM is a dual probability processing method that estimates an unobservable process through other observable processes. Therefore, the use of HMM in speech recognition refers to modeling a minimum phoneme unit of speech recognition and constructing a speech recognition apparatus using the model.

Here, in the speech recognition apparatus, the linguistic speech and the other natural speech have a large mutual characteristic in the same speaker and the individual speaker-to-speaker acoustic space, and stuttering patterns such as mutilation, . These characteristics appear in most natural speech interfaces such as general announcement / meeting minutes, call center recording, medical / legal services except for the purpose of providing information of a trained speaker such as broadcast news among voice dictation areas to which speech recognition technology is applied. Since the variation of the acoustic space in the natural language voice varies greatly depending on the area and the speaker in which the speech recognition is used and the same speaker / area varies depending on the situation, the speaker adaptation technique or the acoustic volume using a general- There is a limit to training a model.

It is an object of the present invention to provide a speech recognition apparatus which can improve recognition performance for recognizing user speech such as presentation / meeting minutes, call center recording, medical / legal services, and the like.

The speech recognition apparatus according to the embodiment is a speech recognition apparatus for recognizing a user's speech data corresponding to a user's speech uttered with a script considering a phoneme distribution and a user's natural language voice data accumulated corresponding to a previously inputted user's voice The user's speech data and the user's natural language speech data are compared with each other to extract context-specific ambiguity weights for ambiguities of the respective phonemes, And a speech recognition module for performing speech recognition based on the acoustic model to which the context-specific ambiguity weight is applied by the ambiguity application module when the speech user's speech is input From the uttered user voice, And a decoding module that performs speech recognition based on context information and classifier parameters set for a phoneme or phoneme interval whose ambiguity weight is equal to or greater than the set weight.

The voice database according to the embodiment may include a first database in which the user-friendly voice data is stored in advance, and a second database in which the user-friendly voice data is classified according to a situation.

The ambiguity extracting unit may determine the degree of variation in the acoustic space for each phoneme included in the user-read-aloud speech data and the user's natural language speech data, and determine ambiguity of each phoneme according to the context Estimate ambiguity weights.

The decoding module according to the exemplary embodiment may further include a speech recognition unit that extracts the speech uttered from the uttered user speech based on the acoustic model, excluding the phonemes having the context-specific ambiguity weight equal to or higher than the set weight, And a second decoding module for performing phonetic recognition by extracting phonemes corresponding to the set weight or more in the context information and the classifier parameter from the first decoding module, .

The speech recognition apparatus according to the embodiment can improve the acoustic model previously set for voice recognition of the user voice based on the accumulated user voice or voice data so that reliability and accuracy in voice recognition of the user voice can be secured There is an advantage.

1 is a control block diagram showing a control configuration of a speech recognition apparatus according to an embodiment.

The following merely illustrates the principles of the invention. Therefore, those skilled in the art will be able to devise various apparatuses which, although not explicitly described or shown herein, embody the principles of the invention and are included in the concept and scope of the invention. It is also to be understood that all conditional terms and examples recited in this specification are, in principle, expressly intended for the purpose of enabling the inventive concept to be understood and are not to be construed as limited to such specifically recited embodiments and conditions do.

It is also to be understood that the detailed description, as well as the principles, aspects and embodiments of the invention, as well as specific embodiments thereof, are intended to cover structural and functional equivalents thereof. It is also to be understood that such equivalents include all elements contemplated to perform the same function irrespective of the currently known equivalents as well as the equivalents to be developed in the future, i.e., the structure.

Thus, for example, it should be understood that the block diagrams herein illustrate exemplary conceptual aspects embodying the principles of the invention. Similarly, all flowcharts, state transition diagrams, pseudo code, and the like are representative of various processes that may be substantially represented on a computer-readable medium and executed by a computer or processor, whether or not the computer or processor is explicitly shown .

The functions of the various elements shown in the figures, including the functional blocks depicted in the processor or similar concept, may be provided by use of dedicated hardware as well as hardware capable of executing software in connection with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared.

Also, the explicit use of terms such as processor, control, or similar concepts should not be interpreted exclusively as hardware capable of running software, and may be used without limitation as a digital signal processor (DSP) (ROM), random access memory (RAM), and non-volatile memory. Other hardware may also be included.

In the claims of the present specification, the elements represented as means for performing the functions described in the detailed description may include, for example, a combination of circuit elements performing the function or any type of software including firmware / And is coupled with appropriate circuitry for executing the software to perform the function. It is to be understood that the invention as defined by the appended claims is not to be interpreted as limiting the scope of the invention as defined by the appended claims, as the functions provided by the various enumerated means are combined and combined with the manner in which the claims require. .

BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: . In the following description, a detailed description of known technologies related to the present invention will be omitted when it is determined that the gist of the present invention may be unnecessarily blurred.

1 is a control block diagram showing a control configuration of a speech recognition apparatus according to an embodiment.

Referring to FIG. 1, the voice recognition apparatus 300 may include a voice database (DB) 310, a ambiguity application module 320, and first and second decoding modules 330 and 340.

The voice database 310 stores the user-readable voice data corresponding to the user voice uttered by the script in consideration of the phoneme distribution and the user's natural voice data accumulated corresponding to the user voice inputted in the past.

Here, the voice database 310 may include a first database 312 in which the user-lenticular voice data is stored in advance, and a second database 314 in which the user natural voice voice data is classified and stored according to a situation.

That is, the second database 314 stores the user natural language voice data for at least one of the previously inputted user voice and the voice recognition result obtained by voice recognition of the user voice as speech and announcement / meeting minutes, call center recording, medical / Or the like, and accumulate and store them.

The ambiguity applying module 320 determines the degree of mutation in the acoustic space for each of the phonemes included in the user-readable speech data and the user's natural language speech data, and calculates the ambiguity weight for each ambiguity And an ambiguity applying unit 324 for applying the ambiguity weight estimated by the ambiguity estimating unit 322 and the ambiguity estimating unit 322 to the acoustic model.

The first decoding module 330 may output the voice recognition result s20 by applying the user voice v20 to the acoustic model set for voice recognition of the user voice v20.

That is, when the user speech v20 is inputted, the first decoding module 330 applies the minimum phoneme unit model constituting the acoustic model to exclude phonemes whose context ambiguity weights are equal to or higher than the set weights in the user speech v20 And the user voice v20 can perform voice recognition for each phoneme.

The second decoding module 340 can perform speech recognition by extracting the phonemes corresponding to the set weight or more that the speech recognition is not performed in the first decoding module 330 with the context information and the corresponding phonemes in the classifier parameter.

The context information and the classifier parameters may be extracted from a separate repository and classifier and stored in the second decoding module 340, but are not limited thereto.

It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

Claims (1)

A voice database in which user's voice data corresponding to a user voice uttered by a script considering a phoneme distribution and voice data accumulated in response to a previously input user voice are stored for each situation;
The speech recognition apparatus according to claim 1, further comprising: a speech recognition unit configured to compare the user's speech data with the user's natural language speech data to extract context-specific ambiguity weights for each ambiguity, An ambiguity applying module for applying the ambiguity weights to the acoustic models; And
Wherein the speech recognition unit performs speech recognition based on the acoustic model to which the context ambiguity weight is applied by the ambiguity application module when the speech user's speech is input, And a decoding module that performs speech recognition based on the context information and the classifier parameter set for the section.
KR1020130136890A 2013-11-12 2013-11-12 Sound recognition device KR20150054445A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020130136890A KR20150054445A (en) 2013-11-12 2013-11-12 Sound recognition device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020130136890A KR20150054445A (en) 2013-11-12 2013-11-12 Sound recognition device

Publications (1)

Publication Number Publication Date
KR20150054445A true KR20150054445A (en) 2015-05-20

Family

ID=53390593

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020130136890A KR20150054445A (en) 2013-11-12 2013-11-12 Sound recognition device

Country Status (1)

Country Link
KR (1) KR20150054445A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020091123A1 (en) * 2018-11-02 2020-05-07 주식회사 시스트란인터내셔널 Method and device for providing context-based voice recognition service
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020091123A1 (en) * 2018-11-02 2020-05-07 주식회사 시스트란인터내셔널 Method and device for providing context-based voice recognition service
CN111883113A (en) * 2020-07-30 2020-11-03 云知声智能科技股份有限公司 Voice recognition method and device
CN111883113B (en) * 2020-07-30 2024-01-30 云知声智能科技股份有限公司 Voice recognition method and device

Similar Documents

Publication Publication Date Title
US10930270B2 (en) Processing audio waveforms
US9299347B1 (en) Speech recognition using associative mapping
US9311915B2 (en) Context-based speech recognition
US9202462B2 (en) Key phrase detection
US8972260B2 (en) Speech recognition using multiple language models
US9552811B2 (en) Speech recognition system and a method of using dynamic bayesian network models
US20150269931A1 (en) Cluster specific speech model
JP6812843B2 (en) Computer program for voice recognition, voice recognition device and voice recognition method
CN106875936B (en) Voice recognition method and device
US10089978B2 (en) Detecting customers with low speech recognition accuracy by investigating consistency of conversation in call-center
US9799325B1 (en) Methods and systems for identifying keywords in speech signal
WO2014183373A1 (en) Systems and methods for voice identification
CN112420026A (en) Optimized keyword retrieval system
KR20170007107A (en) Speech Recognition System and Method
JP7191792B2 (en) Information processing device, information processing method and program
US9959887B2 (en) Multi-pass speech activity detection strategy to improve automatic speech recognition
CN111640423B (en) Word boundary estimation method and device and electronic equipment
KR20150054445A (en) Sound recognition device
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
US11328713B1 (en) On-device contextual understanding
WO2022226782A1 (en) Keyword spotting method based on neural network
CN113658593B (en) Wake-up realization method and device based on voice recognition
Kalantari et al. Incorporating visual information for spoken term detection
Kanrar i Vector used in Speaker Identification by Dimension Compactness
CN117037801A (en) Method for detecting speech wheel and identifying speaker in real teaching environment based on multiple modes

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination