JP5235187B2 - Speech recognition apparatus, speech recognition method, and speech recognition program - Google Patents

Speech recognition apparatus, speech recognition method, and speech recognition program Download PDF

Info

Publication number
JP5235187B2
JP5235187B2 JP2009260836A JP2009260836A JP5235187B2 JP 5235187 B2 JP5235187 B2 JP 5235187B2 JP 2009260836 A JP2009260836 A JP 2009260836A JP 2009260836 A JP2009260836 A JP 2009260836A JP 5235187 B2 JP5235187 B2 JP 5235187B2
Authority
JP
Japan
Prior art keywords
speech
language model
recognition
speaker
adaptation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2009260836A
Other languages
Japanese (ja)
Other versions
JP2011107314A (en
Inventor
済央 野本
浩和 政瀧
敏 高橋
理 吉岡
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2009260836A priority Critical patent/JP5235187B2/en
Publication of JP2011107314A publication Critical patent/JP2011107314A/en
Application granted granted Critical
Publication of JP5235187B2 publication Critical patent/JP5235187B2/en
Application status is Expired - Fee Related legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Abstract

<P>PROBLEM TO BE SOLVED: To adapt a language model to dialogs and perform speech recognition with a high recognition rate without the need to learn a large amount of data, without creating evaluation data, without a large amount of preparations or calculations. <P>SOLUTION: The speech recognition technique includes the steps of: recognizing dialog speech; extracting feature value from an speech signal; performing speech recognition using the feature value obtained from the speech signal containing the utterance contents of a predetermined speaker A, a sound model and the language model before adaption to determine a recognition result A'; determining the language model after adaption using the recognition result A' only and the language model before adaption; and performing speech recognition using the feature value obtained from the speech signal containing the utterance contents of a speaker B other than the predetermined speaker, the sound model and the language model before adaption to determine a recognition result B'. <P>COPYRIGHT: (C)2011,JPO&amp;INPIT

Description

  The present invention relates to a speech recognition apparatus, a speech recognition method, and a speech recognition program that extract the content of a person talking as text data.

  For example, in speech recognition and statistical machine translation, a language model is used as a linguistic restriction for improving recognition performance. When the usage (task) such as voice recognition is limited, it is generally said that the recognition accuracy can be improved by using a language model specially constructed for the usage.

  The statistical language model N-gram, which is a language model actively used in recent years, needs to learn a large amount of data in order to construct a high-performance model. When the usage is limited, it is generally difficult to collect a large amount of text data related to the usage. In order to solve this problem, an adaptation method of a language model has been proposed in which a model is adapted using a target text from a language model learned from a large amount of text data including non-use text. Patent Document 1 is known as a language model generation apparatus that creates a language model (adaptive language model) suitable for a target application without requiring text data suitable for the target application.

JP 2007-249050 A

  However, when speech recognition is performed using a conventional language model generation technique, it is necessary to create evaluation data (speech data and speech transcription text). Prior to speech recognition, it is necessary to create a large number of cluster language models, synthetic cluster language models, etc., and evaluate each language model using evaluation data, etc. There is a problem that the amount of calculation becomes enormous.

  In order to solve the above-described problem, the speech recognition technology according to the present invention recognizes a conversational speech, extracts a feature amount from the speech signal, and obtains a feature amount from a speech signal including the utterance content of a predetermined speaker A. Speech recognition is performed using an acoustic model and a language model before adaptation, a recognition result A ′ is obtained, a language model after adaptation is obtained using only the recognition result A ′ and the language model before adaptation, and a predetermined story Speech recognition is performed using a feature amount obtained from a speech signal including speech content of a speaker B other than the speaker, an acoustic model, and a language model after adaptation, and a recognition result B ′ is obtained.

  The present invention provides language constraints that take advantage of the characteristics of conversation, thereby improving the performance of the language model and improving the recognition rate without creating evaluation data and requiring enormous preparations and calculations. There is an effect of letting.

The figure which shows the structural example of the speech recognition apparatus. The figure which shows the example of a processing flow of the speech recognition apparatus 100. The figure which shows the structural example of the language model adaptation part 121. FIG. The figure which shows the structural example of the speech recognition apparatus 100 '. 2 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100. FIG. The figure which shows the structural example of the speech recognition apparatus. The figure which shows the example of a processing flow of the speech recognition apparatus 200. The figure for demonstrating the selection method of the adaptive speech selection part 225. FIG. The figure which shows the structural example of the speech recognition apparatus 200 '.

  Hereinafter, embodiments of the present invention will be described in detail.

<Voice recognition apparatus 100>
FIG. 1 shows a configuration example of the speech recognition apparatus 100, and FIG. A speech recognition apparatus 100 according to the first embodiment will be described with reference to FIGS.

  The speech recognition apparatus 100 includes a storage unit 103, a control unit 105, speech signal input terminals 107A and 107B, speech signal acquisition units 109A and 109B, feature amount analysis units 113A and 113B, recognition processing units 115A and 115B, and a language model storage unit 117. , An acoustic model storage unit 119, a language model adaptation unit 121, and a post-adaptation language model storage unit 123.

  The speech recognition apparatus 100 recognizes conversation speech. Conversation means communication in which two or more speakers exchange a common topic by utterance of language, and conversational voice means the voice information. In addition, the case where there are two speakers is referred to as “dialogue”, and in this embodiment, a case where the dialogue voice is recognized will be described in order to simplify the explanation.

  In the conventional voice recognition technology, voice recognition is performed independently when recognizing dialogue voices by two speakers. At that time, a language model represented by N-gram was generally used as a language constraint, but there was no framework for giving a language constraint considering the characteristics of dialogue.

  The present invention improves the performance of the language model and improves the recognition rate by providing language constraints that take into consideration the characteristics of dialogue.

  Here, the characteristics of the dialogue have a strong relationship between the utterances of the two speakers, and there is a high probability that the keywords spoken by one speaker will also speak by the other speaker. It is a characteristic. Therefore, in the present invention, when recognizing a dialogue, the speech restriction of one speaker is used to give a language restriction on the speech content of the other speaker. Specifically, the language restriction is applied by adapting a language model so as to match the conversation contents using a speech signal including the utterance contents of a predetermined speaker, and using the language model after the adaptation. This means that speech recognition of a speech signal including the utterance content of a speaker other than the above speaker is performed.

  Here, the predetermined speaker means a speaker who is expected to have a high recognition rate (even when using a language model before adaptation), such as politely speaking and good voice recording conditions. A speaker other than a speaker means a speaker whose recognition rate is expected to be low (using a pre-adaptation language model), such as rough speaking and poor voice recording conditions. For example, when recognizing a dialogue between an operator and a customer who is exchanged at a call center, etc., an operator who is expected to speak well and have good voice recording conditions is assumed to be a predetermined speaker, and his speech is rough and voice recording conditions are also bad. A customer who is expected to be a speaker other than a predetermined speaker is assumed to be a speaker.

  Hereinafter, the processing content of each part is demonstrated.

<Storage unit 103 and control unit 105>
The storage unit 103 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 103, and data may be directly transferred between the units. Note that a language model storage unit 117, an acoustic model storage unit 119, and a post-adaptation language model storage unit 123, which will be described later, may be part of the storage unit 103.

  The control unit 105 controls each process.

<Audio signal input terminals 107A and 107B, audio signal acquisition units 109A and 109B>
The audio signal acquisition units 109A and 109B respectively receive analog audio signals A2 and B2 of a predetermined speaker A (for example, an operator) and a speaker B (for example, a customer) other than the predetermined speaker via audio signal input terminals 107A and 107B, respectively. Is converted into digital audio signals A3 and B3 and output (s109A and s109B).

<Feature amount analysis units 113A and 113B>
The feature amount analyzing units 113A and 113B extract (acoustic) feature amounts A4 and B4 from the digital speech signals A3 and B3 of the predetermined speaker A and a speaker B other than the predetermined speaker, respectively, and output them (s113A, s113B).

  As the feature quantity to be extracted, for example, 1 to 12 dimensions of MFCC (Mel-Frequenct Cepstrum Coefficient) and dynamic parameters such as ΔMFCC which is the change amount, power, Δ power, and the like are used. Also, CMN (cepstrum average normalization) processing may be performed. The feature amount is not limited to MFCC or power, but a parameter used for speech recognition may be used. The specific feature quantity extraction method is well known and will not be described here (for example, Reference 1: Sadahiro Furui, “Acoustic / Speech Engineering”, Modern Science Co., Ltd., September 1992).

<Language model storage unit 117 and acoustic model storage unit 119>
The language model storage unit 117 and the acoustic model storage unit 119 store a language model L and an acoustic model K in advance, respectively. The language model L may be a general-purpose language model or a language model specially constructed for a call center. If a specialized language model is used as the language model before adaptation, the recognition rate of the recognition result A ′ will be higher. Since the language model is adapted based on a more accurate recognition result, the recognition rate of the recognition result B ′ obtained using the language model after adaptation is considered to be high.

<Recognition processing unit 115A>
The recognition processing unit 115A performs speech recognition using the feature amount A4 extracted from the digital speech signal A3 including the utterance content of the predetermined speaker A, the acoustic model K, and the language model L before adaptation (s115A). The recognition processing unit 115A receives the feature amount A4 extracted from the digital speech signal A3 including the utterance content of the predetermined speaker A, and obtains the recognition result A ′ using the acoustic model K and the language model L, as in the prior art. Output. In addition, the language model L ′ used when performing speech recognition is also output. Since a specific recognition processing method is based on a known method (for example, Reference 1), description thereof is omitted.

  The above processing (S109 to s115) is repeated until the dialogue (call in the case of a call center) is completed. After the dialogue is completed, the following processing is performed.

  The voice signal acquisition process (s109B), the feature amount analysis process (s113B), etc. for the voice signal B2 including the utterance contents of the speaker B other than the predetermined speaker are stored in the storage unit 103 or the like. Alternatively, it may be performed after the call ends. The voice recognition process (s115B) is performed by the recognition processing unit 115B after the end of the call and after adaptation of a language model (s121) described below.

<Language model adaptation unit 121>
The language model adaptation unit 121 uses only the speech recognition result A ′ (hereinafter referred to as “recognition result A ′”) of the speech signal A2 including the utterance content of the predetermined speaker A and the language model L to use the language after adaptation. The model L ″ is obtained (s121). Here, “only recognition result A ′” means that the recognition result of the speech signal B2 including the utterance content of the speaker B is not included. That is, using the recognition result A ′ of the predetermined speaker A, the language model L is adapted so as to match the content of the dialogue, and the adapted language model L ″ is obtained.

  For example, weighting adaptation is one of adaptation methods. Weighting adaptation is a method of mixing N-grams learned from a plurality of text corpora of different scales. When mixing, weighting is performed in consideration of the scale and importance for each corpus. In this embodiment, the N-gram learned from the language model L before adaptation and the recognition result A ′ of the predetermined speaker A is mixed in consideration of the weight w, and the mixed N-gram is adapted. For example, let Pa (x) be the appearance frequency of the word x learned from the text corpus A with the total number of words m, and let Pb (x) be the appearance frequency of the word x learned from the text corpus B with the total number n of words. The weight of the corpus B at the time of mixing is set to w. In this case, the appearance frequency P (x) of the word x learned by weighting adaptation of the corpora A and B is expressed by the following equation.

When adaptation is performed, utterances that appear regardless of topics such as “Yes” and “Eh” may be excluded. Note that an appropriate value for the weight w is obtained in advance through experiments or the like.

  FIG. 3 shows a configuration example of the language model adaptation unit 121. The language model adaptation unit 121 includes a weighting unit 121a and an adaptation unit 121b.

  The weighting unit 121a creates a corpus B using the speech recognition result A 'of a predetermined speaker A and the language model L' used for speech recognition, and obtains the total number n of words. Furthermore, n is multiplied by a previously determined weight w to obtain wn.

  The adaptation unit 121b obtains the appearance frequency Pb (x) of the word x from the corpus B, obtains the total number m of words and the appearance frequency Pa (x) of the word x from the language model L before adaptation and the corpus A, From (1), the appearance frequency P (x) of the learned word x is calculated. The language model L is adapted using the appearance frequency P (x), and the language model L ″ after adaptation is obtained.

<Adapted language model storage unit 123>
The language model storage unit 123 after adaptation stores the language model L ″ after adaptation. The language model L before adaptation is stored separately from the language model L before adaptation. Since the result A ′ is different for each call, the language model L ″ after adaptation is also different for each call.

<Recognition processing unit 115B>
The recognition processing unit 115B performs speech recognition using the feature amount B4 extracted from the speech signal B2 including the speech content of the speaker B other than the predetermined speaker, the acoustic model K, and the language model L ″ after adaptation (s115B). The recognition processing unit 115B receives the feature quantity B4 extracted from the digital speech signal B3 including the utterance content of the predetermined speaker B, and uses the acoustic model K and the language model L to obtain the recognition result B ′ as in the conventional technique. Find and output.

<Effect>
In the present embodiment, the utterance with a high recognition rate (reliability) is used throughout the dialogue to adapt the language model and recognize the utterance with a low recognition rate (reliability). With such a configuration, the performance of the language model can be improved and the recognition rate in speech recognition can be improved without creating evaluation data and without requiring a large amount of preparation and calculation.

  In particular, when recognizing customer voice in a call center, the voice signal on the customer side is poor in recording environment and cannot be expected to be effective in the acoustic model, so the conventional voice recognition technology has a low recognition rate, but the present invention is used. Can be expected to improve the recognition rate.

<Others>
Note that the speech recognition apparatus 100 receives digital audio signals A3 and B3 instead of the analog audio signals A2 and B2, or receives digital audio signals A3 and B3 from the storage unit 103, a storage medium (not shown), or a communication device. The voice recognition device 100 may not include the voice input terminals 107A and 107B and the voice signal acquisition units 109A and 109B.

  In this embodiment, a case has been described in which a call (conversation) voice between an operator and a customer in a call center is recognized, but other conversation voices may be used, and further conversation voices may be used. When there are three or more speakers, the speaker who is expected to have a high recognition rate by the language model L before adaptation (for example, a sound collection environment is prepared, speaking speed, words, grammar, etc. A group of appropriate speakers is assumed to be A, and a speaker (for example, in a sound collecting environment where there is a lot of noise or the like, speaking speed is fast, or used) A group of speakers having a word or grammar error) is B, and the language model can be adapted to suit the conversation content in the same manner as in the present embodiment in consideration of the conversation characteristics.

  In the present embodiment, the signal received from the audio signal input terminal A includes the utterance content of the speaker expected to increase the recognition rate by the language model L before adaptation, and is adapted to the signal received from the audio signal input terminal B. It is assumed that the utterance content of a speaker whose recognition rate by the previous language model L is expected to be low is included. However, depending on the amount of noise of each voice signal and the speaking speed, it may be determined whether the voice signal is from a speaker with a high recognition rate or a speaker with a low recognition rate.

  Note that the speech recognition apparatus 100 does not necessarily output the recognition result A ′. For example, in the call center, when it is desired to record only the utterance content of the customer as text data, only the recognition result B 'may be output and stored.

  In this embodiment, the language model is adapted to the conversation after the call ends. However, the call need not necessarily be terminated. For example, the language model is adapted from the recognition result A ′ within a predetermined time, Using a language model, speech recognition including speech content of the speaker B within the predetermined time may be performed.

[Modification 1]
In the speech recognition apparatus 100 according to the first embodiment, speech signals including utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker are respectively input from separate audio signal input terminals and processed separately. In the voice recognition device 100 ′ of the present modification, a case will be described in which a voice signal including utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker is input from the same voice signal input terminal.

<Voice recognition apparatus 100 '>
FIG. 4 shows a configuration example of the speech recognition apparatus 100 ′. A speech recognition apparatus 100 ′ according to the first modification will be described with reference to FIG.

  The speech recognition apparatus 100 includes a storage unit 103, a control unit 105, an audio signal input terminal 107, an audio signal acquisition unit 109, a feature amount analysis unit 113, a recognition processing unit 115, a language model storage unit 117, an acoustic model storage unit 119, a language A model adaptation unit 121 and a post-adaptation language model storage unit 123 are included. Only parts different from the first embodiment will be described.

  Hereinafter, the processing content of each part is demonstrated.

<Audio signal input terminal 107 and audio signal acquisition unit 109>
The audio signal acquisition unit 109 acquires an analog audio signal including the utterance contents of the speaker A and the speaker B via the audio signal input terminal 107, converts it into a digital audio signal, and outputs it.

<Speaker determination unit 111>
The speaker determination unit 111 determines a speaker who is uttering the utterance content included in the digital audio signal using the digital audio signal, and outputs it as speaker information. Since a specific speaker determination method is based on a known method (for example, Reference 1), description thereof is omitted.

<Feature amount analysis unit 113>
The feature amount analysis unit 113 extracts (acoustic) feature amounts from digital audio signals including the utterance contents of the speaker A and the speaker B, adds speaker information to each feature amount, and outputs the feature amount.

<Recognition processing unit 115>
The recognition processing unit 115 determines which speaker the feature amount is based on the speaker information, the feature amount extracted from the digital speech signal including the utterance content of the predetermined speaker A, the acoustic model K, and the language before adaptation. Speech recognition is performed using the model L. And then. The recognition result A ′ and the language model L ′ used are output. Note that the feature quantity extracted from the digital voice signal including the utterance content of the speaker B is stored in the storage unit 103 or the like.

  The above processing is repeated until the dialogue (call in the case of a call center) is completed. After the dialogue is finished, the speech recognition apparatus 100 ′ performs the language model adaptation processing (s121) in the language model adaptation unit 121 as in the first embodiment. The language model L ″ after adaptation is obtained.

  And the recognition process part 115 receives the feature-value extracted from the audio | voice signal containing the utterance content of the speaker B from the memory | storage part 103 grade | etc., Performs speech recognition using the acoustic model K and the language model L "after adaptation, The recognition result B ′ is output.

  The language model L before adaptation may be used at the start of dialogue, and the language model L ″ after adaptation may be used at the end of dialogue (after language model adaptation).

  By adopting such a configuration, the same effect as in the first embodiment can be obtained. Therefore, each unit (speech signal acquisition unit, feature amount analysis unit, recognition processing unit, etc.) may be the same or provided separately.

<Hardware configuration>
FIG. 5 is a block diagram illustrating a hardware configuration of the speech recognition apparatus 100 according to the present embodiment. As illustrated in FIG. 5, the speech recognition apparatus 100 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random). Access Memory) 16 and a bus 17.

  The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output interface for outputting data. The auxiliary storage device 14 is, for example, a hard disk, a semiconductor memory, or the like, and stores programs and various data for causing the computer to function as the voice recognition device 100. Further, the above-mentioned program and various data are expanded in the RAM 16 and used from the CUP 11 or the like. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 in a communicable manner. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

<Program structure>
As described above, each program for executing each process of the speech recognition apparatus 100 according to the present embodiment is stored in the auxiliary storage device 14. Each program constituting the speech recognition program may be described as a single program sequence, or at least a part of the program may be stored in the library as a separate module.

<Cooperation between hardware and program>
The CPU 11 expands the above-described program and various data stored in the auxiliary storage device 14 in the RAM 16 according to the read OS program. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11b to sequentially execute the operation indicated by the program, The calculation result is stored in the register 11c.

  FIG. 1 is a block diagram illustrating a functional configuration of the speech recognition apparatus 100 configured by reading and executing the above-described program in the CPU 11 as described above.

  Here, the storage unit 103, the language model storage unit 117, the acoustic model storage unit 119, and the post-adaptation language model 123 may be any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory, cache memory, etc. Corresponds to a storage area. The voice signal acquisition units 109A and 109B, the speaker determination unit 111, the feature amount analysis units 113A and B, the recognition processing units 115A and 115B, and the language model adaptation unit 121 are configured by causing the CPU 11 to execute a voice recognition program. Is.

<Voice recognition apparatus 200>
FIG. 6 shows a configuration example of the speech recognition apparatus 200, and FIG. Regarding the parts different from the first embodiment, the speech recognition apparatus 200 according to the second embodiment will be described with reference to FIGS.

  The speech recognition apparatus 200 includes a storage unit 103, a control unit 105, speech signal input terminals 107A and 107B, speech signal acquisition units 109A and 109B, feature amount analysis units 113A and 113B, recognition processing units 115A and 115B, and a language model storage unit 117. In addition to the acoustic model storage unit 119, the language model adaptation unit 121, and the post-adaptation language model storage unit 123, an adaptive utterance selection unit 225 and an utterance section determination unit 223 are included.

<Speech section determination unit 223>
The utterance section determination unit 223 receives the digital audio signal B3 including the utterance content of the speaker B other than the predetermined speaker from the audio signal acquisition unit 109B, and uses this to use the utterance of the speaker B other than the predetermined speaker. The section is determined, and the utterance section information is obtained and output (s223). The utterance section information is, for example, a combination of utterance start time and end time. Since a specific speech segment determination method is based on a known method (for example, Reference 1), description thereof is omitted.

<Adaptive utterance selection unit 225>
The adaptive utterance selection unit 225 receives the recognition result A ′ and the language model L ′ from the recognition processing unit 115A, and receives the utterance section information from the utterance section determination unit 223.

  The adaptive utterance selection unit 225 uses the utterance section information of the speaker B other than the predetermined speaker, and recognizes the recognition result A ′ of the speech signal A3 including the utterance contents of n predetermined speakers A before and after the utterance section. Is selected (s225). Note that n is an arbitrary natural number, for example, 1 or 2.

  FIG. 8 is a diagram for explaining a selection method of the adaptive utterance selection unit 225. For example, from the utterance section information of the [t] -th customer B, when n = 1, the recognition result A ′ of the [t−1] -th and [t + 1] -th operator A is selected, and when n = 2 In addition to the [t−1] th and [t + 1] th, the recognition result A ′ of the [t−3] th and [t + 3] th operator A is selected. However, at the start or end of the conversation, if the recognition result of the operator A is not the recognition result A ′ of the n operators A before the utterance section of the customer B, or the recognition result of the n operators A is after In some cases, the image A ′ does not exist. In this case, only the existing recognition result A ′ may be selected. For example, when the conversation starts from the recognition result A ′ of the [t−3] th operator A and n = 2, before the [t−2] th customer B ’s utterance section, [t− 3] Only the recognition result A ′ of the third operator A is selected, but the recognition results A ′ of the existing [t−3] th, [t−1] th and [t + 1] th operators A are selected.

<Language model adaptation unit 121>
The language model adaptation unit 121 obtains the language model L ″ after adaptation using only the n preceding and following recognition results A ′ selected by the adaptive utterance selection unit 225 and the language model L (s121). In contrast to using the recognition result A ′ of the predetermined speaker A for the entire dialogue, the present embodiment is different in that only n recognition results A ′ before and after the speech signal B2 to be recognized are used. The adaptation method itself is the same as that in the first embodiment.

<Recognition processing unit 115B>
The recognition processing unit 115B includes the feature amount B4 extracted from the speech signal B2 including the utterance contents of the speaker B other than the predetermined speaker (corresponding to the utterance section), the acoustic model K, and the adaptation corresponding to the utterance section. Speech recognition is performed using the language model L ″ (s115B). In the first embodiment, the same language model after adaptation is used in the entire dialogue. In this embodiment, the speaker B other than the predetermined speaker is used. Since the language model after adaptation is updated for each utterance section, speech recognition processing is performed using the language model after adaptation that is different for each utterance section.

  The speech recognition apparatus 200 repeats the above process until the dialogue is finished (s228).

<Effect>
In this embodiment, utterances with a high recognition rate before and after utterances with a low recognition rate are used to adapt the language model to recognize utterances with a low recognition rate. With such a configuration, it is possible to improve the performance of the language model and improve the recognition rate without creating evaluation data and without requiring enormous preparations and calculations.

  In addition, in the case of the configuration of the present embodiment, the adaptation is not performed using the entire content of one utterance, but only by using the utterance that is temporally adjacent to the utterance to be recognized. Adaptable to the topic that appears regularly. Therefore, it is effective when the topic of conversation changes from moment to moment. It is also effective in a call center conversation or the like in which the operator repeats the contents of the customer's utterances (repeat return). When the parrot is returned immediately after the customer speaks, a sufficient effect can be obtained even when n = 1, and the amount of calculation can be reduced. In addition, after one utterance of A is completed after the utterance of B, the voice recognition process can be started for the utterance of B.

  Note that the range of application can be expanded by increasing the value of n to 2, 3,..., But as n increases, the amount of calculation increases and the start of speech recognition processing becomes slower. In order to adapt the language model to conversation, an appropriate n may be obtained.

<Others>
In this embodiment, as in the first modification of the first embodiment, when a voice signal including the utterance contents of a predetermined speaker A and a speaker B other than the predetermined speaker is input from the same audio signal input terminal. Can be transformed. A configuration example of the speech recognition apparatus 200 ′ in that case is shown in FIG.

  In this case, the speech recognition apparatus 200 ′ does not require the utterance section determination unit 223, and the adaptive utterance selection unit 225 receives speaker information instead of the utterance section information.

  That is, the adaptive utterance selection unit 225 receives speaker information from the speaker determination unit 111 and receives a recognition result A ′ and a language model L ′ from the recognition processing unit. The adaptive utterance selection unit 225 selects the recognition result A ′ of the speech signal A3 including the utterance contents of n predetermined speakers A before and after the speaker information of the speaker B other than the predetermined speaker. .

100, 100 ′, 200, 200 ′ Speech recognition device 103 Storage unit 105 Control unit 109A, 109B, 109 Speech signal acquisition unit 111 Speaker determination unit 113A, 113B, 113 Feature quantity analysis unit 115A, 115B, 115 Recognition processing unit 117 Language model storage unit 119 Acoustic model storage unit 123 Post-adaptation language model storage unit 225 Adaptive utterance selection unit 223 Utterance section determination unit

Claims (3)

  1. A speech recognition device that recognizes conversational speech,
    A storage unit for storing an acoustic model and a language model;
    A feature quantity analysis unit that extracts a feature quantity from an audio signal;
    Speech recognition is performed using a feature amount obtained from a speech signal including the utterance content of a predetermined speaker A, the acoustic model, and a language model before adaptation, a recognition result A ′ is obtained, and a speech other than the predetermined speaker A recognition processing unit that performs speech recognition using the feature amount obtained from the speech signal including the utterance content of the person B, the acoustic model, and the language model after adaptation, and obtains a recognition result B ′;
    A language model adaptation unit for obtaining a language model after adaptation using only the recognition result A ′ and a language model before adaptation;
    Using said speech period of the speaker B, a adaptive speech selection unit for selecting the front and rear of n recognition result A 'of the speech period,
    The recognition result A ′ used in the language model adaptation unit is selected by the adaptive utterance selection unit,
    The recognition processing unit performs speech recognition using a feature value obtained from an audio signal including the utterance content of the speaker B in the utterance section, the acoustic model, and an adapted language model corresponding to the utterance section, Obtain recognition result B ′.
    A speech recognition apparatus characterized by that.
  2. A speech recognition method for recognizing conversational speech,
    A feature amount analyzing step for extracting a feature amount from the audio signal;
    A recognition processing step A for performing speech recognition using a feature amount, an acoustic model, and a language model before adaptation obtained from a speech signal including the utterance content of a predetermined speaker A, and obtaining a recognition result A ′;
    A language model adaptation step for obtaining a language model after adaptation using only the recognition result A ′ and the language model before adaptation;
    A recognition processing step of performing speech recognition using a feature amount obtained from a speech signal including speech content of a speaker B other than the predetermined speaker, the acoustic model, and the language model after adaptation, and obtaining a recognition result B ′. B and
    Using said speech period of the speaker B, a adaptive utterance selection step of selecting the front and rear of n recognition result A 'of the speech period,
    The recognition result A ′ used in the language model adaptation step is selected in the adaptive utterance selection step,
    In the recognition processing step, speech recognition is performed using a feature amount obtained from a speech signal including the speech content of the speaker B in the speech section, the acoustic model, and an adapted language model corresponding to the speech section, Obtain recognition result B ′.
    A speech recognition method characterized by the above.
  3. Program for causing a computer to function as a speech recognition apparatus according to claim 1 Symbol placement.
JP2009260836A 2009-11-16 2009-11-16 Speech recognition apparatus, speech recognition method, and speech recognition program Expired - Fee Related JP5235187B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009260836A JP5235187B2 (en) 2009-11-16 2009-11-16 Speech recognition apparatus, speech recognition method, and speech recognition program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2009260836A JP5235187B2 (en) 2009-11-16 2009-11-16 Speech recognition apparatus, speech recognition method, and speech recognition program

Publications (2)

Publication Number Publication Date
JP2011107314A JP2011107314A (en) 2011-06-02
JP5235187B2 true JP5235187B2 (en) 2013-07-10

Family

ID=44230867

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2009260836A Expired - Fee Related JP5235187B2 (en) 2009-11-16 2009-11-16 Speech recognition apparatus, speech recognition method, and speech recognition program

Country Status (1)

Country Link
JP (1) JP5235187B2 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101478146B1 (en) * 2011-12-15 2015-01-02 한국전자통신연구원 Apparatus and method for recognizing speech based on speaker group
JP6179509B2 (en) * 2012-05-17 2017-08-16 日本電気株式会社 Language model generation apparatus, speech recognition apparatus, language model generation method, and program storage medium
JP5762365B2 (en) * 2012-07-24 2015-08-12 日本電信電話株式会社 Speech recognition apparatus, speech recognition method, and program
JP6277659B2 (en) * 2013-10-15 2018-02-14 三菱電機株式会社 Speech recognition apparatus and speech recognition method
KR20170030387A (en) 2015-09-09 2017-03-17 삼성전자주식회사 User-based language model generating apparatus, method and voice recognition apparatus

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000250581A (en) * 1999-02-25 2000-09-14 Atr Interpreting Telecommunications Res Lab Language model generating device and voice recognition device
EP2026327A4 (en) * 2006-05-31 2012-03-07 Nec Corp Language model learning system, language model learning method, and language model learning program
WO2008069308A1 (en) * 2006-12-08 2008-06-12 Nec Corporation Audio recognition device and audio recognition method

Also Published As

Publication number Publication date
JP2011107314A (en) 2011-06-02

Similar Documents

Publication Publication Date Title
US8768700B1 (en) Voice search engine interface for scoring search hypotheses
US7016849B2 (en) Method and apparatus for providing speech-driven routing between spoken language applications
US8019602B2 (en) Automatic speech recognition learning using user corrections
US6876966B1 (en) Pattern recognition training method and apparatus using inserted noise followed by noise reduction
EP0980574B1 (en) Pattern recognition enrolment in a distributed system
US7043422B2 (en) Method and apparatus for distribution-based language model adaptation
CA2437620C (en) Hierarchichal language models
US5865626A (en) Multi-dialect speech recognition method and apparatus
US6526380B1 (en) Speech recognition system having parallel large vocabulary recognition engines
DE60302407T2 (en) Ambient and speaker-adapted speech recognition
US6308151B1 (en) Method and system using a speech recognition system to dictate a body of text in response to an available body of text
US6961705B2 (en) Information processing apparatus, information processing method, and storage medium
JP4568371B2 (en) Computerized method and computer program for distinguishing between at least two event classes
JP3782943B2 (en) Speech recognition apparatus, computer system, speech recognition method, program, and recording medium
EP1346343B1 (en) Speech recognition using word-in-phrase command
DE69629763T2 (en) Method and device for determining triphone hidden markov models (HMM)
CN1296886C (en) Speech recognition system and method
Robinson et al. WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition
US6366882B1 (en) Apparatus for converting speech to text
CN1277248C (en) Speech recognition system
ES2327468T3 (en) Voice recognition with adaptation of the speaker based on the classification of the tone.
US7580838B2 (en) Automatic insertion of non-verbalized punctuation
US20080189106A1 (en) Multi-Stage Speech Recognition System
US6839667B2 (en) Method of speech recognition by presenting N-best word candidates
CN1321401C (en) Speech recognition apparatus, speech recognition method, conversation control apparatus, conversation control method

Legal Events

Date Code Title Description
RD03 Notification of appointment of power of attorney

Free format text: JAPANESE INTERMEDIATE CODE: A7423

Effective date: 20110722

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20120307

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20130115

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20130122

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20130228

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20130319

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20130325

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

Ref document number: 5235187

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20160405

Year of fee payment: 3

S531 Written request for registration of change of domicile

Free format text: JAPANESE INTERMEDIATE CODE: R313531

R350 Written notification of registration of transfer

Free format text: JAPANESE INTERMEDIATE CODE: R350

LAPS Cancellation because of no payment of annual fees