WO2009101837A1 - 記号挿入装置および記号挿入方法 - Google Patents
記号挿入装置および記号挿入方法 Download PDFInfo
- Publication number
- WO2009101837A1 WO2009101837A1 PCT/JP2009/050641 JP2009050641W WO2009101837A1 WO 2009101837 A1 WO2009101837 A1 WO 2009101837A1 JP 2009050641 W JP2009050641 W JP 2009050641W WO 2009101837 A1 WO2009101837 A1 WO 2009101837A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- symbol insertion
- model
- symbol
- speech
- models
- Prior art date
Links
- 238000003780 insertion Methods 0.000 title claims abstract description 385
- 230000037431 insertion Effects 0.000 title claims abstract description 385
- 238000012966 insertion method Methods 0.000 title claims description 11
- 238000000034 method Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 38
- 238000005315 distribution function Methods 0.000 description 11
- 230000000694 effects Effects 0.000 description 7
- 238000013518 transcription Methods 0.000 description 5
- 230000035897 transcription Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 2
- 230000035807 sensation Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 239000002344 surface layer Substances 0.000 description 2
- 238000010792 warming Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009413 insulation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000002945 steepest descent method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Definitions
- the present invention relates to a symbol insertion device and a symbol insertion method, and more particularly to a symbol insertion device and a symbol insertion method for inserting a specific symbol such as a punctuation mark into a transcription or a speech-recognized text.
- Non-Patent Document 1 An example of a symbol insertion technique related to the present invention is described in Section 3.2 of Non-Patent Document 1.
- sentence boundaries are detected using the length of a pose taken by a speaker and word information appearing before and after the pose.
- a character string X that does not include a punctuation point but includes pose information and a character string Y that includes a punctuation point are considered to be different languages, and P (Y
- the language model likelihood P (Y) in the case where the pose is inserted and not inserted in all the places where the pose can be converted into the phrase (P (X
- Y) 1).
- the comparison is made to determine whether to insert a punctuation mark.
- Y) uses a model that depends on the expression before and after the pose and the pose length.
- a word 3-gram model learned from a transcription of a CSJ (Japanese spoken corpus) in which sentence boundaries are manually added is used.
- a word string obtained by converting a voice uttered by a speaker into a character string for example, “ ⁇ If a thermal insulation effect is obtained, this will increase the temperature of the sensation by about 2 degrees” If the pose length immediately after “when you get” is long enough, it will be judged as a sentence boundary, and “ ⁇ If you can get a warming effect. A punctuation mark is inserted. On the other hand, if the pose length immediately after “when obtained” is sufficiently short, the point is not judged as a sentence boundary, and “ ⁇ if the warming effect is obtained, this will increase the temperature of the sensation by about 2 degrees.” The whole is treated as one sentence. In such a portion where an extremely long pause length appears, the symbol insertion portion can be detected with a certain degree of accuracy regardless of the difference in speakers.
- the way of speaking is generally different for each speaker, and the way to pose at the punctuation insertion point and the expression of the end of the phrase and the end of the sentence differ depending on the speaker. For example, even if there are multiple speakers with the same pose length immediately after “when obtained”, if the talk speed is fast, the pose length immediately after “when obtained” is relative to the pose length at other locations. Therefore, the possibility of inserting a punctuation point increases, and vice versa for a slow-speaking person. Also, some speakers rarely use the phrase “and the end of the sentence”, such as “If you get it.”
- the symbol insertion likelihood is calculated using one type of model (word 3-gram model) learned from the transcription of CSJ (Japanese Spoken Language Corpus). Therefore, it is not possible to make a symbol insertion determination taking into account differences in the way of speaking for each speaker.
- the object of the present invention is to perform symbol insertion determination using a symbol insertion model according to a speaker's linguistic and acoustic speaking characteristics such as how to speak and pose, and thereby
- the object is to enable symbol insertion determination in consideration of the difference in characteristics.
- the symbol insertion device of the present invention is a symbol insertion device that inserts a symbol into a word string obtained by characterizing speech information, and for each of a plurality of symbol insertion models that are provided according to the manner of speaking of the word string to be inserted.
- the word insertion likelihood is weighted according to the similarity between the speech feature of the word string and a plurality of speech feature models and the relevance between the symbol insertion model and the speech feature model. Determine whether to insert a symbol into a column.
- the present invention it is possible to determine whether to insert a symbol in consideration of the difference in speaking characteristics of each speaker.
- the reason for this is that among the multiple symbol insertion models according to the speaker's linguistic and acoustic speaking characteristics such as how to speak and pose, the symbol insertion model that matches the characteristics of the speaker in the input word string This is because the symbol insertion determination is performed with emphasis on.
- a symbol insertion device 100 includes a processing device 101, an input device 102 connected to the processing device 101, and n symbol insertion model storage devices 103- 1 to 103-n, which similarly includes n speech style feature model storage devices 104-1 to 104-n and an output device 105.
- the processing apparatus 101 further includes a word string information storage unit 111, a speech feature similarity calculation unit 112, a symbol insertion likelihood calculation unit 113, a symbol insertion determination unit 114, and a symbol inserted word string information storage unit 115.
- the speech feature similarity calculating unit 112, the symbol insertion likelihood calculating unit 113, and the symbol insertion determining unit 114 can be realized by, for example, a computer constituting the processing apparatus 101 and a program executed thereon.
- the program is recorded on a computer-readable recording medium such as a magnetic disk, and is read by the computer when the computer is started up, etc., and by controlling the operation of the computer, the speech feature similarity calculating unit 112 is executed on the computer.
- the symbol insertion likelihood calculating means 113 and the symbol insertion determining means 114 are realized.
- Each of the storage devices 103-1 to 103-n, 104-1 to 104-n, the word string information storage means 111 and the symbol inserted word string information storage means 115 in the processing device 101 are the main memory and auxiliary provided in the computer. It can be realized by a storage device.
- the input device 102 is a device for inputting information about a word string of a spoken word to be inserted into the processing device 101, and includes a keyboard, a file device, a data receiving device, or the like.
- the information of the word sequence of spoken words is grammatical information (for example, part-of-speech information) and acoustic information (for example, part-of-speech information) required by the speech feature similarity calculating unit 112 and the symbol insertion likelihood calculating unit 113 for the transcription text and the speech recognition text. (Pause information).
- the word string information storage unit 111 stores information on the word string input from the input device 102.
- word information 1022 for each word is arranged in the order of word utterances.
- Each word information 1022 includes surface information 1023 of the word, part-of-speech information 1024 indicating the part-of-speech and utilization form of the word, a pause length 1025 between the word and the immediately following word, and other information.
- Other information includes speaking speed.
- the symbol insertion model storage devices 103-1 to 103-n store symbol insertion models learned using the learning data of speakers having different speaking characteristics.
- the symbol insertion model models the probability of inserting punctuation marks and punctuation marks using linguistic information (word surface information, part of speech information, etc.) and acoustic information (pause information, etc.).
- the symbol insertion model uses a known technique such as a word-with-symbol n-gram model, a support vector machine (SVM), an identification model such as a conditional random field (CRF) (Non-Patent Document 1), a rule-based model, and the like. Can be created.
- Each symbol insertion model itself is the same as the symbol insertion model used in Non-Patent Document 1 and the like, and is different from the conventional one in that a symbol insertion model is provided for each of different speaking styles.
- the symbol insertion likelihood calculation means 113 reads the word string information 1021 in units of word information 1022 from the word string information storage means 111, and for each word information 1022, a punctuation mark and a reading mark are inserted immediately after the word having the word information.
- the symbol insertion likelihood indicating the accuracy of the inserted symbol and the symbol insertion likelihood indicating the accuracy of the probability that no punctuation is inserted are stored for each symbol insertion model stored in the symbol insertion model storage devices 103-1 to 103-n. It is a means to calculate.
- the speaking style feature model storage devices 104-1 to 104-n store the speaking style feature models learned using the learning data of speakers having different speaking style features.
- the speaking feature model is obtained by modeling the feature of the speaking method using the voice information of the speaker. Examples of the voice information include language information (such as frequency information at the end of sentence expression) and acoustic information (such as pause information, speech speed information, and duration of speech utterance).
- the speech feature model storage devices 104-1 to 104-n correspond one-to-one with the symbol insertion model storage devices 103-1 to 103-n.
- the speech style feature model stored in the speech style feature model storage device 104-i is the feature of the speaker's speech style in the learning data used for learning the symbol insertion model stored in the corresponding symbol insertion model storage device 103-i. Is modeled.
- a speech style feature model using acoustic information there are, for example, a speech speed value and a pause length for each word, an average speech speed value and an average pause length for each phrase or speech unit.
- a simple example of a speech style feature model using linguistic information is frequency information of end-of-sentence expression. Other examples of the speech style feature model will be described in detail in later examples.
- the speech feature similarity calculating unit 112 reads the word string information 1021 in units of the word information 1022 from the word string information storage unit 111, extracts the speech feature, and extracts the extracted speech feature and the speech feature model storage device. This is a means for calculating the similarity to the features of the speech style modeled by the speech style feature models stored in 104-1 to 104-n.
- the unit for extracting the feature of the speech from the word string information 1021 may be for each predetermined number of words, or may be all utterances of one speaker, and is stored in the word string information storage unit 111. The entire word string information 1021 may be used.
- the symbol insertion determining means 114 is a speech feature similarity calculating means for the symbol insertion likelihood obtained for each symbol insertion model by the symbol insertion likelihood calculating means 113 for the word string stored in the word string information storage means 111. 112, weighting is performed according to the degree of similarity between the speech feature of the word sequence obtained by step 112 and the plurality of speech feature models, and the degree of association between the symbol insertion model and the speech feature model, and determination of symbol insertion into the word sequence is performed. In this case, the information on the word string into which the symbol is inserted according to the determination result is stored in the symbol inserted word string information storage unit 115.
- the speech feature model storage devices 104-1 to 104-n and the symbol insertion model storage devices 103-1 to 103-n have a one-to-one correspondence. Weighting is performed by multiplying the symbol insertion likelihood for each symbol insertion model obtained by the insertion likelihood calculating means 113 by the similarity with the corresponding speech style feature model.
- the symbol insertion determining means 114 uses the weighted symbol insertion likelihood for each symbol insertion model to determine whether or not to insert punctuation marks and the type of symbol to be inserted, for example, by any of the following methods.
- Symbol insertion determination method 1 The sum of the top n ′ pieces (n ′ is a constant greater than or equal to 1 and less than or equal to n, the same applies below) of the weights of the insertion points of the symbol after weighting of each symbol insertion model is the combined insertion likelihood of the points and the weight of each symbol insertion model The sum of the top n ′ of the following reading insertion likelihoods is the integrated n likelihood of readings, and the top n of the weighted NULL insertion likelihoods (likelihoods that neither a punctuation nor a reading is inserted) of each symbol insertion model 'Calculate the total sum as NULL combined insertion likelihood. Next, the one with the largest integrated insertion likelihood is used as the symbol insertion determination result. For example, if the integrated insertion likelihood of a phrase is the largest of the three integrated insertion likelihoods, a determination result indicating that a phrase is inserted is generated.
- the integrated insertion likelihood of the punctuation point is compared with a predetermined threshold value, and if it is equal to or greater than the threshold value, a determination result indicating that the punctuation point is inserted is generated. If the integrated insertion likelihood of the punctuation marks is not equal to or greater than the threshold value, a determination result that punctuation marks are not inserted is generated.
- the maximum insertion likelihood symbol (phrase, reading point, NULL) is calculated from the weighted phrase insertion likelihood, the weighted reading insertion likelihood, and the weighted NULL insertion likelihood, The most frequently obtained symbols (punctuation marks, punctuation marks, and NULL) in all the symbol insertion models are obtained by majority decision and used as a determination result.
- the output device 105 is a display, a printer, a file device, a data transmission device, or the like that reads information from a word string with a symbol inserted from the symbol inserted word string information storage unit 115 and outputs the information.
- the processing device 101 When the processing device 101 receives information about a word string as a symbol insertion target from the input device 102, the processing device 101 stores the information in the word string information storage unit 111 as shown in FIG. 2 (S101).
- the processing device 101 reads out the word string information 1021 stored in the word string information storage unit 111 by the speech feature similarity calculation unit 112, and linguistically or acoustically, such as how to speak and how to pose.
- the feature of the speech is extracted (S102).
- the unit for extracting the feature of the speaking method may be the entire word string information 1021 stored in the word string information storage unit 111 or may be for each predetermined number of words.
- the processing device 101 reads out the speech feature model from the speech feature model storage devices 104-1 to 104-n by the speech feature similarity calculation means 112, and the similarity between the input word string extracted in step S102 and the speech feature
- the degree is calculated for each speech feature model and for each extraction unit (S103).
- the similarity for each speech style feature model and each extraction unit calculated here is stored in a memory (not shown) in the speech style feature similarity calculation unit 112 until the symbol insertion determination is completed.
- the processing apparatus 101 pays attention to the word included in the first word information 1022 in the word string information 1021 stored in the word string information storage unit 111 (S104).
- the symbol insertion model is read from the symbol insertion model storage devices 103-1 to 103-n, and a phrase insertion likelihood for inserting a phrase immediately after the word of interest, a reading insertion probability for inserting a reading point, and nothing are inserted.
- the NULL insertion likelihood not to be calculated is calculated for each symbol insertion model (S105).
- the processor 101 multiplies the symbol insertion likelihood obtained for each symbol insertion model by the symbol insertion likelihood calculation means 113 by the similarity between the corresponding speech style feature model and the speech style feature of the input word string.
- weighting is performed (S106).
- the similarity used here is that of the extraction unit to which the word of interest belongs among the similarities calculated and stored for each extraction unit.
- the processing apparatus 101 uses the symbol insertion likelihood calculating unit 113 to insert / not insert a punctuation mark immediately after the word of interest according to any one of the above-described symbol insertion determination methods (1) to (3). In this case, it is determined whether the symbol is a punctuation mark or a punctuation mark (S107). Then, the symbol insertion likelihood calculating unit 113 generates output word information including the word of interest according to the symbol insertion determination result, and stores it in the symbol inserted word string information storage unit 115 (S108).
- output word information is generated by adding the punctuation information next to the word information of interest, and if it is determined that a punctuation point should be inserted, If the output word information is added with the reading information next to the word information, and it is determined that NULL should be inserted, the output word information including only the information of the word of interest is generated, Stored in the inserted word string information storage means 115.
- the processing apparatus 101 When the processing apparatus 101 finishes the process focusing on the first word in the word string information 1021 stored in the word string information storage unit 111, the processing apparatus 101 shifts attention to the second word in the word string information 1021 (S109). ), The processing of steps S105 to S108 is repeated.
- the processing apparatus 101 completes the processing focusing on the last word in the word string information 1021 stored in the word string information storage unit 111 (YES in step S110)
- the processing apparatus 101 stores the symbol inserted word string information storage unit 115 in the symbol string information storage unit 115. Information on the stored word string with the inserted symbol is output from the output device 105 (S111).
- the symbol insertion determination it is possible to perform the symbol insertion determination in consideration of the difference in the way of speaking for each speaker.
- the reason for this is that among the multiple symbol insertion models according to the speaker's linguistic and acoustic speaking characteristics such as how to speak and pose, the symbol insertion model that matches the characteristics of the speaker in the input word string This is because punctuation insertion determination is performed with emphasis on.
- Example of the first embodiment Next, an example of the present embodiment will be described focusing on the speech feature model, the symbol insertion model, the speech feature similarity calculation, the symbol insertion likelihood calculation, and the symbol insertion determination.
- learning data D A to D Z for each speaker A to Z are prepared.
- the learning data D A of the speaker A for example, voice data recording the conversation of the speaker A in daily conversation or a lecture, or the voice data is manually converted into characters, and punctuation marks and punctuation marks are inserted and paused. Text data to which an acoustic feature such as length is added can be considered.
- speaking style feature models SM A to SM Z for each speaker A to Z are created using the learning data D A to D Z.
- a speech feature model for each speaker is created using linguistic and acoustic feature amounts for each speaker.
- acoustic feature quantity to be used pause length, speech speed, etc. can be considered.
- the learning data DY of the speaker Y the part where the symbol C k (punctuation and reading) is to be inserted is found, the word w Y, t immediately before the part where the symbol C k is to be inserted , and the pause length dur ( w Y, t , C k ) are extracted.
- a word w Y, behind t pause length immediately after dur (w Y, t, C k) The distribution function f dur (x, Y, C k ) of the speech feature model in which the symbol C k is inserted into is obtained .
- x is a voice feature amount, and in this case, a pause length.
- the pause length when the word w Y, t is uttered is dur (w Y, t , C k ) and the frequency at which the symbol C k is inserted is N (w Y, t , dur (w Y, t , C k ))
- the distribution function f dur (x, Y, C k ) is defined by the following equation.
- the denominator on the right side represents the total number of times the symbol C k is inserted immediately after the word w Y, t in the learning data DY regardless of the pause length, and the numerator is the pause length dur.
- This represents the total number of insertions of the symbol C k when (w Y, t , C k ) is less than or equal to x. In other words, it represents the ratio at which the pose length is equal to or less than x and the symbol C k is inserted, and is a cumulative function of the likelihood (probability) at which the symbol C k whose pose length is a variable is inserted.
- the distribution function f dur (x, Y, C 1 )
- the distribution function f dur (x, Y, C 2 ) represents the usage characteristics of the punctuation marks immediately after the words w Y, t . It can be said that it represents a feature.
- the denominator on the right side is the same as in equation (2), and the numerator is the number of times the symbol C k is inserted when the pause length dur (w Y, t , NULL) is greater than the threshold value x. Represents the sum of That is, it represents the rate at which the symbol “NULL” is inserted when the pause length is greater than x.
- the pose length is used as the acoustic feature quantity.
- other acoustic feature quantities such as the speech speed can be used, and multiple types of acoustic feature quantities such as the pose length and the speech speed can be used. It can also be used.
- a distribution function for each acoustic feature quantity is generated and a weight is given to them.
- Speaking style feature model SM in Fig. 4 Y corresponds to the distribution function of the speaking style feature model of a speaker Y created as described above. Further, the speech feature models SM A , SM B ,..., SM X , SM Z in FIG. 4 are speech feature models created for speakers A to X, Z other than the speaker Y in the same manner as the speaker Y. Corresponds to the distribution function.
- the speaker A ⁇ Z each speaking style feature model SM A ⁇ SM Z summarized two models with each other that the distribution function is similar to a model in a bottom-up approach.
- the speaking style feature model SM A and SM B in one speech feature model SM AB, ..., spoken feature model SM Y and SM Z to one speech feature model SM YZ are summarized respectively.
- a known clustering method is used as a method of summarizing.
- the speech style feature models compiled into the total number n are finally stored in the n speech style feature model storage devices 104-1 to 104-n in FIG.
- the speech feature similarity calculation unit 112 includes word surface information and speech feature quantities (such as pose information) stored in the word string information storage unit 111, and a speech feature model storage unit 104. From the above-described distribution functions constituting the speech style feature models stored in ⁇ 1 to 104-n, the likelihood of the closeness of the speech style (speaking style feature similarity) is calculated in units of words. Specifically, the value of the j-th speech feature amount of a word w t is x t, j , and the j-th speech feature amount of the speech style feature model stored in the i-th speech style feature model storage means 104-i.
- T i, 1 (w t ) is the similarity regarding the characteristics of the usage of the punctuation immediately after the word w t
- T i, 2 (w t ) is the similarity regarding the characteristics of the usage of the reading just after the word w t
- T i, NULL (w t ) indicates the degree of similarity regarding the feature of not using punctuation marks immediately after the word w t .
- the possible values of j are 1 and 2.
- a constant may be used for the weights a i, j or may be adjusted empirically by conducting a preliminary experiment. Alternatively, it may be estimated in advance by giving teacher data by a known technique such as the steepest descent method.
- the unit for extracting the feature amount may be a unit for each predetermined number stored in the word string information storage unit 111 as described above, or may be a unit of all the stored words.
- a symbol insertion model is a data model corresponding to a speech feature model, that is, a speech feature model that is classified by speaker and then clustered in a bottom-up manner with models having similar features. Create using all the learning data used in. For example, when speaking style feature model SM AB ⁇ SM YZ in FIG. 4 is a final n-number of speech feature model, a one-to-one symbol insertion model KM AB corresponding to speaking style feature model SM AB, learning of the speaker A It is generated from the data D A and the learning data D B of the speaker B.
- the symbol insertion model can be created using a known technique such as a word-with-symbol n-gram model as described above.
- the created symbol insertion models KM AB to KM YZ are stored in the n symbol insertion model storage devices 103-1 to 103-n in FIG.
- the symbol insertion likelihood S i, k (w t ) in which the symbol C k is inserted immediately after the word w t having the word information W t is the word for the i-th symbol insertion model learned by the known technique. Using the likelihood function g i (W, C) in which the symbol C is inserted immediately after the word w having the information W, it is expressed by the following equation.
- Equation (5) the symbol C k is inserted immediately after the word w t when the word information of n words (n> 1) is input to g i (W t , C k ) of the numerator on the right side.
- S i, k (w t ) is the likelihood, and the likelihood is normalized with respect to a symbol that can be inserted immediately after the word w t .
- the symbol insertion determination unit 114 calculates a model-specific symbol insertion score using the symbol insertion likelihood calculated by the symbol insertion likelihood calculation unit 113 and the speech style similarity calculated by the speech feature similarity calculation unit 112. To do.
- the symbol-by-model symbol insertion score F i (w t , C k ) of the i-th symbol insertion model is the speech feature similarity to the speech feature model of the speech feature model storage device 104-i for the word w t. T i, and k (w t), symbol insertion likelihood S i for symbol insertion model stored in the symbol insertion model storing device 103-i, by using the k (w t), calculated by the following equation.
- the model-specific symbol insertion period right after the word w t in the i-th symbol insertion model is inserted score F i (w t, C 1 ) is a word w t calculated from i-th symbol insertion model immediately after the likelihood S i which punctuation is inserted, 1 (w t), i-th similarity of the characteristics of use of the punctuation in the just computed word w t from speaking feature model T i, 1 (w t ) As a weight.
- model-specific symbols comma is inserted immediately after the word w t in the i-th symbol insertion model insertion score F i (w t, C 2 ) is the i-th word w t the symbol is calculated from an insertion model Likelihood S i, 2 (w t ) at which a reading point is inserted immediately after that, similarity score T i, 2 (w t) regarding the usage characteristic of the reading point immediately after the word w t calculated from the i-th speech style feature model ) As a weight.
- the model-specific symbol insertion score F i (w t , C NULL ) in which punctuation marks are not inserted immediately after the word w t in the i th symbol insertion model is immediately after the word w t calculated from the i th symbol insertion model.
- the symbol insertion determination means 114 calculates an integrated symbol insertion score F (w t , C k ) used for symbol insertion determination by the following formula using the model-specific symbol insertion score F i (w t , C k ). To do.
- Example 2 The symbol insertion determination unit 114 calculates an integrated symbol insertion score F (w t , C k ) used for symbol insertion determination in the same manner as in Example 1.
- the symbol insertion determination unit 114 inserts a symbol immediately after the word w t when the integrated symbol insertion score F (w t , C k ) is larger than the threshold ⁇ k .
- the threshold value ⁇ k may be different depending on the type of the symbol C k and can be adjusted.
- Example 3 The symbol insertion determination unit 114 calculates the model-specific symbol insertion score F i (w t , C k ) in the same manner as in Example 1. Next, symbol insertion determination is performed for each symbol insertion model, and the most determined symbol is used as the final output. Specifically, first, as shown in the following equation, the symbol C ⁇ i having the maximum symbol insertion score F i (w t , C k ) for each model is obtained for all the symbol insertion models.
- the processing device 101 When the processing device 101 receives information on a word string to be inserted from the input device 102, the processing device 101 stores the information in the word string information storage unit 111 as shown in FIG. 2 (S201 in FIG. 5).
- the processing apparatus 101 pays attention to the word included in the first word information 1022 in the word string information 1021 stored in the word string information storage unit 111 (S202).
- the processing device 101 reads the word information of the word of interest from the word string information storage unit 111 by the speech feature similarity calculation unit 112 and also speaks feature models from the speech feature model storage devices 104-1 to 104-n. , And using expression (4), the similarity of the way of speaking between the word of interest and the n speech style feature models is calculated for each inserted symbol of punctuation, punctuation, and NULL (S203).
- the processing device 101 reads the word information of the word of interest from the word string information storage unit 111 by the symbol insertion likelihood calculation unit 113 and also stores the symbol insertion model from the symbol insertion model storage units 103-1 to 103-n. And using symbol (5), the likelihood of symbol insertion in which a punctuation mark, punctuation mark, or NULL is inserted immediately after the word of interest is calculated for each of n symbol insertion models (S204).
- the processing apparatus 101 uses the symbol insertion likelihood calculating unit 113 to calculate the symbolic likelihood of each symbol insertion model obtained by using the equation (6), and how to speak the corresponding speech style feature model and the attention word. Weighting is performed by multiplying the similarity (S205).
- the processing apparatus 101 uses the symbol insertion likelihood calculating unit 113 to determine whether or not to insert a punctuation mark immediately after the word of interest and insert it according to any of the symbol insertion determination methods of Examples 1 to 3 described above. Determines whether the symbol is a punctuation mark or a punctuation mark (S206). Then, the symbol insertion likelihood calculating unit 113 generates output word information including the word of interest according to the symbol insertion determination result, and stores it in the symbol inserted word string information storage unit 115 (S207).
- the processing apparatus 101 When the processing apparatus 101 finishes the process focusing on the first word in the word string information 1021 stored in the word string information storage unit 111, the processing apparatus 101 shifts attention to the second word in the word string information 1021 (S208). ), The processes of steps S203 to S207 are repeated.
- the processing apparatus 101 completes the processing focusing on the last word in the word string information 1021 stored in the word string information storage unit 111 (YES in step S209), the processing apparatus 101 stores the symbol inserted word string information storage unit 115 in the processing. Information on the stored word string with the inserted symbol is output from the output device 105 (S210).
- the feature similarity and the symbol insertion likelihood of the word are calculated in the process focusing on one word.
- the symbol insertion likelihood may be calculated by paying attention to each word and then paying attention to each word. However, in that case, it is necessary to store the calculated feature similarity up to the time of weighting.
- the symbol insertion device 200 inserts n symbols as compared with the symbol insertion device 100 according to the first embodiment shown in FIG.
- the n symbol insertion models stored in the model storage devices 103-1 to 103-n and the m speech style feature models stored in the m speech feature model storage devices 104-1 to 104-m are 1 It is different in that it does not correspond to one-to-one.
- the learning data used for creating the symbol insertion model is the same as the learning data used for creating the speech style feature model. There was a one-to-one correspondence with the feature model. However, the learning data used to create the symbol insertion model and the learning data used to create the speech feature model are not necessarily the same depending on how the model is created. The learning data used to create a plurality of speaking style feature models may be mixed in the learning data.
- m kinds of genres for example, genres such as news programs and variety programs when each speaker is an announcer
- symbol insertion models are created for each genre and the speech feature models SM AB to SM YZ are created in the same manner as in the first embodiment
- n symbol insertion models and m symbols There is no one-to-one correspondence with the speaking feature model.
- An object of the present embodiment is to make it possible to perform symbol insertion determination in consideration of the difference in the characteristics of the way of speaking for each speaker even under such circumstances.
- the present embodiment newly includes a model relevance storage device 201 that stores the relevance of n symbol insertion models and m speech feature models, and the processing device 101 includes symbol insertion determination means 114.
- a symbol insertion determining means 202 is provided.
- the model relevance storage device 201 is stored in the speech feature model stored in the speech feature model storage device 104-i (1 ⁇ i ⁇ m) and the symbol insertion model storage device 103-j (1 ⁇ j ⁇ n).
- the degree of association O i, j with the symbol insertion model is stored.
- FIG. 7 shows an example of the association degree O i, j stored in the model association degree storage device 201.
- the model relevance O 2,4 between the speech feature model stored in the speech feature model storage device 104-4 and the symbol insertion model stored in the symbol insertion model storage device 103-2 is: 0.03.
- Each model relevance O i, j is a constant determined by the degree of overlap between the learning data used for learning the symbol insertion model and the learning data used for learning the speech style feature model.
- the model relevance O i, j can be obtained using mutual information.
- the symbol insertion determination unit 202 of the processing device 101 compares the symbol insertion score F i (w t , C k ) for each model of the i-th symbol insertion model as compared with the symbol insertion determination unit 114 in the first embodiment. The difference is that the calculation is performed using the following equation.
- the model-specific symbol insertion score is obtained as in Equation (6) using the speech style feature similarity as a weight.
- O i, j representing the degree of association (correspondence) between the data of the speech style feature model and the symbol insertion model is further weighted.
- the symbol insertion score for each model into which the symbol C k is inserted is obtained.
- the same effects as those of the first embodiment can be obtained, and at the same time, it is not always necessary to have a one-to-one correspondence between the speech style feature model and the symbol insertion model. Can be increased.
- the speech recognition apparatus 300 recognizes the speech recognition by the processing apparatus 101 constituting the symbol insertion apparatus 100 or 200 according to the first or second embodiment.
- Means 311 and word string information generation means 312 are newly provided, and a microphone 301 is connected to the processing apparatus 101.
- the voice recognition unit 311 and the word string information generation unit 312 can be realized by a computer and a program constituting the processing device 101.
- the voice signal input from the microphone 301 is transmitted to the voice recognition unit 311 of the processing device 101, where a known voice recognition process is performed, and the voice recognition result is output to the word string information generation unit 312.
- the speech recognition result is given the time when the word was uttered and the part of speech information of the word (information such as part of speech and usage) and is output in a predetermined format. .
- the word string information generation means 312 acquires the surface layer information and the part of speech information of the word by pattern matching with respect to the speech recognition result output in a predetermined format. Also, the pause length is calculated by taking the difference between the start time and the end time of the preceding and following words. Then, the word string information generating unit 312 generates the word string information 1021 as shown in FIG. 2 and stores it in the word string information storage unit 111.
- symbols such as punctuation marks are inserted into the word string stored in the word string information storage unit 111 by the same configuration and operation as in the first or second embodiment.
- a speech recognition apparatus that recognizes speech uttered by a speaker and automatically inserts symbols such as punctuation marks into the speech-recognized word string.
- the present invention can be applied to uses such as a speech recognition device that converts a speech signal into text and a program for realizing the speech recognition device on a computer.
- Applications such as content playback devices and content search devices that divide audio and video content into appropriate units and display / playback or search content in divided units, and transcription support devices for recorded audio data Can also be applied.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
Abstract
Description
101…処理装置
102…入力装置
103-1~103-n…記号挿入モデル記憶装置
104-1~104-n…話し方特徴モデル記憶装置
105…出力装置
111…単語列情報記憶手段
112…話し方特徴類似度計算手段
113…記号挿入尤度計算手段
114、202…記号挿入判定手段
115…記号挿入済単語列情報記憶手段
201…モデル関連度記憶装置
300…音声認識装置
301…マイクロフォン
311…音声認識手段
312…単語列情報生成手段
図1を参照すると、本発明の第1の実施の形態に係る記号挿入装置100は、処理装置101と、この処理装置101に接続された入力装置102、n個の記号挿入モデル記憶装置103-1~103-n、同じくn個の話し方特徴モデル記憶装置104-1~104-nおよび出力装置105とから構成されている。
各記号挿入モデルの重み付け後の句点挿入尤度のうちの上位n'個(n'は1以上、n以下の定数。以下同じ)の総和を句点の統合挿入尤度、各記号挿入モデルの重み付け後の読点挿入尤度のうちの上位n'個の総和を読点の統合挿入尤度、各記号挿入モデルの重み付け後のNULL挿入尤度(句点も読点も挿入されない尤度)のうちの上位n'個の総和をNULLの統合挿入尤度として算出する。次に、統合挿入尤度の最も大きいものを記号挿入判定結果とする。例えば、3つの統合挿入尤度のうち、句点の統合挿入尤度が最も大きければ、句点を挿入する旨の判定結果を生成する。
各記号挿入モデルの重み付け後の句点挿入尤度のうちの上位n'個の総和を句点の統合挿入尤度、各記号挿入モデルの重み付け後の読点挿入尤度のうちの上位n'個の総和を読点の統合挿入尤度として算出する。次に、複数の記号間に予め定められた優先順位が、例えば優先度の高い順に句点、読点であった場合、まず、句点の統合挿入尤度を予め定められた閾値と比較し、閾値以上であれば句点を挿入する旨の判定結果を生成する。句点の統合挿入尤度が閾値以上でなければ、次に、読点の統合挿入尤度を予め定められた閾値と比較し、閾値以上であれば読点を挿入する旨の判定結果を生成する。読点の統合挿入尤度が閾値以上でなければ、句読点は挿入しない旨の判定結果を生成する。
各記号挿入モデル毎に、重み付け後の句点挿入尤度、重み付け後の読点挿入尤度、重み付け後のNULL挿入尤度のうち、最大の挿入尤度の記号(句点、読点、NULL)を求め、全ての記号挿入モデルで最も多く求められた記号(句点、読点、NULL)を多数決により求めて判定結果とする。
次に、話し方特徴モデル、記号挿入モデル、話し方特徴類似度計算、記号挿入尤度計算、記号挿入判定を中心に、本実施の形態の一実施例について説明する。
話し方特徴類似度計算手段112は、単語列情報記憶手段111に記憶された単語の表層情報および音声特徴量(ポーズ情報など)と、話し方特徴モデル記憶手段104-1~104-nに記憶された話し方特徴モデルを構成する上述した分布関数とから、単語単位で、話し方の特徴の近さの尤度(話し方の特徴類似度)を計算する。具体的には、ある単語wtのj番目の音声特徴量の値をxt,j、i番目の話し方特徴モデル記憶手段104-iに記憶された話し方特徴モデルのj個目の音声特徴量の分布関数をfj(xt,j,i,Ck)、その重みをai,jとすると、i番目の話し方特徴モデル記憶手段104-iに記憶された話し方特徴モデルとの話し方特徴類似度Ti,k(wt)は以下の式で計算される。
記号挿入モデルは、話し方特徴モデルと対応したデータ、すなわち話者別に分類した後、類似した特徴を持つモデルでボトムアップ式にクラスタリングした話し方特徴モデルを作成する上で使用した学習データ全てを用いて作成する。例えば図4における話し方特徴モデルSMAB~SMYZが最終的なn個の話し方特徴モデルであった場合、話し方特徴モデルSMABに1対1対応の記号挿入モデルKMABは、話者Aの学習データDAと話者Bの学習データDBから生成する。記号挿入モデルは、前述したように記号付き単語n-gramモデルなどの公知の技術を用いて作成することができる。作成された記号挿入モデルKMAB~KMYZは、図1のn個の記号挿入モデル記憶装置103-1~103-nに記憶される。
記号挿入尤度計算手段113は、単語列情報記憶手段111に記憶された各単語の表層情報や品詞情報等の情報と、記号挿入モデル記憶手段103-1~103-nに記憶された記号挿入モデルとを用いて、記号挿入モデル毎に、或る単語wtの直後に記号Ckが挿入される(または記号が何も挿入されない(Ck=NULL))確度の高さを示す記号挿入尤度を求める。単語情報Wtを持つ単語wtの直後に記号Ckが挿入される記号挿入尤度Si,k(wt)は、前記公知の技術によって学習された、i番目の記号挿入モデルに対する単語情報Wを持つ単語wの直後に記号Cが挿入される尤度関数gi(W,C)を用いて、以下の式で表される。
(5-1)例1
記号挿入判定手段114は、まず、記号挿入尤度計算手段113で計算された記号挿入尤度と話し方特徴類似度計算手段112で計算される話し方類似度とを用いてモデル別記号挿入スコアを計算する。具体的には、i番目の記号挿入モデルのモデル別記号挿入スコアFi(wt,Ck)は、単語wtについての話し方特徴モデル記憶装置104-iの話し方特徴モデルに対する音声特徴類似度Ti,k(wt)と、記号挿入モデル記憶装置103-iに記憶された記号挿入モデルに対する記号挿入尤度Si,k(wt)とを用いて、以下の式で計算する。
記号挿入判定手段114は、例1と同様にして、記号挿入判定に用いる統合記号挿入スコアF(wt,Ck)を計算する。
記号挿入判定手段114は、例1と同様にして、モデル別記号挿入スコアFi(wt,Ck)を計算する。次に、記号挿入モデル毎に記号挿入判定を行い、最も多く判定された記号を最終出力とする。具体的には、まず、次式に示すように、モデル別記号挿入スコアFi(wt,Ck)が最大となった記号C^iを全ての記号挿入モデルについて求める。
図6を参照すると、本発明の第2の実施の形態に係る記号挿入装置200は、図1に示した第1の実施の形態に係る記号挿入装置100と比較して、n個の記号挿入モデル記憶装置103-1~103-nに記憶されるn個の記号挿入モデルとm個の話し方特徴モデル記憶装置104-1~104-mに記憶されるm個の話し方特徴モデルとは、1対1に対応していない点で相違する。
図8を参照すると、本発明の第3の実施の形態に係る音声認識装置300は、第1または第2の実施の形態に係る記号挿入装置100または200を構成する処理装置101に、音声認識手段311と単語列情報生成手段312とを新たに設け、かつ、処理装置101にマイクロフォン301を接続した構成を有する。音声認識手段311および単語列情報生成手段312は、処理装置101を構成するコンピュータとプログラムとによって実現することができる。
Claims (29)
- 音声情報を文字化した単語列に記号を挿入する記号挿入装置であって、
記号挿入対象となる単語列について話し方の特徴別に設けられた複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行うことを特徴とする記号挿入装置。 - 前記単語列について前記複数の記号挿入モデル毎に記号挿入尤度を求める記号挿入尤度計算手段と、
前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度を求める話し方特徴類似度計算手段と、
前記単語列について前記複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行う記号挿入判定手段とを備えることを特徴とする請求項1に記載の記号挿入装置。 - 前記話し方の特徴とは、少なくとも音響的特徴量であることを特徴とする請求項1または2に記載の記号挿入装置。
- 前記関連度は、記号挿入モデルの学習に使用した学習データと話し方特徴モデルの学習に使用した学習データとの重複度により定まる定数であることを特徴とする請求項2または3に記載の記号挿入装置。
- 前記関連度を保持するモデル関連度記憶手段を備えることを特徴とする請求項2乃至4の何れか1項に記載の記号挿入装置。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和が最大となった記号を挿入記号と判定することを特徴とする請求項2乃至5の何れか1項に記載の記号挿入装置。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和を閾値と比較することによって記号挿入判定を行うことを特徴とする請求項2乃至5の何れか1項に記載の記号挿入装置。
- 前記記号挿入判定手段は、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内、尤度の高いものから順に所定個数を選択することを特徴とする請求項6または7に記載の記号挿入装置。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、記号挿入モデル毎に、モデル別記号挿入尤度が最大となる挿入記号を求め、各記号挿入モデル毎に求められた挿入記号の多数決により記号挿入判定を行うことを特徴とする請求項2乃至5の何れか1項に記載の記号挿入装置。
- 入力音声に対して音声認識を行って音声認識結果を出力する音声認識手段と、該音声認識手段から出力された音声認識結果から請求項1乃至9の何れか1項に記載される記号挿入装置に入力する記号挿入対象となる単語列を生成する単語列情報生成手段とを備えることを特徴とする音声認識装置。
- 音声情報を文字化した単語列に記号を挿入する記号挿入方法であって、
記号挿入対象となる単語列について話し方の特徴別に設けられた複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行うことを特徴とする記号挿入方法。 - 記号挿入尤度計算手段が、前記単語列について前記複数の記号挿入モデル毎に記号挿入尤度を求める記号挿入尤度計算ステップと、
話し方特徴類似度計算手段が、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度を求める話し方特徴類似度計算ステップと、
記号挿入判定手段が、前記単語列について前記複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行う記号挿入判定ステップとを含むことを特徴とする請求項11に記載の記号挿入方法。 - 前記話し方の特徴とは、少なくとも音響的特徴量であることを特徴とする請求項11または12に記載の記号挿入方法。
- 前記関連度は、記号挿入モデルの学習に使用した学習データと話し方特徴モデルの学習に使用した学習データとの重複度により定まる定数であることを特徴とする請求項12または13に記載の記号挿入方法。
- 前記記号挿入判定手段は、前記関連度を保持するモデル関連度記憶手段から前記関連度を入力することを特徴とする請求項12乃至14の何れか1項に記載の記号挿入方法。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和が最大となった記号を挿入記号と判定することを特徴とする請求項12乃至15の何れか1項に記載の記号挿入方法。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和を閾値と比較することによって記号挿入判定を行うことを特徴とする請求項12乃至15の何れか1項に記載の記号挿入方法。
- 前記記号挿入判定手段は、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内、尤度の高いものから順に所定個数を選択することを特徴とする請求項16または17に記載の記号挿入方法。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、記号挿入モデル毎に、モデル別記号挿入尤度が最大となる挿入記号を求め、各記号挿入モデル毎に求められた挿入記号の多数決により記号挿入判定を行うことを特徴とする請求項12乃至15の何れか1項に記載の記号挿入方法。
- 音声情報を文字化した単語列に記号を挿入する記号挿入装置を構成するコンピュータを、記号挿入対象となる単語列について話し方の特徴別に設けられた複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行う手段として機能させるためのプログラム。
- 前記コンピュータを、
前記単語列について前記複数の記号挿入モデル毎に記号挿入尤度を求める記号挿入尤度計算手段と、
前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度を求める話し方特徴類似度計算手段と、
前記単語列について前記複数の記号挿入モデル毎に求められた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および前記記号挿入モデルと前記話し方特徴モデルとの関連度により重み付けを行って、前記単語列への記号挿入判定を行う記号挿入判定手段として機能させるための請求項20に記載のプログラム。 - 前記話し方の特徴とは、少なくとも音響的特徴量であることを特徴とする請求項20または21に記載のプログラム。
- 前記関連度は、記号挿入モデルの学習に使用した学習データと話し方特徴モデルの学習に使用した学習データとの重複度により定まる定数であることを特徴とする請求項21または22に記載のプログラム。
- 前記コンピュータは、前記関連度を保持するモデル関連度記憶手段を備えることを特徴とする請求項21乃至23の何れか1項に記載のプログラム。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和が最大となった記号を挿入記号と判定することを特徴とする請求項21乃至24の何れか1項に記載のプログラム。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、挿入される記号毎に、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内から選択した所定個数のモデル別記号挿入尤度の総和を計算し、総和を閾値と比較することによって記号挿入判定を行うことを特徴とする請求項21乃至24の何れか1項に記載のプログラム。
- 前記記号挿入判定手段は、複数の記号挿入モデルについて求めた複数のモデル別記号挿入尤度の内、尤度の高いものから順に所定個数を選択することを特徴とする請求項25または26に記載のプログラム。
- 前記記号挿入判定手段は、前記単語列について或る記号挿入モデルで求めた記号挿入尤度に対して、前記単語列の話し方の特徴と前記複数の話し方特徴モデルとの類似度および当該記号挿入モデルと前記複数の話し方特徴モデルとの関連度により重み付けを行った記号挿入尤度をモデル別記号挿入尤度と定義するとき、記号挿入モデル毎に、モデル別記号挿入尤度が最大となる挿入記号を求め、各記号挿入モデル毎に求められた挿入記号の多数決により記号挿入判定を行うことを特徴とする請求項21乃至24の何れか1項に記載のプログラム。
- 前記コンピュータを、さらに、入力音声に対して音声認識を行って音声認識結果を出力する音声認識手段と、該音声認識手段から出力された音声認識結果から前記記号挿入対象となる単語列を生成する単語列情報生成手段として機能させるための請求項20乃至28の何れか1項に記載のプログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009553380A JP5141695B2 (ja) | 2008-02-13 | 2009-01-19 | 記号挿入装置および記号挿入方法 |
US12/863,945 US8577679B2 (en) | 2008-02-13 | 2009-01-19 | Symbol insertion apparatus and symbol insertion method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008031287 | 2008-02-13 | ||
JP2008-031287 | 2008-02-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009101837A1 true WO2009101837A1 (ja) | 2009-08-20 |
Family
ID=40956867
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/050641 WO2009101837A1 (ja) | 2008-02-13 | 2009-01-19 | 記号挿入装置および記号挿入方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8577679B2 (ja) |
JP (1) | JP5141695B2 (ja) |
WO (1) | WO2009101837A1 (ja) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103474062A (zh) * | 2012-08-06 | 2013-12-25 | 苏州沃通信息科技有限公司 | 一种语音识别方法 |
JP2015219480A (ja) * | 2014-05-21 | 2015-12-07 | 日本電信電話株式会社 | 対話状況特徴計算装置、文末記号推定装置、これらの方法及びプログラム |
JP2020064370A (ja) * | 2018-10-15 | 2020-04-23 | 株式会社野村総合研究所 | 文章記号挿入装置及びその方法 |
JP2020064630A (ja) * | 2019-10-11 | 2020-04-23 | 株式会社野村総合研究所 | 文章記号挿入装置及びその方法 |
JP2020160782A (ja) * | 2019-03-26 | 2020-10-01 | 日本放送協会 | 自然言語データ処理装置およびプログラム |
WO2023100433A1 (ja) * | 2021-11-30 | 2023-06-08 | 株式会社Nttドコモ | 文字列出力装置 |
WO2024029152A1 (ja) * | 2022-08-05 | 2024-02-08 | 株式会社Nttドコモ | 区切り記号挿入装置及び音声認識システム |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8719004B2 (en) * | 2009-03-19 | 2014-05-06 | Ditech Networks, Inc. | Systems and methods for punctuating voicemail transcriptions |
CN104143331B (zh) | 2013-05-24 | 2015-12-09 | 腾讯科技(深圳)有限公司 | 一种添加标点的方法和系统 |
CN104142915B (zh) * | 2013-05-24 | 2016-02-24 | 腾讯科技(深圳)有限公司 | 一种添加标点的方法和系统 |
US9508338B1 (en) * | 2013-11-15 | 2016-11-29 | Amazon Technologies, Inc. | Inserting breath sounds into text-to-speech output |
US9607613B2 (en) | 2014-04-23 | 2017-03-28 | Google Inc. | Speech endpointing based on word comparisons |
US10269341B2 (en) | 2015-10-19 | 2019-04-23 | Google Llc | Speech endpointing |
US20170110118A1 (en) * | 2015-10-19 | 2017-04-20 | Google Inc. | Speech endpointing |
KR101942521B1 (ko) | 2015-10-19 | 2019-01-28 | 구글 엘엘씨 | 음성 엔드포인팅 |
US10929754B2 (en) | 2017-06-06 | 2021-02-23 | Google Llc | Unified endpointer using multitask and multidomain learning |
CN112581982B (zh) | 2017-06-06 | 2024-06-25 | 谷歌有限责任公司 | 询问结束检测 |
JP6728116B2 (ja) * | 2017-09-21 | 2020-07-22 | 株式会社東芝 | 音声認識装置、音声認識方法およびプログラム |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6234200A (ja) * | 1985-08-08 | 1987-02-14 | 工業技術院長 | 韻律情報を利用した会話音声理解方法 |
JP2000029496A (ja) * | 1998-05-13 | 2000-01-28 | Internatl Business Mach Corp <Ibm> | 連続音声認識において句読点を自動的に生成する装置および方法 |
JP2001134289A (ja) * | 1999-11-08 | 2001-05-18 | Just Syst Corp | 音声認識システム、方法及び記録媒体 |
JP2003295888A (ja) * | 2002-04-04 | 2003-10-15 | Mitsubishi Electric Corp | 音声認識装置及びプログラム |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0693221B2 (ja) * | 1985-06-12 | 1994-11-16 | 株式会社日立製作所 | 音声入力装置 |
CA2119397C (en) * | 1993-03-19 | 2007-10-02 | Kim E.A. Silverman | Improved automated voice synthesis employing enhanced prosodic treatment of text, spelling of text and rate of annunciation |
JP3232289B2 (ja) * | 1999-08-30 | 2001-11-26 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 記号挿入装置およびその方法 |
JP4229627B2 (ja) * | 2002-03-28 | 2009-02-25 | 株式会社東芝 | ディクテーション装置、方法及びプログラム |
EP1422692A3 (en) * | 2002-11-22 | 2004-07-14 | ScanSoft, Inc. | Automatic insertion of non-verbalized punctuation in speech recognition |
US8095364B2 (en) * | 2004-06-02 | 2012-01-10 | Tegic Communications, Inc. | Multimodal disambiguation of speech recognition |
-
2009
- 2009-01-19 US US12/863,945 patent/US8577679B2/en active Active
- 2009-01-19 WO PCT/JP2009/050641 patent/WO2009101837A1/ja active Application Filing
- 2009-01-19 JP JP2009553380A patent/JP5141695B2/ja active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS6234200A (ja) * | 1985-08-08 | 1987-02-14 | 工業技術院長 | 韻律情報を利用した会話音声理解方法 |
JP2000029496A (ja) * | 1998-05-13 | 2000-01-28 | Internatl Business Mach Corp <Ibm> | 連続音声認識において句読点を自動的に生成する装置および方法 |
JP2001134289A (ja) * | 1999-11-08 | 2001-05-18 | Just Syst Corp | 音声認識システム、方法及び記録媒体 |
JP2003295888A (ja) * | 2002-04-04 | 2003-10-15 | Mitsubishi Electric Corp | 音声認識装置及びプログラム |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103474062A (zh) * | 2012-08-06 | 2013-12-25 | 苏州沃通信息科技有限公司 | 一种语音识别方法 |
JP2015219480A (ja) * | 2014-05-21 | 2015-12-07 | 日本電信電話株式会社 | 対話状況特徴計算装置、文末記号推定装置、これらの方法及びプログラム |
JP2020064370A (ja) * | 2018-10-15 | 2020-04-23 | 株式会社野村総合研究所 | 文章記号挿入装置及びその方法 |
JP2020160782A (ja) * | 2019-03-26 | 2020-10-01 | 日本放送協会 | 自然言語データ処理装置およびプログラム |
JP7253951B2 (ja) | 2019-03-26 | 2023-04-07 | 日本放送協会 | 自然言語データ処理装置およびプログラム |
JP2020064630A (ja) * | 2019-10-11 | 2020-04-23 | 株式会社野村総合研究所 | 文章記号挿入装置及びその方法 |
JP7229144B2 (ja) | 2019-10-11 | 2023-02-27 | 株式会社野村総合研究所 | 文章記号挿入装置及びその方法 |
WO2023100433A1 (ja) * | 2021-11-30 | 2023-06-08 | 株式会社Nttドコモ | 文字列出力装置 |
WO2024029152A1 (ja) * | 2022-08-05 | 2024-02-08 | 株式会社Nttドコモ | 区切り記号挿入装置及び音声認識システム |
Also Published As
Publication number | Publication date |
---|---|
US20100292989A1 (en) | 2010-11-18 |
JPWO2009101837A1 (ja) | 2011-06-09 |
US8577679B2 (en) | 2013-11-05 |
JP5141695B2 (ja) | 2013-02-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5141695B2 (ja) | 記号挿入装置および記号挿入方法 | |
US11727914B2 (en) | Intent recognition and emotional text-to-speech learning | |
US10037758B2 (en) | Device and method for understanding user intent | |
KR102582291B1 (ko) | 감정 정보 기반의 음성 합성 방법 및 장치 | |
JP4267385B2 (ja) | 統計的言語モデル生成装置、音声認識装置、統計的言語モデル生成方法、音声認識方法、およびプログラム | |
US11189277B2 (en) | Dynamic gazetteers for personalized entity recognition | |
JP5932869B2 (ja) | N−gram言語モデルの教師無し学習方法、学習装置、および学習プログラム | |
WO2005122144A1 (ja) | 音声認識装置、音声認識方法、及びプログラム | |
TW201203222A (en) | Voice stream augmented note taking | |
Deena et al. | Recurrent neural network language model adaptation for multi-genre broadcast speech recognition and alignment | |
KR101677859B1 (ko) | 지식 베이스를 이용하는 시스템 응답 생성 방법 및 이를 수행하는 장치 | |
CN112669842A (zh) | 人机对话控制方法、装置、计算机设备及存储介质 | |
CN104750677A (zh) | 语音传译装置、语音传译方法及语音传译程序 | |
Ostrogonac et al. | Morphology-based vs unsupervised word clustering for training language models for Serbian | |
CN117043859A (zh) | 查找表循环语言模型 | |
CN111508481B (zh) | 语音唤醒模型的训练方法、装置、电子设备及存储介质 | |
CN108899016B (zh) | 一种语音文本规整方法、装置、设备及可读存储介质 | |
Masumura et al. | Training a Language Model Using Webdata for Large Vocabulary Japanese Spontaneous Speech Recognition. | |
CN115132170A (zh) | 语种分类方法、装置及计算机可读存储介质 | |
US11468897B2 (en) | Systems and methods related to automated transcription of voice communications | |
JP2002091484A (ja) | 言語モデル生成装置及びこれを用いた音声認識装置、言語モデル生成方法及びこれを用いた音声認識方法、並びに言語モデル生成プログラムを記録したコンピュータ読み取り可能な記録媒体及び音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体 | |
JP2006107353A (ja) | 情報処理装置および方法、記録媒体、並びにプログラム | |
JP4674609B2 (ja) | 情報処理装置および方法、プログラム、並びに記録媒体 | |
Kawahara | Intelligent transcription system based on spontaneous speech processing | |
KR102392992B1 (ko) | 음성 인식 기능을 활성화시키는 호출 명령어 설정에 관한 사용자 인터페이싱 장치 및 방법 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09709820 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12863945 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009553380 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09709820 Country of ref document: EP Kind code of ref document: A1 |