WO2009136440A1 - 音声認識辞書作成支援装置,処理プログラム,および処理方法 - Google Patents
音声認識辞書作成支援装置,処理プログラム,および処理方法 Download PDFInfo
- Publication number
- WO2009136440A1 WO2009136440A1 PCT/JP2008/058615 JP2008058615W WO2009136440A1 WO 2009136440 A1 WO2009136440 A1 WO 2009136440A1 JP 2008058615 W JP2008058615 W JP 2008058615W WO 2009136440 A1 WO2009136440 A1 WO 2009136440A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- speech
- cluster
- phoneme string
- phoneme
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims description 8
- 238000011156 evaluation Methods 0.000 claims abstract description 76
- 239000000284 extract Substances 0.000 claims abstract description 10
- 238000013500 data storage Methods 0.000 claims abstract description 8
- 238000000034 method Methods 0.000 claims description 83
- 238000004364 calculation method Methods 0.000 claims description 55
- 238000000605 extraction Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 7
- 230000000877 morphologic effect Effects 0.000 claims description 4
- 238000012512 characterization method Methods 0.000 claims 1
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000013523 data management Methods 0.000 description 34
- 238000010586 diagram Methods 0.000 description 20
- 238000004140 cleaning Methods 0.000 description 15
- 241001436793 Meru Species 0.000 description 5
- 238000007726 management method Methods 0.000 description 5
- 238000012854 evaluation process Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000011157 data evaluation Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013439 planning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Definitions
- the present invention relates to a speech recognition dictionary creation support apparatus, a processing program, and a processing method that support creation processing of a speech recognition dictionary used in speech recognition processing. More specifically, the present invention relates to processing for extracting speech data of unknown words that are candidate keywords to be registered in the speech recognition dictionary from speech data.
- call center operations there is a request to grasp the business details such as the types of customer inquiries, the content of questions, and the time required for response, and use them for business analysis and business planning. Therefore, in many call centers, the operator records the contents of the response at each response and analyzes the response record later. However, since there is no response record at a small call center, or there is a response record, but there is little information left there, the conversation voice between the customer and the operator is recorded and the voice dialogue data is analyzed. There may be a need.
- keywords are extracted from manuals and related documents related to work contents at a call center, and the voice data of the keywords is added to the keyword dictionary. Or, the operator actually listened to the spoken dialogue data from the beginning, and manually extracted the keyword part.
- Patent Document 1 prepares a speech recognition grammar assuming that an unknown word appears in advance, extracts speech feature information and a phoneme sequence of a section where an unknown word is assumed to be generated, A process is disclosed in which clustering is performed based on information, a representative phoneme sequence of the clustered phoneme sequence is detected as unknown, and additionally registered in a dictionary.
- keywords that are actually extracted by manually listening to voice dialogue data have a problem that the cost of listening to the voice data becomes longer and the work cost is very high.
- Patent Document 1 since a section in which an unknown word will be uttered by the grammatical structure for speech recognition is determined in advance, it is difficult to apply to a voice data that records a dialogue that is difficult to be standardized. there were.
- An object of the present invention is to provide an apparatus and a processing program for efficiently extracting an unknown word that can be a keyword from speech data in order to support the creation / maintenance processing of a keyword dictionary for speech recognition processing. And providing a processing method.
- the disclosed device includes a voice data storage unit that stores voice data, and extracts prosodic information including at least a voice power value from the voice data. Furthermore, based on the prosodic information, an utterance section in which the time during which the power value is equal to or greater than a predetermined threshold is extracted from the speech data is equal to or longer than a predetermined time, and the utterance period is determined to have a power value equal to or greater than the predetermined threshold. Divided voice data is generated by dividing into continuous sections.
- phoneme recognition processing is performed on the divided speech data
- phoneme sequence data of each divided speech data is acquired
- clustering processing is performed on the phoneme sequence data
- a cluster that is a set of classified phoneme sequence data is generated.
- an evaluation value is calculated based on the prosodic information of the divided speech data corresponding to the phoneme string data constituting the cluster, and a cluster having an evaluation value equal to or greater than a certain value is selected as a candidate cluster.
- a cluster having an evaluation value equal to or greater than a certain value is selected as a candidate cluster.
- one phoneme string data is specified as the representative phoneme string from the phoneme string data constituting the cluster, and the divided voice data corresponding to the representative phoneme string is selected as the target voice data.
- the selected audio data to be listened to is an utterance section cut out from the audio data based on the power value, and is divided audio data corresponding to keyword candidate words.
- the worker does not need to listen to the sound data from the beginning, and only needs to listen to the sound data to be listened that is a section in which a word that may be adopted as a keyword is spoken.
- FIG. 1 It is a figure which shows the structural example of the speech recognition dictionary creation assistance apparatus in embodiment of this invention. It is a figure which shows the outline
- step S730 It is a more detailed process flow diagram of a large power value acquisition process (step S730) of a divided data management table. It is a more detailed process flow figure of the evaluation value calculation process (step S75) by a pitch value. It is a more detailed processing flowchart of the value acquisition processing (step S750) of the large pitch range of the divided data management table. It is a more detailed process flow figure of the evaluation value calculation process (step S77) by word likeness information. It is a more detailed process flow figure of a listening object data selection process (step S9). It is a figure which shows the structural example of the speech recognition dictionary creation assistance apparatus in another embodiment of this invention. It is a figure which shows the outline
- FIG. 1 is a diagram showing a configuration example of a speech recognition dictionary creation support apparatus 1 according to an embodiment of the present invention.
- the speech recognition dictionary creation support device 1 is a device that supports creation and update processing of the speech recognition dictionary 26 for recognizing a section (partial data) in which a keyword is uttered from speech data. , Prosody information extraction unit 11, speech data division unit 12, phoneme sequence acquisition unit 13, phoneme recognition unit 14, clustering unit 15, evaluation value calculation unit 16, candidate cluster selection unit 17, listening target data selection unit 18, adoption determination unit 19 is provided.
- the voice data storage unit 10 stores voice data 20 in which voice is recorded.
- the voice data 20 is, for example, data obtained by recording a dialogue voice mainly composed of spoken words such as a telephone answering record inputted to a call center.
- the prosodic information extraction unit 11 extracts prosody data 21 such as a power value and a pitch value of a voice for every predetermined time from the voice data 20.
- the voice data dividing unit 12 specifies the utterance section of the voice data 20, divides each utterance section at a predetermined break, and generates divided voice data 22.
- the phoneme sequence acquisition unit 13 generates phoneme sequence data 23 corresponding to the divided speech data 22 based on the phonemes of the speech data 20 recognized by the phoneme recognition unit 14.
- the phoneme recognition unit 14 recognizes phonemes included in the voice data by a known voice recognition method.
- the clustering unit 15 cleans the phoneme string data 23, classifies the cleaned phoneme string data 23 'by a known clustering method, and generates cluster data 24 relating to the classified clusters.
- the evaluation value calculation unit 16 calculates the evaluation value of each cluster of the cluster data 24 using a predetermined evaluation method.
- the candidate cluster selection unit 17 selects a cluster having a high evaluation value from the clusters in the cluster data 24 as a cluster candidate.
- the listening target data selection unit 18 specifies a representative phoneme string from the phoneme string data 23 constituting the cluster, selects divided speech data 22 corresponding to the representative phoneme string as the listening target data, and dictionary candidate phrases. Accumulate in the voice database 25.
- the adoption determining unit 19 reproduces the divided voice data 22 stored in the dictionary candidate phrase voice data 25, determines whether or not to use the divided voice data 22 as registration data in the voice recognition dictionary 26, and the divided voice data 22 determined to be adopted. Is registered in the voice recognition dictionary 26.
- FIG. 2 is a diagram showing an outline of the processing of the speech recognition dictionary creation support apparatus 1.
- Step S1 Prosody data extraction
- the prosody information extraction unit 11 calculates a power value and a pitch value for each fixed time from the audio data 20 stored in a predetermined storage unit and managed by the audio data management table 100, and the power value File and pitch value file are generated.
- FIG. 3 is a diagram showing an example of the audio data management table 100.
- the voice data management table 100 includes items of wav_id, voice data, incidental information, and prosodic data.
- wav_id is the identification information of the voice data 20
- the voice data is the file name of the voice data 20
- the incidental information is the information of the attributes of the voice data 20 (such as gender and name)
- the prosody data is the voice data 20
- the file names of the power value file and pitch value file are stored.
- FIG. 4 is a diagram showing an example of the prosody data 21.
- FIG. 4B shows an example of a pitch value file (a1_pit.txt) 21b.
- the power value file 21a in FIG. 4 (A) is composed of a power value column for each fixed time (12.8 [msec]), and each row represents [time, power value].
- the pitch value file 21b in FIG. 4 (B) is composed of a pitch value sequence every fixed time (12.8 [msec]), and each row represents [time, pitch value]. Note that the pitch value is recorded only for a section that can be calculated.
- Step S2 Speech segment extraction
- the speech data dividing unit 12 determines a speech segment based on the power value file 21a as a speech segment from which the power value equal to or greater than the threshold th1 is continuous and the continuous segment is equal to or longer than the minimum speech time. Detect as. Further, the detected utterance section is registered in the utterance section table 101.
- FIG. 5 is a diagram showing an example of the utterance section table 101.
- the utterance section table 101 includes items of utterance_id, wav_id, start and end.
- the utterance_id stores the identification information of the utterance section
- the wav_id stores the identification information of the audio data 20 including the utterance section
- the start time [msec] of the utterance section is stored at the start
- the end time [msec] of the utterance section is stored at the end.
- Step S3 Voice data division
- the voice division data unit 12 detects, for each utterance section of the voice data 20 based on the power value file 21a, a section in which the power value equal to or greater than the threshold th2 is continuous, and the voice in the detected section.
- the divided voice data 22 is generated and stored from the data. Also, the generated divided audio data 22 is registered in the divided data management table 102.
- FIG. 6 is a diagram illustrating an example of the divided data management table 102.
- the divided data management table 102 includes split_id, wav_id, start and end items.
- the split_id is identification information of the divided voice data 22
- the wav_id is identification information of the voice data 20 including the utterance period
- the start time [msec] of the divided voice data 22 is started
- the end time of the divided voice data 22 is finished [ msec] is stored.
- FIG. 7 is a diagram showing an example of speech segment extraction and voice data division.
- FIG. 7 shows an example of the waveform of the audio data 20, and the lower part of FIG. 7 shows an example of the power value of the audio data 20.
- a section in which a state in which the voice power value of the voice data 20 is greater than the threshold value th1 continues for a certain time or more is detected as a speech section.
- the divided speech data 22 is generated by dividing each utterance section into sections in which the power value is greater than the threshold value th2 for a certain period of time.
- the thresholds th1 and th2 used in each process of speech segment extraction (step S2) and voice data division (step S3) are set by either calculation process shown in FIG. 8A or FIG. 8B. .
- the frequency distribution of all sound pressures of the input audio data 20 is acquired, and the sound pressure value that is “valley” in this frequency distribution, that is, the sound with the lowest frequency value.
- the pressure value is a threshold th1.
- a frequency distribution of sound pressure values (low values) at a location where the sound pressure change (difference) of the input audio data exceeds a certain value is acquired, and the frequency value is obtained in this frequency distribution.
- the maximum sound pressure value is set as a threshold th1.
- the calculation is performed by the same processing with the voice data 20 corresponding to the processing target speech section as an input.
- Step S4 Acquisition of phoneme string
- the phoneme recognition unit 14 recognizes a phoneme from the divided speech data 22.
- the phoneme recognition unit 14 is a processing unit that performs a known phoneme recognition process.
- the speech recognition processing method may be any known processing method that can output phoneme data as intermediate information.
- a processing device such as “Julius speech recognition engine (http://julius.sourceforge.jp/)” may be used.
- the phoneme may be a monophone, a triphone, or a lattice.
- the phoneme sequence acquisition unit 13 generates phoneme sequence data 23 corresponding to the divided speech data 22 based on the phoneme recognition result that is the processing result of the phoneme recognition unit 14. Further, the generated phoneme string data 23 is registered in the phoneme recognition result table 103.
- FIG. 9 is a diagram showing an example of the phoneme recognition result table 103.
- the phoneme recognition result table 103 includes items of split_id, phoneme recognition result, and cleaning result.
- the split_id stores identification information of the divided speech data 22, the phoneme recognition result stores the phoneme string data 23 generated by the phoneme recognition unit 14, and the cleaning result stores the phoneme string data 23 'subjected to the cleaning process described later.
- Step S5 Phoneme string cleaning
- the phoneme string acquisition unit 13 applies a predetermined cleaning rule to perform a cleaning process on a phoneme recognition result (phoneme string) that is a processing result of the phoneme recognition unit 14.
- Cleaning rule 1 Group long sounds (for example, “o:”, “ou”) and single sounds (for example, “o”).
- Cleaning rule 2 Remove uncertain results (for example, remove a series of prompts).
- Cleaning rule 3 Remove consonant sequences in phoneme string.
- Cleaning rule 4 If there is a silent section ( ⁇ sp>) in the phoneme string, divide it at that point.
- FIG. 10 is a diagram showing an example of the phoneme recognition result table 103 in which the phoneme string data 23 ′ subjected to the cleaning process is stored.
- the phoneme string data 23 “tqhou” is cleaned, and the phoneme string data 23 ′ “hou” is obtained.
- Step S6 Clustering
- the clustering unit 15 classifies all the phoneme string data 23 ′ using a known clustering method, and generates cluster data 24 of a set (cluster) of the classified phoneme string data 23 ′.
- Cluster data 24 is implemented as a cluster table 104 shown in FIG.
- the cluster table 104 includes items of split_id, cluster ID, score, and sorting result.
- split_id is identification information of the divided speech data 22 (phoneme string)
- cluster ID is identification information of the cluster into which the phoneme string data 23 is classified
- a score is an evaluation value of the cluster
- a selection result is selected as data to be listened to.
- Information indicating whether or not is stored.
- Step S7 Evaluation Value Calculation
- the evaluation value calculation unit 16 calculates an evaluation value (score) for each cluster of the cluster data 24 by combining one or more of the following evaluation processes.
- the evaluation value calculation unit 16 performs the following plurality of evaluation processes, and sets the sum of the calculated values as the score S.
- the evaluation value calculation unit 16 uses the audio data management table 100 and the divided data management table 102 as shown in FIG. ) To calculate the score A of each cluster and record it in the phoneme string appearance probability management table 105.
- Score A number of voice data in which phoneme string in cluster appears / number of all voice data
- the score A is a score corresponding to the document frequency (DF) of the document (text) data evaluation process, and better evaluates a cluster including information that appears at a high frequency.
- DF document frequency
- Evaluation Value Calculation Processing S73 Evaluation Value Calculation Based on Power Value
- the evaluation value calculation unit 16 additionally configures a high power item in the divided data management table 102 as shown in FIG.
- the appearance frequency with respect to the number of all divided audio data of the divided audio data 22 in which the flag (1) is set to the large power value of the divided data management table 102 is calculated by the following equation (2).
- Score B number of divided voice data set with flag / total number of divided voice data (2) Score B is a better evaluation of clusters containing data uttered loudly, on the premise that important words are uttered loudly. By using the tendency that important words during utterances are uttered louder than others, and evaluating clusters that contain a lot of data uttered in a louder voice than others, the evaluation accuracy is improved. be able to.
- Evaluation Value Calculation Process S75 Evaluation Value Calculation Using Pitch Value
- the appearance frequency with respect to the number of all divided audio data of the divided audio data 22 in which the flag (1) is set to the large pitch range of the divided data management table 102 is calculated by the following equation (3).
- Score C number of divided voice data set with flag / number of all divided voice data Formula (3) Score C is a better evaluation of clusters containing data uttered with intonated (wide pitch range) voices on the premise that important words are uttered loudly. Evaluate a cluster that contains a lot of data uttered by inflection, that is, a voice with a larger pitch range, as a good cluster by using the tendency of important words during utterances with inflections compared to others. By doing so, the evaluation accuracy can be increased.
- Evaluation value calculation processing S77 Evaluation value calculation based on word-likeness information
- the speech recognition dictionary creation support apparatus 1 uses the morpheme in which the evaluation value calculation unit 16 is used in the morphological analysis processing.
- the analysis dictionary 27 and the characterizing rule storage means 28 can be referred to.
- the evaluation value calculation unit 16 extracts from the morphological analysis dictionary 27 words / phrases classified as parts of speech used as keywords such as nouns and verbs, and n-grams of the extracted words / phrases.
- the common part of the phoneme string data 23 ′ constituting the cluster is extracted, and the character string (for example, “mobairumeru”) is referred to the characterizing rule for the common phoneme string (for example, “mobairumeru”). , “Morning”).
- the appearance probability in the character string of the extracted phrase of the common phoneme string is calculated, and the appearance probability of each cluster is recorded in the phoneme string appearance probability management table 105. Let this appearance probability be score D.
- Score D removes interjections such as “Ut” and “Ano” from the target of keyword selection, and better evaluates a cluster including data with a high “word-likeness” as a keyword.
- the evaluation accuracy can be increased by using the degree of keywordness.
- the evaluation value calculation unit 16 calculates the score S of each cluster by the following equation (4).
- Step S8 Candidate Cluster Selection
- the candidate cluster selection unit 17 selects a cluster having a high score value as a candidate cluster based on the score of the cluster table 104. For example, the cluster whose score value is equal to or higher than the threshold th3 or the top n clusters in descending order of the score value is selected.
- Step S9 Selection of listening target data
- the listening target data selection unit 18 performs the following method from the phoneme string data 23 ′ constituting the candidate cluster for the selected candidate cluster of the cluster table 104.
- Selection rule 1 The phoneme sequence having the longest sequence length among the phoneme sequences of the cluster is set as the representative phoneme sequence.
- Selection rule 2 A phoneme string having the largest number of divided speech data corresponding to each phoneme string is selected as a representative phoneme string.
- Selection rule 3 “degree of wordiness” of each phoneme string is calculated by the same process as the process of step S77, and a phoneme string having a large value is set as a representative phoneme string.
- the divided speech data 22_1 corresponding to the selected representative phoneme string is selected, output as listening target data, and stored in the dictionary candidate phrase speech database 25.
- the auxiliary audio information 22 in the audio data management table 100 is referred to, and the divided audio data 22 that matches the auxiliary information is referred to. Is stored in the dictionary candidate phrase speech database 25.
- the designation type 110 is information in which, for example, a voice with a high sound pressure or a female voice is designated. This is because it is possible to specify a voice property that is easy for the user to listen to.
- the incidental information item of the audio data management table 100 is not necessary.
- the adoption determination unit 19 reproduces the divided voice data 22 stored in the dictionary candidate phrase voice database 25.
- the adoption determination unit 19 is provided with an interface that allows a user who has listened to the reproduced voice to decide whether or not to adopt the data as registration data in the speech recognition dictionary 26, and inputs whether or not the adoption is possible. If so, the divided voice data 22 is registered in the voice recognition dictionary 26.
- FIGS. 16 to 21 are more detailed process flow diagrams of steps S71, S73, S75, and S77 of the evaluation value calculation process (step S7).
- FIG. 16 is a more detailed process flow diagram of the evaluation value calculation process (step S71) based on the appearance frequency information.
- the evaluation value calculation unit 16 substitutes the first cluster ID of the cluster table 104 for c-id to empty the check_wav set (step S710).
- step S711 If there is an unprocessed c-id (YES in step S711), a split_id whose cluster ID is c_id in the cluster table 104 is detected and substituted into s_id (step S712).
- wav_id corresponding to s_id is acquired from the divided data management table 102 (step S714).
- step S715 if there is no wav_id in the element of the check_wav set (YES in step S715), wav_id is added to the element in the check_wav set (step S716). If there is wav_id in the element of the check_wav set (NO in step S715), the process returns to step S712.
- step S711 the check_wav set is emptied, the next cluster ID of the cluster table 104 is substituted for c-id, and the process returns to step S711 (step S718).
- step S711 if there is no unprocessed c-id (NO in step S711), the process is terminated.
- FIG. 17 is a more detailed process flow diagram of the evaluation value calculation process (step S73) based on the power value.
- the evaluation value calculation unit 16 acquires a value of power large in the divided data management table 102 (step S730). Details of the processing in step S730 will be described later.
- the first cluster ID of the cluster table 104 is substituted for c-id, 0 (zero) is substituted for power, and 0 (zero) is substituted for s_id_num (step S731).
- step S733 If there is an unprocessed c-id (YES in step S732), the split_id in which the cluster ID in the cluster table 104 is c_id is detected and substituted into s_id (step S733).
- step S734 if there is an unprocessed s_id (YES in step S734), s_id_num is incremented (added by 1) (step S735), and high-power data corresponding to s_id is acquired (step S736). If the flag (1) is set to high power (YES in step S737), the power is incremented (step S738). If the flag (1) is not set to high power (NO in step S737), the process returns to step S733.
- step S740 the next cluster ID of the cluster table 104 is substituted for c-id, 0 (zero) is substituted for power and s_id_num, and the process returns to step S731 (step S740).
- step S732 if there is no unprocessed c-id (NO in step S732), the process is terminated.
- FIG. 18 is a more detailed process flow diagram of the large power value acquisition process (step S730) of the divided data management table 102.
- the average power value (Ave_i) of i is calculated from the power value file 21a (step S7302).
- step S7307 The next split_id is input to j, and if there is an unprocessed j (YES in step S7307), the process returns to step S7304. If there is no unprocessed j (NO in step S7307), the process proceeds to step S7308.
- next wav_id is input to i, and if there is an unprocessed i (YES in step S7308), the process returns to step SS7302. If there is no unprocessed i (NO in step S7308), the process in step S7308 ends.
- FIG. 19 is a more detailed process flow diagram of the evaluation value calculation process (step S75) based on the pitch value.
- the evaluation value calculation unit 16 acquires the value of the large pitch range of the divided data management table 102 (step S750). Details of the processing in step S750 will be described later.
- the first cluster ID of the cluster table 104 is substituted for c-id, 0 (zero) is substituted for pitch, and 0 (zero) is substituted for s_id_num (step S751).
- step S753 If there is an unprocessed c-id (YES in step S752), a split_id having a cluster ID c_id in the cluster table 104 is detected and substituted into s_id (step S753).
- step S754 if there is an unprocessed s_id (YES in step S754), s_id_num is incremented (step S755), and a large pitch range value corresponding to s_id is acquired (step S756). If the flag (1) is set to the large pitch range (YES in step S757), the pitch is incremented (step S758). If the flag (1) is not set to the large pitch range (NO in step S757), the process returns to step S753.
- step S760 the next cluster ID of the cluster table 104 is substituted for c-id, 0 (zero) is substituted for pitch and s_id_num, and the process returns to step S751 (step S760).
- step S752 if there is no unprocessed c-id (NO in step S752), the process is terminated.
- FIG. 20 is a more detailed processing flowchart of the value acquisition processing (step S750) of the large pitch range of the divided data management table 102.
- step S7508 If there is an unprocessed j (YES in step S7508), the process returns to step S7505. If there is no unprocessed j (NO in step S7508), the process proceeds to step S7509.
- step S7502 the process in step S7502 ends.
- FIG. 21 is a more detailed process flow diagram of the evaluation value calculation process (step S77) based on word-likeness information.
- the evaluation value calculation unit 16 substitutes the first cluster ID of the cluster table 104 for c-id (step S770).
- step S771 All the phoneme string data 23 'whose cluster ID is c_id in the cluster table 104 is acquired (step S771), and the common phoneme string part is acquired (step S772). Further, referring to the characterizing rule storage unit 28, the character string of the common phoneme string portion is acquired (step S773). Using the morpheme analysis dictionary 27, the appearance probability in a predetermined extracted word / phrase of the common phoneme string portion based on the n-gram data is calculated (step S774), and the appearance probability is stored in the phoneme string appearance probability management table 105 (step S775). ).
- step S776 The next cluster ID of the cluster table 104 is substituted for c-id (step S776). If there is an unprocessed c-id (YES in step S777), the process returns to step S771, and the unprocessed c-id is If not (NO in step S777), the process is terminated.
- FIG. 22 is a more detailed process flow diagram of the listening target data selection process (step S9).
- the listening target data selection unit 18 sequentially acquires the cluster IDs of the clusters selected as the candidate clusters from the cluster table 104, and substitutes them for c_id (step S90).
- step S91 If there is an unprocessed c_id (YES in step S91), the split_id having the cluster ID c_id is detected from the cluster table 104 and substituted into s_id (step S92).
- step S93 the cleaning result (phoneme string data 23 ') is acquired from the phoneme recognition result table 103 and substituted into onso (step S94). Further, the number of “vowels, N” of onso is counted and set as length (s_id) (step S95), and the process returns to step S92.
- step S93 if there is no unprocessed s_id (NO in step S93), an s_id having the maximum length (s_id) is obtained and placed in the s_id_max set (step S96). Note that there may be a plurality of corresponding s_ids.
- wav_id is acquired from the divided data management table 102, and incidental information is acquired from the audio data management table 100 (step S97).
- the split_id that matches the specified type 110 is entered into the candate_wav set (step S98), and the process returns to step S90.
- step S91 if there is no unprocessed c_id (NO in step S91), the divided speech data 22 corresponding to each split_id of the candidate_wav set is stored in the dictionary candidate phrase speech database 25 (step S99).
- the speech recognition dictionary creation support device 1 can automatically extract speech data that is a keyword candidate registered in the speech recognition dictionary 26 and support the speech recognition dictionary creation processing.
- a set X (element x) of all businesses in the call center is set, and an unprocessed business x is selected.
- Select the data to be listened for job x is selected.
- the processes of steps S6 to S9 are performed on the phoneme string data 23 ′ of the job x from the phoneme string data 23 ′ subjected to the phoneme string cleaning process of step S5 of the process flow of FIG. As a result, it is possible to output data to be listened for each job.
- FIG. 23 is a diagram showing a configuration example in another embodiment of the present invention.
- the configuration of the speech recognition dictionary creation support device 1 ′ in FIG. 23 is substantially the same as the configuration of the speech recognition dictionary creation support device 1 shown in FIG. 1, but the registration information generation unit 30 instead of the listening target data selection unit 18. , A characterizing rule storage unit 31 and a reading variation rule storage unit 32 are provided.
- the registration information generating unit 30 refers to the characterizing rule storage unit 31 and the reading variation rule storage unit 32 to convert the phonemes of the representative phoneme string into character strings, and based on the converted character string, the representative phoneme string is converted. Registration data of notation or reading shown is generated and registered in the speech recognition dictionary 26.
- the characterizing rule storage unit 31 stores characterizing rules that are correspondence rules between phonemes and reading characters.
- the reading variation rule storage unit 32 stores phoneme reading character string variations.
- FIG. 24 is a diagram showing an outline of the processing of the speech recognition dictionary creation support apparatus 1 ′.
- step S1 to step S8 in FIG. 24 is the same as the processing step having the same sign shown in FIG. After the process of step S8, steps S30 to S32 are executed.
- Step S30 Acquisition of representative phoneme string
- the registration information generating unit 30 acquires phoneme string data 23 'as a representative phoneme string from the candidate cluster in which the flag (O) of the cluster table 104 is set.
- Step S31 Registration Data Creation
- the registration information generation unit 30 refers to the characterizing rule storage unit 31, and generates a character string corresponding to the phoneme string of the phoneme string data 23 ′ of the representative phoneme string. To do.
- the generated character string is represented and read as divided speech data 22 corresponding to the representative phoneme string.
- Step S32 Addition of the dictionary
- the registration information generating unit 30 registers the generated registration data in the voice recognition dictionary 26.
- FIG. 26 is a diagram showing a more detailed processing flow of the registration data generation processing in step S31.
- the registration information generation unit 30 acquires one phoneme string x from the phoneme string data 23 ′, which is the representative phoneme string of the candidate cluster (step S310). If there is a phoneme string x (YES in step S311), the representative phoneme character string is converted to the character string y by applying the characterizing rules in the characterizing rule storage unit 31 (step S312).
- the reading variation rule in the reading variation rule storage unit 32 is applied to the character string y, and other character strings z1, z2,... Are acquired (step S313).
- the generated registration data is registered in the speech recognition dictionary 26.
- the speech recognition dictionary creation support device 1 ′ can automatically generate keyword information extracted from the speech data 20 registered in the speech recognition dictionary 26.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
10 音声データ記憶部
11 韻律情報抽出部
12 音声データ分割部
13 音素列取得部
14 音素認識部
15 クラスタリング部
16 評価値算出部
17 候補クラスタ選択部
18 聴取対象データ選択部
19 採用判定部
100 音声データ管理テーブル
101 発話区間テーブル
102 分割データ管理テーブル
103 音素認識結果テーブル
104 クラスタテーブル
105 音素列出現確率管理テーブル
20 音声データ
21 韻律データ
22 分割音声データ
23,23′ 音素列データ
24 クラスタデータ
25 辞書候補フレーズ音声データベース
26 音声認識辞書
30 登録情報生成部
31 文字化ルール記憶部
32 読みバリエーションルール記憶部
韻律情報抽出部11は,所定の記憶部に格納され音声データ管理テーブル100で管理されている音声データ20から,一定時間ごとのパワー値およびピッチ値を算出し,パワー値ファイル,ピッチ値ファイルを生成する。
音声データ分割部12は,パワー値ファイル21aをもとに,音声データ20から閾値th1以上のパワー値が連続かつその連続区間が最低発話時間以上である区間を,発話区間として検出する。また,検出した発話区間を発話区間テーブル101に登録する。
音声分割データ部12は,パワー値ファイル21aをもとに,音声データ20の各発話区間について,閾値th2以上のパワー値が連続する区間を検出し,検出した区間の音声データから分割音声データ22を生成,保存する。また,生成した分割音声データ22を分割データ管理テーブル102に登録する。
音素認識部14は,分割音声データ22から音素を認識する。音素認識部14は,既知の音素認識処理を実施する処理手段である。音声認識処理手法は,既知の処理手法であって,中間情報として音素データを出力できる手法であればよい。例えば,「Julius音声認識エンジン(http://julius.sourceforge.jp/)」のような処理装置を使用してもよい。ここで,音素は,モノフォン(monophone),トライフォン(triphone)でもよく,また,ラティスであってもよい。
音素列取得部13は,所定のクリーニング規則を適用して,音素認識部14の処理結果である音素認識結果(音素列)に対してクリーニング処理を行う。
・クリーニング規則1:長音(例えば,“o:”,“ou”)と単音(例えば,“o”)をまとめる。
・クリーニング規則2:不確かな結果を除去する(例えば,促音の連続を除去する)。
・クリーニング規則3:音素列中の子音の連続を除去する
・クリーニング規則4:音素列中の無音区間(<sp>)がある場合に,その箇所で分割する。
クラスタリング部15は,既知のクラスタリング手法を用いて全ての音素列データ23′を分類し,分類した音素列データ23′の集合(クラスタ)のクラスタデータ24を生成する。
評価値算出部16は,以下の評価処理の1つまたは複数を組み合わせて,クラスタデータ24の各クラスタについて評価値(スコア)を算出する。
評価値算出処理S73:パワー値による評価値算出
評価値算出処理S75:ピッチ値による評価値算出
評価値算出処理S77:単語らしさ情報による評価値算出
(1)評価値算出処理S71:出現頻度情報による評価値算出
評価値算出部16は,図12に示すように,音声データ管理テーブル100,分割データ管理テーブル102を用いて,以下の式(1)で各クラスタのスコアAを算出し,音素列出現確率管理テーブル105に記録する。
スコアAは,ドキュメント(テキスト)データ評価処理の文書頻度(DF)に相当するスコアであり,高い頻度で出現する情報を含むクラスタを,より良く評価するものである。発話中に重要な語句は何度も繰り返し発声される傾向を利用し,比較的多く発声されている同じ語を示すデータを多く含むクラスタを良いクラスタとして評価することによって,評価精度を高くすることができる。
評価値算出部16は,図12に示すように,分割データ管理テーブル102にパワー大の項目を追加構成する。そして,分割音声データ22のパワー値が,その分割音声データ22が含まれる音声データ20の平均パワー値を超える場合に,分割データ管理テーブル102の「パワー値大」にフラグ(=1)を設定する。
スコアBは,重要な語句は,はっきり大きく発声されるという特徴を前提に,大きな声で発声されているデータを含むクラスタを,より良く評価するものである。発話中に重要な語句は他に比べてはっきり大きく発声される傾向を利用して,他より大きな声で発声しているデータを多く含むクラスタを良いクラスタとして評価することによって,評価精度を高くすることができる。
評価値算出部16は,図12に示すように,分割データ管理テーブル102にピッチレンジ大の項目を追加構成する。そして,分割音声データ22のピッチレンジの値が,その分割音声データ22が含まれる音声データ20の平均ピッチレンジの値を超える場合に,分割データ管理テーブル102の「ピッチレンジ大」にフラグ(=1)を設定する。
スコアCは,重要な語句は,はっきり大きく発声されるという特徴を前提に,抑揚ある(ピッチレンジが広い)声で発声されているデータを含むクラスタを,より良く評価するものである。発話中に重要な語句は他に比べて抑揚をつけて発声される傾向を利用し,他より抑揚すなわちピッチレンジが大きくなっている声で発声しているデータを多く含むクラスタを良いクラスタとして評価することによって,評価精度を高くすることができる。
ステップS77の処理を実施する場合には,音声認識辞書作成支援装置1は,評価値算出部16が形態素解析処理で使用される形態素解析辞書27および文字化ルール記憶手段28を参照できるように構成されている。
(α+β+γ+δ=1,0≦α≦1,0≦β≦1,0≦γ≦1,0≦δ≦1)
ステップS8:候補クラスタ選択
候補クラスタ選択部17は,クラスタテーブル104のスコアをもとに,スコア値が高いクラスタを候補クラスタとして選択する。例えば,スコア値が閾値th3以上のクラスタ,またはスコア値が高い順における上位n個のクラスタを選択する。
聴取対象データ選択部18は,図15(A)に示すように,クラスタテーブル104の選択された候補クラスタについて,候補クラスタを構成する音素列データ23′から以下の方法で代表音素列を選択する。
・選択規則1:クラスタの音素列中,列長が最長の音素列を代表音素列とする。
・選択規則2:クラスタの音素列中,各音素列に対応する分割音声データ数が最多の音素列を代表音素列とする。
・選択規則3:ステップS77の処理と同様の処理によって,各音素列の「単語らしさの度合い」を算出し,値の大きい音素列を代表音素列とする。
登録情報生成部30は,クラスタテーブル104のフラグ(○)が設定された候補クラスタから,代表音素列となる音素列データ23′を取得する。
登録情報生成部30は,図25に示すように,文字化ルール記憶部31を参照して,代表音素列の音素列データ23′の音素列に相当する文字列を生成する。生成した文字列を,代表音素列に対応する分割音声データ22の表記および読みとする。
登録情報生成部30は,生成した登録データを音声認識辞書26に登録する。
Claims (14)
- 音声データを記憶する音声データ記憶部と,
前記音声データから,少なくとも音声のパワー値を含む韻律情報を抽出する韻律情報抽出部と,
前記韻律情報をもとに,前記音声データから,前記パワー値が所定の閾値以上である時間が予め定めた時間以上となる発話区間を抽出し,該発話区間を所定の閾値以上のパワー値が一定時間以上連続する区間に分割して分割音声データを生成する音声データ分割部と,
前記分割音声データに対して音素認識処理を行い,各分割音声データの音素列データを取得する音素列取得部と,
前記音素列データに対してクラスタリング処理を行い,分類された音素列データの集合であるクラスタを生成するクラスタリング部と,
前記クラスタ各々について,該クラスタを構成する音素列データに対応する分割音声データの前記韻律情報をもとに,評価値を算出する評価値算出部と,
前記評価値が一定以上であるクラスタを候補クラスタとして選択する候補クラスタ選択部と,
前記候補クラスタ各々について,該クラスタを構成する音素列データから1つの音素列データを代表音素列として特定し,当該代表音素列に対応する分割音声データを聴取対象音声データとして選択する聴取対象データ選択部とを備える
音声認識辞書作成支援装置。 - 前記評価値算出部は,前記音素列データに対応する分割音声データの前記韻律情報のパワー値が一定の大きさ以上である音素列データのデータ数にもとづいて,前記クラスタの評価値を算出する
請求項1に記載の音声認識辞書作成支援装置。 - 前記韻律情報抽出部は,前記韻律情報として音声のピッチ値を含む韻律情報を抽出し,
前記評価値算出部は,前記音素列データに対応する分割音声データの前記韻律情報のピッチ値のレンジが一定の大きさ以上である音素列データのデータ数にもとづいて,前記クラスタの評価値を算出する
請求項1または請求項2に記載の音声認識辞書作成支援装置。 - 前記評価値算出部は,前記音素列データ各々の全分割音声データにおける出現頻度を算出し,当該出現頻度にもとづいて,前記クラスタの評価値を算出する
請求項1ないし請求項3のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記評価値算出部は,形態素解析処理用の辞書データを備え,当該辞書データから所定の品詞に分類される語句を抽出し,前記クラスタを構成する音素列データの共通部分の当該抽出した語句における出現確率を算出し,当該出現確率にもとづいて,前記クラスタの評価値を算出する
請求項1ないし請求項4のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記聴取対象データ選択部は,前記候補クラスタから,音素列長が最長の音素列データを前記代表音素列として特定する
請求項1ないし請求項5のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記聴取対象データ選択部は,前記候補クラスタから,相当する分割音声データ数が最多の音素列データを前記代表音素列として特定する
請求項1ないし請求項5のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記聴取対象データ選択部は,形態素解析処理用の辞書データを備え,当該辞書データから所定の品詞に分類される語句を抽出し,前記候補クラスタを構成する音素列データの当該抽出した語句における出現確率を算出し,当該出現確率が最大の音素列データを前記代表音素列として特定する
請求項1ないし請求項5のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記聴取対象データ選択部は,前記代表音素列に対応する分割音声データのなかから音声のパワー値が最大の分割音声データを選択する
請求項1ないし請求項8のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記聴取対象データ選択部は,前記代表音素列に対応する分割音声データのなかから音声のピッチレンジが最大の分割音声データを選択する
請求項1ないし請求項8のいずれか一項に記載の音声認識辞書作成支援装置。 - 前記音声データの属性に関連する付帯情報を記憶する付帯情報記憶部を備え,
前記聴取対象データ選択部は,外部から入力された音声データの属性が指定された指定条件を取得し,前記付帯情報を参照して,前記代表音素列に対応する分割音声データから当該指定条件に一致する属性の分割音声データを選択する
請求項1ないし請求項8のいずれか一項に記載の音声認識辞書作成支援装置。 - 音素文字との変換規則を示す文字化ルールを記憶する文字化ルール記憶部と,
当該文字化ルールをもとに,前記代表音素列の各音素を文字に変換して文字列を生成し,当該文字列を表記または読みとする,音声認識用辞書の登録データを生成する登録データ生成部とを備える
請求項1ないし請求項11のいずれか一項に記載の音声認識辞書作成支援装置。 - コンピュータを,
音声データを記憶する音声データ記憶部と,
前記音声データから,少なくとも音声のパワー値を含む韻律情報を抽出する韻律情報抽出部と,
前記韻律情報をもとに,前記音声データから,前記パワー値が所定の閾値以上である時間が予め定めた時間以上となる発話区間を抽出し,該発話区間を所定の閾値以上のパワー値が一定時間以上連続する区間に分割して分割音声データを生成する音声データ分割部と,
前記分割音声データに対して音素認識処理を行い,各分割音声データの音素列データを取得する音素列取得部と,
前記音素列データに対してクラスタリング処理を行い,分類された音素列データの集合であるクラスタを生成するクラスタリング部と,
前記クラスタ各々について,該クラスタを構成する音素列データに対応する分割音声データの前記韻律情報をもとに,評価値を算出する評価値算出部と,
前記評価値が一定以上であるクラスタを候補クラスタとして選択する候補クラスタ選択部と,
前記候補クラスタ各々について,該クラスタを構成する音素列データから1つの音素列データを代表音素列として特定し,当該代表音素列に対応する分割音声データを聴取対象音声データとして選択する聴取対象データ選択部とを備える処理装置として
機能させるための音声認識辞書作成支援処理プログラム。 - 音声データを記憶する音声データ記憶部を備えるコンピュータが,実行する処理方法であって,
音声データ記憶部に格納された音声データから,少なくとも音声のパワー値を含む韻律情報を抽出する処理ステップと,
前記韻律情報をもとに,前記音声データから,前記パワー値が所定の閾値以上である時間が予め定めた時間以上となる発話区間を抽出し,該発話区間を所定の閾値以上のパワー値が一定時間以上連続する区間に分割して分割音声データを生成する処理ステップと,
前記分割音声データに対して音素認識処理を行い,各分割音声データの音素列データを取得する処理ステップと,
前記音素列データに対してクラスタリング処理を行い,分類された音素列データの集合であるクラスタを生成する処理ステップと,
前記クラスタ各々について,該クラスタを構成する音素列データに対応する分割音声データの前記韻律情報をもとに,評価値を算出する処理ステップと,
前記評価値が一定以上であるクラスタを候補クラスタとして選択する処理ステップと,
前記候補クラスタ各々について,該クラスタを構成する音素列データから1つの音素列データを代表音素列として特定し,当該代表音素列に対応する分割音声データを聴取対象音声データとして選択する処理ステップとを備える
音声認識辞書作成支援処理方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2008/058615 WO2009136440A1 (ja) | 2008-05-09 | 2008-05-09 | 音声認識辞書作成支援装置,処理プログラム,および処理方法 |
JP2010510981A JP5454469B2 (ja) | 2008-05-09 | 2008-05-09 | 音声認識辞書作成支援装置,処理プログラム,および処理方法 |
GB1018822.5A GB2471811B (en) | 2008-05-09 | 2008-05-09 | Speech recognition dictionary creating support device,computer readable medium storing processing program, and processing method |
US12/926,281 US8423354B2 (en) | 2008-05-09 | 2010-11-05 | Speech recognition dictionary creating support device, computer readable medium storing processing program, and processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2008/058615 WO2009136440A1 (ja) | 2008-05-09 | 2008-05-09 | 音声認識辞書作成支援装置,処理プログラム,および処理方法 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/926,281 Continuation US8423354B2 (en) | 2008-05-09 | 2010-11-05 | Speech recognition dictionary creating support device, computer readable medium storing processing program, and processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009136440A1 true WO2009136440A1 (ja) | 2009-11-12 |
Family
ID=41264502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2008/058615 WO2009136440A1 (ja) | 2008-05-09 | 2008-05-09 | 音声認識辞書作成支援装置,処理プログラム,および処理方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8423354B2 (ja) |
JP (1) | JP5454469B2 (ja) |
GB (1) | GB2471811B (ja) |
WO (1) | WO2009136440A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011203349A (ja) * | 2010-03-24 | 2011-10-13 | Toyota Motor Corp | 音声認識システム及び自動検索システム |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9646613B2 (en) * | 2013-11-29 | 2017-05-09 | Daon Holdings Limited | Methods and systems for splitting a digital signal |
JP6413263B2 (ja) * | 2014-03-06 | 2018-10-31 | 株式会社デンソー | 報知装置 |
US11276390B2 (en) * | 2018-03-22 | 2022-03-15 | Casio Computer Co., Ltd. | Audio interval detection apparatus, method, and recording medium to eliminate a specified interval that does not represent speech based on a divided phoneme |
CN111063372B (zh) * | 2019-12-30 | 2023-01-10 | 广州酷狗计算机科技有限公司 | 确定音高特征的方法、装置、设备及存储介质 |
CN111862954B (zh) * | 2020-05-29 | 2024-03-01 | 北京捷通华声科技股份有限公司 | 一种语音识别模型的获取方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63165900A (ja) * | 1986-12-27 | 1988-07-09 | 沖電気工業株式会社 | 会話音声認識方式 |
JP2003241788A (ja) * | 2002-02-20 | 2003-08-29 | Ntt Docomo Inc | 音声認識装置及び音声認識システム |
JP2007213005A (ja) * | 2006-01-10 | 2007-08-23 | Nissan Motor Co Ltd | 認識辞書システムおよびその更新方法 |
JP2007293600A (ja) * | 2006-04-25 | 2007-11-08 | Ziosoft Inc | 医療用サーバ装置、入力装置、校正装置、閲覧装置、音声入力レポートシステムおよびプログラム |
Family Cites Families (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0720889A (ja) | 1993-06-30 | 1995-01-24 | Omron Corp | 不特定話者の音声認識装置および方法 |
US5806030A (en) * | 1996-05-06 | 1998-09-08 | Matsushita Electric Ind Co Ltd | Low complexity, high accuracy clustering method for speech recognizer |
US5933805A (en) * | 1996-12-13 | 1999-08-03 | Intel Corporation | Retaining prosody during speech analysis for later playback |
US6092044A (en) * | 1997-03-28 | 2000-07-18 | Dragon Systems, Inc. | Pronunciation generation in speech recognition |
US6078885A (en) * | 1998-05-08 | 2000-06-20 | At&T Corp | Verbal, fully automatic dictionary updates by end-users of speech synthesis and recognition systems |
US6192333B1 (en) * | 1998-05-12 | 2001-02-20 | Microsoft Corporation | System for creating a dictionary |
US6317707B1 (en) * | 1998-12-07 | 2001-11-13 | At&T Corp. | Automatic clustering of tokens from a corpus for grammar acquisition |
US7120575B2 (en) * | 2000-04-08 | 2006-10-10 | International Business Machines Corporation | Method and system for the automatic segmentation of an audio stream into semantic or syntactic units |
JP2002116788A (ja) | 2000-10-04 | 2002-04-19 | Sharp Corp | 音素認識装置および方法 |
GB0027178D0 (en) * | 2000-11-07 | 2000-12-27 | Canon Kk | Speech processing system |
US6973427B2 (en) * | 2000-12-26 | 2005-12-06 | Microsoft Corporation | Method for adding phonetic descriptions to a speech recognition lexicon |
US20020087313A1 (en) * | 2000-12-29 | 2002-07-04 | Lee Victor Wai Leung | Computer-implemented intelligent speech model partitioning method and system |
WO2002073595A1 (fr) * | 2001-03-08 | 2002-09-19 | Matsushita Electric Industrial Co., Ltd. | Dispositif generateur de prosodie, procede de generation de prosodie, et programme |
JP2002358095A (ja) * | 2001-03-30 | 2002-12-13 | Sony Corp | 音声処理装置および音声処理方法、並びにプログラムおよび記録媒体 |
WO2003034402A1 (de) * | 2001-10-11 | 2003-04-24 | Siemens Aktiengesellschaft | Verfahren zur erzeugung von sprachbausteine beschreibenden referenzsegmenten und verfahren zur modellierung von spracheinheiten eines gesprochenen testmusters |
US20030120493A1 (en) * | 2001-12-21 | 2003-06-26 | Gupta Sunil K. | Method and system for updating and customizing recognition vocabulary |
WO2003077151A2 (en) * | 2002-03-05 | 2003-09-18 | Siemens Medical Solutions Health Services Corporation | A dynamic dictionary and term repository system |
JP3762327B2 (ja) * | 2002-04-24 | 2006-04-05 | 株式会社東芝 | 音声認識方法および音声認識装置および音声認識プログラム |
US7072828B2 (en) * | 2002-05-13 | 2006-07-04 | Avaya Technology Corp. | Apparatus and method for improved voice activity detection |
US7024353B2 (en) * | 2002-08-09 | 2006-04-04 | Motorola, Inc. | Distributed speech recognition with back-end voice activity detection apparatus and method |
US7047193B1 (en) * | 2002-09-13 | 2006-05-16 | Apple Computer, Inc. | Unsupervised data-driven pronunciation modeling |
JP2004117662A (ja) | 2002-09-25 | 2004-04-15 | Matsushita Electric Ind Co Ltd | 音声合成システム |
AU2003290395A1 (en) * | 2003-05-14 | 2004-12-03 | Dharamdas Gautam Goradia | A system of interactive dictionary |
US7280963B1 (en) * | 2003-09-12 | 2007-10-09 | Nuance Communications, Inc. | Method for learning linguistically valid word pronunciations from acoustic data |
US8019602B2 (en) * | 2004-01-20 | 2011-09-13 | Microsoft Corporation | Automatic speech recognition learning using user corrections |
US7392187B2 (en) * | 2004-09-20 | 2008-06-24 | Educational Testing Service | Method and system for the automatic generation of speech features for scoring high entropy speech |
CA2597803C (en) * | 2005-02-17 | 2014-05-13 | Loquendo S.P.A. | Method and system for automatically providing linguistic formulations that are outside a recognition domain of an automatic speech recognition system |
US9245526B2 (en) * | 2006-04-25 | 2016-01-26 | General Motors Llc | Dynamic clustering of nametags in an automated speech recognition system |
US20080120093A1 (en) * | 2006-11-16 | 2008-05-22 | Seiko Epson Corporation | System for creating dictionary for speech synthesis, semiconductor integrated circuit device, and method for manufacturing semiconductor integrated circuit device |
JP4867654B2 (ja) * | 2006-12-28 | 2012-02-01 | 日産自動車株式会社 | 音声認識装置、および音声認識方法 |
JP5240457B2 (ja) * | 2007-01-16 | 2013-07-17 | 日本電気株式会社 | 拡張認識辞書学習装置と音声認識システム |
EP2126900B1 (en) * | 2007-02-06 | 2013-04-24 | Nuance Communications Austria GmbH | Method and system for creating entries in a speech recognition lexicon |
US8620658B2 (en) * | 2007-04-16 | 2013-12-31 | Sony Corporation | Voice chat system, information processing apparatus, speech recognition method, keyword data electrode detection method, and program for speech recognition |
US7472061B1 (en) * | 2008-03-31 | 2008-12-30 | International Business Machines Corporation | Systems and methods for building a native language phoneme lexicon having native pronunciations of non-native words derived from non-native pronunciations |
-
2008
- 2008-05-09 JP JP2010510981A patent/JP5454469B2/ja not_active Expired - Fee Related
- 2008-05-09 GB GB1018822.5A patent/GB2471811B/en not_active Expired - Fee Related
- 2008-05-09 WO PCT/JP2008/058615 patent/WO2009136440A1/ja active Application Filing
-
2010
- 2010-11-05 US US12/926,281 patent/US8423354B2/en not_active Expired - Fee Related
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63165900A (ja) * | 1986-12-27 | 1988-07-09 | 沖電気工業株式会社 | 会話音声認識方式 |
JP2003241788A (ja) * | 2002-02-20 | 2003-08-29 | Ntt Docomo Inc | 音声認識装置及び音声認識システム |
JP2007213005A (ja) * | 2006-01-10 | 2007-08-23 | Nissan Motor Co Ltd | 認識辞書システムおよびその更新方法 |
JP2007293600A (ja) * | 2006-04-25 | 2007-11-08 | Ziosoft Inc | 医療用サーバ装置、入力装置、校正装置、閲覧装置、音声入力レポートシステムおよびプログラム |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011203349A (ja) * | 2010-03-24 | 2011-10-13 | Toyota Motor Corp | 音声認識システム及び自動検索システム |
Also Published As
Publication number | Publication date |
---|---|
US8423354B2 (en) | 2013-04-16 |
US20110119052A1 (en) | 2011-05-19 |
GB201018822D0 (en) | 2010-12-22 |
JP5454469B2 (ja) | 2014-03-26 |
JPWO2009136440A1 (ja) | 2011-09-01 |
GB2471811A (en) | 2011-01-12 |
GB2471811B (en) | 2012-05-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10720164B2 (en) | System and method of diarization and labeling of audio data | |
EP1960997B1 (en) | Speech recognition system with huge vocabulary | |
JP6440967B2 (ja) | 文末記号推定装置、この方法及びプログラム | |
JP5454469B2 (ja) | 音声認識辞書作成支援装置,処理プログラム,および処理方法 | |
Imperl et al. | Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones | |
Reichl et al. | Language modeling for content extraction in human-computer dialogues | |
JP2011013543A (ja) | 音声認識装置とその方法と、プログラム | |
Sárosi et al. | Automated transcription of conversational Call Center speech–with respect to non-verbal acoustic events | |
Nair et al. | Pair-wise language discrimination using phonotactic information | |
McMurtry | Information Retrieval for Call Center Quality Assurance | |
Anto et al. | Towards improving the performance of language identification system for Indian languages | |
Pittermann et al. | Integrating linguistic cues into speech-based emotion recognition | |
Noth et al. | Language identification in the context of automatic speech understanding | |
JPH0981185A (ja) | 連続音声認識装置 | |
Yang | Automatic Language Identification of Telephone Speech |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08752501 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010510981 Country of ref document: JP |
|
ENP | Entry into the national phase |
Ref document number: 1018822 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20080512 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1018822.5 Country of ref document: GB |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08752501 Country of ref document: EP Kind code of ref document: A1 |