WO2010113438A1 - 音声認識処理システム、および音声認識処理方法 - Google Patents
音声認識処理システム、および音声認識処理方法 Download PDFInfo
- Publication number
- WO2010113438A1 WO2010113438A1 PCT/JP2010/002126 JP2010002126W WO2010113438A1 WO 2010113438 A1 WO2010113438 A1 WO 2010113438A1 JP 2010002126 W JP2010002126 W JP 2010002126W WO 2010113438 A1 WO2010113438 A1 WO 2010113438A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- sound
- speaker
- speech recognition
- text data
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims description 7
- 238000001514 detection method Methods 0.000 claims description 88
- 238000000034 method Methods 0.000 claims description 57
- 238000013500 data storage Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/32—Multiple recognisers used in sequence or in parallel; Score combination systems therefor, e.g. voting systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M3/00—Automatic or semi-automatic exchanges
- H04M3/42—Systems providing special services or facilities to subscribers
- H04M3/487—Arrangements for providing information services, e.g. recorded voice services or time announcements
- H04M3/493—Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/38—Displays
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/40—Electronic components, circuits, software, systems or apparatus used in telephone systems using speech recognition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04M—TELEPHONIC COMMUNICATION
- H04M2201/00—Electronic components, circuits, software, systems or apparatus used in telephone systems
- H04M2201/60—Medium conversion
Definitions
- the present invention relates to a voice recognition processing system and a voice recognition processing method.
- Patent Document 1 Japanese Patent Laid-Open No. 2003-316375
- a communication processing unit for transmitting a voice recognition process start instruction and encoded voice data to a host computer, and decoding received voice data.
- a speech recognition engine that recognizes speech and creates text data
- a distributed dictation system including a communication processing unit that returns the text data to a terminal device are described.
- the text of the speech recognition result can be easily corrected by the terminal device.
- the document describes that correspondence between recorded voice data and dictated text data is taken using time information as a key.
- Patent Document 2 Japanese Patent Application Laid-Open No. 2005-80110 discloses a microphone that collects the voice of a speaker and outputs voice information indicating the voice of the speaker, and is provided in the vicinity of the microphone. Identification means for outputting identification information based on the owner information read from the recording medium on which the specified owner information is recorded, identification information adding means for adding identification information to audio information, and identification information added An audio conference terminal device having a transmission means for transmitting audio information is described. Thereby, it is said that a voice conference system can be provided in which a plurality of people can participate in a voice conference from one point without using a plurality of lines and the speaker can be easily identified.
- Patent Document 3 Japanese Patent Application Laid-Open No. 2003-036096 discloses a voice input means for inputting voice, a voice recognition means for recognizing the contents of the input voice, and a status notification for outputting the voice input or voice recognition status.
- the voice input means and the display means are arranged so that the user can visually recognize the display means while facing the front of the voice input means. The configuration is described. This makes it possible to input an appropriate voice while looking at the display means, so that a voice recognition device having an excellent voice recognition rate can be obtained.
- the voice recognition processing by the voice recognition engine takes a certain amount of time. For example, there is a problem that it is difficult to perform voice recognition of voices of a plurality of speakers in a conference and to output them in substantially real time.
- a plurality of terminal devices for inputting speech are prepared, but speech transmitted from these terminal devices is subjected to speech recognition processing by one speech recognition engine. ing. Therefore, there is still a problem in that the voice recognition result is output quickly.
- a speech recognition engine may be provided for each speaker, or a plurality of speakers may be divided into several groups and a speech recognition engine may be provided for each group.
- voice recognition processing of the voices of a plurality of speakers in a place where a plurality of speakers exist is quickly performed.
- the present inventors have found that a new problem arises when parallel processing is performed by a plurality of speech recognition engines.
- the speech recognition process by the speech recognition engine takes a certain amount of time, and if the amount of speech data to be processed is large, the time required for the speech recognition process becomes long. Therefore, if the amount of speech data processed by each speech recognition engine is different, there will be a difference in the time required for speech recognition results to be output after the speech recognition target speech data is input between the speech recognition engines. End up.
- the second speech recognition engine rather than the relatively long speech “Today's agenda is ...” by the first speech recognition engine.
- the relatively short voice of “Yes” by the voice recognition processing ends first, and the second voice recognition engine outputs the result first. If such a result is displayed and confirmed in real time in the order in which the voice recognition processing is completed, the voice recognition results are displayed in an order different from the actual order of utterances as shown in the figure. For this reason, it is difficult to see the user to confirm, and there is a risk of confusion. On the other hand, there is also a demand for displaying and confirming the result of the voice recognition processing as soon as possible.
- An object of the present invention is to provide a speech recognition processing system and a speech recognition processing method that solve the above-described problem that it is difficult for a user to grasp and recognize a speech recognition result of a plurality of speakers in real time. It is to provide.
- First speech recognition means for inputting a first speech that is a speech of a first speaker, performing speech recognition processing of the first speech, and outputting a speech recognition result as first text data
- Second speech recognition means for inputting a second speech that is the speech of the second speaker, performing speech recognition processing of the second speech, and outputting a speech recognition result as second text data
- Display processing means for displaying the first text data and the second text data on a display means in association with information indicating the first speaker and information indicating the second speaker, respectively;
- the display processing means indicates each speaker before the first text data and the second text data are output from the first voice recognition means and the second voice recognition means, respectively.
- the information is arranged and displayed on the display means in the order in which each voice is uttered, and the first text data and the second text data are respectively sent from the first voice recognition means and the second voice recognition means.
- a speech recognition processing system for displaying the corresponding text data on the display means in association with the information indicating each speaker is provided.
- information indicating each speaker is represented by each speech.
- the first text data and the second text data are A second display step of displaying on the display means in association with information indicating the first speaker and information indicating the second speaker,
- a speech recognition processing method is provided.
- the present invention it is possible to quickly display the speech recognition results of the voices of a plurality of speakers in a form that is easy for the user to grasp.
- FIG. 1 It is a figure which shows an example of the screen displayed on the display of the process of the speech recognition processing system shown in FIG. It is a block diagram which shows the other example of a structure of the speech recognition processing system in embodiment of this invention. It is a figure for demonstrating the new problem which arises when performing a parallel process with a several speech recognition engine.
- FIG. 1 is a block diagram showing an example of the configuration of a speech recognition processing system in the present embodiment.
- the speech recognition processing system 100 includes a first speech recognition processing unit 110, a second speech recognition processing unit 120, a display processing unit 140, and a display 142.
- the first speech recognition processing unit 110 and the second speech recognition processing unit 120 have the same configuration, and each inputs speech, performs speech recognition processing in parallel, and sequentially outputs the results of processing completion. To do.
- the first speech recognition processing unit 110 includes a first speech input unit 112 (first speech input unit), a first speech detection unit 114 (first speech detection unit), and a first speech recognition unit 116. (First speech recognition means).
- the second voice recognition processing unit 120 has the same configuration as the first voice recognition processing unit 110, and includes a second voice input unit 122 (second voice input unit) and a second voice detection unit 124 ( 2nd audio
- the first voice input unit 112 and the second voice input unit 122 may be microphones, for example.
- the first voice input unit 112 and the second voice input unit 122 are installed in a conference room, for example, and input voices spoken by participants during the conference.
- Each of the first voice input unit 112 and the second voice input unit 122 is, for example, arranged in the vicinity of a specific speaker and mainly configured to input the voice of the specific speaker. it can.
- the voice of the speaker “A” (first voice) is mainly input to the first voice input unit 112
- the speaker “B” is mainly input to the second voice input unit 122. It is assumed that the voice (second voice) is input.
- the first voice detection unit 114 performs a known acoustic analysis process on the voice input from the first voice input unit 112, and starts and ends the speech based on spectrum power, signal-to-noise ratio (SNR), and the like. Is detected and voice data is generated in units of one utterance. In addition, the first voice detection unit 114 sets time information (first time information) indicating a time when the voice is detected and a speaker of the voice, with the time when the voice is detected as the time when the voice is emitted. The information shown is associated with the audio data.
- the speech recognition processing system 100 includes a clock unit (such as a clock circuit) that counts the current time, and the first voice detection unit 114 acquires time information from the clock unit. The first voice detection unit 114 outputs voice data in which the time information and the information indicating the speaker are associated with each other to the first voice recognition unit 116.
- the first voice recognition unit 116 performs voice recognition processing on the voice data output from the first voice detection unit 114 and creates text data.
- the first voice recognition unit 116 outputs the created text data to the display processing unit 140.
- the second voice detection unit 124 and the second voice recognition unit 126 of the second voice recognition processing unit 120 are respectively the first voice detection unit 114 and the first voice recognition of the first voice recognition processing unit 110. It has the same configuration as the unit 116 and performs the same processing. That is, the second voice detection unit 124 associates the voice data input from the second voice input unit 122 with the time information indicating the time when the voice is detected and the information indicating the speaker of the voice. Output to the second voice recognition unit 126. The second voice recognition unit 126 performs voice recognition processing on the voice data output from the second voice detection unit 124, creates text data, and outputs the text data to the display processing unit 140.
- the first speech recognition unit 116 and the second speech recognition unit 126 can each be configured to have the same function as a normal speech recognition engine. In the present embodiment, the first voice recognition unit 116 and the second voice recognition unit 126 can be configured to have equivalent processing capabilities.
- the voice input from the first voice input unit 112 in advance is that of the speaker “A”, and the voice input from the second voice input unit 122 is that of the speaker “B”. It can be set to be a thing.
- the first voice detection unit 114 determines that the speaker of the voice input from the first voice input unit 112 is the speaker “A”, and uses the speaker as information indicating the speaker of the input voice. “A” is output.
- the second voice detection unit 124 determines that the speaker of the voice input from the second voice input unit 122 is the speaker “B”, and uses it as information indicating the speaker of the input voice.
- the speaker “B” is output.
- the display processing unit 140 performs processing for sequentially displaying the text data output from the first speech recognition unit 116 and the second speech recognition unit 126 on the display 142. Specifically, the display processing unit 140 displays information indicating a voice speaker and text data as a voice recognition result on the display 142 in the order in which each voice is emitted.
- the display processing unit 140 performs the first story before the text data as the speech recognition result is output from the first speech recognition unit 116 and the second speech recognition unit 126, respectively.
- the information indicating the speaker and the information indicating the second speaker are arranged on the display 142 in the order in which the respective sounds are emitted.
- the text data that is the corresponding speech recognition result is displayed on the display 142 in association with information indicating each speaker. indicate.
- the first voice detection unit 114 when a voice is input, the first voice detection unit 114 includes time information indicating the time when the voice is detected before the text data is output from the first voice recognition unit 116. Information indicating the speaker is output to the display processing unit 140. For example, the first voice detection unit 114 outputs the voice data in which the time information and the information indicating the speaker are associated with each other to the first voice recognition unit 116, and simultaneously displays the time information and the talk to the display processing unit 140. Information indicating the person can be output.
- the second voice detection unit 124 when the voice is input, the second voice detection unit 124 also receives time information indicating the time when the voice is detected and information indicating the speaker, prior to the voice recognition processing by the second voice recognition unit 126. Processing is performed by the display processing unit 140. For example, the second voice detection unit 124 outputs the voice data in which the time information and the information indicating the speaker are associated with each other to the second voice recognition unit 126, and at the same time, the second voice detection unit 124 Information indicating the person can be output.
- FIG. 2 is a flowchart showing processing timing of the speech recognition processing system 100.
- FIG. 3 is a diagram illustrating an example of a screen displayed on the display 142.
- the speech of the speaker A “Now we will start the conference” is input to the voice detection unit 114 via the first voice input unit 112 (step S100). It is assumed that the time when the first voice detection unit 114 detects this voice is 13:12:10 (“13:12:10”). The first voice detection unit 114 generates voice data, and voice data that generates time information “13:12:10” indicating the time when the voice is detected and information “A” indicating the speaker A And output to the first speech recognition unit 116 (step S102).
- the first voice recognition unit 116 sequentially performs voice recognition processing on the voice data input from the first voice detection unit 114.
- the first voice recognition unit 116 starts the voice recognition process for the newly input voice data when the voice recognition process for the voice data input before the newly input voice data has been completed. .
- the first voice recognition unit 116 will immediately complete the voice recognition process for the previous voice data.
- the voice recognition process for newly input voice data is started.
- the first voice detection unit 114 displays time information “13:12:10” indicating the time when the voice is detected and information “A” indicating the speaker A.
- the data is output to the processing unit 140 (step S104).
- the display processing unit 140 displays information “A” indicating the speaker A on the display 142 (FIG. 3A).
- the speech of speaker A “Today's agenda is ...” is input to the voice detection unit 114 via the first voice input unit 112 (step S110). It is assumed that the time when the first voice detection unit 114 detects this voice is 13:12:20 (“13:12:20”). The first voice detection unit 114 generates voice data, and voice data that generates time information “13:12:20” indicating the time when the voice is detected and information “A” indicating the speaker A And output to the first speech recognition unit 116 (step S112).
- the first voice detection unit 114 displays time information “13:12:20” indicating the time when the voice is detected and information “A” indicating the speaker A.
- the data is output to the processing unit 140 (step S114).
- the display processing unit 140 displays information “A” indicating the speaker A on the display 142.
- the utterance “Thank you” of the speaker B is input to the second voice detection unit 124 via the second voice input unit 122 (step S120). It is assumed that the time when the second voice detection unit 124 detects this voice is 13:13:08 (“13:13:08”). The second voice detection unit 124 generates voice data and also generates voice data that generates time information “13:13:08” indicating the time when the voice is detected and information “B” indicating the speaker B. And output to the second speech recognition unit 126 (step S122).
- the second voice recognition unit 126 also sequentially performs voice recognition processing on the voice data input from the second voice detection unit 124.
- the second voice recognition unit 126 starts the voice recognition process of the newly input voice data when the voice recognition process of the voice data input before the newly input voice data has been completed. .
- the second voice recognition unit 126 immediately ends the voice recognition process for the previous voice data. The voice recognition process for newly input voice data is started.
- the second voice detection unit 124 displays time information “13:13:08” indicating the time when the voice is detected and information “B” indicating the speaker B.
- the data is output to the processing unit 140 (step S124).
- the display processing unit 140 displays information “B” indicating the speaker B on the display 142 (FIG. 3B).
- the display processing unit 140 displays information indicating each speaker on the display 142 in order of time based on the time information associated therewith.
- the user watching the display 142 can grasp that the speaker B has a statement after the speaker A has spoken twice. Further, the user looking at the display 142 can grasp in advance which speaker is performing voice recognition.
- the part newly displayed on the display 142 is underlined.
- the actual screen of the display 142 may be underlined in such a manner, and a reverse display or a cursor (flashing bar) can be displayed. Thereby, the user who is looking at the display 142 can easily grasp which part is newly displayed.
- the first speech recognition unit 116 converts the resulting text data to the time “13”. : 12: 10 "and” A "are output to the display processing unit 140 (step S130).
- the display processing unit 140 displays the authentication result “Let's start a meeting” at a location corresponding to “A” on the display 142 (FIG. 3C).
- the display processing unit 140 uses the time information associated with the text data as the speech recognition result and the time information associated with the information indicating the speaker output in advance as keys. Can be processed to display the authentication result “Now start meeting” at the location corresponding to “”. Therefore, in step S130, the first speech recognition unit 116 may output the text data to the display processing unit 140 in association with only the time information of time “13:12:10”.
- the second speech recognition unit 126 converts the resulting text data to the time “13: 13:08 "and” B "are output to the display processing unit 140 (step S132).
- the display processing unit 140 displays “Thank you for your consideration”, which is the authentication result, at a location corresponding to “B” on the display 142 (FIG. 3D).
- a step is performed. Processing similar to S102, step S104, and step S112 or step S114 is performed, and the display processing unit 140 displays information “A” indicating the speaker A on the display 142 (FIG. 3E).
- the first speech recognition unit 116 finishes the speech recognition process of the speech “Today's agenda is ...”
- the first speech recognition unit 116 converts the resulting text data to the time “13:12:20” and “A” are output to the display processing unit 140 (step S134).
- the display processing unit 140 displays “Today's agenda is...” That is the authentication result at a position corresponding to “A” on the display 142 (FIG. 3F).
- FIG. 3 shows an example in which the time is not displayed on the display 142, but the time can also be displayed on the display 142 as shown in FIG. 4 (a) to 4 (d) show states similar to those shown in FIGS. 3 (a) to 3 (d), respectively.
- the first speech recognition unit 116 and the second speech recognition unit 126 which are a plurality of speech recognition engines are provided, and these are in parallel with the speech recognition processing. Therefore, for example, the voice recognition processing of the voices of a plurality of speakers in a meeting can be quickly performed.
- the speech recognition process takes time in either the first speech recognition unit 116 or the second speech recognition unit 126, the order in which the speech recognition results are output is the order in which the speech is actually emitted. It may reverse. In that case, as described with reference to FIG. 11, the speech recognition results are displayed in an order different from the actual order of the utterances, and it is difficult for the user to confirm and there is a risk of confusion.
- the speech recognition processing system 100 in the present embodiment prior to outputting the speech recognition result, information indicating the speakers of each speech is displayed on the display 142 in the order in which each speech is emitted. The Therefore, even if the order in which the speech recognition results are output is reversed from the order in which the speech was actually emitted, the speech recognition process was completed while displaying the order in which each speaker spoke. Results can be displayed as soon as possible. As a result, the user who confirms the voice recognition result can display the voice recognition result in an easy-to-view manner without being confused.
- FIG. 5 is a block diagram showing an example of the configuration of the speech recognition processing system in the present embodiment. Also in the present embodiment, the speech recognition processing system 100 has the same configuration as that described with reference to FIG. 1 in the first embodiment. In the present embodiment, the voice recognition processing system 100 is different from the first embodiment in that it further includes a volume comparison unit 150 in addition to the configuration of the voice recognition processing system 100 shown in FIG.
- the voice of the speaker “A” is mainly input to the first voice input unit 112, and the speaker “B” is mainly input to the second voice input unit 122. ”Is input.
- the first voice input unit 112 or the second voice input unit 122 collects a wide range of sounds or if the speakers are close to each other, the first voice input unit 112 also has a speaker.
- the voice of B may be input, or the voice of the speaker A may be input to the second voice input unit 122 as well. Therefore, there is a possibility that the voice recognition processing of the same voice is duplicated in both the first voice recognition unit 116 and the second voice recognition unit 126, or the speaker cannot be specified correctly.
- the volume comparison unit 150 compares the volume of the audio detected by the first audio detection unit 114 and the second audio detection unit 124 at the same time.
- the process which determines that it is the input of is performed. That is, the volume comparison unit 150 compares the volume of the sound output from the first sound input unit 112 and the second sound input unit 122 at the same time, and the sound output from the first sound input unit 112 is better.
- the volume is higher than the sound output from the second sound input unit 122, it is determined that the sound is the sound of the speaker “A”, and the sound output from the second sound input unit 122 If the volume is higher than the sound output from the first sound input unit 112, it is determined that the sound is the sound of the speaker “B”.
- FIG. 6 is a flowchart showing processing timing of the speech recognition processing system 100.
- the speaker A says, “Let's start a meeting”, and then says “Today's agenda is ...”.
- the speaker B says “Thank you very much”.
- the speech of the speaker A “Now we will start a conference” is input to the voice detection unit 114 via the first voice input unit 112 (step S200a).
- this voice is also input to the second voice detection unit 124 via the second voice input unit 122 (step S200b).
- Each of the first voice detection unit 114 and the second voice detection unit 124 generates voice data of the voice and associates the input times with each other.
- the volume comparison unit 150 compares the volume of the audio data generated by the first audio detection unit 114 and the second audio detection unit 124 (step S202).
- this voice is the voice of the speaker A
- the volume of the voice data generated by the first voice detection unit 114 is higher than that of the voice data generated by the second voice detection unit 124. growing. Therefore, the sound volume comparison unit 150 determines that the sound data should be processed by the first sound recognition processing unit 110, and sends the determination result to the first sound detection unit 114 and the second sound detection unit 124. Notification is made (step S204).
- each process for performing the voice recognition process and the display process similar to those described in the first embodiment is performed in the first voice detection unit 114.
- the second sound detection unit 124 does not perform the subsequent processing and waits for the next sound input.
- the volume comparison unit 150 compares the volume of the audio data generated by the first audio detection unit 114 and the second audio detection unit 124 (step S212).
- this voice is the voice of the speaker B
- the volume of the voice data generated by the second voice detection unit 124 is higher than that of the voice data generated by the first voice detection unit 114. growing. Therefore, the volume comparison unit 150 determines that the audio data should be processed by the second audio recognition processing unit 120, and sends the determination result to the first audio detection unit 114 and the second audio detection unit 124. Notification is made (step S214).
- the second voice detection unit 124 performs each process for performing the voice recognition process and the display process similar to those described in the first embodiment. On the other hand, in the first voice detection unit 114, the subsequent processing is not performed and the next voice input is awaited.
- the same effect as described in the first embodiment can be obtained. Furthermore, when the voice to be originally processed by the first voice recognition processing unit 110 is also input to the second voice input unit 122 of the second voice recognition processing unit 120, or conversely, the second voice recognition Even when the voice to be processed by the processing unit 120 is also input to the first voice input unit 112 of the first voice recognition processing unit 110, the normal input is determined and the voice recognition processing of the same voice is performed. It is possible to prevent both the first voice recognition unit 116 and the second voice recognition unit 126 from being duplicated and the speaker cannot be specified correctly.
- FIG. 7 is a block diagram showing an example of the configuration of the speech recognition processing system in the present embodiment.
- the speech recognition processing system 100 has the same configuration as that described with reference to FIG. 1 in the first embodiment.
- the speech recognition processing system 100 includes a speaker specifying unit 160 and a speech feature data storage unit 162 in addition to the configuration of the speech recognition processing system 100 shown in FIG. Different from the embodiment.
- a plurality of speakers can be divided into groups, and a first voice input unit 112 and a second voice input unit 122 can be provided for each group.
- a first voice input unit 112 and a second voice input unit 122 can be provided for each group.
- the first speech input unit 112 for Company A and the second speech input unit 122 for Company B may be provided.
- each voice is a speech of someone in any group.
- each speaker can be identified by comparing with the feature data.
- the voice feature data storage unit 162 stores voice feature data of participants such as conferences to be subjected to voice recognition processing.
- the voice feature data is arbitrary data indicating the characteristics of individual speakers.
- the mel cepstrum coefficient (MFCC) widely used in a voice recognition system or the like is converted into some mathematical model. Numerical data recorded in the format described above can be used.
- MFCC mel cepstrum coefficient
- GMM Gaussian mixture model
- FIG. 8 is a diagram illustrating an example of the internal configuration of the audio feature data storage unit 162.
- the voice feature data storage unit 162 stores data NO. A column, a group column, a voice feature data column, and a name column are associated with each other.
- the speaker identification unit 160 converts the voice data processed by the first voice recognition unit 116 and the second voice recognition unit 126 to the voice feature data stored in the voice feature data storage unit 162, respectively. In comparison, the speaker of each voice data is specified. Specifically, the speaker identifying unit 160 detects speech feature data that matches the speech features of the speech data processed by the first speech recognition unit 116 and the second speech recognition unit 126, and the speech feature data The display processing unit 140 associates the “name” of the speaker corresponding to the information with the text data of the speech recognition results output from the first speech recognition unit 116 and the second speech recognition unit 126 as information indicating the speaker. Output to.
- FIG. 9 is a diagram illustrating an example of a screen displayed on the display 142 in the present embodiment. Also in the present embodiment, the procedure in which the display processing unit 140 displays the information indicating each speaker on the display 142 in the order of time based on the time information associated therewith is the same as the procedure in the first embodiment. It is the same (FIGS. 9A and 9B). Therefore, until the voice recognition result is displayed on the display 142, information indicating a group such as “A” and “B” is displayed as information indicating the speaker.
- the display processing unit 140 displays the speech recognition result on the display 142 and simultaneously displays information indicating the group such as “A” and “B” with the names of the speakers of the respective sounds.
- the voice feature data of the voice feature data stored in the voice feature data storage unit 162 shown in FIG. It is assumed that the voice feature data “0011” of “0001” is determined to match.
- the speaker identification unit 160 notifies the first speech recognition unit 116 that the name of the speaker of this speech is “Yamada”.
- the first voice recognition unit 116 finishes the voice recognition process of the voice “I will start the conference”
- the first voice recognition unit 116 converts the text data as a result to the time “13:12:10”.
- “Yamada” are output to the display processing unit 140.
- the display processing unit 140 displays the authentication result “Let's start a meeting” at the location corresponding to “A” on the display 142 and replaces what was displayed as “A” with “Yamada”. (FIG. 9 (c)).
- the process of replacing “A” on the display 142 with “Yamada” and the process of displaying the authentication result “Let's start a meeting” at the location corresponding to “A” are simultaneously performed. It does not have to be done.
- the first speech recognition unit 116 may output the name “Yamada” together with the time “13:12:10” to the display processing unit 140 before the speech recognition result is output.
- the display processing unit 140 may first perform a process of replacing what was displayed as “A” on the display 142 with “Yamada”.
- the display processing unit 140 displays each speech recognition result and simultaneously displays the name of each speaker as information indicating each speaker (FIG. 9 ( d), FIG. 9 (f)). Also, for example, if someone in the group “A” says something after saying “Thank you” for the name “Sato”, before the voice recognition result of this voice is displayed, “A "Is displayed only (FIG. 9E).
- the display processing unit 140 At the same time as the recognition result “a little before that” is displayed, the speaker name “Kobayashi” is displayed as information indicating the speaker (FIG. 9G).
- the same effect as described in the first embodiment can be obtained. Furthermore, according to the speech recognition processing system 100 of the present embodiment, for example, even if the speech is input from the same first speech input unit 112, if the speakers are different, each speaker is specified and displayed. The user who confirms the speech recognition result can display the speech recognition result in an easy-to-view manner without confusion.
- the speech recognition processing system 100 has a plurality of judges, a plurality of prosecutors, a plurality of witnesses, etc. in a court hearing, etc. It can also be applied to a configuration in which a voice input unit is provided.
- each component of the speech recognition processing system 100 shown in the above drawings is not a hardware unit configuration but a functional unit block.
- Each component of the speech recognition processing system 100 includes a CPU, a memory of a computer, a program for realizing the components of this figure loaded in the memory, a storage unit such as a hard disk for storing the program, and a network connection interface. It is realized by any combination of hardware and software. It will be understood by those skilled in the art that there are various modifications to the implementation method and apparatus.
- the voice recognition processing system 100 described above can be configured to display a voice recognition result on the display 142 and simultaneously output a corresponding voice.
- FIG. 10 shows an example of the configuration of the speech recognition processing system 100 for realizing such a configuration.
- the voice recognition processing system 100 further includes an output processing unit 138, a voice recording unit 170, a voice storage unit 172, and a voice output processing unit 174 ( Audio output processing means), and a speaker 176.
- the voice storage unit 172 stores each voice in association with time information indicating the time when each voice is detected by the voice detection unit.
- the audio output processing unit 174 outputs the audio at the corresponding time stored in the audio storage unit 172 based on the time information associated with each text data. Output from.
- the voice recognition processing system 100 is configured to include two voice recognition processing units. However, the voice recognition processing system 100 is configured to include more voice recognition processing units. You can also.
- the speech recognition processing system 100 may be configured to have a function of attaching identification information indicating the input order to the input speech, and information indicating the speaker of each speech based on the identification information. It is also possible to control the order in which the speech recognition results are displayed.
- the first audio input unit 112 and the second audio input unit 122 may be configured by a plurality of microphones and a mixer that bundles signals acquired from these microphones into one signal.
- a plurality of microphones may be provided for each group, and the microphones may be bundled into one by a mixer or the like and input to the first sound detection unit 114 and the second sound detection unit 124, respectively. .
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- User Interface Of Digital Computer (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
Description
第1の話者の音声である第1の音声を入力し、当該第1の音声の音声認識処理を行い、音声認識結果を第1のテキストデータとして出力する第1の音声認識手段と、
第2の話者の音声である第2の音声を入力し、当該第2の音声の音声認識処理を行い、音声認識結果を第2のテキストデータとして出力する第2の音声認識手段と、
前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて表示手段に表示する表示処理手段と、
を含み、
前記表示処理手段は、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されるのに先立ち、各前記話者を示す情報を、各音声が発せられた順に並べて前記表示手段に表示しておき、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されると、各前記話者を示す情報に対応づけて、対応する前記テキストデータを前記表示手段に表示する音声認識処理システムが提供される。
第1の話者の音声である第1の音声を入力し、当該第1の音声の音声認識処理を行い、音声認識結果を第1のテキストデータとして出力する第1の音声認識ステップと、
第2の話者の音声である第2の音声を入力し、当該第2の音声の音声認識処理を行い、音声認識結果を第2のテキストデータとして出力する第2の音声認識ステップと、
前記第1の音声認識ステップおよび前記第2の音声認識ステップからそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されるのに先立ち、各前記話者を示す情報を、各音声が発せられた順に並べて表示手段に表示する第1の表示ステップと、
前記第1の音声認識ステップおよび前記第2の音声認識ステップからそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されると、前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて前記表示手段に表示する第2の表示ステップと、
を含む音声認識処理方法が提供される。
図1は、本実施の形態における音声認識処理システムの構成の一例を示すブロック図である。
音声認識処理システム100は、第1の音声認識処理部110と、第2の音声認識処理部120と、表示処理部140と、ディスプレイ142とを含む。第1の音声認識処理部110および第2の音声認識処理部120は、同様の構成を有し、それぞれ音声を入力して、並行して音声認識処理を行い、処理が終了した結果を順次出力する。
図5は、本実施の形態における音声認識処理システムの構成の一例を示すブロック図である。
本実施の形態においても、音声認識処理システム100は、第1の実施の形態において図1を参照して説明したのと同様の構成を有する。本実施の形態において、音声認識処理システム100は、図1に示した音声認識処理システム100の構成に加えて、さらに音量比較部150を含む点で、第1の実施の形態と異なる。
ここでも、第1の実施の形態で説明したのと同様、たとえば、話者Aが、「それでは会議を始めます」と発言し、つづいて「今日の議題は・・・」と発言したとする。次いで、話者Bが、「よろしくお願いします」と発言したとする。
図7は、本実施の形態における音声認識処理システムの構成の一例を示すブロック図である。
本実施の形態においても、音声認識処理システム100は、第1の実施の形態において図1を参照して説明したのと同様の構成を有する。本実施の形態において、音声認識処理システム100は、図1に示した音声認識処理システム100の構成に加えて、さらに話者特定部160および音声特徴データ記憶部162を含む点で、第1の実施の形態と異なる。
本実施の形態においても、表示処理部140が各話者を示す情報を、これに対応づけられた時刻情報に基づき、時刻順にディスプレイ142に表示する手順は、第1の実施の形態における手順と同様である(図9(a)、図9(b))。そのため、ディスプレイ142に音声認識結果が表示されるまでは、話者を示す情報として、「A」、「B」等、グループを示す情報が表示される。
Claims (13)
- 第1の話者の音声である第1の音声を入力し、当該第1の音声の音声認識処理を行い、音声認識結果を第1のテキストデータとして出力する第1の音声認識手段と、
第2の話者の音声である第2の音声を入力し、当該第2の音声の音声認識処理を行い、音声認識結果を第2のテキストデータとして出力する第2の音声認識手段と、
前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて表示手段に表示する表示処理手段と、
を含み、
前記表示処理手段は、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されるのに先立ち、各前記話者を示す情報を、各音声が発せられた順に並べて前記表示手段に表示しておき、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されると、各前記話者を示す情報に対応づけて、対応する前記テキストデータを前記表示手段に表示する音声認識処理システム。 - 請求項1に記載の音声認識処理システムにおいて、
前記第1の音声を検出し、当該第1の音声を検出した時刻を前記第1の音声が発せられた時刻として、当該第1の音声を検出した時刻を示す第1の時刻情報を前記第1の話者を示す情報に対応づける処理を行い、前記第1の音声認識手段による音声認識処理に先立ち、前記第1の時刻情報と前記第1の話者を示す情報とを前記表示処理手段に出力する第1の音声検出手段と、
前記第2の音声を検出し、当該第2の音声を検出した時刻を前記第2の音声が発せられた時刻として、当該第2の音声を検出した時刻を示す第2の時刻情報を前記第2の話者を示す情報に対応づける処理を行い、前記第2の音声認識手段による音声認識処理に先立ち、前記第2の時刻情報と前記第2の話者を示す情報とを前記表示処理手段に出力する第2の音声検出手段と、
をさらに含み、
前記表示処理手段は、前記第1の音声検出手段および前記第2の音声検出手段から出力された各前記話者を示す情報を、それぞれに対応づけられた前記時刻情報に基づき、時刻順に並べて前記表示手段に表示する音声認識処理システム。 - 請求項2に記載の音声認識処理システムにおいて、
前記表示処理手段は、前記第1の音声検出手段および前記第2の音声検出手段から出力された各前記話者を示す情報を、それぞれに対応づけられた前記時刻情報とともに、前記表示手段に表示する音声認識処理システム。 - 請求項2または3に記載の音声認識処理システムにおいて、
前記第1の音声検出手段は、前記第1の音声を前記第1の時刻情報とともに前記第1の音声認識手段に出力し、
前記第1の音声認識手段は、前記第1のテキストデータを前記第1の時刻情報とともに前記表示処理手段に出力し、
前記第2の音声検出手段は、前記第2の音声を前記第2の時刻情報とともに前記第2の音声認識手段に出力し、
前記第2の音声認識手段は、前記第2のテキストデータを前記第2の時刻情報とともに前記表示処理手段に出力し、
前記表示処理手段は、既に前記表示手段に表示している各前記話者を示す情報に、前記時刻情報が同じである各前記テキストデータを対応づけて前記表示手段に表示する音声認識処理システム。 - 請求項2から4いずれかに記載の音声認識処理システムにおいて、
主として前記第1の話者の音声を入力して当該音声を前記第1の音声検出手段に出力する第1の音声入力手段と、
主として前記第2の話者の音声を入力して当該音声を前記第2の音声検出手段に出力する第2の音声入力手段と、
をさらに含み、
前記第1の音声検出手段は、前記第1の音声入力手段から出力された前記音声を前記第1の音声として検出し、
前記第2の音声検出手段は、前記第2の音声入力手段から出力された前記音声を前記第2の音声として検出する音声認識処理システム。 - 請求項2から4いずれかに記載の音声認識処理システムにおいて、
主として前記第1の話者の音声を入力して当該音声を出力する第1の音声入力手段と、
主として前記第2の話者の音声を入力して当該音声を出力する第2の音声入力手段と、
前記第1の音声入力手段および前記第2の音声入力手段から同時に出力された音声の音量を比較し、前記第1の音声入力手段から出力された音声の方が前記第2の音声入力手段から出力された音声よりも音量が大きい場合に、当該音声が前記第1の音声であると判定するとともに、前記第2の音声入力手段から出力された音声の方が前記第1の音声入力手段から出力された音声よりも音量が大きい場合に、当該音声が前記第2の音声であると判定する音量比較手段と、
をさらに含み、
前記第1の音声検出手段および前記第2の音声検出手段は、それぞれ、前記音量比較手段の判定に基づき、前記第1の音声および前記第2の音声を検出する音声認識処理システム。 - 請求項2から6いずれかに記載の音声認識処理システムにおいて、
前記第1の音声および前記第2の音声を、それぞれ、前記第1の時刻情報および前記第2の時刻情報に対応づけて記憶する音声記憶手段と、
前記表示手段に表示された前記第1のテキストデータまたは前記第2のテキストデータが選択されると、各テキストデータに対応づけられた前記第1の時刻情報および前記第2の時刻情報に基づき、前記音声記憶手段に記憶された対応する時刻の各音声を出力する音声出力処理手段と、
をさらに含む音声認識処理システム。 - 請求項1に記載の音声認識処理システムにおいて、
主として前記第1の話者の音声を入力して当該音声を出力する第1の音声入力手段と、
主として前記第2の話者の音声を入力して当該音声を出力する第2の音声入力手段と、
前記第1の音声入力手段および前記第2の音声入力手段から同時に出力された音声の音量を比較し、前記第1の音声入力手段から出力された音声の方が前記第2の音声入力手段から出力された音声よりも音量が大きい場合に、当該音声が前記第1の音声であると判定するとともに、前記第2の音声入力手段から出力された音声の方が前記第1の音声入力手段から出力された音声よりも音量が大きい場合に、当該音声が前記第2の音声であると判定する音量比較手段をさらに含む音声認識処理システム。 - 請求項1から8いずれかに記載の音声認識処理システムにおいて、
話者の音声の音声特徴データを、各話者を示す情報に対応づけて記憶する音声特徴データ記憶手段と、
前記第1の音声および前記第2の音声を、それぞれ、前記音声特徴データ記憶手段に記憶された前記音声特徴データと比較して、前記第1の音声および前記第2の音声の話者を特定する話者特定手段と、
をさらに含み、
前記表示手段は、前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記話者特定手段が特定した前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて前記表示手段に表示する音声認識処理システム。 - 第1の話者の音声である第1の音声を入力し、当該第1の音声の音声認識処理を行い、音声認識結果を第1のテキストデータとして出力する第1の音声認識ステップと、
第2の話者の音声である第2の音声を入力し、当該第2の音声の音声認識処理を行い、音声認識結果を第2のテキストデータとして出力する第2の音声認識ステップと、
前記第1の音声認識ステップおよび前記第2の音声認識ステップからそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されるのに先立ち、各前記話者を示す情報を、各音声が発せられた順に並べて表示手段に表示する第1の表示ステップと、
前記第1の音声認識ステップおよび前記第2の音声認識ステップからそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されると、前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて前記表示手段に表示する第2の表示ステップと、
を含む音声認識処理方法。 - 請求項10に記載の音声認識処理方法において、
前記第1の音声を検出し、当該第1の音声を検出した時刻を前記第1の音声が発せられた時刻として、当該第1の音声を検出した時刻を示す第1の時刻情報を前記第1の話者を示す情報に対応づける処理を行い、前記第1の音声認識ステップにおける音声認識処理に先立ち、前記第1の時刻情報と前記第1の話者を示す情報とを出力する第1の音声検出ステップと、
前記第2の音声を検出し、当該第2の音声を検出した時刻を前記第2の音声が発せられた時刻として、当該第2の音声を検出した時刻を示す第2の時刻情報を前記第2の話者を示す情報に対応づける処理を行い、前記第2の音声認識ステップにおける音声認識処理に先立ち、前記第2の時刻情報と前記第2の話者を示す情報とを出力する第2の音声検出ステップと、
をさらに含み、
前記第1の表示ステップは、前記第1の音声検出ステップおよび前記第2の音声検出ステップから出力された各前記話者を示す情報を、それぞれに対応づけられた前記時刻情報に基づき、時刻順に並べて前記表示手段に表示する音声認識処理方法。 - コンピュータを、
第1の話者の音声である第1の音声を入力し、当該第1の音声の音声認識処理を行い、音声認識結果を第1のテキストデータとして出力する第1の音声認識手段、
第2の話者の音声である第2の音声を入力し、当該第2の音声の音声認識処理を行い、音声認識結果を第2のテキストデータとして出力する第2の音声認識手段、
前記第1のテキストデータおよび前記第2のテキストデータを、それぞれ、前記第1の話者を示す情報および前記第2の話者を示す情報に対応づけて表示手段に表示する表示処理手段、
として機能させ、
前記表示処理手段は、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されるのに先立ち、各前記話者を示す情報を、各音声が発せられた順に並べて前記表示手段に表示しておき、前記第1の音声認識手段および前記第2の音声認識手段からそれぞれ前記第1のテキストデータおよび前記第2のテキストデータが出力されると、各前記話者を示す情報に対応づけて、対応する前記テキストデータを前記表示手段に表示する音声認識処理プログラム。 - 請求項12に記載の音声認識処理プログラムにおいて、
前記コンピュータをさらに、
前記第1の音声を検出し、当該第1の音声を検出した時刻を前記第1の音声が発せられた時刻として、当該第1の音声を検出した時刻を示す第1の時刻情報を前記第1の話者を示す情報に対応づける処理を行い、前記第1の音声認識手段による音声認識処理に先立ち、前記第1の時刻情報と前記第1の話者を示す情報とを前記表示処理手段に出力する第1の音声検出手段、
前記第2の音声を検出し、当該第2の音声を検出した時刻を前記第2の音声が発せられた時刻として、当該第2の音声を検出した時刻を示す第2の時刻情報を前記第2の話者を示す情報に対応づける処理を行い、前記第2の音声認識手段による音声認識処理に先立ち、前記第2の時刻情報と前記第2の話者を示す情報とを前記表示処理手段に出力する第2の音声検出手段、
として機能させ、
前記表示処理手段は、前記第1の音声検出手段および前記第2の音声検出手段から出力された各前記話者を示す情報を、それぞれに対応づけられた前記時刻情報に基づき、時刻順に並べて前記表示手段に表示する音声認識処理プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/201,816 US8606574B2 (en) | 2009-03-31 | 2010-03-25 | Speech recognition processing system and speech recognition processing method |
JP2011507000A JP5533854B2 (ja) | 2009-03-31 | 2010-03-25 | 音声認識処理システム、および音声認識処理方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009-085532 | 2009-03-31 | ||
JP2009085532 | 2009-03-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010113438A1 true WO2010113438A1 (ja) | 2010-10-07 |
Family
ID=42827754
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/002126 WO2010113438A1 (ja) | 2009-03-31 | 2010-03-25 | 音声認識処理システム、および音声認識処理方法 |
Country Status (3)
Country | Link |
---|---|
US (1) | US8606574B2 (ja) |
JP (1) | JP5533854B2 (ja) |
WO (1) | WO2010113438A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013125553A (ja) * | 2011-12-15 | 2013-06-24 | Toshiba Corp | 情報処理装置、記録プログラム |
JP2017182822A (ja) * | 2017-05-08 | 2017-10-05 | 富士通株式会社 | 入力情報支援装置、入力情報支援方法および入力情報支援プログラム |
JP2019164327A (ja) * | 2018-03-19 | 2019-09-26 | 株式会社リコー | 情報処理装置、情報処理システム及び情報処理方法 |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5201050B2 (ja) * | 2009-03-27 | 2013-06-05 | ブラザー工業株式会社 | 会議支援装置、会議支援方法、会議システム、会議支援プログラム |
CN102831894B (zh) * | 2012-08-09 | 2014-07-09 | 华为终端有限公司 | 指令处理方法、装置和系统 |
WO2014085985A1 (zh) * | 2012-12-04 | 2014-06-12 | Itp创新科技有限公司 | 一种通话转录系统和方法 |
US9842489B2 (en) * | 2013-02-14 | 2017-12-12 | Google Llc | Waking other devices for additional data |
US9548868B2 (en) * | 2013-09-06 | 2017-01-17 | International Business Machines Corporation | Gathering participants for meetings |
US9293141B2 (en) | 2014-03-27 | 2016-03-22 | Storz Endoskop Produktions Gmbh | Multi-user voice control system for medical devices |
WO2016035069A1 (en) * | 2014-09-01 | 2016-03-10 | Beyond Verbal Communication Ltd | System for configuring collective emotional architecture of individual and methods thereof |
JP5907231B1 (ja) * | 2014-10-15 | 2016-04-26 | 富士通株式会社 | 入力情報支援装置、入力情報支援方法および入力情報支援プログラム |
JP6464411B6 (ja) * | 2015-02-25 | 2019-03-13 | Dynabook株式会社 | 電子機器、方法及びプログラム |
EP3115886B1 (de) * | 2015-07-07 | 2021-04-14 | Volkswagen Aktiengesellschaft | Verfahren zum betreiben eines sprachsteuerungssystems und sprachsteuerungssystem |
US10089061B2 (en) | 2015-08-28 | 2018-10-02 | Kabushiki Kaisha Toshiba | Electronic device and method |
US20170075652A1 (en) | 2015-09-14 | 2017-03-16 | Kabushiki Kaisha Toshiba | Electronic device and method |
KR101818980B1 (ko) * | 2016-12-12 | 2018-01-16 | 주식회사 소리자바 | 다중 화자 음성 인식 수정 시스템 |
CN111684411A (zh) * | 2018-02-09 | 2020-09-18 | 谷歌有限责任公司 | 用于翻译的对多个用户语音输入的并发接收 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003316375A (ja) * | 2002-04-26 | 2003-11-07 | Ricoh Co Ltd | 分散ディクテーションシステム、プログラム及び記憶媒体 |
JP2005277462A (ja) * | 2004-03-22 | 2005-10-06 | Fujitsu Ltd | 会議支援システム、議事録生成方法、およびコンピュータプログラム |
JP2006050500A (ja) * | 2004-08-09 | 2006-02-16 | Jfe Systems Inc | 会議支援システム |
JP2006301223A (ja) * | 2005-04-20 | 2006-11-02 | Ascii Solutions Inc | 音声認識システム及び音声認識プログラム |
JP2009005064A (ja) * | 2007-06-21 | 2009-01-08 | Panasonic Corp | Ip電話端末および電話会議システム |
JP2009053342A (ja) * | 2007-08-24 | 2009-03-12 | Junichi Shibuya | 議事録作成装置 |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2285895A (en) * | 1994-01-19 | 1995-07-26 | Ibm | Audio conferencing system which generates a set of minutes |
US6850609B1 (en) * | 1997-10-28 | 2005-02-01 | Verizon Services Corp. | Methods and apparatus for providing speech recording and speech transcription services |
US6477491B1 (en) * | 1999-05-27 | 2002-11-05 | Mark Chandler | System and method for providing speaker-specific records of statements of speakers |
JP2003036096A (ja) | 2001-07-23 | 2003-02-07 | Mitsubishi Electric Corp | 音声認識装置 |
US7292543B2 (en) * | 2002-04-17 | 2007-11-06 | Texas Instruments Incorporated | Speaker tracking on a multi-core in a packet based conferencing system |
JP2005080110A (ja) | 2003-09-02 | 2005-03-24 | Yamaha Corp | 音声会議システム、音声会議端末装置およびプログラム |
US7305078B2 (en) * | 2003-12-18 | 2007-12-04 | Electronic Data Systems Corporation | Speaker identification during telephone conferencing |
JP2006251898A (ja) * | 2005-03-08 | 2006-09-21 | Fuji Xerox Co Ltd | 情報処理装置、情報処理方法およびプログラム |
US20070133437A1 (en) * | 2005-12-13 | 2007-06-14 | Wengrovitz Michael S | System and methods for enabling applications of who-is-speaking (WIS) signals |
JP2007288539A (ja) * | 2006-04-17 | 2007-11-01 | Fuji Xerox Co Ltd | 会議システム及び会議方法 |
US20080263010A1 (en) * | 2006-12-12 | 2008-10-23 | Microsoft Corporation | Techniques to selectively access meeting content |
JP4867804B2 (ja) * | 2007-06-12 | 2012-02-01 | ヤマハ株式会社 | 音声認識装置及び会議システム |
US8315866B2 (en) * | 2009-05-28 | 2012-11-20 | International Business Machines Corporation | Generating representations of group interactions |
US8374867B2 (en) * | 2009-11-13 | 2013-02-12 | At&T Intellectual Property I, L.P. | System and method for standardized speech recognition infrastructure |
-
2010
- 2010-03-25 WO PCT/JP2010/002126 patent/WO2010113438A1/ja active Application Filing
- 2010-03-25 JP JP2011507000A patent/JP5533854B2/ja active Active
- 2010-03-25 US US13/201,816 patent/US8606574B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003316375A (ja) * | 2002-04-26 | 2003-11-07 | Ricoh Co Ltd | 分散ディクテーションシステム、プログラム及び記憶媒体 |
JP2005277462A (ja) * | 2004-03-22 | 2005-10-06 | Fujitsu Ltd | 会議支援システム、議事録生成方法、およびコンピュータプログラム |
JP2006050500A (ja) * | 2004-08-09 | 2006-02-16 | Jfe Systems Inc | 会議支援システム |
JP2006301223A (ja) * | 2005-04-20 | 2006-11-02 | Ascii Solutions Inc | 音声認識システム及び音声認識プログラム |
JP2009005064A (ja) * | 2007-06-21 | 2009-01-08 | Panasonic Corp | Ip電話端末および電話会議システム |
JP2009053342A (ja) * | 2007-08-24 | 2009-03-12 | Junichi Shibuya | 議事録作成装置 |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013125553A (ja) * | 2011-12-15 | 2013-06-24 | Toshiba Corp | 情報処理装置、記録プログラム |
JP2017182822A (ja) * | 2017-05-08 | 2017-10-05 | 富士通株式会社 | 入力情報支援装置、入力情報支援方法および入力情報支援プログラム |
JP2019164327A (ja) * | 2018-03-19 | 2019-09-26 | 株式会社リコー | 情報処理装置、情報処理システム及び情報処理方法 |
JP7243145B2 (ja) | 2018-03-19 | 2023-03-22 | 株式会社リコー | 情報処理装置、情報処理システム及び情報処理方法 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2010113438A1 (ja) | 2012-10-04 |
US20110301952A1 (en) | 2011-12-08 |
JP5533854B2 (ja) | 2014-06-25 |
US8606574B2 (en) | 2013-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5533854B2 (ja) | 音声認識処理システム、および音声認識処理方法 | |
US12020705B2 (en) | Methods, systems, and media for voice-based call operations | |
US10984802B2 (en) | System for determining identity based on voiceprint and voice password, and method thereof | |
JP4085924B2 (ja) | 音声処理装置 | |
US11204685B1 (en) | Voice communication targeting user interface | |
US8515025B1 (en) | Conference call voice-to-name matching | |
JP5731998B2 (ja) | 対話支援装置、対話支援方法および対話支援プログラム | |
US20170287482A1 (en) | Identifying speakers in transcription of multiple party conversations | |
US11810585B2 (en) | Systems and methods for filtering unwanted sounds from a conference call using voice synthesis | |
US20240029753A1 (en) | Systems and methods for filtering unwanted sounds from a conference call | |
US8588947B2 (en) | Apparatus for processing an audio signal and method thereof | |
JP2010128766A (ja) | 情報処理装置、情報処理方法、プログラム及び記憶媒体 | |
JP2005338454A (ja) | 音声対話装置 | |
US11699438B2 (en) | Open smart speaker | |
EP1445760A1 (en) | Speaker verifying apparatus | |
JP2005308950A (ja) | 音声処理装置および音声処理システム | |
JP2010164992A (ja) | 音声対話装置 | |
JP2016186646A (ja) | 音声翻訳装置、音声翻訳方法および音声翻訳プログラム | |
JP4531013B2 (ja) | 映像音声会議システムおよび端末装置 | |
JP2019159099A (ja) | 楽曲再生システム | |
WO2022024188A1 (ja) | 音声登録装置、制御方法、プログラム及び記憶媒体 | |
JP2009302824A (ja) | 音声通信システム | |
JP2005123869A (ja) | 通話内容書き起こしシステムおよび通話内容書き起こし方法 | |
JP2002252705A (ja) | 話者id検出方法及び装置 | |
TW202001643A (zh) | 透過聲紋及語音密碼判斷身分之系統及方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10758223 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13201816 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011507000 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 10758223 Country of ref document: EP Kind code of ref document: A1 |