WO2011040056A1 - 音声翻訳システム、第一端末装置、音声認識サーバ装置、翻訳サーバ装置、および音声合成サーバ装置 - Google Patents
音声翻訳システム、第一端末装置、音声認識サーバ装置、翻訳サーバ装置、および音声合成サーバ装置 Download PDFInfo
- Publication number
- WO2011040056A1 WO2011040056A1 PCT/JP2010/053419 JP2010053419W WO2011040056A1 WO 2011040056 A1 WO2011040056 A1 WO 2011040056A1 JP 2010053419 W JP2010053419 W JP 2010053419W WO 2011040056 A1 WO2011040056 A1 WO 2011040056A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- translation
- unit
- result
- model
- Prior art date
Links
- 238000013519 translation Methods 0.000 title claims abstract description 694
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 389
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 389
- 230000005540 biological transmission Effects 0.000 claims description 110
- 238000012545 processing Methods 0.000 claims description 59
- 238000000034 method Methods 0.000 abstract description 83
- 230000008569 process Effects 0.000 abstract description 68
- 230000006866 deterioration Effects 0.000 abstract 1
- 238000007726 management method Methods 0.000 description 60
- 238000010586 diagram Methods 0.000 description 25
- 230000008451 emotion Effects 0.000 description 13
- 230000006870 function Effects 0.000 description 13
- 238000009825 accumulation Methods 0.000 description 8
- 230000008859 change Effects 0.000 description 7
- 230000003068 static effect Effects 0.000 description 7
- 241001122315 Polites Species 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 125000002066 L-histidyl group Chemical group [H]N1C([H])=NC(C([H])([H])[C@](C(=O)[*])([H])N([H])[H])=C1[H] 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000010183 spectrum analysis Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
Definitions
- the translation server device includes a speech recognition unit to be acquired and a speech recognition result transmission unit that transmits a speech recognition result, and the translation server device translates all or some of the two or more languages.
- the voice speaker attribute acquisition unit 304 determines a speaker class (this speaker class is a kind of language speaker attribute) from the voice recognition result that is the result of the voice recognition unit 308 performing voice recognition. Also good.
- the voice speaker attribute acquisition unit 304 holds a term dictionary having a difficulty level (a set of two or more pieces of term information having terms and difficulty levels associated with each other), and is included in the speech recognition result. The difficulty level (n1, n2,%) Is acquired, and the speaker class (high difficulty level “0”, middle difficulty level “1”, low difficulty level “ 2 ”etc.). Further, the voice speaker attribute acquisition unit 304 determines the speaker class using the difficulty level (n1, n2,%) Of one or more terms and the presence / absence of a grammatical error.
- the voice speaker attribute acquisition unit 304 obtains a final difficulty level obtained from one or more difficulty levels (high “0”, difficulty level medium “1”, difficulty level low). A value obtained by adding “1” to “2”) is acquired as a speaker class.
- the checking process for determining whether or not there is a grammatical error in the sentence is a known natural language process, and thus detailed description thereof is omitted.
- the voice speaker attribute acquisition unit 304 may acquire a speaker attribute by a method other than the method described above, or may acquire any speaker attribute.
- the voice recognition unit 308 recognizes the voice information received by the voice information reception unit 306 using the voice recognition model stored in the voice recognition model storage unit 302, and acquires a voice recognition result. In addition, it is preferable that the voice recognition unit 308 recognizes the voice information received by the voice information reception unit 306 using the voice recognition model selected by the voice recognition model selection unit 307 and acquires a voice recognition result. .
- the voice recognition unit 308 may use any voice recognition method.
- the voice recognition unit 308 is a known technique.
- information on a speech recognition target language is included in, for example, speech translation control information.
- the speech translation control information is transferred between the first terminal device, the speech recognition server device, the translation server device, the speech synthesis server device, and the second terminal device 2.
- the voice recognition result is usually a character string in the original language (the language of the voice spoken by the user A of the first terminal device 1).
- the fourth speaker attribute accumulating unit 405 accumulates at least temporarily the speaker attributes received by the fourth speaker attribute receiving unit 403 in the fourth speaker attribute storage unit 401 at least temporarily.
- the fourth speaker attribute storage unit 405 may store the speech translation control information in the fourth speaker attribute storage unit 401.
- the fourth speaker attribute accumulation unit 405 may be referred to as a fourth speech translation control information accumulation unit 405.
- the translation unit 408 translates the speech recognition result received by the speech recognition result receiving unit 406 into a target language using the translation model in the translation model storage unit 402, and acquires the translation result. It is preferable that the translation unit 408 translates the speech recognition result received by the speech recognition result reception unit 406 into a target language using the translation model selected by the translation model selection unit 407 and acquires the translation result. Information specifying the source language and the target language is included in the speech translation control information, for example. Moreover, it does not ask
- the translation unit 408 is a known technique.
- the translation result transmission unit 410 transmits the translation result, which is a result of the translation processing performed by the translation unit 408, to the speech synthesis server device 5 directly or indirectly. In addition, it is preferable that the translation result transmission unit 410 directly or indirectly transmits the translation result to the speech synthesis server device 5 selected by the speech synthesis server selection unit 409.
- the speech synthesis model selection unit 506 selects one speech synthesis model from two or more speech synthesis models according to one or more speaker attributes received by the fifth speaker attribute reception unit 503.
- the fifth model selection means 5062 searches the speech synthesis model selection information management table from one or more speaker attributes stored in the fifth speaker attribute storage unit 501, and performs speech synthesis corresponding to the one or more speaker attributes. Get the model device identifier.
- the speech synthesis unit 507 acquires a speech synthesis model corresponding to the speech synthesis model identifier acquired by the fifth model selection unit 5062 from the speech synthesis model storage unit 502, and performs speech synthesis processing using the speech synthesis model. .
- First speaker attribute storage unit 11 First speaker attribute storage unit 11, first server selection information storage unit 151, second speaker attribute storage unit 21, second server selection information storage unit 251, third speaker attribute storage unit 301, speech recognition model storage unit 302, third model selection information storage unit 3071, third server selection information storage unit 3091, fourth speaker attribute storage unit 401, translation model storage unit 402, fourth model selection information storage unit 4071, fourth server selection information storage
- the unit 4091, the fifth speaker attribute storage unit 501, the speech synthesis model storage unit 502, and the fifth model selection information storage unit 5061 are preferably non-volatile recording media, but can also be realized by volatile recording media. is there.
- the process in which the above information is stored in the first speaker attribute storage unit 11 or the like is not limited.
- First speaker attribute storage unit 13 first voice recognition server selection unit 15, first server selection information storage unit 151, second speaker attribute storage unit 23, second voice recognition server selection unit 25, second server selection unit 252, voice speaker attribute acquisition unit 304, third speaker attribute storage unit 305, voice recognition model selection unit 307, voice recognition unit 308, translation server selection unit 309, third model selection unit 3072, third server selection unit 3092 , Language speaker attribute acquisition unit 404, fourth speaker attribute storage unit 405, translation model selection unit 407, translation unit 408, speech synthesis server selection unit 409, fourth model selection unit 4072, fourth server selection unit 4092,
- the five-speaker attribute storage unit 504, the speech synthesis model selection unit 506, the speech synthesis unit 507, and the fifth model selection unit 5062 can be usually realized by an MPU, a memory, or the like. .
- the processing procedure of the first speaker attribute storage unit 13 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
- Step S603 The first speaker attribute storage unit 13 stores the accepted one or more speaker attributes in the first speaker attribute storage unit 11. The process returns to step S601.
- Step S607 The first server selection unit 152 reads one or more speaker attributes from the first speaker attribute storage unit 11.
- Step S608 The first server selection unit 152 applies one or more speaker attributes read in step S607 to the first server selection information (voice recognition server selection information management table) of the first server selection information storage unit 151. Then, the voice recognition server device 3 is selected.
- the selection of the speech recognition server device 3 is, for example, obtaining one speech recognition server device identifier.
- the first speaker attribute transmission unit 19 configures speech translation control information using one or more speaker attributes stored in the first speaker attribute storage unit 11. For example, the first speaker attribute transmission unit 19 acquires an identifier of the target language determined from the telephone number of the input second terminal device 2. Further, the first speaker attribute transmission unit 19 acquires the identifier of the source language determined from the stored telephone number of the first terminal device 1. For example, since the telephone number includes a country code, the first speaker attribute transmission unit 19 determines a target language from the country code. The first speaker attribute transmission unit 19 holds a correspondence table of country codes and target language identifiers (for example, a table having records such as “81: Japanese” and “82: Korean”). Then, the first speaker attribute transmission unit 19 receives the speech translation control information from one or more speaker attributes stored in the first speaker attribute storage unit 11, the source language identifier, the target language identifier, and the like. Constitute.
- step S607 it is preferable not to perform the processes of step S607, step S608, step S609, and step S611 again during a call. That is, it is preferable to perform the processing of Step S607, Step S608, Step S609, and Step S611 once in a call or a smaller number of times than the transmission of voice information.
- the process is terminated by powering off or interruption for aborting the process.
- Step S703 The voice speaker attribute acquisition unit 304 acquires one or more speaker attributes from the voice information received in step S701. Such processing is called speaker attribute acquisition processing and will be described with reference to the flowchart of FIG.
- the third model selection unit 3072 searches the speech recognition model selection information management table using one or more speaker attributes included in the speech translation control information stored in the third speaker attribute storage unit 301.
- the voice recognition model identifier is acquired. That is, the third model selection unit 3072 selects a speech recognition model. Then, the third model selection unit 3072 reads the selected speech recognition model from the speech recognition model storage unit 302.
- the third server selection unit 3092 searches the translation server selection information management table using one or more speaker attributes included in the speech translation control information stored in the third speaker attribute storage unit 301.
- a translation server device identifier corresponding to one or more speaker attributes is acquired.
- Step S708 The speech recognition result transmitting unit 310 transmits the speech recognition result obtained in step S706 to the translation server device 4 corresponding to the translation server device identifier acquired in step S707.
- Step S710 The third model selection unit 3072 determines whether or not the speech translation control information is stored in the third speaker attribute storage unit 301. If the speech translation control information is stored, go to step S711, otherwise go to step S712.
- the process is terminated by powering off or interruption for aborting the process.
- the voice speaker attribute acquisition unit 304 acquires one or more feature amounts from voice information (voice analysis).
- the feature vector data that is a vector composed of one or more feature amounts acquired by the voice speaker attribute acquisition unit 304 is, for example, MFCC obtained by discrete cosine transform of a filter bank output of 24 channels using a triangular filter.
- the static parameter, the delta parameter and the delta delta parameter each have 12 dimensions, and also have normalized power and delta power and delta delta power (39 dimensions).
- Step S802 The voice speaker attribute acquisition unit 304 determines the gender of the speaker using the one or more feature amounts acquired in step S801.
- Step S804 The voice speaker attribute acquisition unit 304 acquires the speech speed from the voice information. Note that the processing for acquiring the speech speed is a known technique.
- Step S805 The voice speaker attribute acquisition unit 304 requests the voice recognition unit 308 to perform voice recognition processing, and obtains a voice recognition result.
- Step S806 The voice speaker attribute acquisition unit 304 performs natural language processing on the voice recognition result obtained in step S805, and determines a speaker class. Return to upper process.
- Step S901 The speech recognition result receiving unit 406 determines whether a speech recognition result has been received. If the voice recognition result is received, the process goes to step S902. If the voice recognition result is not received, the process returns to step S901.
- Step S902 The fourth speaker attribute receiving unit 403 determines whether or not the speech translation control information has been received. If the speech translation control information is received, the process proceeds to step S903, and if not received, the process proceeds to step S909.
- the language speaker attribute acquisition unit 404 performs natural language processing on the speech recognition result received in step S901, and acquires one or more language speaker attributes.
- the language speaker attribute acquisition unit 404 acquires a speaker class from the speech recognition result, for example.
- the fourth model selection unit 4072 has one or more speaker attributes included in the speech translation control information received in Step S902, or speech translation control information stored in the fourth speaker attribute storage unit 401.
- the translation model selection information management table is searched using one or more speaker attributes to obtain a translation model identifier. That is, the fourth model selection unit 4072 selects a translation model. Then, the fourth model selection unit 4072 reads the selected translation model from the translation model storage unit 402.
- Step S908 The translation result transmission unit 410 transmits the translation result obtained in Step S906 to the speech synthesis server device 5 corresponding to the speech synthesis server device identifier acquired in Step S907.
- Step S909 The fourth speaker attribute transmission unit 411 uses the speech translation control information stored in the fourth speaker attribute storage unit 401 as a speech synthesis server corresponding to the speech synthesis server device identifier acquired in step S907. Transmit to device 5. The process returns to step S901.
- Step S911 The fourth model selection unit 4072 reads the speech translation control information stored in the fourth speaker attribute storage unit 401. Go to step S905
- Step S912 The fourth model selection unit 4072 reads an arbitrary translation model stored in the translation model storage unit 402. Go to step S906.
- Step S1002 The fifth speaker attribute receiving unit 503 determines whether or not the speech translation control information has been received. If the speech translation control information is received, the process goes to step S1003. If not received, the process goes to step S1007.
- the fifth speaker attribute storage unit 504 stores at least temporarily the speech translation control information received in step S1002 in the fifth speaker attribute storage unit 501.
- the fifth model selection unit 5062 searches the speech synthesis model selection information management table using one or more speaker attributes included in the speech translation control information stored in the fifth speaker attribute storage unit 501. Then, a speech synthesis model identifier is acquired. That is, the fifth model selection unit 5062 selects a speech synthesis model. Then, the fifth model selection unit 5062 reads the selected speech synthesis model from the speech synthesis model storage unit 502.
- Step S1005 The speech synthesis unit 507 performs speech synthesis processing on the translation result received in step S1001, using the read speech synthesis model. Then, the voice synthesis unit 507 obtains voice information (speech synthesis result) obtained by voice synthesis.
- the user A of the first terminal device 1 is a 37-year-old woman who speaks Japanese, and the Japanese is native.
- the second terminal device 2 user B is a 38-year-old man who speaks English, and English is native.
- the second speaker attribute storage unit 21 of the second terminal device 2 stores a second speaker attribute management table shown in FIG.
- the first server selection information storage unit 151 of the first terminal device 1 and the second server selection information storage unit 251 of the second terminal device 2 store the voice recognition server selection information management table shown in FIG. Yes.
- the voice recognition server selection information management table stores one or more records having attribute values of “ID”, “language”, “speaker attribute”, and “voice recognition server device identifier”.
- “Language” is a language for speech recognition.
- the “speaker attribute” includes “gender”, “age” (here, age category), and the like.
- the “voice recognition server device identifier” is information for communicating with the voice recognition server device 3, and is an IP address here.
- the first server selection information storage unit 151 only needs to have a record corresponding to the language “Japanese” in the speech recognition server selection information management table.
- the second server selection information storage unit 251 only needs to have a record corresponding to the language “English” in the voice recognition server selection information management table.
- the “translation server device identifier” is information for communicating with the translation server device 4, and here is an IP address.
- the fourth model selection information storage means 4071 of the translation server device 4 holds a translation model selection information management table shown in FIG.
- the translation model selection information management table stores one or more records having attribute values of “ID”, “source language”, “speaker attribute”, and “translation model identifier”.
- the “speaker attribute” includes “sex”, “age”, “second speaker class”, and the like.
- the “second speaker class” indicates whether the language used is native.
- the attribute value is “Y” if it is native, and “N” if it is not native.
- the “translation model identifier” is information for identifying the translation model, and is used, for example, to read out the translation model.
- the “translation model identifier” is a file name in which the translation model is stored.
- the “speech synthesis server device identifier” is information for communicating with the speech synthesis server device 5 and is an IP address here.
- the fifth model selection information storage means 5061 of the speech synthesis server device 5 holds a speech synthesis model selection information management table shown in FIG.
- the speech synthesis model selection information management table stores one or more records having attribute values of “ID”, “target language”, “speaker attribute”, and “speech synthesis model identifier”.
- the “speaker attribute” includes “sex”, “age”, “second speaker class”, and the like. It is more preferable to have “speaking speed” and “first speaker class” as “speaker attributes”.
- the “speech synthesis model identifier” is information for identifying a speech synthesis model, and is used, for example, for reading a speech synthesis model.
- “speech synthesis model identifier” is the name of a file in which the speech synthesis model is stored.
- user A tries to call user B.
- User A is a screen for inputting the telephone number of the other party (user B) from the first terminal device 1 and calls the screen of FIG.
- the 1st terminal device 1 reads the 1st speaker attribute management table
- the user inputs the other party's language and the other party's telephone number, and presses the “call” button.
- FIG. 19 it is assumed that the telephone number “080-1111-2256” is stored in a recording medium (not shown).
- the first voice reception unit 14 of the first terminal device 1 receives the voice “Good morning” of the user A.
- the first server selection unit 152 reads the speaker attributes in FIG. 11 from the first speaker attribute storage unit 11.
- the first speaker attribute transmitting unit 19 configures speech translation control information using one or more speaker attributes.
- the first speaker attribute transmission unit 19 configures the speech translation control information shown in FIG. 20, for example.
- This speech translation control information includes one or more speaker attributes and information (a language used by the other party [target language]) input by the user A from the screen of FIG.
- the speech translation control information includes a speech recognition server device identifier “186.221.1.27”.
- the first voice transmitting unit 16 digitizes the received voice “Good morning” and acquires the voice information of “Good morning”. And the 1st audio
- the first speaker attribute transmission unit 19 transmits the speech translation control information of FIG. 20 to the speech recognition server device 3 identified by “186.221.1.27”.
- the voice information receiving unit 306 of the voice recognition server device 3 receives the voice information “Good morning”. Then, the third speaker attribute receiving unit 303 receives the speech translation control information of FIG.
- the voice speaker attribute acquisition unit 304 acquires one or more speaker attributes from the received voice information “Good morning”. That is, the third speaker attribute receiving unit 303 acquires one or more feature amounts from the voice information “Good morning”. Then, the third speaker attribute receiving unit 303 acquires predetermined information using one or more feature amounts.
- the speech translation control information in FIG. 20 includes speaker attributes such as gender and age, but the third speaker attribute receiving unit 303 uses speaker attributes (such as gender and age) that overlap with the speech translation control information. May be used for speech recognition, later translation, and speech synthesis, giving priority to the acquired speaker attributes.
- the third model selection unit 3072 includes one or more speaker attributes included in the speech translation control information stored in the third speaker attribute storage unit 301 and one or more acquired by the speech speaker attribute acquisition unit 304.
- the third model selection unit 3072 reads the selected speech recognition model “JR6” from the speech recognition model storage unit 302.
- the voice recognition unit 308 performs voice recognition processing on the received voice information using the read voice recognition model, and obtains a voice recognition result “Good morning”.
- the voice speaker attribute acquisition unit 304 requests the voice recognition unit 308 to perform voice recognition processing, and obtains a voice recognition result “Good morning”.
- the speech speaker attribute acquisition unit 304 performs natural language processing on the obtained speech recognition result, and acquires the first speaker class “A” because it is a polite language.
- the voice speaker attribute acquisition unit 304 stores, for example, the terms “There” and “Is” that constitute a polite word, the highly difficult terms “ ⁇ ”,“ error ”, and the like,
- the first speaker class may be determined based on the appearance ratio.
- the voice speaker attribute acquisition unit 304 performs morphological analysis on “good morning” and divides it into two morphemes of “good morning” and “present”. Then, the voice speaker attribute acquisition unit 304 detects that “present” matches the managed term. Next, the voice speaker attribute acquisition unit 304 calculates a management term “50%”.
- the voice speaker attribute acquisition unit 304 stores the judgment condition “A: the appearance ratio of the management term is 5% or more, B: the appearance ratio of the management term is 1% or more and less than 5%, C: management Based on the term appearance ratio of less than 1%, the first speaker class “A” is determined. Note that it is preferable that the voice speaker attribute acquisition unit 304 calculates the appearance ratio of the management terms every time, and determines and changes the first speaker class each time conversations are made.
- the third speaker attribute storage unit 305 adds the translation server device identifier “77.128.50.80” to the speech translation control information and stores it in the third speaker attribute storage unit 301.
- Such updated speech translation control information is shown in FIG.
- the speech recognition result transmission unit 310 transmits the speech recognition result “Good morning” to the translation server device 4 corresponding to the acquired translation server device identifier “77.128.50.80”.
- the fourth speaker attribute storage unit 405 stores at least temporarily the received speech translation control information (FIG. 22) in the fourth speaker attribute storage unit 401.
- the translation unit 408 performs translation processing on the received speech recognition result “Good morning” using the read translation model “JT4”. Then, the translation unit 408 obtains a translation result “Good morning.”
- the fourth speaker attribute storage unit 405 configures speech translation control information (FIG. 23) obtained by adding the speech synthesis server device identifier “238.3.55.7” to the speech translation control information of FIG.
- the speech translation control information is accumulated in the fourth speaker attribute storage unit 401.
- the fourth speaker attribute transmission unit 411 transmits the speech translation control information of FIG. 23 to the speech synthesis server device 5 corresponding to the speech synthesis server device identifier “238.3.55.7”.
- the translation result receiving unit 505 of the speech synthesis server device 5 receives the translation result. Also, the fifth speaker attribute receiving unit 503 receives the speech translation control information of FIG.
- the fifth speaker attribute accumulation unit 504 accumulates the received speech translation control information in the fifth speaker attribute storage unit 501 at least temporarily.
- the speech synthesis unit 507 performs speech synthesis processing on the translation result “Good morning.” Using the read speech synthesis model. Then, the voice synthesis unit 507 obtains voice information (speech synthesis result) obtained by voice synthesis.
- the second voice receiving unit 27 of the second terminal apparatus 2 receives the voice synthesis result “Good morning”. Then, the second audio output unit 28 outputs the audio “Good morning”.
- the voice generated by the user B of the second terminal device 2 in response to “Good morning” and “Good morning” is converted into “good morning” by the same processing as described above, and the voice is sent to the first terminal device 1. “Good morning” is output.
- the speech synthesis processing is performed using the designated speech synthesis server device or speech synthesis model. It is preferred to do so.
- a user may want to perform speech synthesis in a target language by using a speech synthesis model that collects his / her speech or a speech synthesis server device that stores a speech synthesis model that collects his / her speech. is there.
- the first terminal device 1 stores a speech synthesis server device identifier for identifying a speech synthesis server device to be used or a speech synthesis model identifier for identifying a speech synthesis model.
- the speech translation control information is transmitted from the first terminal device 1 to the speech synthesis server device 5 via the speech recognition server device 3 and the translation server device 4.
- the first terminal device 1 performs the selection process of the voice recognition server device 3.
- the speech recognition server device 3 performs a speech recognition model selection process and a translation server device 4 selection process.
- the translation server device 4 performs a translation model selection process and a speech synthesis server apparatus 5 selection process.
- the speech synthesis server device 5 performs a speech synthesis model selection process.
- FIG. 25 is a conceptual diagram of the speech translation system 6 in a case where one control device performs such server device selection processing.
- the first terminal device 251, the second terminal device 252, the speech recognition server device 253, the translation server device 254, and the speech synthesis server device 5 each receive a result before processing from the control device 256.
- the result after processing is transmitted to the control device 256. That is, the first terminal device 251 transmits the audio information received from the user A to the control device 256. Then, the control device 256 determines a voice recognition server device 253 that performs voice recognition, and transmits the voice information to the voice recognition server device 253. Next, the voice recognition server device 253 receives voice information, selects a voice recognition model as necessary, and performs voice recognition processing. Then, the speech recognition server device 253 transmits the speech recognition result to the control device 256.
- the control device 256 receives the speech recognition result from the speech recognition server device 253, and selects the translation server device 254 that performs translation. Then, the control device 256 transmits the speech recognition result to the selected translation server device 254.
- the translation server device 254 receives the speech recognition result, selects a translation model as necessary, and performs a translation process. Then, the translation server device 254 transmits the translation result to the control device 256.
- the control device 256 receives the translation result from the translation server device 254, and selects the speech synthesis server device 5 that performs speech synthesis. Then, the control device 256 transmits the translation result to the selected speech synthesis server device 5.
- the first terminal device 251 includes a first speaker attribute storage unit 11, a first speaker attribute reception unit 12, a first speaker attribute storage unit 13, a first voice reception unit 14, and a first voice transmission unit.
- the second terminal device 252 includes a second speaker attribute storage unit 21, a second speaker attribute reception unit 22, a second speaker attribute storage unit 23, a second voice reception unit 24, a second voice transmission unit 26, a second An audio receiving unit 27, a second audio output unit 28, and a second speaker attribute transmitting unit 29 are provided.
- FIG. 27 is a block diagram of the control device 256.
- the control device 256 includes a speaker attribute storage unit 2561, a transmission / reception unit 2562, a speaker attribute storage unit 2563, a second speech recognition server selection unit 25, a translation server selection unit 309, and a speech synthesis server selection unit 409.
- the speaker attribute storage unit 2561 can store one or more speaker attributes.
- the speaker attribute storage unit 2561 may store speech translation control information.
- the transmission / reception unit 2562 transmits / receives various kinds of information to / from the first terminal device 251, the second terminal device 252, the speech recognition server device 253, the translation server device 254, and the speech synthesis server device 5.
- the various types of information include speech information, speech recognition results, translation results, speech synthesis results, speech translation control information (including some speaker attributes), and the like.
- the transmission / reception unit 2562 can be realized typically by wireless or wired communication means.
- the speaker attribute storage unit 2563 stores one or more speaker attributes (speech translation control information) received by the transmission / reception unit 2562 in the speaker attribute storage unit 2561.
- FIG. 28 is a block diagram of the voice recognition server device 253.
- the voice recognition server device 253 includes a third speaker attribute storage unit 301, a voice recognition model storage unit 302, a third speaker attribute reception unit 303, a voice speaker attribute acquisition unit 304, a third speaker attribute storage unit 305, a voice An information receiving unit 306, a speech recognition model selection unit 307, a speech recognition unit 308, a speech recognition result transmission unit 310, and a third speaker attribute transmission unit 311 are provided.
- FIG. 29 is a block diagram of the translation server device 254.
- the translation server device 254 includes a fourth speaker attribute storage unit 401, a translation model storage unit 402, a fourth speaker attribute reception unit 403, a fourth speaker attribute storage unit 405, a speech recognition result reception unit 406, and a translation model selection unit. 407, a translation unit 408, a translation result transmission unit 410, and a fourth speaker attribute transmission unit 411.
- the speech translation control information may be in an XML format as shown in FIG.
- the description language of the speech translation control information shown in FIG. 30 is referred to as a speech translation markup language, STML (Speed Translation Markup Language).
- STML Speed Translation Markup Language
- the gender here, “male”), age (here, “30”), and whether or not the speaker is native (here, “no”) are described.
- 30 also shows information indicating the format of the output text (here, “SurfaceForm”), and further, the speech translation control information.
- the speech synthesis unit 507 stores the speech synthesis model so that the translation result received by the translation result reception unit 505 matches the attribute indicated by one or more speaker attributes received by the fifth speaker attribute reception unit 503. Speech synthesis may be performed using the speech synthesis model of the unit 502 and a speech synthesis result may be acquired. In addition, the speech synthesis unit 507 uses the speech synthesis model in the speech synthesis model storage unit 52 so that the translation result received by the translation result reception unit 505 matches the attribute indicated by the speaker attribute of the speech translation control information. Speech synthesis may be performed to obtain a speech synthesis result. In such a case, it may be said that the selection of a speech synthesis model.
- the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and distributed. This also applies to other embodiments in this specification.
- achieves the 1st terminal device in this Embodiment is the following programs. That is, this program causes a computer to recognize one voice recognition server among two or more voice recognition server devices according to a first voice receiving unit that receives voice and one or more speaker attributes stored in a storage medium. A first voice recognition server selecting unit for selecting a device, and a voice information composed of the voice received by the first voice receiving unit to the voice recognition server device selected by the first voice recognition server selecting unit; It is a program for functioning as one audio transmission unit.
- the software that realizes the speech recognition server device includes a computer, a speech information receiving unit that receives speech information, and one or more speaker attributes stored in the storage medium.
- a speech recognition model selection unit that selects one speech recognition model from two or more speech recognition models stored in the speech recognition model selected by the speech recognition model selection unit, and the speech information received by the speech information reception unit
- a speech recognition unit that performs speech recognition using a recognition model and acquires a speech recognition result, and a program for functioning as a speech recognition result transmission unit that transmits the speech recognition result.
- the software that realizes the voice recognition server device stores a voice information receiving unit that receives voice information and voice information received by the voice information receiving unit in a storage medium.
- a speech recognition unit that performs speech recognition using a speech recognition model and obtains a speech recognition result, and one translation server among two or more translation server devices according to one or more speaker attributes stored in a storage medium
- a translation server selection unit that selects a device, and a program for causing the translation server device selected by the translation server selection unit to function as a speech recognition result transmission unit that transmits the speech recognition result.
- the software that realizes the speech recognition server device includes: a speech speaker attribute acquisition unit that acquires a speaker attribute related to one or more speeches from speech information received by the speech information reception unit; A program for further functioning as one or more speaker attribute accumulating units that accumulate one or more speaker attributes acquired by the voice speaker attribute acquiring unit in a storage medium.
- the software for realizing the translation server device in the present embodiment includes a computer, a fourth speaker attribute receiving unit that receives one or more speaker attributes, a voice recognition result receiving unit that receives a speech recognition result, A translation model selection unit that selects one translation model from two or more translation models stored in a storage medium according to one or more speaker attributes received by the fourth speaker attribute reception unit; The speech recognition result received by the recognition result receiving unit is translated into a target language using the translation model selected by the translation model selecting unit, and the translation result is sent to obtain the translation result.
- a program for functioning as a section is provided as a section.
- the software for realizing the translation server device in the present embodiment includes a computer, a fourth speaker attribute receiving unit that receives one or more speaker attributes, a voice recognition result receiving unit that receives a speech recognition result, A speech recognition result received by the speech recognition result receiving unit is translated into a target language using a translation model stored in a storage medium, and a translation result is obtained, and the one or more speaker attributes
- a speech synthesis server selection unit that selects one of the two or more speech synthesis server devices, and a translation that transmits the translation result to the speech synthesis server device selected by the speech synthesis server selection unit It is a program for functioning as a result transmission part.
- the software that realizes the translation server device includes a language speaker attribute acquisition unit that acquires a speaker attribute related to one or more languages from a speech recognition result received by the speech recognition result reception unit. And a program for causing the one or more speaker attributes acquired by the language speaker attribute acquiring unit to function as a fourth speaker attribute storing unit that stores in a storage medium.
- the software for realizing the speech synthesis server device includes a computer, a fifth speaker attribute receiving unit that receives one or more speaker attributes, a translation result receiving unit that receives a translation result, A speech synthesis model selection unit that selects one speech synthesis model from two or more speech synthesis models stored in a storage medium according to one or more speaker attributes received by the fifth speaker attribute reception unit; The speech synthesis unit that synthesizes speech using the speech synthesis model selected by the speech synthesis model selection unit and obtains the speech synthesis result, and the speech synthesis result is transmitted to the second terminal. It is a program for functioning as a speech synthesis result transmission unit to be transmitted to the apparatus.
- FIG. 31 shows the external appearance of a computer that executes the program described in this specification to realize the speech translation system or the like of the above-described embodiment.
- the above-described embodiments can be realized by computer hardware and a computer program executed thereon.
- FIG. 31 is an overview diagram of the computer system 340
- FIG. 32 is a diagram showing an internal configuration of the computer system 340.
- the computer system 340 includes a computer 341 including an FD drive 3411 and a CD-ROM drive 3412, a keyboard 342, a mouse 343, and a monitor 344.
- the computer 341 stores an MPU 3413, a bus 3414 connected to the CD-ROM drive 3412 and the FD drive 3411, and a program such as a bootup program. And includes a RAM 3416 for temporarily storing application program instructions and providing a temporary storage space, and a hard disk 3417 for storing application programs, system programs, and data.
- the computer 341 may further include a network card that provides connection to the LAN.
- the program does not necessarily include an operating system (OS) or a third-party program that causes the computer 341 to execute functions such as the speech translation system of the above-described embodiment.
- the program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.
- two or more communication means may be physically realized by one medium. it goes without saying.
- each process may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.
- the speech translation system is a single device, and the speech recognition server device, the translation server device, and the speech synthesis server device are: It is included in one device.
- the transmission and reception of the information is information delivery. That is, the above reception or transmission is widely understood.
- the speech translation system when the speech translation system is centrally processed by a single device, the speech translation system has a configuration shown in FIG. 33, for example.
- the speech translation system includes a speech reception unit 3301, a third speaker attribute storage unit 301, a speech recognition model storage unit 302, a speech speaker attribute acquisition unit 304, a speech recognition model selection unit 307, a speech recognition unit 308, a translation model.
- a storage unit 402, a language speaker attribute acquisition unit 404, a translation model selection unit 407, a translation unit 408, a speech synthesis model storage unit 502, a speech synthesis model selection unit 506, a speech synthesis unit 507, and a speech synthesis result output unit 3302 are provided.
- the voice reception unit 3301 receives voice from the user. This voice is a voice to be translated.
- the voice reception unit 3301 can be constituted by, for example, a microphone and its driver software.
- the third speaker attribute storage unit 301 normally stores speaker attributes received from the user.
- the speaker attribute here is usually static speaker attribute information.
- the voice speaker attribute acquisition unit 304 acquires one or more voice speaker attributes from the voice information configured from the voice received by the voice reception unit 3301.
- the voice speaker attribute acquired here is mainly dynamic speaker attribute information, but may be static speaker attribute information.
- the speech recognition model selection unit 307 includes two or more speaker attributes according to one or more speaker attributes of the speaker attributes of the third speaker attribute storage unit 301 or the speaker attributes acquired by the voice speaker attribute acquisition unit 304.
- One speech recognition model is selected from the speech recognition models.
- the speech recognition unit 308 recognizes speech information composed of the speech received by the speech reception unit 3301 using the speech recognition model in the speech recognition model storage unit 302, and acquires a speech recognition result.
- the speech recognition unit 308 recognizes speech information using the speech recognition model selected by the speech recognition model selection unit 307 and acquires a speech recognition result.
- the language speaker attribute acquisition unit 404 acquires one or more language speaker attributes from the speech recognition result acquired by the speech recognition unit 308.
- the translation model selection unit 407 selects one translation model from two or more translation models according to one or more speaker attributes.
- the speaker attribute here is a speaker attribute of the third speaker attribute storage unit 301 or a speaker attribute acquired by the voice speaker attribute acquisition unit 304 or a language speaker attribute acquired by the language speaker attribute acquisition unit 404. Of these, one or more speaker attributes.
- the translation unit 408 translates the speech recognition result into a target language using the translation model in the translation model storage unit 402, and acquires the translation result.
- the speech translation system includes a third speaker attribute storage unit 301, a speech speaker attribute acquisition unit 304, a speech recognition model selection unit 307, a language speaker attribute acquisition unit 404, a translation model selection unit 407, and a speech synthesis model selection unit. 506 is not an essential component.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
Description
(実施の形態1)
第一話者属性格納部11に格納されている話者属性は、通常、第一端末装置1のユーザが入力した情報である。また、第一話者属性格納部11に格納されている話者属性は、通常、静的話者属性情報である。第一話者属性格納部11は、1以上の話者属性を含む音声翻訳制御情報を格納していても良い。かかる場合、第一話者属性格納部11は、第一音声翻訳制御情報格納部11と呼んでも良い。
また、音声話者属性取得部304は、例えば、音声情報をスペクトル分析し、1以上の特徴量を取得する。そして、音声話者属性取得部304は、1以上の特徴量から、話者の年齢、性別、話速、感情などの話者属性を決定する。音声話者属性取得部304は、例えば、男性または/および女性であることを決定するための特徴量の情報(特徴量をパラメータとする条件)を保持しており、取得した1以上の特徴量から、話者が男性か女性かを決定して、性別の情報(例えば、男性「0」、女性「1」)を取得する。また、音声話者属性取得部304は、例えば、特定の年齢、または特定の年齢層(例えば、10代、20代など)を決定するための特徴量の情報を保持しており、取得した1以上の特徴量から、話者の年齢または年齢層を決定して、年齢または年齢層の情報(例えば、9歳まで「0」、10代「1」など)を取得する。また、音声話者属性取得部304は、音声情報を解析し、話速(例えば、4.5音/秒)を取得する。話速を取得する技術は公知技術であるので、詳細な説明を省略する。また、音声話者属性取得部304は、例えば、取得した1以上の特徴量から感情(動的話者属性情報の一種)を取得しても良い。さらに具体的には、音声話者属性取得部304は、例えば、感情「普通」の場合のピッチとパワーの値を保持している。そして、音声話者属性取得部304は、抽出した有声部分のピッチとパワーの値から平均値・最大値・最小値を求める。そして、音声話者属性取得部304は、感情「普通」の場合のピッチとパワーの値と、抽出した有声部分のピッチとパワーの平均値・最大値・最小値とを用いて、平均ピッチが低く、平均パワーが高い場合は、感情「怒り」を取得する。また、音声話者属性取得部304は、感情「普通」の場合のピッチとパワーの値と比較して、最小ピッチが高く、最大パワーが低い場合、感情「悲しみ」を取得する。また、感情「普通」の場合のピッチとパワーの値と比較して、特徴量が大きい場合、音声話者属性取得部304は、感情「喜び」を取得する。
音声話者属性取得部304は、取得した1以上の特徴量のうち、パワーおよび韻律を用いて感情を取得することは好適である。感情を取得する手法については、URL「http://www.kansei.soft.iwate-pu.ac.jp/abstract/2007/0312004126.pdf」の論文を参照のこと。
なお、音声話者属性取得部304が、話速等の属性を取得する音声情報の単位は問わない。つまり、音声話者属性取得部304は、文の単位で話速等の属性を取得しても良いし、単語の単位で話速等の属性を取得しても良いし、認識結果の単位で話速等の属性を取得しても良いし、複数の文の単位で話速等の属性を取得しても良い。
第一音声出力部18、および第二音声出力部28は、スピーカおよびそのドライバーソフト等で実現され得る。
さらに具体的には、音声翻訳システムが単一の装置によって集中処理される場合、当該音声翻訳システムは、例えば、図33に示す構成になる。
つまり、音声翻訳システムは、音声受付部3301、第三話者属性格納部301、音声認識モデル格納部302、音声話者属性取得部304、音声認識モデル選択部307、音声認識部308、翻訳モデル格納部402、言語話者属性取得部404、翻訳モデル選択部407、翻訳部408、音声合成モデル格納部502、音声合成モデル選択部506、音声合成部507、音声合成結果出力部3302を備える。
音声受付部3301は、ユーザから音声を受け付ける。この音声は、音声翻訳対象の音声である。音声受付部3301は、例えば、マイクとそのドライバーソフトウェア等から構成され得る。
第三話者属性格納部301は、ここでは、通常、ユーザから受け付けた話者属性を格納している。ここでの話者属性は、通常、静的話者属性情報である。
音声話者属性取得部304は、音声受付部3301が受け付けた音声から構成された音声情報から、1以上の音声話者属性を取得する。ここで取得する音声話者属性は、主として、動的話者属性情報であるが、静的話者属性情報でも良い。
音声認識モデル選択部307は、第三話者属性格納部301の話者属性または音声話者属性取得部304が取得した話者属性のうちの、1以上の話者属性に応じて、2以上の音声認識モデルから、一の音声認識モデルを選択する。
音声認識部308は、音声受付部3301が受け付けた音声から構成された音声情報を、音声認識モデル格納部302の音声認識モデルを用いて音声認識し、音声認識結果を取得する。また、音声認識部308は、音声情報を、音声認識モデル選択部307が選択した音声認識モデルを用いて音声認識し、音声認識結果を取得することは好適である。
言語話者属性取得部404は、音声認識部308が取得した音声認識結果から1以上の言語話者属性を取得する。
翻訳モデル選択部407は、1以上の話者属性に応じて、2以上の翻訳モデルから、一の翻訳モデルを選択する。ここでの話者属性は、第三話者属性格納部301の話者属性または音声話者属性取得部304が取得した話者属性または言語話者属性取得部404が取得した言語話者属性のうちの、1以上の話者属性である。
翻訳部408は、音声認識結果を、翻訳モデル格納部402の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する。翻訳部408は、音声認識結果を、翻訳モデル選択部407が選択した翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得することは好適である。
音声合成モデル選択部506は、1以上の話者属性に応じて、2以上の音声合成モデルから、一の音声合成モデルを選択する。ここでの話者属性は、第三話者属性格納部301の話者属性または音声話者属性取得部304が取得した話者属性または言語話者属性取得部404が取得した言語話者属性のうちの、1以上の話者属性である。
音声合成部507は、翻訳結果を、音声合成モデル格納部502の音声合成モデルを用いて音声合成し、音声合成結果を取得する。音声合成部507は、翻訳結果を、音声合成モデル選択部506が選択した音声合成モデルを用いて音声合成し、音声合成結果を取得することは好適である。
音声合成結果出力部3302は、音声合成部507が取得した音声合成結果を出力する。ここでの出力とは、スピーカー等を用いた音声出力、外部の装置(通常、音声出力装置)への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。音声合成結果出力部3302は、スピーカーとそのドライバーソフトウェア等から構成され得る。
なお、音声翻訳システムは、第三話者属性格納部301、音声話者属性取得部304、音声認識モデル選択部307、言語話者属性取得部404、翻訳モデル選択部407、音声合成モデル選択部506は、必須の構成要素ではない。
Claims (14)
- 音声を入力する第一端末装置、2以上の音声認識サーバ装置、1以上の翻訳サーバ装置、1以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記第一端末装置は、
話者の属性値である話者属性を、1以上格納し得る第一話者属性格納部と、
音声を受け付ける第一音声受付部と、
前記1以上の話者属性に応じて、前記2以上の音声認識サーバ装置のうち一の音声認識サーバ装置を選択する第一音声認識サーバ選択部と、
前記第一音声認識サーバ選択部が選択した音声認識サーバ装置に、前記第一音声受付部が受け付けた音声から構成される音声情報を送信する第一音声送信部とを具備し、
前記音声認識サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声認識モデルを格納し得る音声認識モデル格納部と、
前記音声情報を受信する音声情報受信部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル格納部の音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、翻訳モデルを格納し得る翻訳モデル格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル格納部の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声合成モデルを格納し得る音声合成モデル格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル格納部の音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 音声を入力する第一端末装置、1以上の音声認識サーバ装置、1以上の翻訳サーバ装置、1以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記第一端末装置は、
音声を受け付ける第一音声受付部と、
前記第一音声受付部が受け付けた音声から構成される音声情報を、前記音声認識サーバ装置に送信する第一音声送信部とを具備し、
前記音声認識サーバ装置は、
話者の属性値である話者属性を、1以上格納し得る第三話者属性格納部と、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、2以上の音声認識モデルを格納し得る音声認識モデル格納部と、
前記音声情報を受信する音声情報受信部と、
前記1以上の話者属性に応じて、前記2以上の音声認識モデルから、一の音声認識モデルを選択する音声認識モデル選択部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル選択部が選択した音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、翻訳モデルを格納し得る翻訳モデル格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル格納部の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声合成モデルを格納し得る音声合成モデル格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル格納部の音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 1以上の音声認識サーバ装置、2以上の翻訳サーバ装置、1以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記音声認識サーバ装置は、
話者の属性値である話者属性を、1以上格納し得る第三話者属性格納部と、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声認識モデルを格納し得る音声認識モデル格納部と、
音声情報を受信する音声情報受信部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル格納部の音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記1以上の話者属性に応じて、前記2以上の翻訳サーバ装置のうち一の翻訳サーバ装置を選択する翻訳サーバ選択部と、
前記翻訳サーバ選択部が選択した翻訳サーバ装置に、前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、翻訳モデルを格納し得る翻訳モデル格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル格納部の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声合成モデルを格納し得る音声合成モデル格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル格納部の音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 1以上の音声認識サーバ装置、1以上の翻訳サーバ装置、1以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記音声認識サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声認識モデルを格納し得る音声認識モデル格納部と、
音声情報を受信する音声情報受信部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル格納部の音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記翻訳サーバ装置に、前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、2以上の翻訳モデルを格納し得る翻訳モデル格納部と、
1以上の話者属性を格納し得る第四話者属性格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記1以上の話者属性に応じて、前記2以上の翻訳モデルから、一の翻訳モデルを選択する翻訳モデル選択部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル選択部が選択した翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声合成モデルを格納し得る音声合成モデル格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル格納部の音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 1以上の音声認識サーバ装置、1以上の翻訳サーバ装置、2以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記音声認識サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声認識モデルを格納し得る音声認識モデル格納部と、
音声情報を受信する音声情報受信部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル格納部の音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記翻訳サーバ装置に、前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、翻訳モデルを格納し得る翻訳モデル格納部と、
1以上の話者属性を格納し得る第四話者属性格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル格納部の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記1以上の話者属性に応じて、前記2以上の音声合成サーバ装置のうち一の音声合成サーバ装置を選択する音声合成サーバ選択部と、
前記音声合成サーバ選択部が選択した音声合成サーバ装置に、前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声合成モデルを格納し得る音声合成モデル格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル格納部の音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 1以上の音声認識サーバ装置、1以上の翻訳サーバ装置、1以上の音声合成サーバ装置とを有する音声翻訳システムであって、
前記音声認識サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、音声認識モデルを格納し得る音声認識モデル格納部と、
音声情報を受信する音声情報受信部と、
前記音声情報受信部が受信した音声情報を、前記音声認識モデル格納部の音声認識モデルを用いて音声認識し、音声認識結果を取得する音声認識部と、
前記翻訳サーバ装置に、前記音声認識結果を送信する音声認識結果送信部とを具備し、
前記翻訳サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、翻訳モデルを格納し得る翻訳モデル格納部と、
前記音声認識結果を受信する音声認識結果受信部と、
前記音声認識結果受信部が受信した音声認識結果を、前記翻訳モデル格納部の翻訳モデルを用いて、目的言語に翻訳し、翻訳結果を取得する翻訳部と、
前記音声合成サーバ装置に、前記翻訳結果を送信する翻訳結果送信部とを具備し、
前記音声合成サーバ装置は、
前記2以上の言語のうちのすべての言語または2以上の一部の言語について、2以上の音声合成モデルを格納し得る音声合成モデル格納部と、
1以上の話者属性を格納し得る第五話者属性格納部と、
前記翻訳結果を受信する翻訳結果受信部と、
前記1以上の話者属性に応じて、前記2以上の音声合成モデルから、一の音声合成モデルを選択する音声合成モデル選択部と、
前記翻訳結果受信部が受信した翻訳結果を、前記音声合成モデル選択部が選択した音声合成モデルを用いて音声合成し、音声合成結果を取得する音声合成部と、
前記音声合成結果を第二端末装置に送信する音声合成結果送信部とを具備する音声翻訳システム。 - 前記第一端末装置は、
1以上の話者属性を受け付ける第一話者属性受付部と、
前記1以上の話者属性を、前記第一話者属性格納部に蓄積する第一話者属性蓄積部とを具備する請求項1記載の音声翻訳システム。 - 前記音声認識サーバ装置は、
前記音声情報受信部が受信した音声情報から、1以上の音声に関する話者属性を取得する音声話者属性取得部と、
前記音声話者属性取得部が取得した1以上の話者属性を、第三話者属性格納部に蓄積する第三話者属性蓄積部とをさらに具備する請求項2または請求項3記載の音声翻訳システム。 - 前記翻訳サーバ装置は、
前記音声認識結果受信部が受信した音声認識結果から、1以上の言語に関する話者属性を取得する言語話者属性取得部と、
前記言語話者属性取得部が取得した1以上の話者属性を前記第四話者属性格納部に蓄積する第四話者属性蓄積部とをさらに具備する請求項4または請求項5記載の音声翻訳システム。 - 前記話者が使用する言語である原言語を特定する原言語識別子、および翻訳先の言語である目的言語を特定する目的言語識別子、および1以上の話者属性を含む音声翻訳制御情報が、前記音声認識サーバ装置から前記1以上の翻訳サーバ装置を経由して、前記音声合成サーバ装置に送信され、
前記音声認識サーバ選択部、または前記音声認識部、または音声認識モデル選択部、または前記翻訳サーバ選択部、または前記翻訳部、または翻訳モデル選択部、前記音声合成サーバ選択部、または前記音声合成部、または音声合成モデル選択部は、
前記音声翻訳制御情報を用いて、各々の処理を行う請求項1記載の音声翻訳システム。 - 請求項1記載の音声翻訳システムを構成する第一端末装置。
- 請求項2または請求項3記載の音声翻訳システムを構成する音声認識サーバ装置。
- 請求項4または請求項5記載の音声翻訳システムを構成する翻訳サーバ装置。
- 請求項6記載の音声翻訳システムを構成する音声合成サーバ装置。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020127008314A KR101683943B1 (ko) | 2009-10-02 | 2010-03-03 | 음성번역 시스템, 제1 단말장치, 음성인식 서버장치, 번역 서버장치, 및 음성합성 서버장치 |
EP10820177.3A EP2485212A4 (en) | 2009-10-02 | 2010-03-03 | LANGUAGE TRANSLATION SYSTEM, FIRST END DEVICE, VOICE RECOGNITION SERVER, TRANSLATION SERVER AND LANGUAGE SYNTHESIS SERV |
CN201080043645.3A CN102549653B (zh) | 2009-10-02 | 2010-03-03 | 语音翻译系统、第一终端装置、语音识别服务器装置、翻译服务器装置以及语音合成服务器装置 |
JP2011534094A JP5598998B2 (ja) | 2009-10-02 | 2010-03-03 | 音声翻訳システム、第一端末装置、音声認識サーバ装置、翻訳サーバ装置、および音声合成サーバ装置 |
US13/499,311 US8862478B2 (en) | 2009-10-02 | 2010-03-03 | Speech translation system, first terminal apparatus, speech recognition server, translation server, and speech synthesis server |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009230442 | 2009-10-02 | ||
JP2009-230442 | 2009-10-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011040056A1 true WO2011040056A1 (ja) | 2011-04-07 |
Family
ID=43825894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/053419 WO2011040056A1 (ja) | 2009-10-02 | 2010-03-03 | 音声翻訳システム、第一端末装置、音声認識サーバ装置、翻訳サーバ装置、および音声合成サーバ装置 |
Country Status (6)
Country | Link |
---|---|
US (1) | US8862478B2 (ja) |
EP (1) | EP2485212A4 (ja) |
JP (1) | JP5598998B2 (ja) |
KR (1) | KR101683943B1 (ja) |
CN (2) | CN102549653B (ja) |
WO (1) | WO2011040056A1 (ja) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013202080A (ja) * | 2012-03-27 | 2013-10-07 | Advanced Telecommunication Research Institute International | コミュニケーションシステム、コミュニケーション装置、プログラムおよびコミュニケーション制御方法 |
JP2014519627A (ja) * | 2011-06-13 | 2014-08-14 | エムモーダル アイピー エルエルシー | 疎結合コンポーネントを使用した音声認識 |
WO2015004909A1 (ja) * | 2013-07-10 | 2015-01-15 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 話者識別方法及び話者識別システム |
CN105161112A (zh) * | 2015-09-21 | 2015-12-16 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
JP2015537258A (ja) * | 2012-12-12 | 2015-12-24 | アマゾン テクノロジーズ インコーポレーテッド | 分散音声認識システムにおける音声モデル検索 |
JP2018060362A (ja) * | 2016-10-05 | 2018-04-12 | 株式会社リコー | 情報処理システム、情報処理装置、及び情報処理方法 |
US10216729B2 (en) | 2013-08-28 | 2019-02-26 | Electronics And Telecommunications Research Institute | Terminal device and hands-free device for hands-free automatic interpretation service, and hands-free automatic interpretation service method |
JP2019049742A (ja) * | 2012-08-10 | 2019-03-28 | エイディシーテクノロジー株式会社 | 音声応答装置 |
WO2019111346A1 (ja) * | 2017-12-06 | 2019-06-13 | ソースネクスト株式会社 | 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム |
WO2019225028A1 (ja) * | 2018-05-25 | 2019-11-28 | パナソニックIpマネジメント株式会社 | 翻訳装置、システム、方法及びプログラム並びに学習方法 |
JP2020155944A (ja) * | 2019-03-20 | 2020-09-24 | 株式会社リコー | 発話者検出システム、発話者検出方法及びプログラム |
USD897307S1 (en) | 2018-05-25 | 2020-09-29 | Sourcenext Corporation | Translator |
KR20220048578A (ko) * | 2020-10-13 | 2022-04-20 | 주식회사 케이티 | 음성 합성 스케쥴을 조정하는 캐쉬 서버, 방법 및 음성 합성을 수행하는 음성 합성 서버 |
Families Citing this family (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130243207A1 (en) * | 2010-11-25 | 2013-09-19 | Telefonaktiebolaget L M Ericsson (Publ) | Analysis system and method for audio data |
US9053096B2 (en) * | 2011-12-01 | 2015-06-09 | Elwha Llc | Language translation based on speaker-related information |
US8811638B2 (en) | 2011-12-01 | 2014-08-19 | Elwha Llc | Audible assistance |
US9107012B2 (en) | 2011-12-01 | 2015-08-11 | Elwha Llc | Vehicular threat detection based on audio signals |
US9159236B2 (en) | 2011-12-01 | 2015-10-13 | Elwha Llc | Presentation of shared threat information in a transportation-related context |
US9245254B2 (en) | 2011-12-01 | 2016-01-26 | Elwha Llc | Enhanced voice conferencing with history, language translation and identification |
US8934652B2 (en) | 2011-12-01 | 2015-01-13 | Elwha Llc | Visual presentation of speaker-related information |
US10875525B2 (en) | 2011-12-01 | 2020-12-29 | Microsoft Technology Licensing Llc | Ability enhancement |
US9368028B2 (en) | 2011-12-01 | 2016-06-14 | Microsoft Technology Licensing, Llc | Determining threats based on information from road-based devices in a transportation-related context |
US9064152B2 (en) | 2011-12-01 | 2015-06-23 | Elwha Llc | Vehicular threat detection based on image analysis |
JP5727980B2 (ja) * | 2012-09-28 | 2015-06-03 | 株式会社東芝 | 表現変換装置、方法およびプログラム |
CN103811003B (zh) * | 2012-11-13 | 2019-09-24 | 联想(北京)有限公司 | 一种语音识别方法以及电子设备 |
US9959865B2 (en) | 2012-11-13 | 2018-05-01 | Beijing Lenovo Software Ltd. | Information processing method with voice recognition |
US9135916B2 (en) * | 2013-02-26 | 2015-09-15 | Honeywell International Inc. | System and method for correcting accent induced speech transmission problems |
CN104700836B (zh) | 2013-12-10 | 2019-01-29 | 阿里巴巴集团控股有限公司 | 一种语音识别方法和系统 |
US9230542B2 (en) * | 2014-04-01 | 2016-01-05 | Zoom International S.R.O. | Language-independent, non-semantic speech analytics |
US9412358B2 (en) | 2014-05-13 | 2016-08-09 | At&T Intellectual Property I, L.P. | System and method for data-driven socially customized models for language generation |
US9437189B2 (en) * | 2014-05-29 | 2016-09-06 | Google Inc. | Generating language models |
US9678954B1 (en) * | 2015-10-29 | 2017-06-13 | Google Inc. | Techniques for providing lexicon data for translation of a single word speech input |
WO2017187712A1 (ja) * | 2016-04-26 | 2017-11-02 | 株式会社ソニー・インタラクティブエンタテインメント | 情報処理装置 |
EP3455853A2 (en) * | 2016-05-13 | 2019-03-20 | Bose Corporation | Processing speech from distributed microphones |
KR102596430B1 (ko) * | 2016-08-31 | 2023-10-31 | 삼성전자주식회사 | 화자 인식에 기초한 음성 인식 방법 및 장치 |
KR101917648B1 (ko) | 2016-09-08 | 2018-11-13 | 주식회사 하이퍼커넥트 | 단말 및 그 제어 방법 |
CN106550156A (zh) * | 2017-01-23 | 2017-03-29 | 苏州咖啦魔哆信息技术有限公司 | 一种基于语音识别的人工智能客服系统及其实现方法 |
CN108364633A (zh) * | 2017-01-25 | 2018-08-03 | 晨星半导体股份有限公司 | 文字转语音系统以及文字转语音方法 |
JP7197259B2 (ja) * | 2017-08-25 | 2022-12-27 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 情報処理方法、情報処理装置およびプログラム |
KR102450823B1 (ko) | 2017-10-12 | 2022-10-05 | 한국전자통신연구원 | 사용자 맞춤형 통번역 장치 및 방법 |
US10665234B2 (en) * | 2017-10-18 | 2020-05-26 | Motorola Mobility Llc | Detecting audio trigger phrases for a voice recognition session |
CN110021290A (zh) * | 2018-01-08 | 2019-07-16 | 上海西门子医疗器械有限公司 | 医疗系统和用于医疗系统的实时语言转换方法 |
US10691894B2 (en) * | 2018-05-01 | 2020-06-23 | Disney Enterprises, Inc. | Natural polite language generation system |
KR102107447B1 (ko) * | 2018-07-03 | 2020-06-02 | 주식회사 한글과컴퓨터 | 선택적 음성 모델의 적용에 기초한 번역 기능을 제공하는 텍스트 음성 변환 장치 및 그 동작 방법 |
JP7143665B2 (ja) * | 2018-07-27 | 2022-09-29 | 富士通株式会社 | 音声認識装置、音声認識プログラムおよび音声認識方法 |
CN109388699A (zh) | 2018-10-24 | 2019-02-26 | 北京小米移动软件有限公司 | 输入方法、装置、设备及存储介质 |
CN109861904B (zh) * | 2019-02-19 | 2021-01-05 | 天津字节跳动科技有限公司 | 姓名标签显示方法和装置 |
JPWO2021192719A1 (ja) * | 2020-03-27 | 2021-09-30 | ||
US20230351123A1 (en) * | 2022-04-29 | 2023-11-02 | Zoom Video Communications, Inc. | Providing multistream machine translation during virtual conferences |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000148176A (ja) * | 1998-11-18 | 2000-05-26 | Sony Corp | 情報処理装置および方法、提供媒体、音声認識システム、音声合成システム、翻訳装置および方法、並びに翻訳システム |
JP2002311983A (ja) * | 2001-04-11 | 2002-10-25 | Atr Onsei Gengo Tsushin Kenkyusho:Kk | 翻訳電話システム |
JP2003058458A (ja) * | 2001-08-14 | 2003-02-28 | Nippon Telegr & Teleph Corp <Ntt> | 多言語遠隔マルチユーザコミュニケーションシステム |
JP2005031758A (ja) * | 2003-07-07 | 2005-02-03 | Canon Inc | 音声処理装置及び方法 |
JP2005140988A (ja) * | 2003-11-06 | 2005-06-02 | Canon Inc | 音声認識装置及び方法 |
JP2008243080A (ja) | 2007-03-28 | 2008-10-09 | Toshiba Corp | 音声を翻訳する装置、方法およびプログラム |
JP2009140503A (ja) | 2007-12-10 | 2009-06-25 | Toshiba Corp | 音声翻訳方法及び装置 |
JP2009527818A (ja) * | 2006-02-17 | 2009-07-30 | グーグル・インコーポレーテッド | 分散型モデルの符号化及び適応可能なスケーラブルアクセス処理 |
Family Cites Families (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6052657A (en) * | 1997-09-09 | 2000-04-18 | Dragon Systems, Inc. | Text segmentation and identification of topic using language models |
US6219638B1 (en) * | 1998-11-03 | 2001-04-17 | International Business Machines Corporation | Telephone messaging and editing system |
US7263489B2 (en) * | 1998-12-01 | 2007-08-28 | Nuance Communications, Inc. | Detection of characteristics of human-machine interactions for dialog customization and analysis |
US6278968B1 (en) * | 1999-01-29 | 2001-08-21 | Sony Corporation | Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system |
US6266642B1 (en) * | 1999-01-29 | 2001-07-24 | Sony Corporation | Method and portable apparatus for performing spoken language translation |
JP4517260B2 (ja) * | 2000-09-11 | 2010-08-04 | 日本電気株式会社 | 自動通訳システム、自動通訳方法、および自動通訳用プログラムを記録した記憶媒体 |
EP1217609A3 (en) * | 2000-12-22 | 2004-02-25 | Hewlett-Packard Company | Speech recognition |
JP2002245038A (ja) * | 2001-02-21 | 2002-08-30 | Ricoh Co Ltd | 携帯端末装置による多国語翻訳システム |
US6996525B2 (en) * | 2001-06-15 | 2006-02-07 | Intel Corporation | Selecting one of multiple speech recognizers in a system based on performance predections resulting from experience |
JP2004048277A (ja) * | 2002-07-10 | 2004-02-12 | Mitsubishi Electric Corp | 通信システム |
US7228275B1 (en) * | 2002-10-21 | 2007-06-05 | Toyota Infotechnology Center Co., Ltd. | Speech recognition system having multiple speech recognizers |
CN1221937C (zh) * | 2002-12-31 | 2005-10-05 | 北京天朗语音科技有限公司 | 语速自适应的语音识别系统 |
US20050144012A1 (en) * | 2003-11-06 | 2005-06-30 | Alireza Afrashteh | One button push to translate languages over a wireless cellular radio |
JP2005202884A (ja) * | 2004-01-19 | 2005-07-28 | Toshiba Corp | 送信装置、受信装置、中継装置、および送受信システム |
US8036893B2 (en) * | 2004-07-22 | 2011-10-11 | Nuance Communications, Inc. | Method and system for identifying and correcting accent-induced speech recognition difficulties |
US7624013B2 (en) * | 2004-09-10 | 2009-11-24 | Scientific Learning Corporation | Word competition models in voice recognition |
JP2006099296A (ja) * | 2004-09-29 | 2006-04-13 | Nec Corp | 翻訳システム、翻訳通信システム、機械翻訳方法、および、プログラム |
WO2006083690A2 (en) * | 2005-02-01 | 2006-08-10 | Embedded Technologies, Llc | Language engine coordination and switching |
JP4731174B2 (ja) * | 2005-02-04 | 2011-07-20 | Kddi株式会社 | 音声認識装置、音声認識システム及びコンピュータプログラム |
CN1953052B (zh) * | 2005-10-20 | 2010-09-08 | 株式会社东芝 | 训练时长预测模型、时长预测和语音合成的方法及装置 |
WO2007070558A2 (en) * | 2005-12-12 | 2007-06-21 | Meadan, Inc. | Language translation using a hybrid network of human and machine translators |
US7822606B2 (en) * | 2006-07-14 | 2010-10-26 | Qualcomm Incorporated | Method and apparatus for generating audio information from received synthesis information |
US7881928B2 (en) * | 2006-09-01 | 2011-02-01 | International Business Machines Corporation | Enhanced linguistic transformation |
US7702510B2 (en) * | 2007-01-12 | 2010-04-20 | Nuance Communications, Inc. | System and method for dynamically selecting among TTS systems |
CN101266600A (zh) * | 2008-05-07 | 2008-09-17 | 陈光火 | 多媒体多语言互动同步翻译方法 |
US8868430B2 (en) * | 2009-01-16 | 2014-10-21 | Sony Corporation | Methods, devices, and computer program products for providing real-time language translation capabilities between communication terminals |
US8515749B2 (en) * | 2009-05-20 | 2013-08-20 | Raytheon Bbn Technologies Corp. | Speech-to-speech translation |
US8386235B2 (en) * | 2010-05-20 | 2013-02-26 | Acosys Limited | Collaborative translation system and method |
-
2010
- 2010-03-03 JP JP2011534094A patent/JP5598998B2/ja active Active
- 2010-03-03 KR KR1020127008314A patent/KR101683943B1/ko active IP Right Grant
- 2010-03-03 CN CN201080043645.3A patent/CN102549653B/zh not_active Expired - Fee Related
- 2010-03-03 WO PCT/JP2010/053419 patent/WO2011040056A1/ja active Application Filing
- 2010-03-03 US US13/499,311 patent/US8862478B2/en not_active Expired - Fee Related
- 2010-03-03 CN CN201310130953.5A patent/CN103345467B/zh not_active Expired - Fee Related
- 2010-03-03 EP EP10820177.3A patent/EP2485212A4/en not_active Withdrawn
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2000148176A (ja) * | 1998-11-18 | 2000-05-26 | Sony Corp | 情報処理装置および方法、提供媒体、音声認識システム、音声合成システム、翻訳装置および方法、並びに翻訳システム |
JP2002311983A (ja) * | 2001-04-11 | 2002-10-25 | Atr Onsei Gengo Tsushin Kenkyusho:Kk | 翻訳電話システム |
JP2003058458A (ja) * | 2001-08-14 | 2003-02-28 | Nippon Telegr & Teleph Corp <Ntt> | 多言語遠隔マルチユーザコミュニケーションシステム |
JP2005031758A (ja) * | 2003-07-07 | 2005-02-03 | Canon Inc | 音声処理装置及び方法 |
JP2005140988A (ja) * | 2003-11-06 | 2005-06-02 | Canon Inc | 音声認識装置及び方法 |
JP2009527818A (ja) * | 2006-02-17 | 2009-07-30 | グーグル・インコーポレーテッド | 分散型モデルの符号化及び適応可能なスケーラブルアクセス処理 |
JP2008243080A (ja) | 2007-03-28 | 2008-10-09 | Toshiba Corp | 音声を翻訳する装置、方法およびプログラム |
JP2009140503A (ja) | 2007-12-10 | 2009-06-25 | Toshiba Corp | 音声翻訳方法及び装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2485212A4 |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9454961B2 (en) | 2011-06-13 | 2016-09-27 | Mmodal Ip Llc | Speech recognition using loosely coupled components |
JP2014519627A (ja) * | 2011-06-13 | 2014-08-14 | エムモーダル アイピー エルエルシー | 疎結合コンポーネントを使用した音声認識 |
US9666190B2 (en) | 2011-06-13 | 2017-05-30 | Mmodal Ip Llc | Speech recognition using loosely coupled components |
JP2013202080A (ja) * | 2012-03-27 | 2013-10-07 | Advanced Telecommunication Research Institute International | コミュニケーションシステム、コミュニケーション装置、プログラムおよびコミュニケーション制御方法 |
JP2019049742A (ja) * | 2012-08-10 | 2019-03-28 | エイディシーテクノロジー株式会社 | 音声応答装置 |
JP2015537258A (ja) * | 2012-12-12 | 2015-12-24 | アマゾン テクノロジーズ インコーポレーテッド | 分散音声認識システムにおける音声モデル検索 |
US10152973B2 (en) | 2012-12-12 | 2018-12-11 | Amazon Technologies, Inc. | Speech model retrieval in distributed speech recognition systems |
WO2015004909A1 (ja) * | 2013-07-10 | 2015-01-15 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカ | 話者識別方法及び話者識別システム |
JPWO2015004909A1 (ja) * | 2013-07-10 | 2017-03-02 | パナソニック インテレクチュアル プロパティ コーポレーション オブ アメリカPanasonic Intellectual Property Corporation of America | 話者識別方法及び話者識別システム |
US9349372B2 (en) | 2013-07-10 | 2016-05-24 | Panasonic Intellectual Property Corporation Of America | Speaker identification method, and speaker identification system |
US10216729B2 (en) | 2013-08-28 | 2019-02-26 | Electronics And Telecommunications Research Institute | Terminal device and hands-free device for hands-free automatic interpretation service, and hands-free automatic interpretation service method |
CN105161112A (zh) * | 2015-09-21 | 2015-12-16 | 百度在线网络技术(北京)有限公司 | 语音识别方法和装置 |
JP2022046660A (ja) * | 2016-10-05 | 2022-03-23 | 株式会社リコー | 情報処理システム、情報処理装置、プログラム、及び情報処理方法 |
JP2018060362A (ja) * | 2016-10-05 | 2018-04-12 | 株式会社リコー | 情報処理システム、情報処理装置、及び情報処理方法 |
US12008335B2 (en) | 2016-10-05 | 2024-06-11 | Ricoh Company, Ltd. | Information processing system, information processing apparatus, and information processing method |
JP7338676B2 (ja) | 2016-10-05 | 2023-09-05 | 株式会社リコー | 情報処理システム、情報処理装置、プログラム、及び情報処理方法 |
US10956686B2 (en) | 2016-10-05 | 2021-03-23 | Ricoh Company, Ltd. | Information processing system, information processing apparatus, and information processing method |
JP7000671B2 (ja) | 2016-10-05 | 2022-01-19 | 株式会社リコー | 情報処理システム、情報処理装置、及び情報処理方法 |
WO2019111346A1 (ja) * | 2017-12-06 | 2019-06-13 | ソースネクスト株式会社 | 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム |
JPWO2019111346A1 (ja) * | 2017-12-06 | 2020-10-22 | ソースネクスト株式会社 | 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム |
USD897307S1 (en) | 2018-05-25 | 2020-09-29 | Sourcenext Corporation | Translator |
WO2019225028A1 (ja) * | 2018-05-25 | 2019-11-28 | パナソニックIpマネジメント株式会社 | 翻訳装置、システム、方法及びプログラム並びに学習方法 |
JP7259447B2 (ja) | 2019-03-20 | 2023-04-18 | 株式会社リコー | 発話者検出システム、発話者検出方法及びプログラム |
JP2020155944A (ja) * | 2019-03-20 | 2020-09-24 | 株式会社リコー | 発話者検出システム、発話者検出方法及びプログラム |
KR20220048578A (ko) * | 2020-10-13 | 2022-04-20 | 주식회사 케이티 | 음성 합성 스케쥴을 조정하는 캐쉬 서버, 방법 및 음성 합성을 수행하는 음성 합성 서버 |
KR102428296B1 (ko) * | 2020-10-13 | 2022-08-02 | 주식회사 케이티 | 음성 합성 스케쥴을 조정하는 캐쉬 서버, 방법 및 음성 합성을 수행하는 음성 합성 서버 |
Also Published As
Publication number | Publication date |
---|---|
EP2485212A4 (en) | 2016-12-07 |
CN102549653A (zh) | 2012-07-04 |
US8862478B2 (en) | 2014-10-14 |
JPWO2011040056A1 (ja) | 2013-02-21 |
EP2485212A1 (en) | 2012-08-08 |
CN102549653B (zh) | 2014-04-30 |
KR20120086287A (ko) | 2012-08-02 |
CN103345467B (zh) | 2017-06-09 |
JP5598998B2 (ja) | 2014-10-01 |
CN103345467A (zh) | 2013-10-09 |
US20120197629A1 (en) | 2012-08-02 |
KR101683943B1 (ko) | 2016-12-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5598998B2 (ja) | 音声翻訳システム、第一端末装置、音声認識サーバ装置、翻訳サーバ装置、および音声合成サーバ装置 | |
JP5545467B2 (ja) | 音声翻訳システム、制御装置、および情報処理方法 | |
US7689417B2 (en) | Method, system and apparatus for improved voice recognition | |
CN102792294B (zh) | 自然语言语音服务环境中的混合处理的系统及方法 | |
US9761241B2 (en) | System and method for providing network coordinated conversational services | |
JP2023022150A (ja) | 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム | |
WO2014010450A1 (ja) | 音声処理システム及び端末装置 | |
JP5062171B2 (ja) | 音声認識システム、音声認識方法および音声認識用プログラム | |
JP2018017936A (ja) | 音声対話装置、サーバ装置、音声対話方法、音声処理方法およびプログラム | |
EP1899955B1 (en) | Speech dialog method and system | |
JP6580281B1 (ja) | 翻訳装置、翻訳方法、および翻訳プログラム | |
JP5704686B2 (ja) | 音声翻訳システム、音声翻訳装置、音声翻訳方法、およびプログラム | |
KR102376552B1 (ko) | 음성 합성 장치 및 음성 합성 방법 | |
US20170185587A1 (en) | Machine translation method and machine translation system | |
Fischer et al. | Towards multi-modal interfaces for embedded devices | |
JP2017009685A (ja) | 情報処理装置、情報処理方法、及びプログラム | |
CN118101877A (zh) | 用于实时通讯的字幕生成方法、系统、存储介质及电子设备 | |
Di Fabbrizio et al. | Speech Mashups | |
JP2018097201A (ja) | 音声対話装置および対話方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080043645.3 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10820177 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2011534094 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010820177 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 20127008314 Country of ref document: KR Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13499311 Country of ref document: US |