US20120215528A1 - Speech recognition system, speech recognition request device, speech recognition method, speech recognition program, and recording medium - Google Patents
Speech recognition system, speech recognition request device, speech recognition method, speech recognition program, and recording medium Download PDFInfo
- Publication number
- US20120215528A1 US20120215528A1 US13/504,264 US201013504264A US2012215528A1 US 20120215528 A1 US20120215528 A1 US 20120215528A1 US 201013504264 A US201013504264 A US 201013504264A US 2012215528 A1 US2012215528 A1 US 2012215528A1
- Authority
- US
- United States
- Prior art keywords
- speech recognition
- data
- speech
- mapping function
- acoustic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims description 19
- 238000012545 processing Methods 0.000 claims abstract description 159
- 238000006243 chemical reaction Methods 0.000 claims abstract description 125
- 230000010365 information processing Effects 0.000 claims abstract description 63
- 238000013507 mapping Methods 0.000 claims description 124
- 230000006870 function Effects 0.000 claims description 120
- 238000004891 communication Methods 0.000 claims description 17
- 238000010276 construction Methods 0.000 claims description 16
- 230000004044 response Effects 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 description 23
- 230000005540 biological transmission Effects 0.000 description 20
- 238000005516 engineering process Methods 0.000 description 19
- 238000004364 calculation method Methods 0.000 description 14
- 238000010586 diagram Methods 0.000 description 11
- 238000007476 Maximum Likelihood Methods 0.000 description 9
- 238000001514 detection method Methods 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 230000001747 exhibiting effect Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000006837 decompression Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000011084 recovery Methods 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 2
- 238000007906 compression Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000006866 deterioration Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
- G10L15/30—Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- This invention relates to a speech recognition system, a speech recognition method, and a speech recognition program. Specifically, this invention relates to a speech recognition system, a speech recognition method, and a speech recognition program, which disable the third party from restoring details of a recognition result regarding a content of speech to be subjected to speech recognition, details of a speech recognition dictionary, or the like.
- a speech recognition technology using an information processing system is a technology for taking out language information included in input speech data.
- a system using the speech recognition technology can be used as a speech word processor if all the speech data are converted into text, and can be used as a speech command input device if a keyword included in the speech data is extracted.
- FIG. 7 illustrates an example of a related speech recognition system.
- the speech recognition system illustrated in FIG. 7 includes an utterance segment extraction unit, a feature vector extraction unit, an acoustic likelihood computation unit, a hypothesis search unit, and a database for speech recognition.
- the speech recognition system including such components operates as follows.
- a feature vector is extracted by taking out various features included in the speech at regular time intervals (frames).
- the features that are often used may be, for example, cepstrum, power, and ⁇ power.
- a combination of a plurality of features is handled as a sequence (vector) and may be referred to as “feature vector”.
- the extracted feature vector of the speech is sent to the acoustic likelihood computation unit to obtain likelihood (acoustic likelihood) thereof with respect to each of a plurality of phonemes that are given in advance.
- acoustic likelihood is a similarity to a model of each of phonemes recorded in an acoustic model of the database.
- the similarity is generally expressed as a “distance” (magnitude of deviation) from the model, and hence “acoustic likelihood computation” is referred to also as “distance calculation”.
- the phonemes are obtained intuitively by dividing a phonetic unit into a consonant and a vowel, but even the same phoneme exhibits different acoustic features when the preceding phoneme or the following phoneme differs.
- the phonemes obtained by thus taking the phonemes before and after a phoneme into consideration are referred to as “triphone (trio of phonemes)”.
- the acoustic model widely used today expresses state transitions among the phonemes by a Hidden Markov Model (HMM). Accordingly, the acoustic model represents a set of HMMs on a triphone-to-triphone basis. In most implementations, each triphone is assigned an ID (hereinafter, referred to as “phoneme ID”), and is handled wholly by the phoneme ID in processing in the subsequent stages.
- phoneme ID an ID
- the hypothesis search unit references a language model regarding the acoustic likelihood obtained by the acoustic likelihood computation unit to make a search for a word string having the highest likelihood.
- the language model may be considered by being classified into a dictionary and a strict language model.
- the dictionary is given a list of vocabulary that can be handled by the (broad-sense) language model.
- each word entry within the dictionary is assigned a phoneme string (phoneme ID string) of a corresponding word and a representation character string thereof.
- the strict language model includes information obtained by modeling the likelihood (language likelihood) that a given word group within the vocabulary continuously appears in a given order. Grammar and N-gram are most often used as the strict language model today.
- the grammar represents direct descriptions of adequacy of given word concatenations that are made by using words, attributes of words, categories to which words belong, and the like.
- the N-gram is obtained by statistically computing an appearance likelihood of each word concatenation formed of N words based on an actual appearance frequency within a large volume of corpus (text data for learning).
- each entry of the dictionary is assigned an ID (hereinafter, referred to as “word ID”), and the (strict) language model serves as a function that returns a language likelihood when a word ID string is input thereto.
- search processing performed by the hypothesis search unit is processing for obtaining the likelihood (acoustic likelihood) of phonemes from a feature vector string, obtaining whether or not to allow conversion into the word ID from the phoneme ID string, obtaining the appearance likelihood (language likelihood) of the word string from the word ID string, and finally finding the word string having the highest likelihood.
- a typical example of those kinds of speech recognition systems includes that described by T. Kawahara, A. Lee, T. Kobayashi, K. Takeda, N. Minematsu, S. Sagayama, K. Itou, A. Ito, M. Yamamoto, A. Yamada, T. Utsuro, and K. Shikano in “Free software toolkit for Japanese large vocabulary continuous speech recognition.”
- ICLSP Spoken Language Processing
- a language model is constructed by modeling only words and phrases that appeared in the written minutes of the meeting held in the past and the speech in the meeting along with their related words and phrases. This enables the vocabulary specific to a particular task or domain to be collected and enables appearance patterns thereof to be modeled.
- the acoustic model is generally obtained by putting a machine learning technology to full use by use of a large amount of labeled speech data (set of speech data to which information as to which segment of the speech data corresponds to which phoneme is given).
- labeled speech data set of speech data to which information as to which segment of the speech data corresponds to which phoneme is given.
- such speech data collection of which requires high cost, is not customized for each user and is prepared individually so as to suit general properties of expected use scenes.
- the acoustic model learned from labeled data of telephone speech is used for telephone speech recognition.
- the speech recognition is widely applicable to various purposes, but poses a problem of requiring corresponding calculation amount particularly in the above-mentioned hypothesis search processing.
- the speech recognition technology has been developed by solving mutually contradictory objects to increase the recognition precision and to reduce the calculation amount, but even today, there still remains a problem, for example, that there are limitations on a vocabulary number that can be handled by a cellular telephone terminal and the like.
- it is more effective to execute speech recognition processing on a remote server that can process an abundant amount of calculation.
- client-server speech recognition form as to execute the speech recognition processing on the remote server and receive only a recognition result (or some action based on the result) on a local terminal is under active development.
- Patent Document 1 discloses an example of the speech recognition system having the implementation form described above.
- a speech recognition system disclosed in Patent Document 1 includes a client terminal and a server that communicate with each other via a network.
- the client terminal includes a speech detection unit (utterance extraction unit) for detecting a speech segment from an input speech, a waveform compression unit for compressing the speech data of the detected segment, and a waveform transmission unit for transmitting compressed waveform data to the server.
- speech detection unit utterance extraction unit
- waveform compression unit for compressing the speech data of the detected segment
- waveform transmission unit for transmitting compressed waveform data to the server.
- the server includes a waveform reception unit for receiving the compressed waveform data transmitted from the client terminal, a waveform decompression unit for decompressing the received compressed speech, and an analysis unit and a recognition unit for analyzing the decompressed waveform and subjecting the waveform to the speech recognition processing.
- the speech recognition system of Patent Document 1 including such components operates as follows. That is, a sound (speech) taken in the client terminal is divided into a speech segment and a non-speech segment by the speech detection unit. Of those, the speech segment is compressed by the waveform compression unit and then transmitted to the server by the waveform transmission unit. The waveform reception unit of the server, which has received this, sends the received data to the waveform decompression unit. The server causes the analysis unit to extract a feature from the waveform data decompressed by the waveform decompression unit, and finally causes the recognition unit to execute speech recognition processing.
- an operation itself of a speech recognition unit has essentially the same operation itself as that operating on a single host.
- the processing up to the step of FIG. 7 performed by the utterance extraction unit is executed by the client terminal, and the subsequent steps are executed by the server.
- the processing up to the step corresponding to the feature vector extraction unit is performed on the client terminal.
- the client-server speech recognition technology has been developed mainly by assuming the use on mobile terminals (such as cellular telephones, PDAs, PHSs, and netbooks).
- mobile terminals such as cellular telephones, PDAs, PHSs, and netbooks.
- an original object thereof is to overcome the problem that the speech recognition is difficult to perform on the mobile terminals having poor processing performance because the calculation amount involved in the speech recognition processing is severe.
- the processing performance of the mobile terminals has improved, while the speech recognition technology has been sophisticated, and hence a client-server speech recognition system is not always necessary.
- the client-server speech recognition system is drawing much more attention.
- SaaS software as a service
- a first problem is that a risk that a content of a user's utterance (speech signal) may be leaked to the third party increases in a case where a speech recognition function is realized as a service provided via a network. This is because even if secrets of communications are protected by encrypting speech data on a communication channel, the speech data may be decoded at least on a speech recognition server that provides a speech recognition service.
- a second problem is that a risk that a content expected to be uttered by the user or special information related to a task or domain to be used for a speech recognition technology by the user may be leaked to the third party increases in the case where the speech recognition function is realized as the service provided via the network.
- the speech recognition function is realized as the service provided via the network.
- more or less customization is necessary for a language model in order to perform the speech recognition with a practical accuracy. Specifically, such customization may need to add, to the language model, a vocabulary that expresses the special information related to the task or domain.
- the language model is essential in a hypothesis search stage within speech recognition processing, and hence the language model is put into a readable state at least on the recognition server in a system that performs hypothesis search processing on a recognition server.
- the third party referred to herein includes one (natural person, artificial person, and other system) that provides the speech recognition service. If the leak only to a speech recognition service provider is no problem, the communication channel and a language model file may be simply encrypted. However, in a case of wishing to make information secret even from the speech recognition service provider, the above-mentioned technology cannot handle the case. Further, other examples of the third party include a hacker or cracker that illegally breaks into a server, and a system (program) that performs such an act. This means that in the case where the server that provides the speech recognition service has been broken into, the speech data, analysis results, the special information related to the task or domain, and the like may be acquired with ease by the third party and no countermeasures can be taken by service users at all.
- This invention provides a speech recognition system capable of secret speech recognition which suppresses a risk that a content of a user's utterance may be leaked to the third party to a minimum level in a case where a speech recognition function is realized as a service provided via a network.
- this invention provides a speech recognition system capable of secret speech recognition which suppresses a risk that a content expected to be uttered by the user or special information related to a task or domain to be used for a speech recognition technology by the user may be leaked to the third party to a minimum level in a case where the speech recognition function is realized as the service provided via the network.
- a speech recognition system includes: a first information processing device including a speech recognition processing unit for receiving data to be used for speech recognition transmitted via a network, carrying out speech recognition processing, and returning resultant data; and a second information processing device connected to the first information processing device via the network, for transmitting the data to be used for the speech recognition by the speech recognition processing unit after performing conversion thereof into data having a format that disables a content thereof from being captured and also enables the speech recognition processing unit to perform the speech recognition processing, and constructing the resultant data returned from the first information processing device into a content being a valid recognition result.
- a speech recognition request device includes: a communication unit connected via a network to a speech recognition device including a speech recognition processing unit for receiving data to be used for speech recognition transmitted via the network, carrying out speech recognition processing, and returning resultant data; an information conversion unit for converting the data to be used for the speech recognition by the speech recognition processing unit into data having a format that disables a content thereof from being captured and also enables the speech recognition processing unit to perform the speech recognition processing; and an authentication result construction unit for reconstructing the resultant data returned from the speech recognition device after performing the speech recognition on the converted data into a speech recognition result that enables a content being a valid recognition result to be captured, based on the converted content.
- a speech recognition system capable of secret speech recognition which suppresses a risk that a content of a user's utterance may be leaked to the third party to a minimum level in a case where a speech recognition function is realized as a service provided via a network.
- a speech recognition system capable of secret speech recognition which suppresses a risk that a content expected to be uttered by the user or special information related to a task or domain to be used for a speech recognition technology by the user may be leaked to the third party to a minimum level in a case where the speech recognition function is realized as the service provided via the network.
- FIG. 1 is a block diagram illustrating a configuration of a first embodiment.
- FIG. 2 is a flowchart illustrating speech recognition processing according to the first embodiment.
- FIG. 3 is a block diagram illustrating a configuration of a second embodiment.
- FIG. 4 is a block diagram illustrating a configuration of a third embodiment.
- FIG. 5 is a block diagram illustrating a configuration of a fourth embodiment.
- FIG. 6 is a block diagram illustrating a configuration of a fifth embodiment.
- FIG. 7 is a block diagram illustrating an example of a configuration of a speech recognition system.
- FIG. 8 is a block diagram illustrating an example of a configuration of the speech recognition system having a client-server structure.
- FIG. 1 illustrates a configuration of the first embodiment of this invention.
- the first embodiment of this invention includes a client 110 and a server 120 .
- Each thereof includes components for performing the following operations:
- the client 110 includes an utterance extraction unit 111 a feature vector extraction unit 112 , a feature vector conversion unit 113 , a phoneme ID conversion unit 114 , a data transmission unit 115 , a search result reception unit 116 , and a recognition result construction unit 117 . Further included therein is a database 118 , which stores an acoustic model, a language model, and conversion/reconstruction data. The conversion/reconstruction data is used by the feature vector conversion unit 113 , the phoneme ID conversion unit 114 , and the recognition result construction unit 117 . Note that, the conversion/reconstruction data may be previously set in the feature vector conversion unit 113 , the phoneme ID conversion unit 114 , and the recognition result construction unit 117 .
- the utterance extraction unit 111 extracts a speech from acoustic sound and outputs the speech as the speech data. For example, a segment that involves actual utterance (utterance segment) is extracted from the acoustic data by discriminating from a segment that does not (silent segment). Further, noise is separated from the speech and eliminated.
- the feature vector extraction unit 112 extracts a set (feature vector) of the acoustic features such as cepstrum, power, and ⁇ power from the speech data.
- the feature vector conversion unit 113 converts the feature vector into data having a format that disables the third party from capturing or perceiving a content thereof. At this time, the feature vector conversion unit 113 performs conversion processing so as to guarantee that, in a case where an acoustic likelihood computation unit 122 a of the server 120 performs an acoustic likelihood calculation on the data after conversion by using the appropriately-converted acoustic model, an output result thereof has the same value as or an approximate value to an output result obtained from a combination of the acoustic model before the conversion and the feature vector. Examples of the conversion include shuffling the order of feature vector and adding a dimension that is redundant and can be ignored in terms of the calculation.
- the phoneme ID conversion unit 114 converts the acoustic model and the phoneme IDs of the language model into the data having a format that disables the third party from perceiving contents thereof. Further, information unnecessary for the speech recognition processing performed on the server 120 is deleted from the acoustic model and the language model. In addition, depending on the content of the conversion processing, information necessary for restoration thereof is recorded in the database 118 as the conversion/reconstruction data. Examples of the conversion and deletion include shuffling the phoneme IDs and word IDs and deleting the representation character string and the like from the language model. The kind of conversion processing to be performed may be supplied in advance or may be dynamically determined
- the data transmission unit 115 transmits the converted data such as the feature vector, the acoustic model, and the language model to the server 120 as appropriate.
- the search result reception unit 116 receives the output of a speech recognition unit 122 such as a maximum-likelihood word ID string via a search result transmission unit 123 of the server 120 .
- the recognition result construction unit 117 references the conversion/reconstruction data recorded in the database 118 regarding the maximum-likelihood word ID string received from the search result reception unit 116 to restore the data subjected to the conversion by the phoneme ID conversion unit 114 .
- the recognition result construction unit 117 references the language model before the conversion by using the thus-restored word IDs to thereby construct the recognition result being the same as a recognition result obtained by an existing system. That is, almost without affecting a speech recognition result, the server 120 that performs the speech recognition can be disabled from capturing the content of the data used for the speech recognition.
- the server 120 includes a data reception unit 121 , the speech recognition unit 122 , and the search result transmission unit 123 .
- the data reception unit 121 receives the data used for the speech recognition from the client 110 .
- the data used for the speech recognition which are received in this embodiment are converted data which include the feature vector, the acoustic model, and the language model.
- the speech recognition unit 122 references the acoustic model and the language model to make a search for a maximum-likelihood word string regarding a feature vector sequence. Note that, the speech recognition unit 122 , which is to be described in detail, is divided into the acoustic likelihood computation unit 122 a and a hypothesis search unit 122 b.
- the acoustic likelihood computation unit 122 a obtains an acoustic likelihood of the feature vector regarding the respective phonemes within the acoustic model.
- the search result transmission unit 123 transmits the output of the speech recognition unit 122 such as the maximum-likelihood word ID string to the client 110 .
- (C) indicates a client device
- (S) indicates a server device.
- the client device and the server device start the speech recognition and operate as follows.
- mapping function conversion processing using a mapping function
- the conversion of the feature vector and the acoustic model using the mapping function which is performed by the feature vector conversion unit 113 and the phoneme ID conversion unit 114 relates to an operation of the speech recognition unit 122 , in particular, the acoustic likelihood computation unit 122 a included therein. Described below as an example is a process for recovery to the valid processing result in the case of using the mapping function.
- the processing performed by the acoustic likelihood computation unit 122 a is processing for obtaining the likelihood of the feature vector given to the respective phoneme. This can be expressed as processing that employs an acoustic likelihood function D being:
- V represents the feature vector
- A represents the acoustic model
- M kinds of phoneme are included therein.
- mapping function that satisfies such a property are taken.
- the feature vector if being a vector of N features, can be expressed by the following expression.
- V ( v — 1 , . . . , v _N)
- f_v shifts suffixes of the respective elements of the feature vector one by one to move the N-th element to the zeroth position. That is, the shift is caused as in the following expression.
- La is a function that shifts a model regarding the ith feature within the acoustic model to the (i+1)th position
- mapping that shifts the elements of the feature vector by k satisfies the required property.
- order itself has no meaning, and hence a mapping (shuffle function) that converts the order of the elements of the feature vector into an arbitrary order satisfies the required property as well.
- c_k and c_k ⁇ 1 ⁇ are a group of known values that satisfy the above-mentioned expression.
- mappings (f_v,f_a) are given as
- the acoustic likelihood is linear with respect to the likelihoods regarding respective elements of the feature vector, and if a combination of the value of the feature for which the total sum of the acoustic likelihoods becomes zero and the model regarding the feature is known, it is possible to increase the number of apparent dimensions of the feature vector by using the combination.
- acoustic likelihood is linear with respect to the likelihoods regarding respective elements of the feature vector, and if an acoustic likelihood function D(v_i,A_ ⁇ i,j ⁇ ) regarding the respective features is also linear, it is possible to increase the number of apparent dimensions of the feature vector by dividing a given feature into a plurality of elements.
- the acoustic likelihood computation unit 122 a is established on the basis of the acoustic likelihood function exhibiting such a property, as many arbitrary mapping functions required by the embodiment of this invention as desired can be given by combining “shuffling of the feature vector” and “extension of the number of apparent dimensions” as described above.
- the speech recognition unit 122 of the server 120 can obtain the recognition result the same as or approximate to the case where such conversion is not performed.
- the conversion for the acoustic model and the language model performed by the phoneme ID conversion unit 114 relates to the inside of the speech recognition unit 122 , in particular, relates to the operation of the hypothesis search unit 122 b.
- a lookup function that returns any one of 0 and 1 in relation to all the words w included in L can be expressed as the following expression.
- this function seems to have an extremely high calculation load, but can be obtained speedily by using a TRIE structure and the like.
- the phoneme ID string and the word ID are often used instead of the phoneme string itself and the word itself, respectively, but are both correspond to the phoneme and the word on a one-to-one basis, and hence only the phoneme and the word are described below.
- mapping function G (g — 1,g_a)
- the property required for g — 1 and g_a is that the following expression always holds true with respect to an arbitrary phoneme string a — 1, . . . ,a_N.
- T ( L, ⁇ ,a — 1, . . . , a — N ) T ( g — 1( L ), g — a ( ⁇ ), g — a ( a — 1), . . . , g — a ( a — N ))
- the above-mentioned two conversion processing steps can be conversion processing steps that satisfy the following requirements after all.
- F include:
- the hypothesis search unit 122 b is expressed as a search problem that regards the likelihood as a score and obtains a path exhibiting the highest score, only a magnitude relationship between the likelihoods may be saved, and hence what actually matters in the conversion performed on the feature vector and the acoustic model is such a property that:
- information related to the respective words included in the language model is basically deleted other than information on the phoneme ID string (with the phoneme ID also converted as described above by the mapping function). This not only achieves secrecy but is also effective in reduction of a communication amount.
- the speech recognition unit 122 that requests for data that may be involved in the leak of the word information should be avoided from being used for the speech recognition processing. For example, it is assumed that the speech recognition unit 122 that requests for a display character string of the word is not used in this embodiment. In a case of wishing to use the speech recognition processing unit that requests for such data at any cost, there is an attempt to avoid the leak by a method of, for example, performing the mapping in the same manner as the phoneme ID and the word ID.
- the feature vector conversion is executed each time when a new feature vector is obtained.
- the conversion of the acoustic model and the phoneme IDs of the language model may be performed once prior to the speech recognition as described above.
- mapping function may be conjectured by using a statistical method or the like.
- the secrecy against the third party is enhanced by periodically switching a behavior of the conversion operation such as changing the mapping function to another one.
- the switching may be performed at the timing of once every several utterances or once every several minutes.
- a calculation amount necessary for the conversion operation and the communication amount for transmitting the model after conversion to the server are taken into consideration, it is not appropriate to perform the switching very frequently.
- the timing and frequency for the switching may have values obtained in consideration of overhead (calculation amount necessary for the conversion operation and communication amount for transmitting the model after conversion to the server) that occurs due to the frequent switching. Further, alteration may be performed as appropriate at a timing at which a processing amount or the communication amount is lowered, for example, during the silent segment.
- the embodiment for performing the conversion using the mapping function is configured to convert the feature vector by the mapping function and then transmit the feature vector to the server, and hence even if the third party obtains the feature vector on the communication channel or the server, it can be made difficult for the third party to immediately restore the speech therefrom.
- the acoustic model is also converted by the mapping function selected so as to return the same acoustic likelihood as the feature vector before conversion, which guarantees that the same acoustic likelihood is computed, in other words, the same recognition result is obtained, as in the case where the feature vector is not converted.
- the above-mentioned mode is configured to avoid transmitting the information on the representation character string within the information on the respective word entries included in the language model to the server and to also convert the phoneme ID string indicating the pronunciation of the word entry by the mapping function and then transmit the phoneme ID string to the server.
- the third party that knows the structure of the language model obtains the phoneme ID string, it can be made difficult for the third party to immediately know the information such as the pronunciation and surface form of the word included therein.
- the acoustic model is also converted by the mapping function selected so as to return the same outcome of the word with regard to the same phoneme string as the language model before conversion, which guarantees that the same outcome regarding the word is obtained, in other words, the same recognition result is obtained, as in the case where the language model is not converted with regard to the same phoneme string.
- FIG. 3 is a block diagram illustrating a configuration of the second embodiment.
- a speech recognition system according to the second embodiment includes a plurality of speech recognition servers. Further, an information processing device that requests for the speech recognition is also a server.
- the plurality of speech recognition servers correspond to mutually different items of converted acoustic recognition information data (in the figure, types A, B, and C).
- the server that requests for the speech recognition previously stores specifications of respective acoustic recognition servers, and stores the converted acoustic recognition information data to be transmitted to the respective acoustic recognition servers. Note that, such specifications of the acoustic recognition server and the like may be managed integrally with the conversion/reconstruction data or may be managed by another method.
- the server that requests for the speech recognition uses the respective units to carry out utterance extraction processing and feature vector extraction processing, then selects the acoustic recognition server to be used, converts the information for speech recognition into data having such a format that enables the recovery to the valid processing result corresponding to the selected acoustic recognition server, and transmits the data to the selected acoustic recognition server.
- the server that requests for the speech recognition uses the respective units to construct result data returned from the acoustic recognition server into the speech recognition result being a valid recognition result and output the resultant.
- FIG. 4 a third embodiment is described by referring to FIG. 4 . Note that, to clarify the description, descriptions of the same parts as those of the first and second embodiments are simplified or omitted.
- FIG. 4 is a block diagram illustrating a configuration of the third embodiment.
- a plurality of speech recognition servers of a speech recognition system according to the third embodiment provide only the service of hypothesis search processing.
- the speech recognition servers are capable of performing acoustic likelihood detection processing and the hypothesis search processing, and can provide only the service of the hypothesis search processing.
- the information processing device that requests for the speech recognition includes an acoustic likelihood detection unit, and is enabled to perform a distance calculation.
- the plurality of speech recognition servers perform requested speech recognition processing (acoustic likelihood detection processing and hypothesis search processing) respectively, and return the result thereof.
- a requesting terminal that requests for the speech recognition previously stores specifications of respective acoustic recognition servers, and stores the converted acoustic recognition information data to be transmitted to the respective acoustic recognition servers. Note that, such specifications of the acoustic recognition server and the like may be managed integrally with the conversion/reconstruction data or may be managed by another method.
- the requesting terminal that requests for the speech recognition uses the respective units to carry out utterance extraction processing, feature vector extraction processing, and acoustic likelihood detection processing, then selects the acoustic recognition server to be used, converts information on detected acoustic likelihood and the information for speech recognition into data having such a format that enables the recovery to the valid processing result corresponding to the selected acoustic recognition server, and transmits the data to the selected acoustic recognition server.
- the requesting terminal uses the respective units to construct result data returned from the acoustic recognition server into the speech recognition result being a valid recognition result and output the resultant.
- Such a configuration can omit shuffling processing for the acoustic model or the transmission of the acoustic model. That is, if the terminal has such a calculation ability as to perform acoustic likelihood computation processing, the communication amount can be compressed.
- FIG. 5 a fourth embodiment is described by referring to FIG. 5 . Note that, to clarify the description, descriptions of the same parts as those of other embodiments are simplified or omitted.
- FIG. 5 is a block diagram illustrating a configuration of the fourth embodiment.
- a plurality of speech recognition servers of a speech recognition system according to the fourth embodiment each provide a speech recognition service.
- the information processing device that requests for the speech recognition includes an utterance dividing unit for extracting the feature vector by performing time division on the sound (speech) input thereto. Note that, instead of the time division for the feature vector, division may be performed in units of clauses or words of the speech.
- the information processing device that requests for the speech recognition performs the shuffling or the like on a sequence relationship between the divided items of speech data, then subjects the resultant data to the conversion as the information for speech recognition, which is then transmitted separately to the plurality of speech recognition servers, and collectively reconstructs the results returned from the respective speech recognition servers.
- a time-division interval, the shuffling method, and the acoustic recognition server to be the transmission destination are switched as necessary.
- FIG. 6 is a block diagram illustrating a configuration of a fifth embodiment.
- a speech recognition system has a mode in which the speech recognition server including the acoustic likelihood detection unit is used to generate result data on the acoustic likelihood and transfer the result data to another speech recognition server including the hypothesis search unit. Further, the speech recognition system may be configured such that a secret speech identification device instructs the speech recognition server including the acoustic likelihood detection unit to perform the transfer itself. Further, the speech recognition system may be configured such that the result data on the acoustic likelihood to be transferred is divided and transferred to the plurality of speech recognition servers each including the hypothesis search unit.
- the speech data or the feature extracted on the secret speech identification device serving as a client is divided, the sequence relationship therebetween is shuffled, and the respective servers are requested for the speech recognition.
- the secret speech identification device subjects the speech recognition results sent from the respective servers to inverse processing to the shuffling performed before transmission, and reconstructs the content being the valid recognition result. That is, the secret speech identification device carries out the processing up to feature vector extraction and reconstruction processing, while the server carries out the others.
- Such an operation can reduce communication load and load on the secret speech identification device.
- a client terminal caused to perform the speech recognition receives the speech recognition result from the server, and in response to the result, executes second recognition processing for inserting the word and the concatenation information on words deleted from the dictionary. That is, information the leak of which is feared and which is not included in the recognition result sent from the server is regained by being subjected to second speech recognition processing (search processing).
- a second speech recognition unit is provided within a recognition result construction unit, and uses the recognition result output by the speech recognition unit (first speech recognition unit) on the server as an input.
- the word and its likelihood are assigned to each arc appearing in a graph structure generated halfway through the search processing, and the search processing is processing for finding a path exhibiting the highest total sum of the likelihoods.
- the recognition result construction unit converts those into the word string, and further converts the word string into the phoneme string by using the pronunciation information. By performing the processing in this manner, only one phoneme string is obtained in the case where the maximum-likelihood word ID string is used as an input, and otherwise a plurality of phoneme strings are obtained.
- the second speech recognition unit takes out the phoneme strings from the recognition result returned from the server, and searches the phoneme strings for a segment that matches the phoneme string of the deleted word and word concatenation.
- the recognition result construction unit constructs the valid recognition result by replacing (inserting) the word or the word concatenation into the corresponding part.
- the mapping for the word ID becomes unnecessary with the result that uploading of only the acoustic model and the dictionary suffices.
- the secrecy can be ensured.
- the strict language model occupies most of the capacity of a broad-sense language model, which produces a remarkable effect in the reduction of the communication bandwidth between the server and the client.
- This embodiment is configured to inhibit the client terminal from executing the acoustic likelihood calculation without involving the uploading of the acoustic model. That is, the extraction of the feature and the acoustic likelihood calculation are carried out on the server and transmitted, while the search processing is carried out on the client terminal. At this time, the acoustic data transmitted from the client terminal to the server is kept secret by the encryption operation that can be decrypted by the server and the mapping operation of mapping the content into data which cannot be perceived or captured by the server.
- Such a configuration effectively operates as means for performing client-server speech recognition that guarantees the secrecy without particularly converting the language model.
- the first effect is to be able to reduce a risk that the utterance content of a speaker may be leaked to the third party. This is because, even if the third party acquires intermediate data (feature vector, phoneme ID string, and word string ID string) obtained by converting the speech data, it is necessary for the third party to know the details of how the phoneme ID and the like have been converted in order to restore the same, which can make it difficult for the third party to restore the speech data by performing the conversion as appropriate.
- the second effect is to be able to reduce a risk that special information related to a task or domain may be leaked from the language model to the third party.
- the language model temporarily retained on the server includes only the minimum word information such as the phoneme ID after conversion, the details of the conversion of the phoneme ID are unknown to the server, which can make it difficult for the third party to know the details of the content of the language model.
- the third party referred to herein also includes a speech recognition service provider. Therefore, indirect effects of this invention include the ability to perform the speech recognition in the form of a network service also with regard to the speech whose secrecy is demanded extremely strongly, for example, speech related to privacy or a trade secret.
- the speech recognition system may be configured in the following manner.
- a speech recognition system including: a first information processing device including a speech recognition processing unit for receiving data to be used for speech recognition transmitted via a network, carrying out speech recognition processing, and returning resultant data; and a second information processing device connected to the first information processing device via the network, for transmitting the data to be used for the speech recognition by the speech recognition processing unit after performing mapping thereof by using a mapping function unknown to the first information processing device, and constructing a speech recognition result by modifying, based on the mapping function used, the resultant data returned from the first information processing device into the same result as a result of performing the speech recognition without using the mapping function.
- a speech recognition system including a plurality of information processing devices that are connected to one another via a network and include a speech recognition processing unit in at least one information processing device.
- the requesting information processing device converts at least one data structure of data to be used for speech recognition processing by the speech recognition processing unit by using a mapping function and transmits the resultant to the information processing device including the speech recognition processing unit.
- the information processing device including the speech recognition processing unit carries out the speech recognition processing based on the converted data structure and transmits a result thereof.
- the requesting information processing device constructs the result of carrying out the speech recognition processing which is affected by the mapping function into a result of carrying out the speech recognition processing which is not affected by the mapping function.
- a speech recognition system which is configured by using the mapping function in which: with regard to a reference relationship between an index that refers to specific data included in a given data structure and a reference destination, a destination to which a given arbitrary index refers before mapping does not necessarily match a destination to which the same index refers after the mapping; and it is guaranteed that data at the reference destination to which any one of indices refers before the mapping is always referred to by any one of the indices after the mapping.
- a speech recognition system which is configured by using the mapping function which indicates shuffling of indices that refer to the specific data included in the given data structure.
- a speech recognition system which is configured by using the mapping function which adds an arbitrary number of indices to the specific data included in the given data structure.
- a speech recognition system in which at least one item of data to be used for speech recognition which is subjected to mapping by using the mapping function is retained before the mapping only on an information processing device for inputting a sound to be subjected to the speech recognition.
- a speech recognition system in which the data to be used by the speech recognition processing unit has a structure to which at least one selected from the group consisting of a structure of an acoustic model, a structure of a language model, and a structure of a feature vector is mapped.
- a speech recognition system in which: indices indicating respective features included in the feature vector are mapped by using the mapping function given by a device for inputting a sound to be subjected to speech recognition; and indices to models associated with respective features within the acoustic model are mapped by using the mapping function given by the device for inputting the sound to be subjected to the speech recognition.
- a speech recognition system in which: phoneme IDs being indices to phonemes included in the acoustic model are mapped by using the mapping function given by the device for inputting the sound; phoneme ID strings indicating pronunciations of respective words included in the language model are mapped by using the mapping function given by the device for inputting the sound; and at least information on representation character strings of the respective words included in the language model is deleted.
- a speech recognition system in which word IDs being indices to the respective words included in the language model are mapped by using the mapping function given by the device for inputting the sound.
- a speech recognition system in which an information processing device for inputting speech data includes at least an acoustic likelihood computation unit and is configured to: map phoneme ID strings indicating pronunciations of respective words included in the language model by using the mapping function given by the information processing device for inputting speech data, and delete at least information on representation character strings of the respective words included in the language model; compute acoustic likelihoods of all known phonemes or necessary phonemes for each frame of the speech data to generate a sequence of a group of the phoneme IDs and acoustic likelihoods that are mapped by using the mapping function given by the information processing device for inputting speech data; and transmit the sequence of the group of the mapped phoneme IDs and acoustic likelihoods and the language model after the mapping to the information processing device including a hypothesis search unit.
- a speech recognition system in which an information processing device for inputting speech data is configured to: divide the speech data into blocks; map a time sequence among the divided blocks by using the mapping function given by the information processing device for inputting speech data; transmit the blocks of speech to an information processing device for performing speech recognition based on the time sequence after the mapping; receive any one of a feature vector or a sequence of a group of phoneme IDs and acoustic likelihoods from the information processing device for performing the speech recognition; and restore the time sequence by using an inverse function to the mapping function given by the information processing device for inputting speech data.
- the respective units of a speech recognition request device may be realized by hardware or by using a combination of hardware and software.
- the respective units and various means are realized by causing a speech recognition program to be expanded in a RAM and hardware such as a CPU to be operated according to the program.
- the program may be distributed by being recorded on a recording medium.
- the program recorded on the recording medium is read into a memory in a wired manner, in a wireless manner, or via the recording medium itself to cause a control unit and the like to operate.
- examples of the recording medium include an optical disc, a magnetic disk, a semiconductor memory device, and a hard disk.
- This invention can be applied for the purpose of increasing the secrecy in all the applications for performing the client-server speech recognition.
- this invention can be applied for constructing a SaaS-based speech recognition system for recognizing the speech including a trade secret. Further, this invention can be applied for constructing a SaaS-based speech recognition system for the speech high in privacy such as a diary.
- a speech-controlled online store website that allows a menu selection and the like to be performed by speech
- the website is constructed by using the SaaS-based speech recognition system using this invention
- the user can keep his/her purchase history and the like from being known by at least a SaaS-based speech recognition system provider. This serves as a merit for a webmaster of the speech-controlled online store website in that a fear of the leak of customer information decreases.
- the use of this invention eliminates the need, although temporarily, for retaining the speech of users and the language model including a vocabulary corresponding to personal information on the users on the self-managed speech recognition server, which can avoid unintended leak of the personal information to a cracker or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Telephonic Communication Services (AREA)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/064,976 US9520129B2 (en) | 2009-10-28 | 2013-10-28 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
US15/241,233 US9905227B2 (en) | 2009-10-28 | 2016-08-19 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2009247874 | 2009-10-28 | ||
JP2009-247874 | 2009-10-28 | ||
PCT/JP2010/068230 WO2011052412A1 (ja) | 2009-10-28 | 2010-10-12 | 音声認識システム、音声認識要求装置、音声認識方法、音声認識用プログラムおよび記録媒体 |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/068230 A-371-Of-International WO2011052412A1 (ja) | 2009-10-28 | 2010-10-12 | 音声認識システム、音声認識要求装置、音声認識方法、音声認識用プログラムおよび記録媒体 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/064,976 Division US9520129B2 (en) | 2009-10-28 | 2013-10-28 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
Publications (1)
Publication Number | Publication Date |
---|---|
US20120215528A1 true US20120215528A1 (en) | 2012-08-23 |
Family
ID=43921838
Family Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/504,264 Abandoned US20120215528A1 (en) | 2009-10-28 | 2010-10-12 | Speech recognition system, speech recognition request device, speech recognition method, speech recognition program, and recording medium |
US14/064,976 Active 2031-05-10 US9520129B2 (en) | 2009-10-28 | 2013-10-28 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
US15/241,233 Active US9905227B2 (en) | 2009-10-28 | 2016-08-19 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/064,976 Active 2031-05-10 US9520129B2 (en) | 2009-10-28 | 2013-10-28 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
US15/241,233 Active US9905227B2 (en) | 2009-10-28 | 2016-08-19 | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content |
Country Status (3)
Country | Link |
---|---|
US (3) | US20120215528A1 (ja) |
JP (1) | JP5621993B2 (ja) |
WO (1) | WO2011052412A1 (ja) |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110295605A1 (en) * | 2010-05-28 | 2011-12-01 | Industrial Technology Research Institute | Speech recognition system and method with adjustable memory usage |
US20120310651A1 (en) * | 2011-06-01 | 2012-12-06 | Yamaha Corporation | Voice Synthesis Apparatus |
US20140136210A1 (en) * | 2012-11-14 | 2014-05-15 | At&T Intellectual Property I, L.P. | System and method for robust personalization of speech recognition |
US20150012261A1 (en) * | 2012-02-16 | 2015-01-08 | Continetal Automotive Gmbh | Method for phonetizing a data list and voice-controlled user interface |
CN105009206A (zh) * | 2013-03-06 | 2015-10-28 | 三菱电机株式会社 | 语音识别装置和语音识别方法 |
US20150348571A1 (en) * | 2014-05-29 | 2015-12-03 | Nec Corporation | Speech data processing device, speech data processing method, and speech data processing program |
US20160019893A1 (en) * | 2014-07-16 | 2016-01-21 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US9269355B1 (en) * | 2013-03-14 | 2016-02-23 | Amazon Technologies, Inc. | Load balancing for automatic speech recognition |
US20160055850A1 (en) * | 2014-08-21 | 2016-02-25 | Honda Motor Co., Ltd. | Information processing device, information processing system, information processing method, and information processing program |
US20170229124A1 (en) * | 2016-02-05 | 2017-08-10 | Google Inc. | Re-recognizing speech with external data sources |
US20170294188A1 (en) * | 2016-04-12 | 2017-10-12 | Fujitsu Limited | Apparatus, method for voice recognition, and non-transitory computer-readable storage medium |
US20170316780A1 (en) * | 2016-04-28 | 2017-11-02 | Andrew William Lovitt | Dynamic speech recognition data evaluation |
US20170365249A1 (en) * | 2016-06-21 | 2017-12-21 | Apple Inc. | System and method of performing automatic speech recognition using end-pointing markers generated using accelerometer-based voice activity detector |
US9997173B2 (en) * | 2016-03-14 | 2018-06-12 | Apple Inc. | System and method for performing automatic gain control using an accelerometer in a headset |
US20190214014A1 (en) * | 2016-05-26 | 2019-07-11 | Nuance Communications, Inc. | Method And System For Hybrid Decoding For Enhanced End-User Privacy And Low Latency |
EP3690879A3 (en) * | 2014-11-07 | 2020-08-26 | Samsung Electronics Co., Ltd. | Speech signal processing method and speech signal processing apparatus |
CN111868717A (zh) * | 2018-03-20 | 2020-10-30 | 索尼公司 | 信息处理装置及信息处理方法 |
US10950235B2 (en) * | 2016-09-29 | 2021-03-16 | Nec Corporation | Information processing device, information processing method and program recording medium |
US11011167B2 (en) * | 2018-01-10 | 2021-05-18 | Toyota Jidosha Kabushiki Kaisha | Communication system, communication method, and computer-readable storage medium |
US11308936B2 (en) | 2014-11-07 | 2022-04-19 | Samsung Electronics Co., Ltd. | Speech signal processing method and speech signal processing apparatus |
US11373656B2 (en) * | 2019-10-16 | 2022-06-28 | Lg Electronics Inc. | Speech processing method and apparatus therefor |
US11721347B1 (en) * | 2021-06-29 | 2023-08-08 | Amazon Technologies, Inc. | Intermediate data for inter-device speech processing |
US11900921B1 (en) | 2020-10-26 | 2024-02-13 | Amazon Technologies, Inc. | Multi-device speech processing |
US12124498B2 (en) * | 2020-01-09 | 2024-10-22 | Amazon Technologies, Inc. | Time code to byte indexer for partial object retrieval |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9317736B1 (en) * | 2013-05-08 | 2016-04-19 | Amazon Technologies, Inc. | Individual record verification based on features |
CN105981099A (zh) * | 2014-02-06 | 2016-09-28 | 三菱电机株式会社 | 语音检索装置和语音检索方法 |
US10065124B2 (en) * | 2016-01-15 | 2018-09-04 | Disney Enterprises, Inc. | Interacting with a remote participant through control of the voice of a toy device |
JP6731609B2 (ja) * | 2016-05-13 | 2020-07-29 | パナソニックIpマネジメント株式会社 | データ処理装置、データ処理システム、データ処理方法及びデータ処理プログラム |
CN106601257B (zh) * | 2016-12-31 | 2020-05-26 | 联想(北京)有限公司 | 一种声音识别方法、设备和第一电子设备 |
JP6599914B2 (ja) * | 2017-03-09 | 2019-10-30 | 株式会社東芝 | 音声認識装置、音声認識方法およびプログラム |
JP7088645B2 (ja) * | 2017-09-20 | 2022-06-21 | 株式会社野村総合研究所 | データ変換装置 |
EP3496090A1 (en) * | 2017-12-07 | 2019-06-12 | Thomson Licensing | Device and method for privacy-preserving vocal interaction |
US10909983B1 (en) * | 2018-09-18 | 2021-02-02 | Amazon Technologies, Inc. | Target-device resolution |
JP7211103B2 (ja) * | 2019-01-24 | 2023-01-24 | 日本電信電話株式会社 | 系列ラベリング装置、系列ラベリング方法、およびプログラム |
JP6849977B2 (ja) * | 2019-09-11 | 2021-03-31 | 株式会社ソケッツ | テキスト表示用同期情報生成装置および方法並びに音声認識装置および方法 |
CN111081256A (zh) * | 2019-12-31 | 2020-04-28 | 苏州思必驰信息科技有限公司 | 数字串声纹密码验证方法及系统 |
KR20220010259A (ko) * | 2020-07-17 | 2022-01-25 | 삼성전자주식회사 | 음성 신호 처리 방법 및 장치 |
JP7567940B2 (ja) | 2021-01-15 | 2024-10-16 | 日本電信電話株式会社 | 学習方法、学習システム及び学習プログラム |
WO2022215140A1 (ja) * | 2021-04-05 | 2022-10-13 | 株式会社KPMG Ignition Tokyo | プログラム、情報処理装置、及び情報処理方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6292782B1 (en) * | 1996-09-09 | 2001-09-18 | Philips Electronics North America Corp. | Speech recognition and verification system enabling authorized data transmission over networked computer systems |
US20040010409A1 (en) * | 2002-04-01 | 2004-01-15 | Hirohide Ushida | Voice recognition system, device, voice recognition method and voice recognition program |
US20060009980A1 (en) * | 2004-07-12 | 2006-01-12 | Burke Paul M | Allocation of speech recognition tasks and combination of results thereof |
US20090299743A1 (en) * | 2008-05-27 | 2009-12-03 | Rogers Sean Scott | Method and system for transcribing telephone conversation to text |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5893057A (en) | 1995-10-24 | 1999-04-06 | Ricoh Company Ltd. | Voice-based verification and identification methods and systems |
JP3601631B2 (ja) * | 1995-10-24 | 2004-12-15 | 株式会社リコー | 話者認識システムおよび話者認識方法 |
US20020107918A1 (en) * | 2000-06-15 | 2002-08-08 | Shaffer James D. | System and method for capturing, matching and linking information in a global communications network |
FR2820872B1 (fr) | 2001-02-13 | 2003-05-16 | Thomson Multimedia Sa | Procede, module, dispositif et serveur de reconnaissance vocale |
JP3885523B2 (ja) * | 2001-06-20 | 2007-02-21 | 日本電気株式会社 | サーバ・クライアント型音声認識装置及び方法 |
JP4425055B2 (ja) * | 2004-05-18 | 2010-03-03 | 日本電信電話株式会社 | クライアント・サーバ音声認識方法、これに用いる装置、そのプログラム及び記録媒体 |
JP2006309356A (ja) * | 2005-04-26 | 2006-11-09 | Mark-I Inc | スケジュール等管理システムおよびスケジュール等管理方法 |
-
2010
- 2010-10-12 US US13/504,264 patent/US20120215528A1/en not_active Abandoned
- 2010-10-12 JP JP2011538353A patent/JP5621993B2/ja active Active
- 2010-10-12 WO PCT/JP2010/068230 patent/WO2011052412A1/ja active Application Filing
-
2013
- 2013-10-28 US US14/064,976 patent/US9520129B2/en active Active
-
2016
- 2016-08-19 US US15/241,233 patent/US9905227B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6292782B1 (en) * | 1996-09-09 | 2001-09-18 | Philips Electronics North America Corp. | Speech recognition and verification system enabling authorized data transmission over networked computer systems |
US20040010409A1 (en) * | 2002-04-01 | 2004-01-15 | Hirohide Ushida | Voice recognition system, device, voice recognition method and voice recognition program |
US20060009980A1 (en) * | 2004-07-12 | 2006-01-12 | Burke Paul M | Allocation of speech recognition tasks and combination of results thereof |
US20090299743A1 (en) * | 2008-05-27 | 2009-12-03 | Rogers Sean Scott | Method and system for transcribing telephone conversation to text |
Cited By (38)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110295605A1 (en) * | 2010-05-28 | 2011-12-01 | Industrial Technology Research Institute | Speech recognition system and method with adjustable memory usage |
US20120310651A1 (en) * | 2011-06-01 | 2012-12-06 | Yamaha Corporation | Voice Synthesis Apparatus |
US9230537B2 (en) * | 2011-06-01 | 2016-01-05 | Yamaha Corporation | Voice synthesis apparatus using a plurality of phonetic piece data |
US20150012261A1 (en) * | 2012-02-16 | 2015-01-08 | Continetal Automotive Gmbh | Method for phonetizing a data list and voice-controlled user interface |
US9405742B2 (en) * | 2012-02-16 | 2016-08-02 | Continental Automotive Gmbh | Method for phonetizing a data list and voice-controlled user interface |
US20140136210A1 (en) * | 2012-11-14 | 2014-05-15 | At&T Intellectual Property I, L.P. | System and method for robust personalization of speech recognition |
CN105009206A (zh) * | 2013-03-06 | 2015-10-28 | 三菱电机株式会社 | 语音识别装置和语音识别方法 |
CN105009206B (zh) * | 2013-03-06 | 2018-02-09 | 三菱电机株式会社 | 语音识别装置和语音识别方法 |
US9431010B2 (en) * | 2013-03-06 | 2016-08-30 | Mitsubishi Electric Corporation | Speech-recognition device and speech-recognition method |
US9269355B1 (en) * | 2013-03-14 | 2016-02-23 | Amazon Technologies, Inc. | Load balancing for automatic speech recognition |
US20150348571A1 (en) * | 2014-05-29 | 2015-12-03 | Nec Corporation | Speech data processing device, speech data processing method, and speech data processing program |
US20180040320A1 (en) * | 2014-07-16 | 2018-02-08 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US20160019893A1 (en) * | 2014-07-16 | 2016-01-21 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US10515633B2 (en) * | 2014-07-16 | 2019-12-24 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US10504517B2 (en) * | 2014-07-16 | 2019-12-10 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US9824688B2 (en) * | 2014-07-16 | 2017-11-21 | Panasonic Intellectual Property Corporation Of America | Method for controlling speech-recognition text-generation system and method for controlling mobile terminal |
US20160055850A1 (en) * | 2014-08-21 | 2016-02-25 | Honda Motor Co., Ltd. | Information processing device, information processing system, information processing method, and information processing program |
US9899028B2 (en) * | 2014-08-21 | 2018-02-20 | Honda Motor Co., Ltd. | Information processing device, information processing system, information processing method, and information processing program |
CN111787012A (zh) * | 2014-11-07 | 2020-10-16 | 三星电子株式会社 | 语音信号处理方法及实现此的终端和服务器 |
US11308936B2 (en) | 2014-11-07 | 2022-04-19 | Samsung Electronics Co., Ltd. | Speech signal processing method and speech signal processing apparatus |
EP3690879A3 (en) * | 2014-11-07 | 2020-08-26 | Samsung Electronics Co., Ltd. | Speech signal processing method and speech signal processing apparatus |
US20170229124A1 (en) * | 2016-02-05 | 2017-08-10 | Google Inc. | Re-recognizing speech with external data sources |
US9997173B2 (en) * | 2016-03-14 | 2018-06-12 | Apple Inc. | System and method for performing automatic gain control using an accelerometer in a headset |
US20170294188A1 (en) * | 2016-04-12 | 2017-10-12 | Fujitsu Limited | Apparatus, method for voice recognition, and non-transitory computer-readable storage medium |
US10733986B2 (en) * | 2016-04-12 | 2020-08-04 | Fujitsu Limited | Apparatus, method for voice recognition, and non-transitory computer-readable storage medium |
US20170316780A1 (en) * | 2016-04-28 | 2017-11-02 | Andrew William Lovitt | Dynamic speech recognition data evaluation |
US10192555B2 (en) * | 2016-04-28 | 2019-01-29 | Microsoft Technology Licensing, Llc | Dynamic speech recognition data evaluation |
US20190214014A1 (en) * | 2016-05-26 | 2019-07-11 | Nuance Communications, Inc. | Method And System For Hybrid Decoding For Enhanced End-User Privacy And Low Latency |
US10803871B2 (en) * | 2016-05-26 | 2020-10-13 | Nuance Communications, Inc. | Method and system for hybrid decoding for enhanced end-user privacy and low latency |
US20170365249A1 (en) * | 2016-06-21 | 2017-12-21 | Apple Inc. | System and method of performing automatic speech recognition using end-pointing markers generated using accelerometer-based voice activity detector |
US10950235B2 (en) * | 2016-09-29 | 2021-03-16 | Nec Corporation | Information processing device, information processing method and program recording medium |
US11011167B2 (en) * | 2018-01-10 | 2021-05-18 | Toyota Jidosha Kabushiki Kaisha | Communication system, communication method, and computer-readable storage medium |
CN111868717A (zh) * | 2018-03-20 | 2020-10-30 | 索尼公司 | 信息处理装置及信息处理方法 |
US11373656B2 (en) * | 2019-10-16 | 2022-06-28 | Lg Electronics Inc. | Speech processing method and apparatus therefor |
US12124498B2 (en) * | 2020-01-09 | 2024-10-22 | Amazon Technologies, Inc. | Time code to byte indexer for partial object retrieval |
US11900921B1 (en) | 2020-10-26 | 2024-02-13 | Amazon Technologies, Inc. | Multi-device speech processing |
US11721347B1 (en) * | 2021-06-29 | 2023-08-08 | Amazon Technologies, Inc. | Intermediate data for inter-device speech processing |
US20240029743A1 (en) * | 2021-06-29 | 2024-01-25 | Amazon Technologies, Inc. | Intermediate data for inter-device speech processing |
Also Published As
Publication number | Publication date |
---|---|
US9520129B2 (en) | 2016-12-13 |
JP5621993B2 (ja) | 2014-11-12 |
US9905227B2 (en) | 2018-02-27 |
US20140058729A1 (en) | 2014-02-27 |
WO2011052412A1 (ja) | 2011-05-05 |
JPWO2011052412A1 (ja) | 2013-03-21 |
US20160358608A1 (en) | 2016-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9905227B2 (en) | Speech recognition system, request device, method, program, and recording medium, using a mapping on phonemes to disable perception of selected content | |
CN109493850B (zh) | 成长型对话装置 | |
US20190005954A1 (en) | Wake-on-voice method, terminal and storage medium | |
JP6058807B2 (ja) | 検索クエリ情報を使用する音声認識処理のための方法およびシステム | |
TWI610295B (zh) | 解壓縮及壓縮用於語音辨識之轉換器資料的電腦實施方法及電腦實施之語音辨識系統 | |
Sainath et al. | No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models | |
JP6469252B2 (ja) | アカウント追加方法、端末、サーバ、およびコンピュータ記憶媒体 | |
CN110288980A (zh) | 语音识别方法、模型的训练方法、装置、设备及存储介质 | |
US9293137B2 (en) | Apparatus and method for speech recognition | |
EP2956939B1 (en) | Personalized bandwidth extension | |
TW201606750A (zh) | 使用外國字文法的語音辨識 | |
KR20230107860A (ko) | 실제 노이즈를 사용한 음성 개인화 및 연합 트레이닝 | |
CN113724718B (zh) | 目标音频的输出方法及装置、系统 | |
JP5558284B2 (ja) | 音声認識システム、音声認識方法、および音声認識プログラム | |
JP2023162265A (ja) | テキストエコー消去 | |
US11948564B2 (en) | Information processing device and information processing method | |
US20120278079A1 (en) | Compressed phonetic representation | |
CN113793599A (zh) | 语音识别模型的训练方法和语音识别方法及装置 | |
JP5050175B2 (ja) | 音声認識機能付情報処理端末 | |
KR20160055059A (ko) | 음성 신호 처리 방법 및 장치 | |
JP4769121B2 (ja) | サーバ・クライアント型音声認識方法、装置およびサーバ・クライアント型音声認識プログラム、記録媒体 | |
CN114283811A (zh) | 语音转换方法、装置、计算机设备和存储介质 | |
Sertsi et al. | Offline Thai speech recognition framework on mobile device | |
CN117059076A (zh) | 方言语音识别方法、装置、设备及存储介质 | |
CN118506773A (zh) | 使用大型语言模型的联合语音和语言模型 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: NEC CORPORATION, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NAGATOMO, KENTARO;REEL/FRAME:028112/0918 Effective date: 20120417 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |