WO2011148594A1 - Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program - Google Patents

Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program Download PDF

Info

Publication number
WO2011148594A1
WO2011148594A1 PCT/JP2011/002764 JP2011002764W WO2011148594A1 WO 2011148594 A1 WO2011148594 A1 WO 2011148594A1 JP 2011002764 W JP2011002764 W JP 2011002764W WO 2011148594 A1 WO2011148594 A1 WO 2011148594A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
acquisition terminal
recognition
speech recognition
unit
Prior art date
Application number
PCT/JP2011/002764
Other languages
French (fr)
Japanese (ja)
Inventor
荒川隆行
越仲孝文
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Publication of WO2011148594A1 publication Critical patent/WO2011148594A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device.
  • a distributed speech recognition (hereinafter referred to as DSR) system is widely used.
  • a terminal first performs speech detection, noise suppression, and feature amount conversion processing on speech input by a user, and transmits the compressed feature amount to a server.
  • the server device performs speech recognition based on the feature amount transmitted from the terminal, and transmits the recognition result to the terminal side.
  • the terminal notifies the user of the recognition result.
  • FIG. 14 is an explanatory diagram showing a voice recognition system described in Patent Document 1.
  • a plurality of client stations (terminals) 320, 330, and 340 are connected to a server 310 via a public Internet network 350.
  • the terminal 330 includes an interface (IF) 331 that acquires a user's voice signal, a communication interface (COM) 332 that communicates with the server 310, and a spectrum analysis unit that obtains a feature vector from the acquired voice signal ( SAS) 333, a speech recognition unit (SR) 334 that performs speech recognition from the feature vector, a speech controller (SC) 335 that distributes a part of the feature vector to the server 310 depending on the speech recognition result, And a controllable switch (SW) 336 for determining whether or not a feature vector has been transmitted to the server 310 via the communication interface 332.
  • the server 310 includes a communication interface (COM) 312 that communicates with a terminal, and a speech recognition unit (REC) 314 that performs speech recognition from a feature quantity vector received from the terminal.
  • IF interface
  • COM communication interface
  • REC speech recognition unit
  • the speech recognition unit 334 on the terminal side performs speech recognition that has relatively little vocabulary and requires less CPU power.
  • the server-side voice recognition unit 314 performs voice recognition that has a relatively large vocabulary and requires a lot of CPU power. In this way, voice recognition with good response is performed by efficiently distributing voice recognition processing.
  • Patent Document 2 describes an information terminal that performs voice recognition processing in a shared manner.
  • the information terminal described in Patent Document 2 extracts feature points of the captured voice data, determines the complexity of the voice data, and determines a device that performs voice recognition processing according to the complexity.
  • processing related to speech recognition is shared by changing the amount of speech input signal transmitted according to the CPU load on the terminal side or the CPU load on the server side.
  • the speech recognition system described in Patent Document 1 has a problem in that the processing performed in the speech recognition cannot be appropriately shared because the processing is distributed considering only the CPU load. That is, there is a problem that it is not sufficient to consider the CPU load in order to appropriately share the voice recognition processing among a plurality of devices.
  • the present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device.
  • An object of the present invention is to provide a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that can be appropriately shared.
  • the speech recognition system includes a speech acquisition terminal that receives speech and acquires an input signal representing the speech, and a speech recognition device that performs speech recognition based on information transmitted from the speech acquisition terminal.
  • the acquisition terminal selects at least one of the own voice acquisition terminal and the voice recognition device as a feature amount calculation process used for voice recognition according to the voice input status, and calculates according to the input status.
  • a processing device determination unit that selects at least one device that performs voice recognition based on the feature amount, from the own voice acquisition terminal and the voice recognition device, and the processing device determination unit includes a voice input environment, a task size, A feature amount calculation process is performed according to information representing at least one of the status of the own voice acquisition terminal, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device. And selecting a device that performs location and voice recognition.
  • the voice acquisition terminal is a voice acquisition terminal that receives an input of voice and acquires an input signal representing the voice, and a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice At least one device that performs processing for calculating feature values used for speech recognition is selected from the acquisition terminals according to the voice input status, and the feature calculated from the speech recognition device and the own speech acquisition terminal is selected.
  • a processing device determination unit that selects at least one device that performs voice recognition based on the volume according to an input situation, and the processing device determination unit includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a voice Select a device for performing feature amount calculation and a device for performing speech recognition according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device Characterized in that it features a Rukoto.
  • the voice recognition sharing method includes a voice recognition device that receives a voice and obtains an input signal representing the voice, and performs voice recognition based on information transmitted from the voice acquisition terminal. From the terminal, at least one device that performs the calculation processing of the feature value used for voice recognition is selected according to the voice input status, and the voice acquisition terminal is selected from the voice recognition device and the own voice acquisition terminal.
  • At least one device that performs speech recognition based on the calculated feature amount is selected according to the input situation, and the speech acquisition terminal determines whether the device that performs feature amount calculation processing and the device that performs speech recognition Selection is made according to information representing at least one of the environment, task size, status of the own voice acquisition terminal, status of the voice recognition device, and communication status between the own voice acquisition terminal and the voice recognition device.
  • a speech recognition program is a speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer. Select at least one device for calculating the feature value used for speech recognition from the speech recognition device and the computer to be used, and calculate from the speech recognition device and the computer.
  • a processing device determination process for selecting at least one device that performs voice recognition based on the input feature amount according to an input situation and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device
  • a device for performing device and speech recognition performs calculation processing of the feature.
  • the process performed by the voice recognition can be appropriately shared between the terminal and the server apparatus.
  • FIG. FIG. 1 is a block diagram showing an example of a speech recognition system according to the first embodiment of the present invention.
  • the voice recognition system according to the present embodiment includes a terminal 100 and a server device 200.
  • the terminal 100 may be referred to as the terminal side
  • the server device 200 may be referred to as the server side.
  • the terminal 100 and the server device 200 are connected via, for example, a public Internet network.
  • the terminal 100 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a processing control unit 105, a transmission / reception unit 106, a recognition result integration unit 107, and a recognition result. And a display unit 108.
  • the input signal acquisition unit 101 converts the input voice into input sound data (hereinafter referred to as an input signal). Specifically, the input signal acquisition unit 101 cuts out time-series input sound data collected by the microphone 99 or the like for each frame of unit time.
  • the feature quantity conversion unit 102 converts the time series of the input signal into a time series of feature quantities used for speech recognition.
  • the feature quantity conversion unit 102 converts an input signal into a feature quantity by using a method such as LPC (Linear Predictive Coding) cepstrum analysis or MFCC (Mel-Frequency Cepstrum Coefficient) analysis.
  • LPC Linear Predictive Coding
  • MFCC Mel-Frequency Cepstrum Coefficient
  • the voice recognition unit 103 performs voice recognition based on the time series of the converted feature values. In addition, the voice recognition unit 103 calculates a score representing the recognition result at the same time.
  • the score representing the recognition result (hereinafter sometimes simply referred to as a score) is an index representing the likelihood of speech recognition.
  • the speech recognition unit 103 may calculate, for example, a distance between the feature amount time series and the acoustic model, an acoustic score such as likelihood, a language score representing linguistic coincidence, and the like as a score representing a recognition result. .
  • the voice recognition unit 103 may obtain a score for the entire recognition result, or may obtain a score in various units such as for each frame, for each word, or for each utterance section. Note that a method for performing speech recognition based on a feature amount used for speech recognition and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here.
  • the situation detection unit 104 detects a situation where sound is input.
  • the status detection unit 104 includes the environment in which voice is input, the status of the terminal 100 and the server device 200, the task size, the line status between the terminal 100 and the server device 200 that transmits the input signal, and the like. Detect various situations.
  • the status detection unit 104 detects, for example, the CPU load and the memory usage status as the status of the terminal 100 and the server device 200.
  • the statuses of the terminal 100 and the server device 200 detected by the status detection unit 104 are not limited to the CPU load and the memory usage status.
  • task size is an index that represents the difficulty in speech recognition of utterances. For example, the number of vocabulary words that can be recognized by speech may be used as a scale representing the task size.
  • the complexity of utterance accepted by speech recognition may be used as a measure representing the task size. For example, utterance complexity may be expressed by keyword recognition or natural language recognition.
  • the situation detection unit 104 determines that the difficulty of speech recognition is low even if the vocabulary is large or the utterance is complicated. May be.
  • the situation detection unit 104 may determine that the difficulty in voice recognition is high.
  • the situation detection unit 104 may detect difficulty in speech recognition due to the large number of words or complexity based on specifications required by the application.
  • the environment where voice is input includes noise level and the like.
  • the noise level represents the degree of noise included in the input voice.
  • the magnitude (sound pressure) of the noise may be set as the noise level.
  • the situation detection unit 104 may detect the noise level from, for example, the sound pressure of the input signal input to the microphone 99 before the user speaks.
  • the status detection unit 104 may detect the CPU load of the terminal 100 and the server device 200 and the line status between the devices using, for example, an API (Application Program Interface) provided by an OS (Operating System). For example, the status detection unit 104 may transmit a packet requesting the CPU usage status to the server side and calculate the CPU load from information included in the returned packet. Further, for example, the status detection unit 104 transmits an ICMP (Internet Control Message Protocol) Echo request packet to the server side, measures the time until the ICMP packet returned from the server side is received, and determines the line state. It may be detected.
  • ICMP Internet Control Message Protocol
  • Echo request packet measures the time until the ICMP packet returned from the server side is received, and determines the line state. It may be detected.
  • the method by which the state detection unit 104 detects the CPU load and the line state between devices is not limited to the above method.
  • the process control unit 105 includes a feature amount conversion device determination unit 105a and a voice recognition device determination unit 105b.
  • the process control unit 105 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104.
  • the feature amount conversion device determination unit 105a determines which device is to execute the feature amount calculation process based on the detected situation.
  • the voice recognition device determination unit 105b determines which device should perform voice recognition based on the feature amount based on the detected situation.
  • the feature amount conversion device determination unit 105 a and the speech recognition device determination unit 105 b cause the terminal 100 to execute processing, cause the server device 200 to execute processing, or the terminal 100 and the server device 200. Select whether or not to cause both to execute processing.
  • a specific method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b select which device to execute the subsequent processing will be described later.
  • the transmission / reception unit 106 determines the time series of the input signal and the time series of the feature amount according to the determination results of the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b). It transmits to the server apparatus 200. In addition, the transmission / reception unit 106 receives the result of speech recognition performed by the server device 200.
  • the transmission / reception unit 106 transmits the time series of the input signal to the server apparatus 200.
  • the transmission / reception unit 106 transmits the time series of the feature amount to the server device 200.
  • the recognition result integration unit 107 compares and integrates the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition performed by the server device 200 received by the transmission / reception unit 106. That is, the recognition result integration unit 107 selects a more appropriate speech recognition result from the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition by the server device 200 received by the transmission / reception unit 106, Merge selected results. For example, when the speech recognition process is performed only in one of the terminal side and the server side, the recognition result integration unit 107 may use the speech recognition performed in each device as the speech recognition result. On the other hand, when the speech recognition process is performed on either the terminal side or the server side, the recognition result integration unit 107 may select a more likely speech recognition result and use the selected result as the speech recognition result. .
  • the recognition result display unit 108 displays the result of speech recognition compared and integrated by the recognition result integration unit 107 to the user.
  • the recognition result display unit 108 is realized by, for example, a display device.
  • the server device 200 includes a transmission / reception unit 201, a processing control unit 202, a feature amount conversion unit 203, and a voice recognition unit 204. Server device 200 performs voice recognition based on information transmitted from terminal 100.
  • the transmission / reception unit 201 receives data transmitted from the terminal 100. Further, the transmission / reception unit 201 transmits the speech recognition result by the speech recognition unit 204 to the terminal 100.
  • the process control unit 202 determines subsequent process contents based on the information received from the terminal 100. Specifically, the processing control unit 202 determines the subsequent processing contents depending on whether the data received from the terminal 100 is a time series of input signals or a time series of feature amounts. For example, when the data received from the terminal 100 is a time series of input signals, the process control unit 202 causes the feature amount conversion unit 203 to execute a feature amount calculation process. On the other hand, when the data received from the terminal 100 is a time series of feature amounts, the process control unit 202 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
  • the feature value conversion unit 203 converts the time series of the received input signal into a time series of feature values used for speech recognition.
  • the speech recognition unit 204 performs speech recognition based on the time series of feature amounts converted by the feature amount conversion unit 203 or the time series of feature amounts received from the terminal 100.
  • the voice recognition unit 204 calculates a score representing the recognition result at the same time. Note that a method for converting an input signal into a feature value, a method for performing speech recognition based on the feature value, and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here. To do.
  • Input signal acquisition unit 101, feature amount conversion unit 102, speech recognition unit 103, situation detection unit 104, processing control unit 105 (more specifically, feature amount conversion device determination unit 105a, speech recognition device determination Unit 105b), transmission / reception unit 106, and recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program).
  • the program is stored in a storage unit (not shown) of the terminal 100, and the CPU reads the program, and in accordance with the program, the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit. 104, the process control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 may be operated.
  • the input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the processing control unit 105, the transmission / reception unit 106, and the recognition result integration unit 107 are dedicated to each. It may be realized by hardware.
  • FIG. 2 is a flowchart showing an example of the operation on the terminal side.
  • FIG. 3 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
  • the input signal acquisition unit 101 cuts out the collected time-series input sound data for each unit time frame (step S101).
  • the input signal acquisition unit 101 may sequentially cut out waveform data for a unit time while shifting a portion to be cut out from input sound data by a predetermined time.
  • this unit time is referred to as a frame width
  • this predetermined time is referred to as a frame shift.
  • the input sound data is 16-bit Linear-PCM (Pulse Code Modulation) with a sampling frequency of 8000 Hz
  • waveform data for 8000 points per second is included.
  • the input signal acquisition unit 101 may extract the waveform data sequentially in time series at a frame width of 200 points (ie, 25 milliseconds) and a frame shift of 80 points (ie, 10 milliseconds).
  • the feature quantity conversion device determination unit 105a converts the time series of the input signal into the feature quantity time series by the feature quantity conversion unit 102 on the terminal side according to the situation detected by the situation detection unit 104, or It is determined whether the transmission / reception unit 106 transmits to the server device 200 (that is, whether the server side feature amount conversion unit 203 converts the feature amount into a time series) or both of them convert into a feature amount time series. (Step S102).
  • the feature amount conversion device determination unit 105a determines which device converts the time series of the input signal into the time series of the feature amount based on the following conditions. 1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 200 is high, the feature amount conversion device determination unit 105a does not transmit the input signal to the server device 200. Then, it is determined that the terminal side converts to a feature value. 2. Other than the above, when the CPU load of the terminal 100 is high or the noise level is high, the feature amount conversion apparatus determination unit 105a transmits the input signal to the server side, and does not perform the feature amount conversion on the terminal side. to decide.
  • the feature quantity conversion device determination unit 105a determines that the server performs the feature quantity conversion. 3. In cases other than the above, the feature quantity conversion device determination unit 105a transmits the input signal to the server side, and determines that the terminal side also converts to the feature quantity. That is, the feature quantity conversion device determination unit 105a determines that the feature quantity conversion is performed on both the terminal side and the server side.
  • the feature value conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).
  • the feature value conversion device determination unit 105a determines that the feature value conversion is performed on the server side (“server side” in step S102)
  • the feature value conversion device determination unit 105a transmits the input signal to the transmission / reception unit 106.
  • the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal.
  • the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is an input signal.
  • the transmission / reception unit 106 may transmit the input signal to the server side after notifying the server side of the data format. By doing in this way, it becomes possible for the server side to judge the content of the data received after that.
  • the voice recognition device determination unit 105b causes the terminal-side voice recognition unit 103 to perform voice recognition based on the time series of the feature amounts converted by the feature amount conversion unit 102 in accordance with the situation detected by the situation detection unit 104. It is determined whether processing is performed, a time series of feature values is transmitted to the server device 200 (that is, whether the server-side speech recognition unit 204 performs speech recognition processing), or both are recognized (step S105).
  • the voice recognition device determination unit 105b determines that the input signal is not transmitted to the server device 200, the voice recognition device determination unit 105b, for example, based on the following conditions, based on the time series of the feature amount Whether to perform voice recognition processing. 1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server apparatus 200 is high, the speech recognition apparatus determination unit 105b does not transmit the feature amount to the server apparatus 200, and the terminal It is determined that voice recognition is performed on the side. 2. Other than the above, when the task size is large (that is, more difficult), the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side does not perform speech recognition.
  • the voice recognition device determination unit 105b determines that voice recognition is performed on the server side. 3. In cases other than the above, the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side also performs speech recognition. That is, the speech recognition device determination unit 105b determines that speech recognition is performed on both the terminal side and the server side.
  • the speech recognition unit 103 When the speech recognition apparatus determination unit 105b determines that speech recognition is to be performed on the terminal side (“terminal side” in step S105), the speech recognition unit 103 performs speech recognition on the time series of feature amounts (step S106). . Specifically, the speech recognition unit 103 searches for a corresponding word string from a storage unit (not shown) provided in the terminal 100, and uses the search result as the speech recognition result. At this time, the voice recognition unit 103 simultaneously calculates a score representing the recognition result.
  • the voice recognition device determination unit 105b determines that the server side performs voice recognition (“server side” in step S105)
  • the voice recognition device determination unit 105b causes the transmission / reception unit 106 to transmit the feature amount (step S107).
  • the transmission / reception unit 106 may compress and transmit the time series of feature amounts for each unit, or may encode and transmit the time series of feature amounts.
  • the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is a feature amount.
  • the recognition result integration unit 107 selects either the terminal side speech recognition result or the server side speech recognition result, and The speech recognition results are integrated (step S109).
  • the recognition result integration unit 107 may select one speech recognition result without integrating the speech recognition results.
  • the recognition result integration unit 107 integrates the speech recognition results. Specifically, the recognition result integration unit 107 selects one of the speech recognitions performed on both the terminal side and the server side, and integrates the selected results.
  • the recognition result integration unit 107 may select a speech recognition result having a higher score from the scores calculated by the speech recognition unit 103 and the speech recognition unit 204, for example. For example, the recognition result integration unit 107 may compare the speech recognition results in units of division such as words, and select a speech recognition result having a higher score in the compared division units.
  • the recognition result integration unit 107 does not use the voice recognition result from the server apparatus 200 and does not use the voice recognition result on the terminal side. Only the result may be used.
  • the recognition result display unit 108 displays the voice recognition results (step S110).
  • the recognition result display unit 108 may display the recognition result as a character string on a display device or the like.
  • the recognition result display unit 108 may notify the result of voice synthesis from the voice recognition result using a headphone stereo or a speaker (not shown).
  • the transmission / reception unit 201 receives data from the terminal side (step S201).
  • the transmission / reception unit 201 decompresses and decodes the data.
  • the processing control unit 202 changes the subsequent processing content according to the content of the received data (step S202).
  • the feature amount conversion unit 203 converts the input signal into a feature amount (step S203).
  • the feature amount conversion unit 203 does not perform a feature amount conversion process.
  • the server-side feature value conversion unit 203 converts the time series of the input signal into the time series of feature values for each frame.
  • the speech recognition unit 204 Performs voice recognition on the time series of feature values (step S204). Specifically, the voice recognition unit 204 searches for a corresponding word string from a storage unit (not shown) provided in the server device 200, and uses the search result as the voice recognition result. At this time, the voice recognition unit 204 simultaneously calculates a score representing the recognition result. Then, the transmission / reception unit 201 transmits the voice recognition result to the terminal 100 (step S205).
  • the feature quantity conversion device determination unit 105a performs the voice input status (for example, the environment in which voice is input, the task size, the status of the terminal 100 or the server device 200, the communication line Depending on the situation, it is selected where to calculate the feature value used for speech recognition.
  • the voice recognition device determination unit 105b selects where to perform voice recognition according to the voice input status. Therefore, the process performed by voice recognition can be appropriately shared by the terminal 100 and the server apparatus 200.
  • the status detection unit 104 determines the CPU load on the terminal side, the CPU load on the server side, the memory usage status on the terminal side, the memory usage status on the server side, the task size, the noise level, the status of the transmission line, etc. Detect. Then, the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b) controls the processing content performed on the terminal side and the processing content performed on the server side. Therefore, efficient processing distribution can be performed and voice recognition with good response can be realized.
  • the sharing destination is determined based on various factors other than the CPU load, such as task size, noise level, and the state of the line for transmitting information.
  • factors other than the CPU load such as task size, noise level, and the state of the line for transmitting information.
  • the sharing destination is determined based on various factors other than the CPU load. Therefore, the process performed by voice recognition can be appropriately shared between the terminal and the server device.
  • FIG. FIG. 4 is a block diagram illustrating an example of a speech recognition system according to the second embodiment of the present invention.
  • symbol same as FIG. 1 is attached
  • subjected and description is abbreviate
  • the voice recognition system in this embodiment includes a terminal 300 and a server device 400.
  • the terminal 300 may be referred to as the terminal side
  • the server device 400 may be referred to as the server side.
  • the terminal 300 and the server device 400 are connected via, for example, a public Internet network.
  • the terminal 300 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a voice detection unit 301, a processing control unit 302, a transmission / reception unit 106, and a recognition result integration.
  • Unit 107 and recognition result display unit 108 That is, the terminal 300 in the second embodiment is different from the terminal 100 in the first embodiment in that a voice detection unit 301 is added. Further, in the terminal 300 in the second embodiment, the process control unit 105 in the first embodiment is replaced with a process control unit 302.
  • the voice detection unit 301 determines a voice section to be recognized from the time series of the input signal input to the input signal acquisition unit 101, and cuts the time series of the voice section. That is, the voice detection unit 301 extracts a voice section from a time series of input signals. For example, as described in Reference Document 1, the voice detection unit 301 may detect an utterance section by measuring the energy of framed voice data. In addition, as described in Reference Document 2, the voice detection unit 301 may detect a voice section using a plurality of feature amounts extracted from the input signal.
  • the method by which the voice detection unit 301 detects the voice section is not limited to the above method.
  • the process control unit 302 includes a voice detection device determination unit 302a, a feature amount conversion device determination unit 302b, and a voice recognition device determination unit 302c.
  • the process control unit 302 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the voice detection device determination unit 302a determines which device is to execute the process of extracting the voice section based on the detected situation. Further, the feature amount conversion device determination unit 302b determines which device is to execute processing for converting the input signal from which the speech section is extracted into the feature amount based on the detected state. Furthermore, the voice recognition device determination unit 302c determines which device is to execute the voice recognition processing based on the feature amount based on the detected situation. Note that the method by which the speech recognition device determination unit 302c selects a device is the same as the method by which the speech recognition device determination unit 105b in the first embodiment selects a device.
  • the voice detection device determination unit 302a, the feature amount conversion device determination unit 302b, and the voice recognition device determination unit 302c cause the terminal 300 to execute processing, or causes the server device 400 to execute processing, The terminal 300 and the server device 400 are selected to execute processing. Note that a method by which the voice detection device determination unit 302a and the feature amount conversion device determination unit 302b select which device to execute the subsequent processing will be described later.
  • the server apparatus 400 includes a transmission / reception unit 201, a processing control unit 401, a voice detection unit 402, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server apparatus 400 in the second embodiment is different from the server apparatus 200 in the first embodiment in that a voice detection unit 402 is added. In the server device 400 in the second embodiment, the process control unit 202 in the first embodiment is replaced with a process control unit 401.
  • the processing control unit 401 determines subsequent processing contents based on the information received from the terminal 300. Specifically, the processing control unit 401 determines the subsequent processing contents depending on whether the data received from the terminal 300 is a time series of an input signal, a time series of an input signal obtained by cutting a speech section, or a time series of feature amounts. judge.
  • the process control unit 401 when the data received from the terminal 300 is a time series of the input signal, the process control unit 401 causes the voice detection unit 402 to execute a process of cutting out a voice section. In addition, when the data received from the terminal 300 is a time series of input signals obtained by cutting out voice segments, the processing control unit 401 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 300 is a time series of feature amounts, the process control unit 401 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
  • the voice detection unit 402 determines a voice section to be recognized from the time series of the input signal received from the terminal 200, and cuts the time series of the voice section. That is, the voice detection unit 402 extracts a voice section from a time series of input signals.
  • the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit 302a)
  • the feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (speech recognition program).
  • the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit).
  • 302a, the feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 may each be realized by dedicated hardware.
  • FIG. 5 is a flowchart showing an example of the operation on the terminal side.
  • FIG. 6 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
  • the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101).
  • the voice detection device determination unit 302a determines the voice section to be recognized from the time series of the input signal on the terminal side according to the situation detected by the situation detection unit 104, or extracts the voice section from the transmission / reception unit 106 to the server. It is determined whether to transmit to the side (that is, to determine and cut out on the server side) or to determine and cut out on both sides (step S301).
  • the speech detection device determination unit 302a determines which device determines and extracts a speech section from the time series of input signals. 1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the voice detection device determination unit 302a does not transmit the input signal to the server device 400, It is determined that the voice section is cut out on the terminal side. 2. Other than the above, when the CPU load of the terminal 300 is high, when the terminal side uses a lot of memory, or when the noise level is high, the voice detection device determination unit 302a transmits the input signal to the server side, It is determined that the voice segment is not cut out.
  • the voice detection device determination unit 302a determines to cut out a voice section on the server side. 3. In cases other than the above, the voice detection device determination unit 302a transmits an input signal to the server side, and determines that the voice section is also cut out on the terminal side. That is, the voice detection device determination unit 302a determines that a voice section is cut out on both the terminal side and the server side.
  • the voice detection unit 301 determines a voice section to be recognized from the time series of the input signal. To do. And the audio
  • the voice detection device determination unit 302a determines to cut out a voice section on the server side (“server side” in step S301)
  • the voice detection device determination unit 302a causes the transmission / reception unit 106 to transmit an input signal ( Step S104). Similar to the first embodiment, the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal. .
  • the feature quantity conversion device determination unit 302b converts the time series of the speech segment extracted on the terminal side according to the situation detected by the situation detection unit 104, and the feature quantity time series of the feature quantity on the terminal side. Or is transmitted from the transmission / reception unit 106 to the server device 400 (that is, the feature amount conversion unit 203 on the server side converts the feature amount into a time series), or both are converted into a feature amount time series. It is determined whether or not to perform (step S303).
  • the feature amount conversion device determination unit 302b uses the time series of the input signal as a feature amount based on the following conditions, for example. Judge whether to convert to time series. 1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the feature amount conversion device determination unit 302b stores the cut voice section in the server device 400. It is determined that the data is not transmitted but converted to a feature value on the terminal side. 2.
  • the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side Then, the terminal side determines that the feature value is not converted. That is, the feature amount conversion device determination unit 302b determines that the feature amount is converted on the server side. 3. In cases other than the above, the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side, and determines that the terminal side also converts to the feature amount. That is, the feature quantity conversion device determination unit 302b determines to convert the feature quantity on both the terminal side and the server side.
  • the determination unit 302a determines whether the communication speed is low, whether the CPU load on the terminal 300 and the server device 400 is high, whether there is a large amount of used memory on the terminal side, and whether the noise level is high.
  • the feature amount conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).
  • the feature amount conversion device determination unit 302b determines that the feature amount conversion is to be performed on the server side (“server side” in step S303)
  • the feature amount conversion device determination unit 302b receives the input signal obtained by cutting out the speech section. It is made to transmit to the transmission / reception part 106 (step S304).
  • step S105 the processing until the speech recognition device determination unit 302c determines a device that performs speech recognition processing and the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 is illustrated in step S105 illustrated in FIG. ⁇ Similar to step S110.
  • the transmission / reception unit 201 receives data from the terminal side (step S201).
  • the transmission / reception unit 201 decompresses and decodes the data.
  • the processing control unit 401 changes the subsequent processing content according to the content of the received data (step S401).
  • the voice detection unit 402 uses the time series of the input signal as a voice section to be subjected to voice recognition. And the determined speech section is cut out (step S402). At this time, the voice detection unit 402 may cut out with a margin of several frames before and after the voice section.
  • the feature amount conversion unit 203 Converts the input signal into a feature value (step S203).
  • the speech recognition unit 204 After the feature amount conversion unit 203 converts the input signal into a feature amount, or when the received data is a feature amount time series (“feature amount” in step S401), the speech recognition unit 204 performs speech recognition. (Step S204).
  • the process in which the transmission / reception unit 201 transmits the speech recognition result to the terminal 300 is the same as the process in step S205 illustrated in FIG.
  • the voice detection process can increase the accuracy of voice recognition, but requires a lot of CPU power.
  • the voice detection device determination unit 302a selects which device performs voice segment extraction processing according to the voice input status.
  • the voice detection unit 301 extracts a voice section from the input signal. Therefore, for example, when the noise level is high, a voice recognition result with higher accuracy can be obtained by performing voice detection processing on the server side.
  • FIG. FIG. 7 is a block diagram showing an example of a speech recognition system according to the third embodiment of the present invention.
  • symbol same as FIG. 1 is attached
  • subjected and description is abbreviate
  • the voice recognition system in this embodiment includes a terminal 500 and a server device 600.
  • the terminal 500 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a noise removal unit 501, a processing control unit 502, a transmission / reception unit 106, and a recognition result integration.
  • the process control unit 105 according to the first embodiment is replaced with a process control unit 502.
  • the noise removing unit 501 removes noise components from the input signal.
  • the noise removing unit 501 removes a noise component using a method such as a spectral subtraction method or a Wiener filter, for example.
  • a method such as a spectral subtraction method or a Wiener filter
  • the method by which the noise removing unit 501 removes noise components is not limited to the above method.
  • the noise removing unit 501 may remove the noise component using another widely known method.
  • the process control unit 502 includes a noise removal device determination unit 502a, a feature amount conversion device determination unit 105a, and a speech recognition device determination unit 105b. Similar to the first embodiment, the processing control unit 502 controls which device is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the noise removal device determination unit 502a determines which device is to execute the noise removal process based on the detected situation. Further, the feature amount conversion device determination unit 105a determines which device is to execute the process of converting the input signal from which noise has been removed into the feature amount based on the detected state. The voice recognition device determination unit 105b is the same as that in the first embodiment.
  • the noise removal device determination unit 502 a, the feature amount conversion device determination unit 105 a, and the speech recognition device determination unit 105 b cause the terminal 500 to execute processing, or causes the server device 600 to execute processing,
  • the terminal 500 and the server apparatus 600 are selected to execute processing.
  • the noise removal device determination unit 502a determines which device is to execute the process of removing noise by, for example, a method similar to the method in which the feature amount conversion device determination unit 105a or the speech recognition device determination unit 105b selects a device. May be. Note that the method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b determine which device to execute the subsequent processing is the same as in the first embodiment.
  • the server apparatus 600 includes a transmission / reception unit 201, a processing control unit 601, a noise removal unit 602, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server device 600 in the third embodiment is different from the server device 200 in the first embodiment in that a noise removing unit 602 is added. In the server device 600 according to the third embodiment, the process control unit 202 according to the first embodiment is replaced with a process control unit 601.
  • the processing control unit 601 determines the subsequent processing content based on the information received from the terminal 500. Specifically, the processing control unit 601 determines the subsequent processing contents depending on whether the data received from the terminal 500 is a time series of input signals, a time series of input signals from which noise is removed, or a time series of feature amounts. judge.
  • the processing control unit 601 when the data received from the terminal 500 is a time series of the input signal, the processing control unit 601 causes the noise removal unit 602 to execute processing for removing noise from the input signal. Also, when the data received from the terminal 500 is a time series of input signals from which noise has been removed, the processing control unit 601 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 500 is a time series of feature amounts, the process control unit 601 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
  • the noise removing unit 602 removes noise from the input signal in the same manner as the noise removing unit 501. Note that the method by which the noise removing unit 602 removes noise may be the same method as the noise removing unit 501, or may be different.
  • the input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the noise removal unit 301, and the processing control unit 502 (more specifically, the noise removal device determination unit 502a)
  • the feature amount conversion device determination unit 105a, the voice recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program).
  • the noise removing unit 501 and the noise removing unit 602 are provided in addition to the voice recognition system in the first embodiment. Note that the noise removing unit 501 and the noise removing unit 602 may be included in the speech recognition system in the second embodiment.
  • FIG. 8 is a flowchart showing an example of the operation on the terminal side.
  • FIG. 9 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
  • the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101).
  • the noise removal device determination unit 502a performs a process of removing noise from the input signal on the terminal side, the server side, or both depending on the situation detected by the situation detection unit 104. Is determined (step S501).
  • the noise removal unit 502a determines to remove noise from the input signal on the terminal side (“terminal side” in step S501)
  • the noise removal unit 501 removes noise from the time series of the input signal (step S502).
  • the noise removal device determination unit 502a determines that the noise of the input signal is removed on the server side (“server side” in step S501)
  • the noise removal device determination unit 502a causes the transmission / reception unit 106 to transmit the input signal ( Step S503).
  • the process until the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 after the feature amount conversion device determination unit 105a determines a device for calculating the feature amount is illustrated in FIG. This is the same as the processing from S102 to S110.
  • the transmission / reception unit 201 receives data from the terminal side (step S201).
  • the processing control unit 601 changes the subsequent processing content according to the content of the received data (step S601).
  • the noise removal unit 602 removes noise from the time series of the input signal (step S602).
  • the noise removal unit 602 removes noise or when the received data is a time series of the input signal from which noise has been removed (“noise-removed input signal” in step S601)
  • the feature amount conversion unit 203 The input signal is converted into a feature value (step S203). Thereafter, the processing from performing speech recognition based on the feature amount to transmitting the speech recognition result to the terminal 500 is the same as the processing in steps S204 to S205 illustrated in FIG.
  • noise suppression processing can increase the accuracy of speech recognition, but requires a large amount of CPU power.
  • the noise removal device determination unit 502a selects which device to perform noise component removal processing according to the voice input status.
  • the noise removing unit 501 removes a noise component from the input signal. Therefore, for example, when the noise level is high, a speech recognition result with higher accuracy can be obtained by performing noise suppression processing on the server side.
  • Embodiment 4 a speech recognition system according to the fourth embodiment of the present invention will be described.
  • the process control unit 105, the process control unit 302, and the process control unit 502 (hereinafter referred to as each process control unit) are detected by the situation detection unit 104. Based on the situation, it is controlled which device performs the subsequent processing.
  • an index (hereinafter referred to as a score table) that is determined according to the situation detected by the situation detection unit 104 is set in advance, and each processing control unit thereafter performs the processing based on the score table.
  • each processing control unit calculates the total score according to the voice input status based on the score determined according to the status detected by the status detection unit 104. Then, each processing control unit compares the calculated total with a predetermined condition, and selects which device performs the feature amount calculation processing and voice recognition.
  • the score table is stored in advance in a storage unit (not shown) on the terminal side, for example.
  • FIG. 10 is an explanatory diagram showing an example of a score table.
  • the score table illustrated in FIG. 10 is set in association with the situation detected by the situation detection unit 104 and the score indicating the weight in the situation. For example, when the situation detection unit 104 detects a situation where communication between the terminal side and the server side is disconnected, the weight of the situation is set to 5 points.
  • Each processing control unit calculates the total number V of points corresponding to the situation detected by the situation detection unit 104.
  • Each processing control unit selects which device is to execute the subsequent processing based on a predetermined condition. For example, when the total V is 4 or more, information is not transmitted to the server side (that is, processing is performed on the terminal side), and when the total V is 2 or more and less than 4, the feature amount is transmitted to the server side. If V is less than 2, a condition may be set such that an input signal is transmitted to the server side.
  • Each processing control unit selects which device is to execute the subsequent processing based on the conditions set in this way. The set conditions are not limited to the above contents.
  • the connection between the terminal-side device and the server-side device is not limited to one-to-one. Two or more terminal-side devices and server-side devices may be connected to each other.
  • FIG. 11 is an explanatory diagram showing an example of a voice recognition system in the present embodiment.
  • the speech recognition system in this embodiment includes terminals A to D, server apparatuses A to D, and a connection state controller 700.
  • the connection state controller 700 is connected between the terminals A to D and the server apparatuses A to D.
  • the configurations of the terminals A to D are the same as those of the terminals 100, 300, and 500 in the first to fourth embodiments.
  • the configurations of the server apparatuses A to D are the same as those of the server apparatuses 200, 400, and 600 in the first to fourth embodiments.
  • the connection state controller 700 selects the server devices A to D to which the terminals A to D are connected. Specifically, the control unit 701 of the connection state controller 700 determines the terminal based on at least one of the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side. Server apparatuses A to D to which A to D are connected are selected.
  • the data transmitted from the terminal side may include information indicating whether it is an input signal, an input signal obtained by cutting out a voice section, an input signal from which noise has been removed, or a feature amount.
  • the connection state controller 700 is realized by a server device, for example. Further, the control unit 701 of the connection state controller 700 is realized by a CPU included in the connection state controller 700, for example.
  • the control unit 701 Upon receiving a connection request including a data format to be transmitted from the terminal side, the control unit 701 selects a server device that can support the received data format. In addition, the control unit 701 may further select a server device from a plurality of selected server devices on the basis of a low CPU load or a small memory usage. At this time, the number of server devices selected by the control unit 701 is not limited to one, and may be two or more. After selecting the server device, the control unit 701 sets a connection between the terminal that has received the connection request and the selected server device. Thereafter, data transmission / reception is performed between the terminal to which the connection is set and the server device.
  • control unit 701 further selects a server device based on a criterion such as a low CPU load or a small memory usage from among the selected server devices.
  • a criterion such as a low CPU load or a small memory usage from among the selected server devices.
  • the criteria for the control unit 701 to select a server device are not limited to the above.
  • the control unit 701 may select the server device using a standard determined according to the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side.
  • control unit 701 selects the server device to be connected to the terminals A to D
  • the combination of the terminals A to D and the server devices A to D is not limited to the combination of configurations exemplified in each embodiment.
  • the voice recognition system in the present embodiment may include, for example, the terminal 100 in the first embodiment and the server device 600 in the third embodiment.
  • the terminal status detection unit 104 determines the status of the server device with the least CPU load, the status of the server with the least memory usage, the server device with the fastest communication speed, and the like. It may be detected as a line state. And the connection control controller 700 should just connect to a server apparatus according to the connection request
  • FIG. 12 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention.
  • FIG. 13 is a block diagram showing an example of the minimum configuration of the voice acquisition terminal according to the present invention.
  • the voice recognition system according to the present invention receives a voice and receives an input signal (for example, input sound data) representing the voice, and information transmitted from the voice acquisition terminal 80.
  • a speech recognition device 90 for example, server device 200
  • the voice acquisition terminal 80 selects at least one device that performs the calculation processing of the feature amount used for voice recognition from the own voice acquisition terminal 80 and the voice recognition device 90 according to the voice input status (for example, feature
  • the amount conversion device determination unit 105a selects, and selects at least one device that performs speech recognition based on the calculated feature amount from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state.
  • a processing device determination unit 81 for example, the processing control unit 105) (for example, selected by the speech recognition device determination unit 105b) is provided.
  • the processing device determination unit 81 includes a voice input environment (for example, a noise level), a task size (for example, the number of vocabulary of words that can be recognized by speech, the degree of utterance complexity), a situation of the own voice acquisition terminal 80 (for example, the terminal 100).
  • CPU load and memory usage status for example, the number of vocabulary of words that can be recognized by speech, the degree of utterance complexity
  • voice recognition device 90 status for example, CPU load and memory usage status of server device 200
  • communication status between own voice acquisition terminal 80 and voice recognition device 90 for example, The device that performs the feature amount calculation processing and the device that performs the speech recognition are selected according to information representing at least one of communication disconnection and communication speed is low.
  • processing performed by voice recognition can be appropriately shared between the terminal and the server device.
  • the voice acquisition terminal 80 may include status detection means (for example, the status detection unit 104) that detects a voice input status.
  • the situation detection means 104 includes the voice input environment, the task size, the situation of the own voice acquisition terminal 80, the situation of the voice recognition device 90, and the communication situation between the own voice acquisition terminal 80 and the voice recognition device 90. A situation representing at least one of the following is detected.
  • the processing device determination unit 81 may select a device that performs a feature amount calculation process and a device that performs voice recognition according to the situation detected by the situation detection unit.
  • the voice acquisition terminal 80 may include voice detection means (for example, a voice detection unit 301) that extracts a voice section from the input signal.
  • the processing device determination unit 81 selects at least one device for performing speech segment extraction processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state of the speech, and the speech detection unit
  • the voice section may be extracted from the input signal.
  • the voice acquisition terminal 80 may include noise removing means (for example, the noise removing unit 501) for removing a noise component from the input signal.
  • the processing device determination unit 81 selects at least one device for performing noise component removal processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the voice input status, and the noise removal unit The noise component may be removed from the input signal when the processing device determination means selects the own voice acquisition terminal 80.
  • the processing device determination unit 81 responds to the voice input status based on a score (for example, a score table illustrated in FIG. 8) that is a predetermined index according to the status detected by the status detection unit.
  • a score for example, a score table illustrated in FIG. 8
  • a device that performs a feature amount calculation process and a device that performs voice recognition may be selected by calculating a total score (for example, a total V) and comparing the calculated total with a predetermined condition. Predetermining such a score table makes it possible to make detailed judgments according to the environment.
  • the voice acquisition terminal 80 may include a communication unit (for example, the transmission / reception unit 106) that transmits information representing the detected input signal or information representing the calculated feature value to the voice recognition device.
  • the communication unit may notify the voice recognition device 90 of the data format of the information, and then transmit the information to the voice recognition device 90.
  • the communication unit may receive a voice recognition result for the information from the voice recognition device 90.
  • a speech recognition system is connected between at least one speech recognition device 90 (for example, the server devices A to D) and the speech acquisition terminal 80 and the speech recognition device 90.
  • a connection destination control device for example, a connection state controller 700 that selects a connection destination of the voice acquisition terminal 80 may be provided.
  • the data format of information transmitted from the voice acquisition terminal 80 to the voice recognition device 90 for example, information indicating an input signal, information indicating a feature amount
  • a selection unit for example, the control unit 701 that selects a connection destination of the voice acquisition terminal 80 based on at least one of the load and the memory usage rate of each voice recognition device may be provided.
  • the voice acquisition terminal illustrated in FIG. 13 also includes a processing device determination unit 81 (for example, a processing control unit 105).
  • the contents of the processing device determination unit 81 are the same as the contents shown in FIG.
  • voice acquisition terminal which acquires the audio
  • voice recognition apparatus which performs audio
  • voice acquisition terminal Selects at least one device for performing calculation processing of a feature amount used for speech recognition from the own speech acquisition terminal and the speech recognition device according to the speech input status, and calculates according to the input status.
  • a processing device determination unit that selects at least one device that performs speech recognition based on the feature amount, from the own speech acquisition terminal and the speech recognition device, and speech that performs speech recognition of an input signal based on the selection result Recognizing means, and the processing device judging means includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and between the own voice acquisition terminal and the voice recognition apparatus.
  • Speech recognition system characterized by selecting the device in response to information representing at least one of signal status, performing apparatus and speech recognition performs calculation processing of the feature.
  • the voice acquisition terminal includes a status detection unit that detects a voice input status
  • the status detection unit includes a voice input environment, a task size, a status of the own voice acquisition terminal, a status of the voice recognition device, A situation representing at least one of communication situations between the own voice acquisition terminal and the voice recognition device is detected, and the processing device determination means calculates the feature amount according to the situation detected by the situation detection means.
  • the speech recognition system according to supplementary note 1, wherein a device that performs processing and a device that performs speech recognition are selected.
  • the voice acquisition terminal includes a voice detection unit that extracts a voice section from an input signal, and the processing device determination unit acquires a device that performs the voice section extraction process according to a voice input state. At least one of the terminal and the speech recognition device is selected, and the speech detection means extracts a speech section from the input signal when the processing device determination means selects the own speech acquisition terminal.
  • the voice acquisition terminal includes a noise removal unit that removes a noise component from the input signal, and the processing device determination unit acquires the device that performs the noise component removal process according to the voice input status. At least one of the terminal and the speech recognition device is selected, and the noise removal unit removes a noise component from the input signal when the processing device determination unit selects the own voice acquisition terminal.
  • the speech recognition system according to any one of the above.
  • the processing device determination means calculates the total score according to the voice input status based on the score which is a predetermined index according to the situation detected by the situation detection means,
  • the speech recognition system according to any one of appendix 2 to appendix 4, wherein a device that performs a feature amount calculation process and a device that performs speech recognition is selected by comparing the total with a predetermined condition.
  • the voice acquisition terminal includes a communication unit that transmits information representing the detected input signal or information representing the calculated feature amount to the voice recognition device, and the communication unit recognizes the data format of the information by voice recognition. 6.
  • the speech recognition system according to any one of supplementary notes 1 to 5, wherein the information is transmitted to the speech recognition device after being notified to the device, and the speech recognition result for the information is received from the speech recognition device.
  • connection destination control device At least one or more voice recognition devices, a connection destination control device connected between the voice acquisition terminal and the voice recognition device, and selecting a connection destination of the voice acquisition terminal from the voice recognition devices;
  • the connection destination control device includes at least one of a data format of information transmitted from the voice acquisition terminal to the voice recognition device, a CPU load of each voice recognition device, and a memory usage rate of each voice recognition device.
  • the speech recognition system according to any one of supplementary notes 1 to 6, further comprising selection means for selecting a connection destination of the speech acquisition terminal based on one piece of information.
  • the situation detection means includes a noise level indicating a voice input environment, the number of words or complexity of a voice recognition target indicating a task size, a CPU load or a memory usage rate indicating a situation of the own voice acquisition terminal, a voice recognition device Any one of appendix 2 to appendix 7 for detecting a situation representing at least one of a CPU load or a memory usage rate indicating the status of the line, and a line status between the own voice acquisition terminal and the voice recognition device The speech recognition system described.
  • the voice acquisition terminal includes a likelihood calculation unit that calculates a likelihood representing the likelihood of the voice recognition result, a voice recognition result selection unit that selects one voice recognition result from a plurality of voice recognition results, and
  • the speech recognition result selection means includes a speech recognition result obtained by the speech acquisition terminal and a speech recognition result obtained by the speech recognition device when speech recognition processing for the input signal is performed by both the speech acquisition terminal and the speech recognition device.
  • the speech recognition system according to any one of appendix 1 to appendix 8, which selects a speech recognition result having a higher likelihood.
  • the communication unit transmits an input signal to the speech recognition device, and the speech recognition device is selected as a device that performs speech recognition.
  • the speech recognition system according to any one of appendix 6 to appendix 9, wherein the feature amount calculated by the own speech recognizer is transmitted to the speech recognizer.
  • a voice acquisition terminal that receives a voice and acquires an input signal representing the voice, and includes a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice acquisition terminal. Then, at least one device that performs a feature amount calculation process used for speech recognition is selected according to the speech input status, and based on the feature amount calculated from the speech recognition device and the own speech acquisition terminal.
  • a processing device determination unit that selects at least one device that performs voice recognition according to the input status, the processing device determination unit including a voice input environment, a task size, a situation of the own voice acquisition terminal, the voice A device for calculating a feature value and a device for performing speech recognition are selected according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device.
  • Audio acquiring terminal characterized by.
  • condition detection means which detects the input condition of an audio
  • condition detection means includes the input environment of voice, the task size, the situation of the own voice acquisition terminal, the situation of the voice recognition device, the own voice acquisition terminal, A device that detects a situation representing at least one of communication situations with the voice recognition device, and a processing device determination unit that performs a feature amount calculation process according to the situation detected by the situation detection unit;
  • the voice acquisition terminal according to appendix 11, which selects a device that performs voice recognition.
  • voice acquisition terminal which acquires the input signal which represents the audio
  • At least one device that performs a feature amount calculation process used for speech recognition is selected according to a speech input state, and the speech acquisition terminal is calculated from the speech recognition device and the own speech acquisition terminal.
  • At least one device that performs speech recognition based on a feature amount is selected according to the input situation, and the speech acquisition terminal determines a device that performs feature amount calculation processing and a device that performs speech recognition as a speech input environment. Selection based on information representing at least one of a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. Speech recognition sharing method to.
  • the voice acquisition terminal is at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus.
  • a speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer From the apparatus and the computer, at least one apparatus that performs processing for calculating a feature value used for speech recognition is selected according to the input state of the speech, and is calculated from the speech recognition apparatus and the computer.
  • a processing device determination process for selecting at least one device that performs voice recognition based on a feature amount according to the input status is executed, and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations: the status of the voice recognition device; and the communication status between the own voice acquisition terminal and the voice recognition device
  • the speech recognition program for selecting a device and apparatus for speech recognition performs calculation processing of the feature.
  • a computer is caused to execute a situation detection process for detecting a voice input situation.
  • a situation detection process a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition device, A condition representing at least one of communication conditions between the voice acquisition terminal and the voice recognition device is detected, and a feature amount is calculated according to the status detected in the status detection process in the processing device determination process.
  • the speech recognition program according to appendix 15, which selects a device that performs processing and a device that performs speech recognition.
  • the present invention is preferably applied to a voice recognition system that shares voice recognition processing between a terminal and a server device.

Abstract

A processing device determination means selects in accordance with the voice input situation at least one device from a local-voice acquisition terminal and a voice recognition device to perform calculations for feature quantities used in voice recognition, and selects in accordance with the input situation at least one device from the local-voice acquisition terminal and the voice recognition device to perform voice recognition on the basis of the feature quantities calculated. Furthermore, the processing device determination means selects a device to perform feature quantity calculations and a device to perform voice recognition, in accordance with information representing at least one from among the voice input environment, task size, local-voice acquisition terminal situation, voice recognition device situation, and communication situation between the local-voice acquisition terminal and the voice recognition device.

Description

音声認識システム、音声取得端末、音声認識分担方法および音声認識プログラムSpeech recognition system, speech acquisition terminal, speech recognition sharing method, and speech recognition program
 本発明は、端末とサーバ装置との間で音声認識処理を分担する音声認識システム、音声取得端末、音声認識分担方法および音声認識プログラムに関する。 The present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device.
 携帯端末のような比較的CPUパワーの少ない端末で音声認識を行う場合、分散音声認識(Distributed Speech Recognition:以下、DSRと記す。)システムが広く用いられる。一般的なDSRシステムでは、ユーザによって入力された音声に対し、まず、端末が、音声検出、雑音抑圧及び特徴量変換処理を行い、圧縮した特徴量をサーバに伝送する。次に、サーバ装置が、端末から伝送された特徴量を元に音声認識を行い、その認識結果を端末側に伝送する。そして、端末が、ユーザに対して認識結果を通知する。このような構成とすることにより、端末に実装できない大規模な辞書や、多くのCPUパワーを必要とする音声認識を、端末を介して行うことが可能になる。 When performing speech recognition with a terminal having relatively low CPU power such as a portable terminal, a distributed speech recognition (hereinafter referred to as DSR) system is widely used. In a general DSR system, a terminal first performs speech detection, noise suppression, and feature amount conversion processing on speech input by a user, and transmits the compressed feature amount to a server. Next, the server device performs speech recognition based on the feature amount transmitted from the terminal, and transmits the recognition result to the terminal side. Then, the terminal notifies the user of the recognition result. With such a configuration, a large-scale dictionary that cannot be mounted on the terminal and voice recognition that requires a large amount of CPU power can be performed via the terminal.
 このようなDSRシステムが、特許文献1に記載されている。図14は、特許文献1に記載された音声認識システムを示す説明図である。特許文献1に記載された音声認識システムは、複数のクライアント局(端末)320,330,340が、公衆インターネット網350を介して、サーバ310に接続される。 Such a DSR system is described in Patent Document 1. FIG. 14 is an explanatory diagram showing a voice recognition system described in Patent Document 1. As shown in FIG. In the speech recognition system described in Patent Document 1, a plurality of client stations (terminals) 320, 330, and 340 are connected to a server 310 via a public Internet network 350.
 なお、端末330は、ユーザの音声信号を取得するインタフェース(IF)331と、サーバ310と通信を行う通信インタフェース(COM)332と、取得された音声信号から特徴量のベクトルを求めるスペクトル分析部(SAS)333と、特徴量のベクトルから音声認識を行う音声認識部(SR)334と、音声認識結果に依存して特徴量のベクトルの一部をサーバ310に振り分ける音声コントローラ(SC)335と、特徴量のベクトルが通信インタフェース332を介してサーバ310に送信されたか否かを判定する制御可能なスイッチ(SW)336とを備えている。また、サーバ310は、端末と通信を行う通信インタフェース(COM)312と、端末から受信した特徴量のベクトルから音声認識を行う音声認識部(REC)314とを備えている。 The terminal 330 includes an interface (IF) 331 that acquires a user's voice signal, a communication interface (COM) 332 that communicates with the server 310, and a spectrum analysis unit that obtains a feature vector from the acquired voice signal ( SAS) 333, a speech recognition unit (SR) 334 that performs speech recognition from the feature vector, a speech controller (SC) 335 that distributes a part of the feature vector to the server 310 depending on the speech recognition result, And a controllable switch (SW) 336 for determining whether or not a feature vector has been transmitted to the server 310 via the communication interface 332. In addition, the server 310 includes a communication interface (COM) 312 that communicates with a terminal, and a speech recognition unit (REC) 314 that performs speech recognition from a feature quantity vector received from the terminal.
 特許文献1に記載された分散音声認識システムでは、端末側の音声認識部334が比較的語彙が少なく、必要とするCPUパワーが少ない音声認識を行う。そして、サーバ側の音声認識314部が、比較的語彙が多く、CPUパワーを多く必要とする音声認識を行う。このように、効率良く音声認識処理を分散することで、レスポンスの良い音声認識を行っている。 In the distributed speech recognition system described in Patent Document 1, the speech recognition unit 334 on the terminal side performs speech recognition that has relatively little vocabulary and requires less CPU power. The server-side voice recognition unit 314 performs voice recognition that has a relatively large vocabulary and requires a lot of CPU power. In this way, voice recognition with good response is performed by efficiently distributing voice recognition processing.
 また、特許文献2には、音声認識処理を分担して実行する情報端末が記載されている。特許文献2に記載された情報端末は、取り込んだ音声データの特徴点を抽出して、その音声データの複雑度を判定し、その複雑度に応じて音声認識処理を行う装置を決定する。 In addition, Patent Document 2 describes an information terminal that performs voice recognition processing in a shared manner. The information terminal described in Patent Document 2 extracts feature points of the captured voice data, determines the complexity of the voice data, and determines a device that performs voice recognition processing according to the complexity.
特表2002-540479号公報Special Table 2002-540479 特開2007-41089号公報JP 2007-41089 A
 特許文献1に記載された音声認識システムでは、端末側のCPU負荷やサーバ側のCPU負荷に応じて音声入力信号を伝送する量を変化させることで、音声認識に関する処理を分担させる。しかし、特許文献1に記載された音声認識システムでは、CPU負荷のみを考慮して処理を分散させているため、その音声認識で行われる処理を適切に分担できていないという問題がある。すなわち、音声認識処理を複数の装置で適切に分担するためには、CPU負荷を考慮するだけでは十分でないという問題がある。 In the speech recognition system described in Patent Document 1, processing related to speech recognition is shared by changing the amount of speech input signal transmitted according to the CPU load on the terminal side or the CPU load on the server side. However, the speech recognition system described in Patent Document 1 has a problem in that the processing performed in the speech recognition cannot be appropriately shared because the processing is distributed considering only the CPU load. That is, there is a problem that it is not sufficient to consider the CPU load in order to appropriately share the voice recognition processing among a plurality of devices.
 一方、特許文献2に記載された情報端末では、取り込んだ音声データの特徴点を抽出し、分散先を決定するための要因としてその特徴点を利用している。そのため、特許文献2に記載された情報端末は、音声認識で行われる処理をより適切に分担できているとも考えられる。 On the other hand, in the information terminal described in Patent Document 2, feature points of the captured voice data are extracted, and the feature points are used as factors for determining the distribution destination. Therefore, it is considered that the information terminal described in Patent Document 2 can more appropriately share the processing performed by voice recognition.
 しかし、特許文献2に記載された情報端末では、音声認識の処理の一つである特徴点の抽出処理は情報端末で行われることになる。すなわち、特徴点の抽出処理については、必ず情報端末が行うことになるため、音声認識で行われる処理を適切に分担させているとは言い難い。 However, in the information terminal described in Patent Document 2, feature point extraction processing, which is one of speech recognition processing, is performed in the information terminal. In other words, the feature point extraction process is always performed by the information terminal, and thus it is difficult to say that the process performed in the speech recognition is appropriately shared.
 そこで、本発明は、音声認識処理を端末とサーバ装置とで分担する音声認識システム、音声取得端末、音声認識分担方法および音声認識プログラムにおいて、特に、音声認識で行われる処理を端末及びサーバ装置に適切に分担させることができる音声認識システム、音声取得端末、音声認識分担方法および音声認識プログラムを提供することを目的とする。 Therefore, the present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device. An object of the present invention is to provide a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that can be appropriately shared.
 本発明による音声認識システムは、音声が入力され、その音声を表す入力信号を取得する音声取得端末と、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置とを備え、音声取得端末が、音声の入力状況に応じて、音声認識に用いられる特徴量の算出処理を行う装置を自音声取得端末と音声認識装置の中から少なくとも1つ選択し、入力状況に応じて、算出された特徴量に基づいて音声認識を行う装置を自音声取得端末と音声認識装置の中から少なくとも1つ選択する処理装置判断手段を備え、処理装置判断手段が、音声の入力環境、タスクサイズ、自音声取得端末の状況、音声認識装置の状況、自音声取得端末と音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択することを特徴とする。 The speech recognition system according to the present invention includes a speech acquisition terminal that receives speech and acquires an input signal representing the speech, and a speech recognition device that performs speech recognition based on information transmitted from the speech acquisition terminal. The acquisition terminal selects at least one of the own voice acquisition terminal and the voice recognition device as a feature amount calculation process used for voice recognition according to the voice input status, and calculates according to the input status. And a processing device determination unit that selects at least one device that performs voice recognition based on the feature amount, from the own voice acquisition terminal and the voice recognition device, and the processing device determination unit includes a voice input environment, a task size, A feature amount calculation process is performed according to information representing at least one of the status of the own voice acquisition terminal, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device. And selecting a device that performs location and voice recognition.
 本発明による音声取得端末は、音声が入力され、その音声を表す入力信号を取得する音声取得端末であって、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、入力状況に応じて少なくとも1つ選択する処理装置判断手段を備え、処理装置判断手段が、音声の入力環境、タスクサイズ、自音声取得端末の状況、音声認識装置の状況、自音声取得端末と音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択することを特徴とすることを特徴とする。 The voice acquisition terminal according to the present invention is a voice acquisition terminal that receives an input of voice and acquires an input signal representing the voice, and a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice At least one device that performs processing for calculating feature values used for speech recognition is selected from the acquisition terminals according to the voice input status, and the feature calculated from the speech recognition device and the own speech acquisition terminal is selected. A processing device determination unit that selects at least one device that performs voice recognition based on the volume according to an input situation, and the processing device determination unit includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a voice Select a device for performing feature amount calculation and a device for performing speech recognition according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device Characterized in that it features a Rukoto.
 本発明による音声認識分担方法は、音声が入力され、その音声を表す入力信号を取得する音声取得端末が、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、音声取得端末が、音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、入力状況に応じて少なくとも1つ選択し、音声取得端末が、特徴量の算出処理を行う装置及び音声認識を行う装置を、音声の入力環境、タスクサイズ、自音声取得端末の状況、音声認識装置の状況、自音声取得端末と音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて選択することを特徴とする。 The voice recognition sharing method according to the present invention includes a voice recognition device that receives a voice and obtains an input signal representing the voice, and performs voice recognition based on information transmitted from the voice acquisition terminal. From the terminal, at least one device that performs the calculation processing of the feature value used for voice recognition is selected according to the voice input status, and the voice acquisition terminal is selected from the voice recognition device and the own voice acquisition terminal. At least one device that performs speech recognition based on the calculated feature amount is selected according to the input situation, and the speech acquisition terminal determines whether the device that performs feature amount calculation processing and the device that performs speech recognition Selection is made according to information representing at least one of the environment, task size, status of the own voice acquisition terminal, status of the voice recognition device, and communication status between the own voice acquisition terminal and the voice recognition device. To.
 本発明による音声認識プログラムは、音声が入力され、その音声を表す入力信号を取得するコンピュータに適用される音声認識プログラムであって、コンピュータに、そのコンピュータから送信される情報に基づいて音声認識を行う音声認識装置とそのコンピュータの中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、音声認識装置とそのコンピュータの中から、算出された特徴量に基づいて音声認識を行う装置を、入力状況に応じて少なくとも1つ選択する処理装置判断処理を実行させ、処理装置判断処理で、音声の入力環境、タスクサイズ、自音声取得端末の状況、音声認識装置の状況、自音声取得端末と音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択させることを特徴とする。 A speech recognition program according to the present invention is a speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer. Select at least one device for calculating the feature value used for speech recognition from the speech recognition device and the computer to be used, and calculate from the speech recognition device and the computer. A processing device determination process for selecting at least one device that performs voice recognition based on the input feature amount according to an input situation, and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device In response, characterized in that to select a device for performing device and speech recognition performs calculation processing of the feature.
 本発明によれば、音声認識処理を端末とサーバ装置とで分担する場合に、音声認識で行われる処理を端末及びサーバ装置に適切に分担させることができる。 According to the present invention, when the voice recognition process is shared between the terminal and the server apparatus, the process performed by the voice recognition can be appropriately shared between the terminal and the server apparatus.
本発明の第1の実施形態における音声認識システムの例を示すブロック図である。It is a block diagram which shows the example of the speech recognition system in the 1st Embodiment of this invention. 第1の実施形態における端末側の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement by the side of the terminal in 1st Embodiment. 第1の実施形態におけるサーバ側の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement by the side of the server in 1st Embodiment. 本発明の第2の実施形態における音声認識システムの例を示すブロック図である。It is a block diagram which shows the example of the speech recognition system in the 2nd Embodiment of this invention. 第2の実施形態における端末側の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement by the side of the terminal in 2nd Embodiment. 第2の実施形態におけるサーバ側の動作の例を示すフローチャートである。It is a flowchart which shows the example of the operation | movement by the side of the server in 2nd Embodiment. 本発明の第3の実施形態における音声認識システムの例を示すブロック図である。It is a block diagram which shows the example of the speech recognition system in the 3rd Embodiment of this invention. 第3の実施形態における端末側の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement by the side of the terminal in 3rd Embodiment. 第3の実施形態におけるサーバ側の動作の例を示すフローチャートである。It is a flowchart which shows the example of operation | movement by the side of the server in 3rd Embodiment. 得点表の例を示す説明図である。It is explanatory drawing which shows the example of a score table. 第5の実施形態における音声認識システムの例を示す説明図である。It is explanatory drawing which shows the example of the speech recognition system in 5th Embodiment. 本発明による音声認識システムの最小構成の例を示すブロック図である。It is a block diagram which shows the example of the minimum structure of the speech recognition system by this invention. 本発明による音声取得端末の最小構成の例を示すブロック図である。It is a block diagram which shows the example of the minimum structure of the audio | voice acquisition terminal by this invention. 特許文献1に記載された音声認識システムを示す説明図である。It is explanatory drawing which shows the speech recognition system described in patent document 1.
 以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
実施形態1.
 図1は、本発明の第1の実施形態における音声認識システムの例を示すブロック図である。本実施形態における音声認識システムは、端末100と、サーバ装置200とを備えている。なお、以下の説明では、端末100のことを端末側と記すこともあり、また、サーバ装置200のことをサーバ側と記すこともある。端末100とサーバ装置200とは、例えば、公衆インターネット網を介して接続される。
Embodiment 1. FIG.
FIG. 1 is a block diagram showing an example of a speech recognition system according to the first embodiment of the present invention. The voice recognition system according to the present embodiment includes a terminal 100 and a server device 200. In the following description, the terminal 100 may be referred to as the terminal side, and the server device 200 may be referred to as the server side. The terminal 100 and the server device 200 are connected via, for example, a public Internet network.
 端末100は、入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、処理制御部105と、送受信部106と、認識結果統合部107と、認識結果表示部108とを備えている。 The terminal 100 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a processing control unit 105, a transmission / reception unit 106, a recognition result integration unit 107, and a recognition result. And a display unit 108.
 入力信号取得部101は、入力された音声を入力音データ(以下、入力信号と記す。)に変換する。具体的には、入力信号取得部101は、マイクロフォン99などで集音された時系列の入力音データを単位時間のフレームごとに切り出す。 The input signal acquisition unit 101 converts the input voice into input sound data (hereinafter referred to as an input signal). Specifically, the input signal acquisition unit 101 cuts out time-series input sound data collected by the microphone 99 or the like for each frame of unit time.
 特徴量変換部102は、入力信号の時系列を、音声認識に用いられる特徴量の時系列に変換する。特徴量変換部102は、例えば、LPC(Linear Predictive Coding)ケプストラム分析や、MFCC(Mel-Frequency Cepstrum Coefficient)分析といった方法を用いて入力信号を特徴量に変換する。ただし、特徴量変換部102が入力信号を特徴量に変換する方法は、上記方法に限定されない。 The feature quantity conversion unit 102 converts the time series of the input signal into a time series of feature quantities used for speech recognition. The feature quantity conversion unit 102 converts an input signal into a feature quantity by using a method such as LPC (Linear Predictive Coding) cepstrum analysis or MFCC (Mel-Frequency Cepstrum Coefficient) analysis. However, the method by which the feature amount conversion unit 102 converts the input signal into the feature amount is not limited to the above method.
 音声認識部103は、変換された特徴量の時系列をもとに、音声認識を行う。また、音声認識部103は、認識結果を表すスコアを同時に算出する。ここで、認識結果を表すスコア(以下、単にスコアと記すこともある。)とは、音声認識の尤もらしさを表す指標である。音声認識部103は、例えば、特徴量の時系列と音響モデルとの距離や、尤度といった音響スコア、言語的な一致度を表す言語スコアなどを、認識結果を表すスコアとして算出してもよい。このとき、音声認識部103は、認識結果全体に対してスコアを求めてもよく、フレームごとや単語ごと、一発声区間ごとなど様々な単位でスコアを求めてもよい。なお、音声認識に用いられる特徴量を基に音声認識を行う方法、および、音声認識の尤もらしさを表す指標の算出方法は広く知られているため、ここでは説明を省略する。 The voice recognition unit 103 performs voice recognition based on the time series of the converted feature values. In addition, the voice recognition unit 103 calculates a score representing the recognition result at the same time. Here, the score representing the recognition result (hereinafter sometimes simply referred to as a score) is an index representing the likelihood of speech recognition. The speech recognition unit 103 may calculate, for example, a distance between the feature amount time series and the acoustic model, an acoustic score such as likelihood, a language score representing linguistic coincidence, and the like as a score representing a recognition result. . At this time, the voice recognition unit 103 may obtain a score for the entire recognition result, or may obtain a score in various units such as for each frame, for each word, or for each utterance section. Note that a method for performing speech recognition based on a feature amount used for speech recognition and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here.
 状況検知部104は、音声が入力される状況を検知する。具体的には、状況検知部104は、音声が入力される環境や、端末100及びサーバ装置200の状況、タスクサイズや、入力信号を伝送する端末100とサーバ装置200との間の回線状態など、様々な状況を検知する。状況検知部104は、例えば、端末100及びサーバ装置200の状況として、CPU負荷やメモリの使用状況などを検知する。ただし、状況検知部104が検知する端末100及びサーバ装置200の状況は、CPU負荷やメモリの使用状況に限定されない。 The situation detection unit 104 detects a situation where sound is input. Specifically, the status detection unit 104 includes the environment in which voice is input, the status of the terminal 100 and the server device 200, the task size, the line status between the terminal 100 and the server device 200 that transmits the input signal, and the like. Detect various situations. The status detection unit 104 detects, for example, the CPU load and the memory usage status as the status of the terminal 100 and the server device 200. However, the statuses of the terminal 100 and the server device 200 detected by the status detection unit 104 are not limited to the CPU load and the memory usage status.
 また、タスクサイズとは、発話を音声認識する際の困難性を表す指標である。例えば、タスクサイズを表す尺度として、音声認識可能な単語の語彙数を用いてもよい。他にも、タスクサイズを表す尺度として、音声認識で受け付けられる発声の複雑度を用いてもよい。例えば、キーワード認識か自然言語認識かで発声の複雑度を表してもよい。 Also, task size is an index that represents the difficulty in speech recognition of utterances. For example, the number of vocabulary words that can be recognized by speech may be used as a scale representing the task size. In addition, as a measure representing the task size, the complexity of utterance accepted by speech recognition may be used. For example, utterance complexity may be expressed by keyword recognition or natural language recognition.
 なお、これらの尺度は、ユーザが音声を入力する状況やアプリケーションに依存して検知される。例えば、ユーザがはっきり話していて、その発声が比較的判別しやすい場合、状況検知部104は、語彙数が多かったり、発声が複雑であったりしても音声認識の困難性を低いと判断してもよい。一方、ユーザが早口だったり、小声だったりすることで、その発声が判別しにくい場合、音声認識の誤りを少なく抑えるため、語彙数を少なくしたり、発声を簡単なものにしたりする必要がある。そのため、このような場合、状況検知部104は、音声認識する際の困難性を高いと判断してもよい。また、状況検知部104は、アプリケーションの要求するスペックなどにより、語彙数の多さや複雑さによる音声認識の困難性を検知してもよい。 Note that these measures are detected depending on the situation and application in which the user inputs voice. For example, when the user is speaking clearly and the utterance is relatively easy to distinguish, the situation detection unit 104 determines that the difficulty of speech recognition is low even if the vocabulary is large or the utterance is complicated. May be. On the other hand, if it is difficult to identify the utterance due to the user's quickness or low voice, it is necessary to reduce the number of vocabularies or simplify the utterance in order to minimize errors in speech recognition. . Therefore, in such a case, the situation detection unit 104 may determine that the difficulty in voice recognition is high. In addition, the situation detection unit 104 may detect difficulty in speech recognition due to the large number of words or complexity based on specifications required by the application.
 また、音声が入力される環境には、雑音レベルなどが含まれる。雑音レベルとは、入力される音声に含まれる雑音の度合いを表す。例えば、マイクロフォン99を介して入力された音声のうち、音声認識の対象になる発声以外の部分を雑音としたとき、その雑音の大きさ(音圧)を雑音レベルとしてもよい。状況検知部104は、例えば、ユーザが発声する以前にマイクロフォン99に入力された入力信号の音圧などから雑音レベルを検知してもよい。 Also, the environment where voice is input includes noise level and the like. The noise level represents the degree of noise included in the input voice. For example, when the part other than the utterance that is the target of voice recognition among the voices input through the microphone 99 is set as noise, the magnitude (sound pressure) of the noise may be set as the noise level. The situation detection unit 104 may detect the noise level from, for example, the sound pressure of the input signal input to the microphone 99 before the user speaks.
 状況検知部104は、端末100及びサーバ装置200のCPU負荷や、装置間の回線状態を、例えば、OS(Operating System)により提供されるAPI(Application Program Interface)を使って検知してもよい。例えば、状況検知部104は、サーバ側にCPU使用状況を要求するパケットを送信し、返送されるパケットに含まれる情報からCPU負荷を算出してもよい。また、例えば、状況検知部104は、ICMP(Internet Control Message Protocol)のEcho要求パケットをサーバ側に送信し、サーバ側から返送されたICMPパケットを受信するまでの時間を計測して、回線状態を検知してもよい。ただし、状況検知部104がCPU負荷や装置間の回線状態を検知する方法は、上記方法に限定されない。 The status detection unit 104 may detect the CPU load of the terminal 100 and the server device 200 and the line status between the devices using, for example, an API (Application Program Interface) provided by an OS (Operating System). For example, the status detection unit 104 may transmit a packet requesting the CPU usage status to the server side and calculate the CPU load from information included in the returned packet. Further, for example, the status detection unit 104 transmits an ICMP (Internet Control Message Protocol) Echo request packet to the server side, measures the time until the ICMP packet returned from the server side is received, and determines the line state. It may be detected. However, the method by which the state detection unit 104 detects the CPU load and the line state between devices is not limited to the above method.
 処理制御部105は、特徴量変換装置判断部105aと、音声認識装置判断部105bとを含む。処理制御部105は、状況検知部104が検知した状況をもとに、以降の処理をどの装置に実行させるかを制御する。具体的には、特徴量変換装置判断部105aは、検知された状況をもとに、特徴量の算出処理をどの装置に実行させるか判断する。また、音声認識装置判断部105bは、検知された状況をもとに、特徴量に基づく音声認識をどの装置に実行させるかを判断する。図1に示す例では、特徴量変換装置判断部105a及び音声認識装置判断部105bは、端末100に処理を実行させるか、サーバ装置200に処理を実行させるか、または、端末100とサーバ装置200の両方に処理を実行させるかを選択する。なお、特徴量変換装置判断部105a及び音声認識装置判断部105bが、以降の処理をどの装置に実行させるか選択する具体的な方法は後述する。 The process control unit 105 includes a feature amount conversion device determination unit 105a and a voice recognition device determination unit 105b. The process control unit 105 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the feature amount conversion device determination unit 105a determines which device is to execute the feature amount calculation process based on the detected situation. In addition, the voice recognition device determination unit 105b determines which device should perform voice recognition based on the feature amount based on the detected situation. In the example illustrated in FIG. 1, the feature amount conversion device determination unit 105 a and the speech recognition device determination unit 105 b cause the terminal 100 to execute processing, cause the server device 200 to execute processing, or the terminal 100 and the server device 200. Select whether or not to cause both to execute processing. A specific method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b select which device to execute the subsequent processing will be described later.
 送受信部106は、処理制御部105(より具体的には、特徴量変換装置判断部105a及び音声認識装置判断部105b)の判断結果に応じて、入力信号の時系列及び特徴量の時系列をサーバ装置200に送信する。また、送受信部106は、サーバ装置200が行った音声認識の結果を受信する。 The transmission / reception unit 106 determines the time series of the input signal and the time series of the feature amount according to the determination results of the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b). It transmits to the server apparatus 200. In addition, the transmission / reception unit 106 receives the result of speech recognition performed by the server device 200.
 例えば、特徴量の算出処理をサーバ装置200に実行させると判断された場合、送受信部106は、入力信号の時系列をサーバ装置200に送信する。また、特徴量による音声認識処理をサーバ装置200に実行させると判断された場合、送受信部106は、特徴量の時系列をサーバ装置200に送信する。 For example, when it is determined that the server apparatus 200 is to execute the feature amount calculation process, the transmission / reception unit 106 transmits the time series of the input signal to the server apparatus 200. In addition, when it is determined that the server device 200 is to perform the voice recognition process based on the feature amount, the transmission / reception unit 106 transmits the time series of the feature amount to the server device 200.
 認識結果統合部107は、音声認識部103が行った音声認識の結果と、送受信部106が受信したサーバ装置200による音声認識の結果とを比較及び統合する。すなわち、認識結果統合部107は、音声認識部103が行った音声認識の結果と、送受信部106が受信したサーバ装置200による音声認識の結果の中から、より適切な音声認識結果を選択し、選択した結果を統合する。例えば、音声認識処理が端末側とサーバ側のいずれか一方の装置でのみ行われた場合、認識結果統合部107は、各装置で行われた音声認識を音声認識結果としてもよい。一方、音声認識処理が端末側とサーバ側のいずれの装置でも行われた場合、認識結果統合部107は、より尤もらしい音声認識の結果を選択し、その選択した結果を音声認識結果としてもよい。 The recognition result integration unit 107 compares and integrates the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition performed by the server device 200 received by the transmission / reception unit 106. That is, the recognition result integration unit 107 selects a more appropriate speech recognition result from the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition by the server device 200 received by the transmission / reception unit 106, Merge selected results. For example, when the speech recognition process is performed only in one of the terminal side and the server side, the recognition result integration unit 107 may use the speech recognition performed in each device as the speech recognition result. On the other hand, when the speech recognition process is performed on either the terminal side or the server side, the recognition result integration unit 107 may select a more likely speech recognition result and use the selected result as the speech recognition result. .
 認識結果表示部108は、認識結果統合部107が比較及び統合した音声認識の結果をユーザに表示する。認識結果表示部108は、例えば、ディスプレイ装置などにより実現される。 The recognition result display unit 108 displays the result of speech recognition compared and integrated by the recognition result integration unit 107 to the user. The recognition result display unit 108 is realized by, for example, a display device.
 サーバ装置200は、送受信部201と、処理制御部202と、特徴量変換部203と、音声認識部204とを備えている。サーバ装置200は、端末100から送信される情報に基づいて音声認識を行う。 The server device 200 includes a transmission / reception unit 201, a processing control unit 202, a feature amount conversion unit 203, and a voice recognition unit 204. Server device 200 performs voice recognition based on information transmitted from terminal 100.
 送受信部201は、端末100から送信されたデータを受信する。また、送受信部201は、音声認識部204による音声認識結果を端末100に送信する。 The transmission / reception unit 201 receives data transmitted from the terminal 100. Further, the transmission / reception unit 201 transmits the speech recognition result by the speech recognition unit 204 to the terminal 100.
 処理制御部202は、端末100から受信した情報をもとに、以降の処理内容を判定する。具体的には、処理制御部202は、端末100から受信したデータが入力信号の時系列か、特徴量の時系列かによって、以降の処理内容を判定する。例えば、端末100から受信したデータが入力信号の時系列である場合、処理制御部202は、特徴量の算出処理を特徴量変換部203に実行させる。一方、端末100から受信したデータが特徴量の時系列である場合、処理制御部202は、特徴量による音声認識処理を音声認識部204に実行させる。 The process control unit 202 determines subsequent process contents based on the information received from the terminal 100. Specifically, the processing control unit 202 determines the subsequent processing contents depending on whether the data received from the terminal 100 is a time series of input signals or a time series of feature amounts. For example, when the data received from the terminal 100 is a time series of input signals, the process control unit 202 causes the feature amount conversion unit 203 to execute a feature amount calculation process. On the other hand, when the data received from the terminal 100 is a time series of feature amounts, the process control unit 202 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
 特徴量変換部203は、受信した入力信号の時系列を、音声認識に用いられる特徴量の時系列に変換する。音声認識部204は、特徴量変換部203が変換した特徴量の時系列または端末100から受信した特徴量の時系列をもとに、音声認識を行う。また、音声認識部204は、認識結果を表すスコアを同時に算出する。なお、入力信号を特徴量に変換する方法、その特徴量を基に音声認識を行う方法、及び、音声認識の尤もらしさを表す指標の算出方法は広く知られているため、ここでは説明を省略する。 The feature value conversion unit 203 converts the time series of the received input signal into a time series of feature values used for speech recognition. The speech recognition unit 204 performs speech recognition based on the time series of feature amounts converted by the feature amount conversion unit 203 or the time series of feature amounts received from the terminal 100. In addition, the voice recognition unit 204 calculates a score representing the recognition result at the same time. Note that a method for converting an input signal into a feature value, a method for performing speech recognition based on the feature value, and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here. To do.
 入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、処理制御部105(より具体的には、特徴量変換装置判断部105aと、音声認識装置判断部105b)と、送受信部106と、認識結果統合部107とは、プログラム(音声認識プログラム)に従って動作するコンピュータのCPUによって実現される。例えば、プログラムは、端末100の記憶部(図示せず)に記憶され、CPUは、そのプログラムを読み込み、プログラムに従って、入力信号取得部101、特徴量変換部102、音声認識部103、状況検知部104、処理制御部105(より具体的には、特徴量変換装置判断部105aと、音声認識装置判断部105b)、送受信部106、および、認識結果統合部107として動作してもよい。 Input signal acquisition unit 101, feature amount conversion unit 102, speech recognition unit 103, situation detection unit 104, processing control unit 105 (more specifically, feature amount conversion device determination unit 105a, speech recognition device determination Unit 105b), transmission / reception unit 106, and recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program). For example, the program is stored in a storage unit (not shown) of the terminal 100, and the CPU reads the program, and in accordance with the program, the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit. 104, the process control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 may be operated.
 また、入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、処理制御部105と、送受信部106と、認識結果統合部107とは、それぞれが専用のハードウェアで実現されていてもよい。 The input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the processing control unit 105, the transmission / reception unit 106, and the recognition result integration unit 107 are dedicated to each. It may be realized by hardware.
 次に、動作について説明する。図2は、端末側における動作の例を示すフローチャートである。また、図3は、サーバ側における動作の例を示すフローチャートである。まず初めに、図2を用いて、端末側における動作を説明する。 Next, the operation will be described. FIG. 2 is a flowchart showing an example of the operation on the terminal side. FIG. 3 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
 端末側では、まず、マイクロフォン99などを用いて入力音が集音されると、入力信号取得部101が、集音された時系列の入力音データを単位時間のフレームごとに切り出す(ステップS101)。例えば、入力信号取得部101は、入力音データから切り出し対象になる部分を、予め定められた時間ずつずらしながら、単位時間分の波形データを順次切り出せばよい。以下、この単位時間をフレーム幅と記し、この予め定められた時間をフレームシフトと呼ぶ。例えば、入力音データがサンプリング周波数8000Hzの16bit Linear-PCM(Pulse Code Modulation )の場合、1秒当たり8000点分の波形データが含まれている。この場合、入力信号取得部101は、この波形データをフレーム幅200点(すなわち、25ミリ秒)、フレームシフト80点(すなわち、10ミリ秒)で時系列にしたがって逐次切り出せばよい。 On the terminal side, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 cuts out the collected time-series input sound data for each unit time frame (step S101). . For example, the input signal acquisition unit 101 may sequentially cut out waveform data for a unit time while shifting a portion to be cut out from input sound data by a predetermined time. Hereinafter, this unit time is referred to as a frame width, and this predetermined time is referred to as a frame shift. For example, when the input sound data is 16-bit Linear-PCM (Pulse Code Modulation) with a sampling frequency of 8000 Hz, waveform data for 8000 points per second is included. In this case, the input signal acquisition unit 101 may extract the waveform data sequentially in time series at a frame width of 200 points (ie, 25 milliseconds) and a frame shift of 80 points (ie, 10 milliseconds).
 次に、特徴量変換装置判断部105aは、状況検知部104が検知した状況に応じて、入力信号の時系列を、端末側の特徴量変換部102で特徴量の時系列に変換するか、送受信部106からサーバ装置200に送信するか(すなわち、サーバ側の特徴量変換部203で特徴量の時系列に変換するか)、または、両方で特徴量の時系列に変換するかを判定する(ステップS102)。 Next, the feature quantity conversion device determination unit 105a converts the time series of the input signal into the feature quantity time series by the feature quantity conversion unit 102 on the terminal side according to the situation detected by the situation detection unit 104, or It is determined whether the transmission / reception unit 106 transmits to the server device 200 (that is, whether the server side feature amount conversion unit 203 converts the feature amount into a time series) or both of them convert into a feature amount time series. (Step S102).
 特徴量変換装置判断部105aは、例えば、以下の条件に基づき、どちらの装置で入力信号の時系列を特徴量の時系列に変換するか判断する。
 1.伝送する回線が切断されている場合、非常に通信速度が遅い場合、または、サーバ装置200のCPU負荷が高い場合、特徴量変換装置判断部105aは、入力信号をサーバ装置200には送信せず、端末側で特徴量に変換すると判断する。
 2.上記以外で、端末100のCPU負荷が高い場合、または、雑音レベルが高い場合、特徴量変換装置判断部105aは、入力信号をサーバ側に送信し、端末側では特徴量の変換を行わないと判断する。すなわち、特徴量変換装置判断部105aは、サーバ側で特徴量の変換を行うと判断する。
 3.上記以外の場合、特徴量変換装置判断部105aは、入力信号をサーバ側に送信し、また、端末側でも特徴量に変換すると判断する。すなわち、特徴量変換装置判断部105aは、端末側とサーバ側の両方で特徴量の変換を行うと判断する。
For example, the feature amount conversion device determination unit 105a determines which device converts the time series of the input signal into the time series of the feature amount based on the following conditions.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 200 is high, the feature amount conversion device determination unit 105a does not transmit the input signal to the server device 200. Then, it is determined that the terminal side converts to a feature value.
2. Other than the above, when the CPU load of the terminal 100 is high or the noise level is high, the feature amount conversion apparatus determination unit 105a transmits the input signal to the server side, and does not perform the feature amount conversion on the terminal side. to decide. That is, the feature quantity conversion device determination unit 105a determines that the server performs the feature quantity conversion.
3. In cases other than the above, the feature quantity conversion device determination unit 105a transmits the input signal to the server side, and determines that the terminal side also converts to the feature quantity. That is, the feature quantity conversion device determination unit 105a determines that the feature quantity conversion is performed on both the terminal side and the server side.
 なお、通信速度が遅いか否か、端末100及びサーバ装置200のCPU負荷が高いか否か、及び、雑音レベルが高いか否かは、予め定められた閾値と比較して判断される。 Note that whether the communication speed is low, whether the CPU load on the terminal 100 and the server device 200 is high, and whether the noise level is high is determined by comparing with a predetermined threshold.
 特徴量変換装置判断部105aが端末側で特徴量の変換を行うと判断した場合(ステップS102における「端末側」)、特徴量変換部102は、フレームごとに切り出された入力信号の時系列を特徴量の時系列に変換する(ステップS103)。 When the feature value conversion apparatus determination unit 105a determines that the feature value conversion is performed on the terminal side (“terminal side” in step S102), the feature value conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).
 一方、特徴量変換装置判断部105aがサーバ側で特徴量の変換を行うと判断した場合(ステップS102における「サーバ側」)、特徴量変換装置判断部105aは、入力信号を送受信部106に送信させる(ステップS104)。送受信部106は、例えば、あるまとまった単位ごとに入力信号の時系列を圧縮して送信してもよく、また、入力信号の時系列をエンコードして送信してもよい。この際、送受信部106は、送信する内容が入力信号であることを表すため、データの先頭にヘッダ情報などを付与してもよい。 On the other hand, when the feature value conversion device determination unit 105a determines that the feature value conversion is performed on the server side (“server side” in step S102), the feature value conversion device determination unit 105a transmits the input signal to the transmission / reception unit 106. (Step S104). For example, the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal. At this time, the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is an input signal.
 また、送受信部106は、データの形式をサーバ側に通知した後で、入力信号をサーバ側に送信してもよい。このようにすることで、サーバ側は、その後に受信するデータの内容を判断することが可能になる。 Further, the transmission / reception unit 106 may transmit the input signal to the server side after notifying the server side of the data format. By doing in this way, it becomes possible for the server side to judge the content of the data received after that.
 次に、音声認識装置判断部105bは、状況検知部104が検知した状況に応じ、特徴量変換部102が変換した特徴量の時系列をもとに、端末側の音声認識部103が音声認識処理を行うか、特徴量の時系列をサーバ装置200に送信するか(すなわち、サーバ側の音声認識部204が音声認識処理を行うか)、または、両方で音声認識するかを判定する(ステップS105)。 Next, the voice recognition device determination unit 105b causes the terminal-side voice recognition unit 103 to perform voice recognition based on the time series of the feature amounts converted by the feature amount conversion unit 102 in accordance with the situation detected by the situation detection unit 104. It is determined whether processing is performed, a time series of feature values is transmitted to the server device 200 (that is, whether the server-side speech recognition unit 204 performs speech recognition processing), or both are recognized (step S105).
 音声認識装置判断部105bは、特徴量変換装置判断部105aがサーバ装置200に入力信号を送信しないと判断した場合、例えば、以下の条件に基づき、どちらの装置で特徴量の時系列をもとに音声認識処理を行うか判断する。
 1.伝送する回線が切断されている場合、非常に通信速度が遅い場合、または、サーバ装置200のCPU負荷が高い場合、音声認識装置判断部105bは、特徴量をサーバ装置200に送信せず、端末側で音声認識を行うと判断する。
 2.上記以外で、タスクサイズが大きい(すなわち、より困難性が高い)場合、音声認識装置判断部105bは、特徴量をサーバ装置200に送信し、端末側では音声認識を行わないと判断する。すなわち、音声認識装置判断部105bは、サーバ側で音声認識を行うと判断する。
 3.上記以外の場合、音声認識装置判断部105bは、特徴量をサーバ装置200に送信し、また、端末側でも音声認識を行うと判断する。すなわち、音声認識装置判断部105bは、端末側とサーバ側の両方で音声認識を行うと判断する。
If the feature amount conversion device determination unit 105a determines that the input signal is not transmitted to the server device 200, the voice recognition device determination unit 105b, for example, based on the following conditions, based on the time series of the feature amount Whether to perform voice recognition processing.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server apparatus 200 is high, the speech recognition apparatus determination unit 105b does not transmit the feature amount to the server apparatus 200, and the terminal It is determined that voice recognition is performed on the side.
2. Other than the above, when the task size is large (that is, more difficult), the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side does not perform speech recognition. That is, the voice recognition device determination unit 105b determines that voice recognition is performed on the server side.
3. In cases other than the above, the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side also performs speech recognition. That is, the speech recognition device determination unit 105b determines that speech recognition is performed on both the terminal side and the server side.
 なお、通信速度が遅いか否か、及び、サーバ装置200のCPU負荷が高いか否か、及び、困難性が高いか否かは、予め定められた閾値と比較して判断される。 Note that whether or not the communication speed is low, whether or not the CPU load of the server device 200 is high, and whether or not the difficulty is high is determined by comparison with a predetermined threshold value.
 音声認識装置判断部105bが端末側で音声認識を行うと判断した場合(ステップS105における「端末側」)、音声認識部103は、特徴量の時系列に対して音声認識を行う(ステップS106)。具体的には、音声認識部103は、対応する単語列を端末100の備える記憶部(図示せず)から探索し、探索結果を音声認識結果とする。また、このとき、音声認識部103は、認識結果を表すスコアを同時に算出する。 When the speech recognition apparatus determination unit 105b determines that speech recognition is to be performed on the terminal side (“terminal side” in step S105), the speech recognition unit 103 performs speech recognition on the time series of feature amounts (step S106). . Specifically, the speech recognition unit 103 searches for a corresponding word string from a storage unit (not shown) provided in the terminal 100, and uses the search result as the speech recognition result. At this time, the voice recognition unit 103 simultaneously calculates a score representing the recognition result.
 一方、音声認識装置判断部105bがサーバ側で音声認識を行うと判断した場合(ステップS105における「サーバ側」)、音声認識装置判断部105bは、特徴量を送受信部106に送信させる(ステップS107)。送受信部106は、例えば、あるまとまった単位ごとに特徴量の時系列を圧縮して送信してもよく、また、特徴量の時系列をエンコードして送信してもよい。また、この際、送受信部106は、送信する内容が特徴量であることを表すため、データの先頭にヘッダ情報などを付与してもよい。 On the other hand, when the voice recognition device determination unit 105b determines that the server side performs voice recognition (“server side” in step S105), the voice recognition device determination unit 105b causes the transmission / reception unit 106 to transmit the feature amount (step S107). ). For example, the transmission / reception unit 106 may compress and transmit the time series of feature amounts for each unit, or may encode and transmit the time series of feature amounts. At this time, the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is a feature amount.
 そして、音声認識をサーバ側で行った場合(ステップS108におけるYes)、認識結果統合部107は、端末側の音声認識結果とサーバ側の音声認識結果のうちのいずれかを選択して、両者の音声認識結果を統合する(ステップS109)。 When the speech recognition is performed on the server side (Yes in step S108), the recognition result integration unit 107 selects either the terminal side speech recognition result or the server side speech recognition result, and The speech recognition results are integrated (step S109).
 ここで、端末側またはサーバ側のいずれか一方でのみ音声認識が行われた場合、認識結果統合部107は、音声認識結果の統合は行わず片方の音声認識結果を選択すればよい。一方、端末側とサーバ側の両方で音声認識が行われた場合、認識結果統合部107は、音声認識結果を統合する。具体的には、認識結果統合部107は、端末側とサーバ側の両方で行われた音声認識のうちから一方を選択し、選択した結果を統合する。 Here, when speech recognition is performed only on either the terminal side or the server side, the recognition result integration unit 107 may select one speech recognition result without integrating the speech recognition results. On the other hand, when speech recognition is performed on both the terminal side and the server side, the recognition result integration unit 107 integrates the speech recognition results. Specifically, the recognition result integration unit 107 selects one of the speech recognitions performed on both the terminal side and the server side, and integrates the selected results.
 認識結果統合部107は、例えば、音声認識部103及び音声認識部204が算出したスコアのうち、より高いスコアの音声認識結果を選択してもよい。また、認識結果統合部107は、例えば、音声認識結果を単語などの分割単位で比較し、比較した分割単位でより高いスコアの音声認識結果を選択してもよい。 The recognition result integration unit 107 may select a speech recognition result having a higher score from the scores calculated by the speech recognition unit 103 and the speech recognition unit 204, for example. For example, the recognition result integration unit 107 may compare the speech recognition results in units of division such as words, and select a speech recognition result having a higher score in the compared division units.
 さらに、サーバ装置200からの音声認識結果が回線速度などの影響である一定時間以上得られない場合、認識結果統合部107は、サーバ装置200からの音声認識結果を使わず、端末側の音声認識結果のみを用いてもよい。 Furthermore, when the voice recognition result from the server apparatus 200 cannot be obtained for a certain period of time due to the influence of the line speed or the like, the recognition result integration unit 107 does not use the voice recognition result from the server apparatus 200 and does not use the voice recognition result on the terminal side. Only the result may be used.
 サーバ側で音声認識を行っていない場合(ステップS108におけるNo)、または、音声認識結果の統合後、認識結果表示部108は、音声認識結果の表示を行う(ステップS110)。認識結果表示部108は、例えば、ディスプレイ装置などに文字列で認識結果を表示してもよい。さらに、認識結果表示部108は、音声認識結果から音声合成した結果を、ヘッドホンステレオやスピーカ(図示せず)を用いて通知してもよい。 When voice recognition is not performed on the server side (No in step S108), or after integration of the voice recognition results, the recognition result display unit 108 displays the voice recognition results (step S110). For example, the recognition result display unit 108 may display the recognition result as a character string on a display device or the like. Furthermore, the recognition result display unit 108 may notify the result of voice synthesis from the voice recognition result using a headphone stereo or a speaker (not shown).
 次に、図3を用いて、サーバ側における動作を説明する。サーバ側では、まず、送受信部201が、端末側からのデータを受信する(ステップS201)。受信したデータが圧縮またはエンコードされていた場合、送受信部201は、データの解凍やデコードを行う。 Next, the operation on the server side will be described with reference to FIG. On the server side, first, the transmission / reception unit 201 receives data from the terminal side (step S201). When the received data is compressed or encoded, the transmission / reception unit 201 decompresses and decodes the data.
 次に、処理制御部202は、受信したデータの内容に応じて、以降の処理内容を変更する(ステップS202)。受信したデータが入力信号の時系列である場合(ステップS202における「入力信号」)、特徴量変換部203は、入力信号を特徴量に変換する(ステップS203)。一方、受信したデータが特徴量の時系列である場合(ステップS202における「特徴量」)、特徴量変換部203は、特徴量の変換処理を行わない。 Next, the processing control unit 202 changes the subsequent processing content according to the content of the received data (step S202). When the received data is a time series of the input signal (“input signal” in step S202), the feature amount conversion unit 203 converts the input signal into a feature amount (step S203). On the other hand, when the received data is a time series of feature amounts (“feature amount” in step S202), the feature amount conversion unit 203 does not perform a feature amount conversion process.
 特徴量の変換を行う場合、サーバ側の特徴量変換部203は、フレームごとに、入力信号の時系列を特徴量の時系列に変換する。受信したデータが特徴量の時系列である場合(ステップS202における「特徴量」)、または、特徴量変換部203が入力信号の時系列を特徴量の時系列に変換した後、音声認識部204は、特徴量の時系列に対して音声認識を行う(ステップS204)。具体的には、音声認識部204は、対応する単語列をサーバ装置200の備える記憶部(図示せず)から探索し、探索結果を音声認識結果とする。また、このとき、音声認識部204は、認識結果を表すスコアを同時に算出する。そして、送受信部201は、音声認識結果を端末100に送信する(ステップS205)。 When performing feature value conversion, the server-side feature value conversion unit 203 converts the time series of the input signal into the time series of feature values for each frame. When the received data is a time series of feature quantities (“feature quantity” in step S202), or after the feature quantity conversion unit 203 converts the time series of the input signal to the time series of feature quantities, the speech recognition unit 204 Performs voice recognition on the time series of feature values (step S204). Specifically, the voice recognition unit 204 searches for a corresponding word string from a storage unit (not shown) provided in the server device 200, and uses the search result as the voice recognition result. At this time, the voice recognition unit 204 simultaneously calculates a score representing the recognition result. Then, the transmission / reception unit 201 transmits the voice recognition result to the terminal 100 (step S205).
 次に、本実施形態における効果を説明する。以上のように、本実施形態によれば、特徴量変換装置判断部105aが、音声の入力状況(例えば、音声が入力される環境、タスクサイズ、端末100やサーバ装置200の状況、通信回線の状況)に応じて、音声認識に用いられる特徴量の算出処理をどこで行うか選択する。また、音声認識装置判断部105bが、音声の入力状況に応じて、音声認識をどこで行うか選択する。そのため、音声認識で行われる処理を端末100及びサーバ装置200に適切に分担させることができる。 Next, the effect of this embodiment will be described. As described above, according to the present embodiment, the feature quantity conversion device determination unit 105a performs the voice input status (for example, the environment in which voice is input, the task size, the status of the terminal 100 or the server device 200, the communication line Depending on the situation, it is selected where to calculate the feature value used for speech recognition. In addition, the voice recognition device determination unit 105b selects where to perform voice recognition according to the voice input status. Therefore, the process performed by voice recognition can be appropriately shared by the terminal 100 and the server apparatus 200.
 具体的には、状況検知部104が、端末側のCPU負荷、サーバ側のCPU負荷、端末側のメモリ使用状況、サーバ側のメモリ使用状況、タスクサイズ、雑音レベル、伝送する回線の状態などを検知する。そして、処理制御部105(より詳しくは、特徴量変換装置判断部105a、音声認識装置判断部105b)が、端末側で行う処理内容とサーバ側で行う処理内容を制御する。そのため、効率の良い処理分散を行うことができ、また、レスポンスの良い音声認識を実現できる。 Specifically, the status detection unit 104 determines the CPU load on the terminal side, the CPU load on the server side, the memory usage status on the terminal side, the memory usage status on the server side, the task size, the noise level, the status of the transmission line, etc. Detect. Then, the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b) controls the processing content performed on the terminal side and the processing content performed on the server side. Therefore, efficient processing distribution can be performed and voice recognition with good response can be realized.
 すなわち、各装置が音声認識に関する処理を最適な割合で分担するためには、タスクサイズや雑音レベル、情報を伝送する回線の状態など、CPU負荷以外の様々な要因に基づいて分担先を決定する必要がある。例えば、特許文献1に記載された音声認識システムでは、CPU負荷に基づいて分担先を決定するため、分担による充分な効果が得られるとは言い難い。しかし、本実施形態によれば、CPU負荷以外の様々な要因に基づいて分担先を決定する。よって、音声認識で行われる処理を端末及びサーバ装置に適切に分担させることができる。 In other words, in order for each device to share processing related to speech recognition at an optimal rate, the sharing destination is determined based on various factors other than the CPU load, such as task size, noise level, and the state of the line for transmitting information. There is a need. For example, in the speech recognition system described in Patent Document 1, it is difficult to say that a sufficient effect due to sharing is obtained because the sharing destination is determined based on the CPU load. However, according to the present embodiment, the sharing destination is determined based on various factors other than the CPU load. Therefore, the process performed by voice recognition can be appropriately shared between the terminal and the server device.
実施形態2.
 図4は、本発明の第2の実施形態における音声認識システムの例を示すブロック図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。
Embodiment 2. FIG.
FIG. 4 is a block diagram illustrating an example of a speech recognition system according to the second embodiment of the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted.
 本実施形態における音声認識システムは、端末300と、サーバ装置400とを備えている。なお、以下の説明では、端末300のことを端末側と記すこともあり、また、サーバ装置400のことをサーバ側と記すこともある。端末300とサーバ装置400とは、例えば、公衆インターネット網を介して接続される。 The voice recognition system in this embodiment includes a terminal 300 and a server device 400. In the following description, the terminal 300 may be referred to as the terminal side, and the server device 400 may be referred to as the server side. The terminal 300 and the server device 400 are connected via, for example, a public Internet network.
 端末300は、入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、音声検出部301と、処理制御部302と、送受信部106と、認識結果統合部107と、認識結果表示部108とを備えている。すなわち、第2の実施形態における端末300は、音声検出部301が追加されている点において第1の実施形態における端末100と異なる。また、第2の実施形態における端末300では、第1の実施形態における処理制御部105が、処理制御部302に置き換わっている。 The terminal 300 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a voice detection unit 301, a processing control unit 302, a transmission / reception unit 106, and a recognition result integration. Unit 107 and recognition result display unit 108. That is, the terminal 300 in the second embodiment is different from the terminal 100 in the first embodiment in that a voice detection unit 301 is added. Further, in the terminal 300 in the second embodiment, the process control unit 105 in the first embodiment is replaced with a process control unit 302.
 音声検出部301は、入力信号取得部101に入力された入力信号の時系列から認識対象になる音声区間を判定し、音声区間の時系列を切り取る。すなわち、音声検出部301は、入力信号の時系列から音声区間を抽出する。音声検出部301は、例えば、参考文献1に記載されているように、フレーム化された音声データのエネルギを測定することにより発話区間を検出してもよい。他にも、音声検出部301は、参考文献2に記載されているように、入力信号から抽出される複数の特徴量を用いて音声区間を検出してもよい。ただし、音声検出部301が音声区間を検出する方法は、上記方法に限定されない。
[参考文献1]特開2005-31632号公報
[参考文献2]特開2007-17620号公報
The voice detection unit 301 determines a voice section to be recognized from the time series of the input signal input to the input signal acquisition unit 101, and cuts the time series of the voice section. That is, the voice detection unit 301 extracts a voice section from a time series of input signals. For example, as described in Reference Document 1, the voice detection unit 301 may detect an utterance section by measuring the energy of framed voice data. In addition, as described in Reference Document 2, the voice detection unit 301 may detect a voice section using a plurality of feature amounts extracted from the input signal. However, the method by which the voice detection unit 301 detects the voice section is not limited to the above method.
[Reference 1] Japanese Patent Laid-Open No. 2005-31632 [Reference 2] Japanese Patent Laid-Open No. 2007-17620
 処理制御部302は、音声検出装置判断部302aと、特徴量変換装置判断部302bと、音声認識装置判断部302cとを含む。処理制御部302は、状況検知部104が検知した状況をもとに、以降の処理をどの装置に実行させるか制御する。具体的には、音声検出装置判断部302aは、検知された状況をもとに、音声区間を抽出する処理をどの装置に実行させるか判断する。また、特徴量変換装置判断部302bは、検知された状況をもとに、音声区間が抽出された入力信号を特徴量に変換する処理をどの装置に実行させるか判断する。さらに、音声認識装置判断部302cは、検知された状況をもとに、特徴量による音声認識処理をどの装置に実行させるか判断する。なお、音声認識装置判断部302cが装置を選択する方法は、第1の実施形態における音声認識装置判断部105bが装置を選択する方法と同様である。 The process control unit 302 includes a voice detection device determination unit 302a, a feature amount conversion device determination unit 302b, and a voice recognition device determination unit 302c. The process control unit 302 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the voice detection device determination unit 302a determines which device is to execute the process of extracting the voice section based on the detected situation. Further, the feature amount conversion device determination unit 302b determines which device is to execute processing for converting the input signal from which the speech section is extracted into the feature amount based on the detected state. Furthermore, the voice recognition device determination unit 302c determines which device is to execute the voice recognition processing based on the feature amount based on the detected situation. Note that the method by which the speech recognition device determination unit 302c selects a device is the same as the method by which the speech recognition device determination unit 105b in the first embodiment selects a device.
 図4に示す例では、音声検出装置判断部302a、特徴量変換装置判断部302b及び音声認識装置判断部302cは、端末300に処理を実行させるか、サーバ装置400に処理を実行させるか、または、端末300とサーバ装置400の両方に処理を実行させるかを選択する。なお、音声検出装置判断部302a及び特徴量変換装置判断部302bが、以降の処理をどの装置に実行させるか選択する方法については後述する。 In the example illustrated in FIG. 4, the voice detection device determination unit 302a, the feature amount conversion device determination unit 302b, and the voice recognition device determination unit 302c cause the terminal 300 to execute processing, or causes the server device 400 to execute processing, The terminal 300 and the server device 400 are selected to execute processing. Note that a method by which the voice detection device determination unit 302a and the feature amount conversion device determination unit 302b select which device to execute the subsequent processing will be described later.
 サーバ装置400は、送受信部201と、処理制御部401と、音声検出部402と、特徴量変換部203と、音声認識部204とを備えている。すなわち、第2の実施形態におけるサーバ装置400は、音声検出部402が追加されている点において第1の実施形態におけるサーバ装置200と異なる。また、第2の実施形態におけるサーバ装置400では、第1の実施形態における処理制御部202が、処理制御部401に置き換わっている。 The server apparatus 400 includes a transmission / reception unit 201, a processing control unit 401, a voice detection unit 402, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server apparatus 400 in the second embodiment is different from the server apparatus 200 in the first embodiment in that a voice detection unit 402 is added. In the server device 400 in the second embodiment, the process control unit 202 in the first embodiment is replaced with a process control unit 401.
 処理制御部401は、端末300から受信した情報をもとに、以降の処理内容を判定する。具体的には、処理制御部401は、端末300から受信したデータが入力信号の時系列か、音声区間を切り取った入力信号の時系列か、特徴量の時系列かによって、以降の処理内容を判定する。 The processing control unit 401 determines subsequent processing contents based on the information received from the terminal 300. Specifically, the processing control unit 401 determines the subsequent processing contents depending on whether the data received from the terminal 300 is a time series of an input signal, a time series of an input signal obtained by cutting a speech section, or a time series of feature amounts. judge.
 例えば、端末300から受信したデータが入力信号の時系列である場合、処理制御部401は、音声区間を切り取る処理を音声検出部402に実行させる。また、端末300から受信したデータが音声区間を切り取った入力信号の時系列である場合、処理制御部401は、特徴量の算出処理を特徴量変換部203に実行させる。また、端末300から受信したデータが特徴量の時系列である場合、処理制御部401は、特徴量による音声認識処理を音声認識部204に実行させる。 For example, when the data received from the terminal 300 is a time series of the input signal, the process control unit 401 causes the voice detection unit 402 to execute a process of cutting out a voice section. In addition, when the data received from the terminal 300 is a time series of input signals obtained by cutting out voice segments, the processing control unit 401 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 300 is a time series of feature amounts, the process control unit 401 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
 音声検出部402は、端末200から受信した入力信号の時系列から、認識対象になる音声区間を判定し、音声区間の時系列を切り取る。すなわち、音声検出部402は、入力信号の時系列から音声区間を抽出する。 The voice detection unit 402 determines a voice section to be recognized from the time series of the input signal received from the terminal 200, and cuts the time series of the voice section. That is, the voice detection unit 402 extracts a voice section from a time series of input signals.
 入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、音声検出部301と、処理制御部302(より具体的には、音声検出装置判断部302aと、特徴量変換装置判断部302bと、音声認識装置判断部302c)と、送受信部106と、認識結果統合部107とは、プログラム(音声認識プログラム)に従って動作するコンピュータのCPUによって実現される。 The input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit 302a) The feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (speech recognition program).
 また、入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、音声検出部301と、処理制御部302(より具体的には、音声検出装置判断部302aと、特徴量変換装置判断部302bと、音声認識装置判断部302c)と、送受信部106と、認識結果統合部107とは、それぞれが専用のハードウェアで実現されていてもよい。 In addition, the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit). 302a, the feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 may each be realized by dedicated hardware.
 次に、動作について説明する。図5は、端末側における動作の例を示すフローチャートである。また、図6は、サーバ側における動作の例を示すフローチャートである。まず初めに、図5を用いて、端末側における動作を説明する。 Next, the operation will be described. FIG. 5 is a flowchart showing an example of the operation on the terminal side. FIG. 6 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
 端末側では、第1の実施形態と同様に、まず、マイクロフォン99などを用いて入力音が集音されると、入力信号取得部101は、集音された時系列の入力音データを単位時間のフレームごとに切り出す(ステップS101)。次に、音声検出装置判断部302aは、状況検知部104が検知した状況に応じて、入力信号の時系列から認識対象になる音声区間を端末側で判定して切り出すか、送受信部106からサーバ側に送信するか(すなわち、サーバ側で判定して切り出すか)、または、両方で判定して切り出すかを判断する(ステップS301)。 On the terminal side, as in the first embodiment, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101). Next, the voice detection device determination unit 302a determines the voice section to be recognized from the time series of the input signal on the terminal side according to the situation detected by the situation detection unit 104, or extracts the voice section from the transmission / reception unit 106 to the server. It is determined whether to transmit to the side (that is, to determine and cut out on the server side) or to determine and cut out on both sides (step S301).
 音声検出装置判断部302aは、例えば、以下の条件に基づき、入力信号の時系列からどちらの装置で音声区間を判定して切り出すか判断する。
 1.伝送する回線が切断されている場合、非常に通信速度が遅い場合、または、サーバ装置400のCPU負荷が高い場合、音声検出装置判断部302aは、入力信号をサーバ装置400には送信せず、端末側で音声区間を切り出すと判断する。
 2.上記以外で、端末300のCPU負荷が高い場合、端末側の使用メモリが多い場合、または、雑音レベルが高い場合、音声検出装置判断部302aは、入力信号をサーバ側に送信し、端末側では音声区間を切り出さないと判断する。すなわち、音声検出装置判断部302aは、サーバ側で音声区間を切り出すと判断する。
 3.上記以外の場合、音声検出装置判断部302aは、入力信号をサーバ側に送信し、また、端末側でも音声区間の切り出しを行うと判断する。すなわち、音声検出装置判断部302aは、端末側とサーバ側の両方で音声区間を切り出すと判断する。
For example, based on the following conditions, the speech detection device determination unit 302a determines which device determines and extracts a speech section from the time series of input signals.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the voice detection device determination unit 302a does not transmit the input signal to the server device 400, It is determined that the voice section is cut out on the terminal side.
2. Other than the above, when the CPU load of the terminal 300 is high, when the terminal side uses a lot of memory, or when the noise level is high, the voice detection device determination unit 302a transmits the input signal to the server side, It is determined that the voice segment is not cut out. That is, the voice detection device determination unit 302a determines to cut out a voice section on the server side.
3. In cases other than the above, the voice detection device determination unit 302a transmits an input signal to the server side, and determines that the voice section is also cut out on the terminal side. That is, the voice detection device determination unit 302a determines that a voice section is cut out on both the terminal side and the server side.
 なお、通信速度が遅いか否か、端末300及びサーバ装置400のCPU負荷が高いか否か、端末側の使用メモリが多いか否か、及び、雑音レベルが高いか否かは、予め定められた閾値と比較して判断される。 Note that whether the communication speed is low, whether the CPU load on the terminal 300 and the server device 400 is high, whether the terminal side memory is large, and whether the noise level is high are determined in advance. It is judged by comparing with the threshold value.
 音声検出装置判断部302aが端末側で音声区間の切り出しを行うと判断した場合(ステップS301における「端末側」)、音声検出部301は、入力信号の時系列から認識対象になる音声区間を判定する。そして、音声検出部301は、その入力信号の時系列から音声区間のみ切り出す(ステップS302)。このとき、音声検出部301は、音声区間の前後に数フレームのマージンをつけて切り出しても良い。 When the voice detection device determination unit 302a determines to cut out a voice section on the terminal side (“terminal side” in step S301), the voice detection unit 301 determines a voice section to be recognized from the time series of the input signal. To do. And the audio | voice detection part 301 cuts out only an audio | voice area from the time series of the input signal (step S302). At this time, the voice detection unit 301 may cut out with a margin of several frames before and after the voice section.
 一方、音声検出装置判断部302aがサーバ側で音声区間の切り出しを行うと判断した場合(ステップS301における「サーバ側」)、音声検出装置判断部302aは、入力信号を送受信部106に送信させる(ステップS104)。送受信部106は、第1の実施形態と同様、あるまとまった単位ごとに入力信号の時系列を圧縮して送信してもよく、また、入力信号の時系列をエンコードして送信してもよい。 On the other hand, when the voice detection device determination unit 302a determines to cut out a voice section on the server side (“server side” in step S301), the voice detection device determination unit 302a causes the transmission / reception unit 106 to transmit an input signal ( Step S104). Similar to the first embodiment, the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal. .
 次に、特徴量変換装置判断部302bは、状況検知部104が検知した状況に応じ、端末側で切り出された音声区間の時系列を、端末側の特徴量変換部102で特徴量の時系列に変換するか、送受信部106からサーバ装置400に送信するか(すなわち、サーバ側の特徴量変換部203で特徴量の時系列に変換するか)、または、両方で特徴量の時系列に変換するかを判断する(ステップS303)。 Next, the feature quantity conversion device determination unit 302b converts the time series of the speech segment extracted on the terminal side according to the situation detected by the situation detection unit 104, and the feature quantity time series of the feature quantity on the terminal side. Or is transmitted from the transmission / reception unit 106 to the server device 400 (that is, the feature amount conversion unit 203 on the server side converts the feature amount into a time series), or both are converted into a feature amount time series. It is determined whether or not to perform (step S303).
 特徴量変換装置判断部302bは、音声検出装置判断部302aがサーバ装置400に入力信号を送信しないと判断した場合、例えば、以下の条件に基づき、どちらの装置で入力信号の時系列を特徴量の時系列に変換するか判断する。
 1.伝送する回線が切断されている場合、非常に通信速度が遅い場合、または、サーバ装置400のCPU負荷が高い場合、特徴量変換装置判断部302bは、切り出された音声区間をサーバ装置400には送信せず、端末側で特徴量に変換すると判断する。
 2.上記以外で、端末側のCPU負荷が高い場合、端末側のメモリ使用量が多い場合、または、雑音レベルが高い場合、特徴量変換装置判断部302bは、切り出された音声区間をサーバ側に送信し、端末側では特徴量の変換を行わないと判断する。すなわち、特徴量変換装置判断部302bは、サーバ側で特徴量の変換を行うと判断する。
 3.上記以外の場合、特徴量変換装置判断部302bは、切り出された音声区間をサーバ側に送信し、また、端末側でも特徴量に変換すると判断する。すなわち、特徴量変換装置判断部302bは、端末側とサーバ側の両方で特徴量の変換を行うと判断する。
When the speech detection device determination unit 302a determines that the input signal is not transmitted to the server device 400, the feature amount conversion device determination unit 302b uses the time series of the input signal as a feature amount based on the following conditions, for example. Judge whether to convert to time series.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the feature amount conversion device determination unit 302b stores the cut voice section in the server device 400. It is determined that the data is not transmitted but converted to a feature value on the terminal side.
2. Other than the above, when the CPU load on the terminal side is high, the memory usage on the terminal side is large, or the noise level is high, the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side Then, the terminal side determines that the feature value is not converted. That is, the feature amount conversion device determination unit 302b determines that the feature amount is converted on the server side.
3. In cases other than the above, the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side, and determines that the terminal side also converts to the feature amount. That is, the feature quantity conversion device determination unit 302b determines to convert the feature quantity on both the terminal side and the server side.
 なお、通信速度が遅いか否か、端末300及びサーバ装置400のCPU負荷が高いか否か、端末側の使用メモリが多いか否か、及び、雑音レベルが高いか否かは、音声検出装置判断部302aにより判断される場合と同様、予め定められた閾値と比較して判断される。 Note that whether the communication speed is low, whether the CPU load on the terminal 300 and the server device 400 is high, whether there is a large amount of used memory on the terminal side, and whether the noise level is high are as follows: As in the case of determination by the determination unit 302a, the determination is made by comparing with a predetermined threshold value.
 特徴量変換装置判断部302bが端末側で特徴量の変換を行うと判断した場合(ステップS303における「端末側」)、特徴量変換部102は、フレームごとに切り出された入力信号の時系列を特徴量の時系列に変換する(ステップS103)。 When the feature amount conversion apparatus determination unit 302b determines that the feature amount conversion is to be performed on the terminal side (“terminal side” in step S303), the feature amount conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).
 一方、特徴量変換装置判断部302bがサーバ側で特徴量の変換を行うと判断した場合(ステップS303における「サーバ側」)、特徴量変換装置判断部302bは、音声区間を切り出した入力信号を送受信部106に送信させる(ステップS304)。 On the other hand, when the feature amount conversion device determination unit 302b determines that the feature amount conversion is to be performed on the server side (“server side” in step S303), the feature amount conversion device determination unit 302b receives the input signal obtained by cutting out the speech section. It is made to transmit to the transmission / reception part 106 (step S304).
 以降、音声認識装置判断部302cが音声認識処理を行う装置を決定し、認識結果統合部107が統合した認識結果を認識結果表示部108が表示するまでの処理は、図2に例示するステップS105~ステップS110の処理と同様である。 Thereafter, the processing until the speech recognition device determination unit 302c determines a device that performs speech recognition processing and the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 is illustrated in step S105 illustrated in FIG. ~ Similar to step S110.
 次に、図6を用いて、サーバ側における動作を説明する。サーバ側では、第1の実施形態と同様に、まず、送受信部201が、端末側からのデータを受信する(ステップS201)。受信したデータが圧縮またはエンコードされていた場合、送受信部201は、データの解凍やデコードを行う。 Next, the operation on the server side will be described with reference to FIG. On the server side, as in the first embodiment, first, the transmission / reception unit 201 receives data from the terminal side (step S201). When the received data is compressed or encoded, the transmission / reception unit 201 decompresses and decodes the data.
 次に、処理制御部401は、受信したデータの内容に応じて、以降の処理内容を変更する(ステップS401)。受信したデータが音声区間の切り出されていない入力信号の時系列である場合(ステップS401における「入力信号」)、音声検出部402は、その入力信号の時系列から音声認識の対象になる音声区間を判定し、判定した音声区間を切り出す(ステップS402)。このとき、音声検出部402は、音声区間の前後に数フレームのマージンをつけて切り出しても良い。 Next, the processing control unit 401 changes the subsequent processing content according to the content of the received data (step S401). When the received data is a time series of an input signal from which a voice section is not cut out (“input signal” in step S401), the voice detection unit 402 uses the time series of the input signal as a voice section to be subjected to voice recognition. And the determined speech section is cut out (step S402). At this time, the voice detection unit 402 may cut out with a margin of several frames before and after the voice section.
 音声検出部402が音声区間を切り出したあと、または、受信したデータが音声区間の切り出された入力信号の時系列である場合(ステップS401における「音声区間切り出し入力信号」)、特徴量変換部203は、その入力信号を特徴量に変換する(ステップS203)。 After the voice detection unit 402 cuts out the voice section, or when the received data is a time series of the input signal cut out of the voice section (“voice section cut-out input signal” in step S401), the feature amount conversion unit 203 Converts the input signal into a feature value (step S203).
 特徴量変換部203が入力信号を特徴量に変換したあと、または、受信したデータが特徴量の時系列である場合(ステップS401における「特徴量」)、音声認識部204は、音声認識を行う(ステップS204)。以降、送受信部201が音声認識結果を端末300に送信する処理は、図3に例示するステップS205の処理と同様である。 After the feature amount conversion unit 203 converts the input signal into a feature amount, or when the received data is a feature amount time series (“feature amount” in step S401), the speech recognition unit 204 performs speech recognition. (Step S204). Hereinafter, the process in which the transmission / reception unit 201 transmits the speech recognition result to the terminal 300 is the same as the process in step S205 illustrated in FIG.
 次に、本実施形態における効果を説明する。一般的に、音声検出処理は、音声認識の精度を高くすることが出来る一方でCPUパワーを多く必要とする。サーバ側では、端末側に比べ、このような音声検出処理を行うことが可能である。本実施形態によれば、第1の実施形態における構成に加え、音声検出装置判断部302aが、音声の入力状況に応じて、音声区間の抽出処理をどの装置で行うか選択する。そして、端末300が選択された場合に、音声検出部301が、入力信号から音声区間を抽出する。そのため、例えば、雑音レベルが高い場合、サーバ側で音声検出処理を行うことで、より精度が高い音声認識結果を得ることができる。 Next, the effect of this embodiment will be described. In general, the voice detection process can increase the accuracy of voice recognition, but requires a lot of CPU power. On the server side, it is possible to perform such voice detection processing compared to the terminal side. According to the present embodiment, in addition to the configuration in the first embodiment, the voice detection device determination unit 302a selects which device performs voice segment extraction processing according to the voice input status. When the terminal 300 is selected, the voice detection unit 301 extracts a voice section from the input signal. Therefore, for example, when the noise level is high, a voice recognition result with higher accuracy can be obtained by performing voice detection processing on the server side.
実施形態3.
 図7は、本発明の第3の実施形態における音声認識システムの例を示すブロック図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。
Embodiment 3. FIG.
FIG. 7 is a block diagram showing an example of a speech recognition system according to the third embodiment of the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted.
 本実施形態における音声認識システムは、端末500と、サーバ装置600とを備えている。端末500は、入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、雑音除去部501と、処理制御部502と、送受信部106と、認識結果統合部107と、認識結果表示部108とを備えている。すなわち、第3の実施形態における端末500は、雑音除去部501が追加されている点において第1の実施形態における端末100と異なる。また、第3の実施形態における端末500では、第1の実施形態における処理制御部105が、処理制御部502に置き換わっている。 The voice recognition system in this embodiment includes a terminal 500 and a server device 600. The terminal 500 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a noise removal unit 501, a processing control unit 502, a transmission / reception unit 106, and a recognition result integration. Unit 107 and recognition result display unit 108. That is, the terminal 500 in the third embodiment is different from the terminal 100 in the first embodiment in that a noise removing unit 501 is added. In the terminal 500 according to the third embodiment, the process control unit 105 according to the first embodiment is replaced with a process control unit 502.
 雑音除去部501は、入力信号から雑音成分を除去する。雑音除去部501は、例えば、スペクトル減算法やウィナーフィルタといった方法を用いて雑音成分を除去する。ただし、雑音除去部501が雑音成分を除去する方法は、上記方法に限定されない。雑音除去部501は、広く知られている他の方法を用いて、雑音成分を除去してもよい。 The noise removing unit 501 removes noise components from the input signal. The noise removing unit 501 removes a noise component using a method such as a spectral subtraction method or a Wiener filter, for example. However, the method by which the noise removing unit 501 removes noise components is not limited to the above method. The noise removing unit 501 may remove the noise component using another widely known method.
 処理制御部502は、雑音除去装置判断部502aと、特徴量変換装置判断部105aと、音声認識装置判断部105bとを含む。処理制御部502は、第1の実施形態と同様、状況検知部104が検知した状況をもとに、以降の処理をどの装置に実行させるかを制御する。具体的には、雑音除去装置判断部502aは、検知された状況をもとに、雑音を除去する処理をどの装置に実行させるかを判断する。また、特徴量変換装置判断部105aは、検知された状況をもとに、雑音が除去された入力信号を特徴量に変換する処理をどの装置に実行させるかを判断する。なお、音声認識装置判断部105bは、第1の実施形態と同様である。 The process control unit 502 includes a noise removal device determination unit 502a, a feature amount conversion device determination unit 105a, and a speech recognition device determination unit 105b. Similar to the first embodiment, the processing control unit 502 controls which device is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the noise removal device determination unit 502a determines which device is to execute the noise removal process based on the detected situation. Further, the feature amount conversion device determination unit 105a determines which device is to execute the process of converting the input signal from which noise has been removed into the feature amount based on the detected state. The voice recognition device determination unit 105b is the same as that in the first embodiment.
 図7に示す例では、雑音除去装置判断部502a、特徴量変換装置判断部105a及び音声認識装置判断部105bは、端末500に処理を実行させるか、サーバ装置600に処理を実行させるか、または、端末500とサーバ装置600の両方に処理を実行させるかを選択する。雑音除去装置判断部502aは、雑音を除去する処理をどの装置に実行させるかについて、例えば、特徴量変換装置判断部105aや音声認識装置判断部105bが装置を選択する方法と同様の方法で判断してもよい。なお、特徴量変換装置判断部105a及び音声認識装置判断部105bが、以降の処理をどの装置に実行させるか判断する方法は、第1の実施形態と同様である。 In the example illustrated in FIG. 7, the noise removal device determination unit 502 a, the feature amount conversion device determination unit 105 a, and the speech recognition device determination unit 105 b cause the terminal 500 to execute processing, or causes the server device 600 to execute processing, The terminal 500 and the server apparatus 600 are selected to execute processing. The noise removal device determination unit 502a determines which device is to execute the process of removing noise by, for example, a method similar to the method in which the feature amount conversion device determination unit 105a or the speech recognition device determination unit 105b selects a device. May be. Note that the method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b determine which device to execute the subsequent processing is the same as in the first embodiment.
 サーバ装置600は、送受信部201と、処理制御部601と、雑音除去部602と、特徴量変換部203と、音声認識部204とを備えている。すなわち、第3の実施形態におけるサーバ装置600は、雑音除去部602が追加されている点において第1の実施形態におけるサーバ装置200と異なる。また、第3の実施形態におけるサーバ装置600では、第1の実施形態における処理制御部202が、処理制御部601に置き換わっている。 The server apparatus 600 includes a transmission / reception unit 201, a processing control unit 601, a noise removal unit 602, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server device 600 in the third embodiment is different from the server device 200 in the first embodiment in that a noise removing unit 602 is added. In the server device 600 according to the third embodiment, the process control unit 202 according to the first embodiment is replaced with a process control unit 601.
 処理制御部601は、端末500から受信した情報をもとに、以降の処理内容を判定する。具体的には、処理制御部601は、端末500から受信したデータが入力信号の時系列か、雑音が除去された入力信号の時系列か、特徴量の時系列かによって、以降の処理内容を判定する。 The processing control unit 601 determines the subsequent processing content based on the information received from the terminal 500. Specifically, the processing control unit 601 determines the subsequent processing contents depending on whether the data received from the terminal 500 is a time series of input signals, a time series of input signals from which noise is removed, or a time series of feature amounts. judge.
 例えば、端末500から受信したデータが入力信号の時系列である場合、処理制御部601は、入力信号から雑音を除去する処理を雑音除去部602に実行させる。また、端末500から受信したデータが雑音を除去された入力信号の時系列である場合、処理制御部601は、特徴量の算出処理を特徴量変換部203に実行させる。また、端末500から受信したデータが特徴量の時系列である場合、処理制御部601は、特徴量による音声認識処理を音声認識部204に実行させる。 For example, when the data received from the terminal 500 is a time series of the input signal, the processing control unit 601 causes the noise removal unit 602 to execute processing for removing noise from the input signal. Also, when the data received from the terminal 500 is a time series of input signals from which noise has been removed, the processing control unit 601 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 500 is a time series of feature amounts, the process control unit 601 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.
 雑音除去部602は、雑音除去部501と同様、入力信号から雑音を除去する。なお、雑音除去部602が雑音を除去する方法は、雑音除去部501と同様の方法であってもよく、異なっていてもよい。 The noise removing unit 602 removes noise from the input signal in the same manner as the noise removing unit 501. Note that the method by which the noise removing unit 602 removes noise may be the same method as the noise removing unit 501, or may be different.
 入力信号取得部101と、特徴量変換部102と、音声認識部103と、状況検知部104と、雑音除去部301と、処理制御部502(より具体的には、雑音除去装置判断部502aと、特徴量変換装置判断部105aと、音声認識装置判断部105b)と、送受信部106と、認識結果統合部107とは、プログラム(音声認識プログラム)に従って動作するコンピュータのCPUによって実現される。 The input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the noise removal unit 301, and the processing control unit 502 (more specifically, the noise removal device determination unit 502a) The feature amount conversion device determination unit 105a, the voice recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program).
 なお、本実施形態では、第1の実施形態における音声認識システムに加え、雑音除去部501及び雑音除去部602を備えている場合について説明した。なお、雑音除去部501及び雑音除去部602は、第2の実施形態における音声認識システムに含まれていてもよい。 In the present embodiment, the case where the noise removing unit 501 and the noise removing unit 602 are provided in addition to the voice recognition system in the first embodiment has been described. Note that the noise removing unit 501 and the noise removing unit 602 may be included in the speech recognition system in the second embodiment.
 次に、動作について説明する。図8は、端末側における動作の例を示すフローチャートである。また、図9は、サーバ側における動作の例を示すフローチャートである。まず初めに、図8を用いて、端末側における動作を説明する。 Next, the operation will be described. FIG. 8 is a flowchart showing an example of the operation on the terminal side. FIG. 9 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.
 端末側では、第1の実施形態と同様に、まず、マイクロフォン99などを用いて入力音が集音されると、入力信号取得部101は、集音された時系列の入力音データを単位時間のフレームごとに切り出す(ステップS101)。次に、雑音除去装置判断部502aは、状況検知部104が検知した状況に応じて、入力信号から雑音を除去する処理を、端末側で行うか、サーバ側で行うか、または、両方で行うかを判断する(ステップS501)。 On the terminal side, as in the first embodiment, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101). Next, the noise removal device determination unit 502a performs a process of removing noise from the input signal on the terminal side, the server side, or both depending on the situation detected by the situation detection unit 104. Is determined (step S501).
 雑音除去装置判断部502aが端末側で入力信号の雑音を除去すると判断した場合(ステップS501における「端末側」)、雑音除去部501は、入力信号の時系列から雑音を除去する(ステップS502)。一方、雑音除去装置判断部502aがサーバ側で入力信号の雑音を除去すると判断した場合(ステップS501における「サーバ側」)、雑音除去装置判断部502aは、入力信号を送受信部106に送信させる(ステップS503)。 When the noise removal device determination unit 502a determines to remove noise from the input signal on the terminal side (“terminal side” in step S501), the noise removal unit 501 removes noise from the time series of the input signal (step S502). . On the other hand, when the noise removal device determination unit 502a determines that the noise of the input signal is removed on the server side (“server side” in step S501), the noise removal device determination unit 502a causes the transmission / reception unit 106 to transmit the input signal ( Step S503).
 以降、特徴量変換装置判断部105aが特徴量を算出させる装置を決定し、認識結果統合部107が統合した認識結果を認識結果表示部108が表示するまでの処理は、図2に例示するステップS102~ステップS110の処理と同様である。 Hereinafter, the process until the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 after the feature amount conversion device determination unit 105a determines a device for calculating the feature amount is illustrated in FIG. This is the same as the processing from S102 to S110.
 次に、図9を用いて、サーバ側における動作を説明する。サーバ側では、第2の実施形態と同様に、まず、送受信部201が、端末側からのデータを受信する(ステップS201)。次に、処理制御部601は、受信したデータの内容に応じて、以降の処理内容を変更する(ステップS601)。受信したデータが雑音の除去されていない入力信号の時系列である場合(ステップS601における「入力信号」)、雑音除去部602は、その入力信号の時系列から雑音を除去する(ステップS602)。 Next, the operation on the server side will be described with reference to FIG. On the server side, as in the second embodiment, first, the transmission / reception unit 201 receives data from the terminal side (step S201). Next, the processing control unit 601 changes the subsequent processing content according to the content of the received data (step S601). When the received data is a time series of an input signal from which noise has not been removed (“input signal” in step S601), the noise removal unit 602 removes noise from the time series of the input signal (step S602).
 雑音除去部602が雑音を除去したあと、または、受信したデータが雑音の除去された入力信号の時系列である場合(ステップS601における「雑音除去済入力信号」)、特徴量変換部203は、その入力信号を特徴量に変換する(ステップS203)。以降、特徴量をもとに音声認識を行って、端末500に音声認識結果を送信するまでの処理は、図6に例示するステップS204~S205の処理と同様である。 After the noise removal unit 602 removes noise or when the received data is a time series of the input signal from which noise has been removed (“noise-removed input signal” in step S601), the feature amount conversion unit 203 The input signal is converted into a feature value (step S203). Thereafter, the processing from performing speech recognition based on the feature amount to transmitting the speech recognition result to the terminal 500 is the same as the processing in steps S204 to S205 illustrated in FIG.
 次に、本実施形態における効果を説明する。一般的に、雑音抑圧処理は、音声認識の精度を高くすることが出来る一方でCPUパワーを多く必要とする。サーバ側では、端末側に比べ、このような雑音抑圧処理を行うことが可能である。本実施形態によれば、第1の実施形態における構成に加え、雑音除去装置判断部502aが、音声の入力状況に応じて、雑音成分の除去処理をどの装置で行うか選択する。そして、端末500が選択された場合に、雑音除去部501が、入力信号から雑音成分を除去する。そのため、例えば、雑音レベルが高い場合、サーバ側で雑音抑圧処理を行うことで、より精度が高い音声認識結果を得ることができる。 Next, the effect of this embodiment will be described. In general, noise suppression processing can increase the accuracy of speech recognition, but requires a large amount of CPU power. On the server side, it is possible to perform such noise suppression processing compared to the terminal side. According to the present embodiment, in addition to the configuration in the first embodiment, the noise removal device determination unit 502a selects which device to perform noise component removal processing according to the voice input status. When the terminal 500 is selected, the noise removing unit 501 removes a noise component from the input signal. Therefore, for example, when the noise level is high, a speech recognition result with higher accuracy can be obtained by performing noise suppression processing on the server side.
実施形態4.
 次に、本発明の第4の実施形態における音声認識システムについて説明する。第1の実施形態~第3の実施形態で説明したように、処理制御部105、処理制御部302及び処理制御部502(以下、各処理制御部と記す。)は、状況検知部104が検知した状況をもとに、以降の処理をどの装置に実行させるかを制御する。
Embodiment 4 FIG.
Next, a speech recognition system according to the fourth embodiment of the present invention will be described. As described in the first to third embodiments, the process control unit 105, the process control unit 302, and the process control unit 502 (hereinafter referred to as each process control unit) are detected by the situation detection unit 104. Based on the situation, it is controlled which device performs the subsequent processing.
 本実施形態では、状況検知部104が検知する状況に応じて定められる指標(以下、得点表と記す。)を予め設定しておき、各処理制御部が、その得点表をもとに、以降の処理をどの装置に実行させるか判断する場合について説明する。具体的には、各処理制御部は、状況検知部104が検知する状況に応じて定められた得点をもとに、音声の入力状況に応じた得点の総計を算出する。そして、各処理制御部は、算出された総計を予め定められた条件と比較して、特徴量の算出処理や音声認識をどの装置で行うか選択する。なお、この得点表は、例えば、端末側の記憶部(図示せず)に予め記憶される。 In the present embodiment, an index (hereinafter referred to as a score table) that is determined according to the situation detected by the situation detection unit 104 is set in advance, and each processing control unit thereafter performs the processing based on the score table. A case where it is determined which apparatus is to execute the process will be described. Specifically, each processing control unit calculates the total score according to the voice input status based on the score determined according to the status detected by the status detection unit 104. Then, each processing control unit compares the calculated total with a predetermined condition, and selects which device performs the feature amount calculation processing and voice recognition. The score table is stored in advance in a storage unit (not shown) on the terminal side, for example.
 図10は、得点表の例を示す説明図である。図10に例示する得点表は、状況検知部104が検知した状況と、その状況における重みを示す得点とを対応付けて設定される。例えば、状況検知部104が端末側とサーバ側との通信が切断されている状況を検知した場合、その状況の重みは5点とされる。 FIG. 10 is an explanatory diagram showing an example of a score table. The score table illustrated in FIG. 10 is set in association with the situation detected by the situation detection unit 104 and the score indicating the weight in the situation. For example, when the situation detection unit 104 detects a situation where communication between the terminal side and the server side is disconnected, the weight of the situation is set to 5 points.
 各処理制御部は、状況検知部104が検知した状況に応じた点数の総計Vを算出する。そして、各処理制御部は、予め定められた条件に基づいて、以降の処理をどの装置に実行させるかを選択する。例えば、総計Vが4以上の場合、サーバ側に情報の送信を行わず(すなわち、端末側で処理を行う)、総計Vが2以上4未満の場合、サーバ側に特徴量を送信し、総計Vが2未満の場合、サーバ側に入力信号を送信する、といった条件を設定しておいてもよい。各処理制御部は、このように設定された条件に基づいて、以降の処理をどの装置に実行させるかを選択する。なお、設定される条件は、上記内容に限定されない。 Each processing control unit calculates the total number V of points corresponding to the situation detected by the situation detection unit 104. Each processing control unit selects which device is to execute the subsequent processing based on a predetermined condition. For example, when the total V is 4 or more, information is not transmitted to the server side (that is, processing is performed on the terminal side), and when the total V is 2 or more and less than 4, the feature amount is transmitted to the server side. If V is less than 2, a condition may be set such that an input signal is transmitted to the server side. Each processing control unit selects which device is to execute the subsequent processing based on the conditions set in this way. The set conditions are not limited to the above contents.
 このような得点表を予め定めることにより、環境に応じた細かな判断が可能になる。 定 め る Predetermining such a score table makes it possible to make detailed judgments according to the environment.
実施形態5.
 次に、本発明の第5の実施形態における音声認識システムについて説明する。第1の実施形態~第4の実施形態では、端末側の装置とサーバ側の装置とが、それぞれ1対1に接続される場合について説明した。本実施形態における音声認識システムは、端末側の装置とサーバ側の装置との接続は、1対1に限定されない。端末側の装置及びサーバ側の装置が、それぞれ2台以上接続されていてもよい。
Embodiment 5. FIG.
Next, a speech recognition system according to the fifth embodiment of the present invention will be described. In the first to fourth embodiments, the case has been described in which the terminal-side device and the server-side device are connected one-to-one. In the speech recognition system according to the present embodiment, the connection between the terminal-side device and the server-side device is not limited to one-to-one. Two or more terminal-side devices and server-side devices may be connected to each other.
 図11は、本実施形態における音声認識システムの例を示す説明図である。本実施形態における音声認識システムは、端末A~Dと、サーバ装置A~Dと、接続状態コントローラ700とを備えている。接続状態コントローラ700は、端末A~Dとサーバ装置A~Dとの間に接続される。端末A~Dの構成は、第1の実施形態~第4の実施形態における端末100,300,500と同様である。また、サーバ装置A~Dの構成は、第1の実施形態~第4の実施形態におけるサーバ装置200,400,600と同様である。 FIG. 11 is an explanatory diagram showing an example of a voice recognition system in the present embodiment. The speech recognition system in this embodiment includes terminals A to D, server apparatuses A to D, and a connection state controller 700. The connection state controller 700 is connected between the terminals A to D and the server apparatuses A to D. The configurations of the terminals A to D are the same as those of the terminals 100, 300, and 500 in the first to fourth embodiments. The configurations of the server apparatuses A to D are the same as those of the server apparatuses 200, 400, and 600 in the first to fourth embodiments.
 接続状態コントローラ700は、端末A~Dを接続させるサーバ装置A~Dを選択する。具体的には、接続状態コントローラ700の制御部701は、端末側から送信されるデータ形式、サーバ側のCPU負荷、サーバ側のメモリ使用率のうちの少なくとも1つ以上の情報に基づいて、端末A~Dを接続させるサーバ装置A~Dを選択する。ここで、端末側から送信されるデータには、入力信号か、音声区間を切り出した入力信号か、雑音を除去した入力信号か、または、特徴量かを示す情報を含んでいてもよい。接続状態コントローラ700は、例えば、サーバ装置等により実現される。また、接続状態コントローラ700の制御部701は、例えば、接続状態コントローラ700の備えるCPUにより実現される。 The connection state controller 700 selects the server devices A to D to which the terminals A to D are connected. Specifically, the control unit 701 of the connection state controller 700 determines the terminal based on at least one of the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side. Server apparatuses A to D to which A to D are connected are selected. Here, the data transmitted from the terminal side may include information indicating whether it is an input signal, an input signal obtained by cutting out a voice section, an input signal from which noise has been removed, or a feature amount. The connection state controller 700 is realized by a server device, for example. Further, the control unit 701 of the connection state controller 700 is realized by a CPU included in the connection state controller 700, for example.
 以下、接続状態コントローラ700の動作を具体的に説明する。制御部701は、端末側から送信するデータ形式を含む接続要求を受け取ると、受け取ったデータ形式に対応可能なサーバ装置を選択する。また、制御部701は、選択した複数のサーバ装置の中から、CPU負荷が小さい、または、メモリ使用量が少ないなどの基準で、さらにサーバ装置を選択してもよい。このとき、制御部701が選択するサーバ装置は1台に限られず、2台以上であってもよい。サーバ装置の選択後、制御部701は、接続要求を受けた端末と、選択されたサーバ装置との間のコネクションを設定する。以降、コネクションが設定された端末とサーバ装置との間で、データの送受信が行われる。 Hereinafter, the operation of the connection state controller 700 will be described in detail. Upon receiving a connection request including a data format to be transmitted from the terminal side, the control unit 701 selects a server device that can support the received data format. In addition, the control unit 701 may further select a server device from a plurality of selected server devices on the basis of a low CPU load or a small memory usage. At this time, the number of server devices selected by the control unit 701 is not limited to one, and may be two or more. After selecting the server device, the control unit 701 sets a connection between the terminal that has received the connection request and the selected server device. Thereafter, data transmission / reception is performed between the terminal to which the connection is set and the server device.
 なお、上記説明では、制御部701が、選択した複数のサーバ装置の中から、CPU負荷が小さい、または、メモリ使用量が少ないなどの基準で、さらにサーバ装置を選択する場合について説明した。ただし、制御部701が、サーバ装置を選択する基準は、上記内容に限定されない。制御部701は、端末側から送信されるデータ形式、サーバ側のCPU負荷及びサーバ側のメモリ使用率に応じて定められた基準を用いてサーバ装置を選択すればよい。 In the above description, a case has been described in which the control unit 701 further selects a server device based on a criterion such as a low CPU load or a small memory usage from among the selected server devices. However, the criteria for the control unit 701 to select a server device are not limited to the above. The control unit 701 may select the server device using a standard determined according to the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side.
 このように、制御部701が、端末A~Dに接続させるサーバ装置を選択するため、端末A~Dとサーバ装置A~Dの組合せは、各実施形態で例示した構成の組合せに限定されない。本実施形態における音声認識システムには、例えば、第1の実施形態における端末100と、第3の実施形態におけるサーバ装置600とが含まれていてもよい。 Thus, since the control unit 701 selects the server device to be connected to the terminals A to D, the combination of the terminals A to D and the server devices A to D is not limited to the combination of configurations exemplified in each embodiment. The voice recognition system in the present embodiment may include, for example, the terminal 100 in the first embodiment and the server device 600 in the third embodiment.
 なお、この場合、端末の状況検知部104は、最もCPU負荷が少ないサーバ装置の状況や、最もメモリ使用量が少ないサーバ側の状況、最も通信速度が速いサーバ装置などを、サーバ装置の状況及び回線状態として検知してもよい。そして、接続制御コントローラ700は、状況検知部104が検知した情報に基づく接続要求に応じて、サーバ装置への接続を行えばよい。 In this case, the terminal status detection unit 104 determines the status of the server device with the least CPU load, the status of the server with the least memory usage, the server device with the fastest communication speed, and the like. It may be detected as a line state. And the connection control controller 700 should just connect to a server apparatus according to the connection request | requirement based on the information which the condition detection part 104 detected.
 次に、本発明の最小構成を説明する。図12は、本発明による音声認識システムの最小構成の例を示すブロック図である。また、図13は、本発明による音声取得端末の最小構成の例を示すブロック図である。本発明による音声認識システムは、音声が入力され、その音声を表す入力信号(例えば、入力音データ)を取得する音声取得端末80(例えば、端末100)と、音声取得端末80から送信される情報に基づいて音声認識を行う音声認識装置90(例えば、サーバ装置200)とを備えている。 Next, the minimum configuration of the present invention will be described. FIG. 12 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention. FIG. 13 is a block diagram showing an example of the minimum configuration of the voice acquisition terminal according to the present invention. The voice recognition system according to the present invention receives a voice and receives an input signal (for example, input sound data) representing the voice, and information transmitted from the voice acquisition terminal 80. And a speech recognition device 90 (for example, server device 200) that performs speech recognition based on the above.
 音声取得端末80は、音声の入力状況に応じて、音声認識に用いられる特徴量の算出処理を行う装置を自音声取得端末80と音声認識装置90の中から少なくとも1つ選択し(例えば、特徴量変換装置判断部105aが選択し、)、入力状況に応じて、算出された特徴量に基づいて音声認識を行う装置を自音声取得端末80と音声認識装置90の中から少なくとも1つ選択する(例えば、音声認識装置判断部105bが選択する)処理装置判断手段81(例えば、処理制御部105)を備えている。 The voice acquisition terminal 80 selects at least one device that performs the calculation processing of the feature amount used for voice recognition from the own voice acquisition terminal 80 and the voice recognition device 90 according to the voice input status (for example, feature The amount conversion device determination unit 105a selects, and selects at least one device that performs speech recognition based on the calculated feature amount from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state. A processing device determination unit 81 (for example, the processing control unit 105) (for example, selected by the speech recognition device determination unit 105b) is provided.
 処理装置判断手段81は、音声の入力環境(例えば、雑音レベル)、タスクサイズ(例えば、音声認識可能な単語の語彙数、発声の複雑度)、自音声取得端末80の状況(例えば、端末100のCPU負荷やメモリの使用状況)、音声認識装置90の状況(例えば、サーバ装置200のCPU負荷やメモリの使用状況)、自音声取得端末80と音声認識装置90との間の通信状況(例えば、通信が切断されている、通信速度が遅い)のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する。 The processing device determination unit 81 includes a voice input environment (for example, a noise level), a task size (for example, the number of vocabulary of words that can be recognized by speech, the degree of utterance complexity), a situation of the own voice acquisition terminal 80 (for example, the terminal 100). CPU load and memory usage status), voice recognition device 90 status (for example, CPU load and memory usage status of server device 200), and communication status between own voice acquisition terminal 80 and voice recognition device 90 (for example, The device that performs the feature amount calculation processing and the device that performs the speech recognition are selected according to information representing at least one of communication disconnection and communication speed is low.
 そのような構成により、音声認識で行われる処理を端末及びサーバ装置に適切に分担させることができる。 With such a configuration, processing performed by voice recognition can be appropriately shared between the terminal and the server device.
 具体的には、音声取得端末80が、音声の入力状況を検知する状況検知手段(例えば、状況検知部104)を備えていてもよい。このとき、状況検知手段104が、音声の入力環境、タスクサイズ、自音声取得端末80の状況、音声認識装置90の状況、自音声取得端末80と音声認識装置90との間の通信状況のうちの少なくとも1つを表す状況を検知する。処理装置判断手段81は、状況検知手段が検知した状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択してもよい。 Specifically, the voice acquisition terminal 80 may include status detection means (for example, the status detection unit 104) that detects a voice input status. At this time, the situation detection means 104 includes the voice input environment, the task size, the situation of the own voice acquisition terminal 80, the situation of the voice recognition device 90, and the communication situation between the own voice acquisition terminal 80 and the voice recognition device 90. A situation representing at least one of the following is detected. The processing device determination unit 81 may select a device that performs a feature amount calculation process and a device that performs voice recognition according to the situation detected by the situation detection unit.
 また、音声取得端末80が、入力信号から音声区間を抽出する音声検出手段(例えば、音声検出部301)を備えていてもよい。このとき、処理装置判断手段81が、音声の入力状況に応じて、音声区間の抽出処理を行う装置を自音声取得端末80と音声認識装置90の中から少なくとも1つ選択し、音声検出手段が、処理装置判断手段が自音声取得端末80を選択した場合に、入力信号から音声区間を抽出してもよい。このように、CPUパワーを多く必要とする音声検出処理を分担させることで、より精度が高い音声認識結果を得ることができる。 Further, the voice acquisition terminal 80 may include voice detection means (for example, a voice detection unit 301) that extracts a voice section from the input signal. At this time, the processing device determination unit 81 selects at least one device for performing speech segment extraction processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state of the speech, and the speech detection unit When the processing device determination unit selects the own voice acquisition terminal 80, the voice section may be extracted from the input signal. Thus, by sharing the voice detection process that requires a lot of CPU power, a voice recognition result with higher accuracy can be obtained.
 また、音声取得端末80が、入力信号から雑音成分を除去する雑音除去手段(例えば、雑音除去部501)を備えていてもよい。このとき、処理装置判断手段81が、音声の入力状況に応じて、雑音成分の除去処理を行う装置を自音声取得端末80と音声認識装置90の中から少なくとも1つ選択し、雑音除去手段が、処理装置判断手段が自音声取得端末80を選択した場合に、入力信号から雑音成分を除去してもよい。このように、CPUパワーを多く必要とする雑音除去処理を分担させることで、より精度が高い音声認識結果を得ることができる。 Further, the voice acquisition terminal 80 may include noise removing means (for example, the noise removing unit 501) for removing a noise component from the input signal. At this time, the processing device determination unit 81 selects at least one device for performing noise component removal processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the voice input status, and the noise removal unit The noise component may be removed from the input signal when the processing device determination means selects the own voice acquisition terminal 80. Thus, by sharing the noise removal processing that requires a lot of CPU power, a more accurate speech recognition result can be obtained.
 また、処理装置判断手段81が、状況検知手段が検知する状況に応じて予め定められた指標である得点(例えば、図8に例示する得点表)をもとに、音声の入力状況に応じた得点の総計(例えば、総計V)を算出し、算出された総計を予め定められた条件と比較することにより、特徴量の算出処理を行う装置及び音声認識を行う装置を選択してもよい。このような得点表を予め定めることにより、環境に応じた細かな判断が可能になる。 In addition, the processing device determination unit 81 responds to the voice input status based on a score (for example, a score table illustrated in FIG. 8) that is a predetermined index according to the status detected by the status detection unit. A device that performs a feature amount calculation process and a device that performs voice recognition may be selected by calculating a total score (for example, a total V) and comparing the calculated total with a predetermined condition. Predetermining such a score table makes it possible to make detailed judgments according to the environment.
 また、音声取得端末80が、検知した入力信号を表す情報または算出された特徴量を表す情報を音声認識装置に送信する通信手段(例えば、送受信部106)を備えていてもよい。このとき、通信手段は、情報のデータ形式を音声認識装置90に通知した後、その情報を音声認識装置90に送信してもよい。また、通信手段は、その情報に対する音声認識結果を音声認識装置90から受信してもよい。 Further, the voice acquisition terminal 80 may include a communication unit (for example, the transmission / reception unit 106) that transmits information representing the detected input signal or information representing the calculated feature value to the voice recognition device. At this time, the communication unit may notify the voice recognition device 90 of the data format of the information, and then transmit the information to the voice recognition device 90. The communication unit may receive a voice recognition result for the information from the voice recognition device 90.
 また、音声認識システムが、少なくとも1つ以上の音声認識装置90(例えば、サーバ装置A~D)と、音声取得端末80と音声認識装置90との間に接続され、音声認識装置90の中から音声取得端末80の接続先を選択する接続先制御装置(例えば、接続状態コントローラ700)とを備えていてもよい。そして、接続先制御装置が、音声取得端末80から音声認識装置90に対して送信される情報のデータ形式(例えば、入力信号を示す情報、特徴量を示す情報)と、各音声認識装置のCPU負荷と、各音声認識装置のメモリ使用率のうちの少なくとも1つの情報に基づいて、その音声取得端末80の接続先を選択する選択手段(例えば、制御部701)を備えていてもよい。 A speech recognition system is connected between at least one speech recognition device 90 (for example, the server devices A to D) and the speech acquisition terminal 80 and the speech recognition device 90. A connection destination control device (for example, a connection state controller 700) that selects a connection destination of the voice acquisition terminal 80 may be provided. Then, the data format of information transmitted from the voice acquisition terminal 80 to the voice recognition device 90 (for example, information indicating an input signal, information indicating a feature amount) and the CPU of each voice recognition device. A selection unit (for example, the control unit 701) that selects a connection destination of the voice acquisition terminal 80 based on at least one of the load and the memory usage rate of each voice recognition device may be provided.
 また、図13に例示する音声取得端末も、処理装置判断手段81(例えば、処理制御部105)を備えている。なお、処理装置判断手段81の内容は、図12に示す内容と同様である。 Further, the voice acquisition terminal illustrated in FIG. 13 also includes a processing device determination unit 81 (for example, a processing control unit 105). The contents of the processing device determination unit 81 are the same as the contents shown in FIG.
 なお、上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Note that a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(付記1)音声が入力され、当該音声を表す入力信号を取得する音声取得端末と、前記音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置とを備え、前記音声取得端末は、音声の入力状況に応じて、音声認識に用いられる特徴量の算出処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、前記入力状況に応じて、算出された特徴量に基づいて音声認識を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択する処理装置判断手段と、前記選択結果に基づいて入力信号の音声認識を行う音声認識手段とを備え、前記処理装置判断手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択することを特徴とする音声認識システム。 (Additional remark 1) It is provided with the audio | voice acquisition terminal which acquires the audio | voice and the input signal showing the said audio | voice, and the audio | voice recognition apparatus which performs audio | voice recognition based on the information transmitted from the said audio | voice acquisition terminal, The said audio | voice acquisition terminal Selects at least one device for performing calculation processing of a feature amount used for speech recognition from the own speech acquisition terminal and the speech recognition device according to the speech input status, and calculates according to the input status. And a processing device determination unit that selects at least one device that performs speech recognition based on the feature amount, from the own speech acquisition terminal and the speech recognition device, and speech that performs speech recognition of an input signal based on the selection result Recognizing means, and the processing device judging means includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and between the own voice acquisition terminal and the voice recognition apparatus. Speech recognition system, characterized by selecting the device in response to information representing at least one of signal status, performing apparatus and speech recognition performs calculation processing of the feature.
(付記2)音声取得端末は、音声の入力状況を検知する状況検知手段を備え、前記状況検知手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す状況を検知し、処理装置判断手段は、前記状況検知手段が検知した状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する付記1記載の音声認識システム。 (Supplementary Note 2) The voice acquisition terminal includes a status detection unit that detects a voice input status, and the status detection unit includes a voice input environment, a task size, a status of the own voice acquisition terminal, a status of the voice recognition device, A situation representing at least one of communication situations between the own voice acquisition terminal and the voice recognition device is detected, and the processing device determination means calculates the feature amount according to the situation detected by the situation detection means. The speech recognition system according to supplementary note 1, wherein a device that performs processing and a device that performs speech recognition are selected.
(付記3)音声取得端末は、入力信号から音声区間を抽出する音声検出手段を備え、処理装置判断手段は、音声の入力状況に応じて、前記音声区間の抽出処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、前記音声検出手段は、前記処理装置判断手段が自音声取得端末を選択した場合に、入力信号から音声区間を抽出する付記1または付記2記載の音声認識システム。 (Supplementary Note 3) The voice acquisition terminal includes a voice detection unit that extracts a voice section from an input signal, and the processing device determination unit acquires a device that performs the voice section extraction process according to a voice input state. At least one of the terminal and the speech recognition device is selected, and the speech detection means extracts a speech section from the input signal when the processing device determination means selects the own speech acquisition terminal. The speech recognition system described.
(付記4)音声取得端末は、入力信号から雑音成分を除去する雑音除去手段を備え、処理装置判断手段は、音声の入力状況に応じて、前記雑音成分の除去処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、前記雑音除去手段は、前記処理装置判断手段が自音声取得端末を選択した場合に、入力信号から雑音成分を除去する付記1から付記3のうちのいずれか1つに記載の音声認識システム。 (Supplementary Note 4) The voice acquisition terminal includes a noise removal unit that removes a noise component from the input signal, and the processing device determination unit acquires the device that performs the noise component removal process according to the voice input status. At least one of the terminal and the speech recognition device is selected, and the noise removal unit removes a noise component from the input signal when the processing device determination unit selects the own voice acquisition terminal. The speech recognition system according to any one of the above.
(付記5)処理装置判断手段は、状況検知手段が検知する状況に応じて予め定められた指標である得点をもとに、音声の入力状況に応じた得点の総計を算出し、算出された総計を予め定められた条件と比較することにより、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する付記2から付記4のうちのいずれか1つに記載の音声認識システム。 (Supplementary Note 5) The processing device determination means calculates the total score according to the voice input status based on the score which is a predetermined index according to the situation detected by the situation detection means, The speech recognition system according to any one of appendix 2 to appendix 4, wherein a device that performs a feature amount calculation process and a device that performs speech recognition is selected by comparing the total with a predetermined condition.
(付記6)音声取得端末は、検知した入力信号を表す情報または算出された特徴量を表す情報を音声認識装置に送信する通信手段を備え、前記通信手段は、前記情報のデータ形式を音声認識装置に通知した後、当該情報を音声認識装置に送信し、当該情報に対する音声認識結果を前記音声認識装置から受信する付記1から付記5のうちのいずれか1つに記載の音声認識システム。 (Supplementary Note 6) The voice acquisition terminal includes a communication unit that transmits information representing the detected input signal or information representing the calculated feature amount to the voice recognition device, and the communication unit recognizes the data format of the information by voice recognition. 6. The speech recognition system according to any one of supplementary notes 1 to 5, wherein the information is transmitted to the speech recognition device after being notified to the device, and the speech recognition result for the information is received from the speech recognition device.
(付記7)少なくとも1つ以上の音声認識装置と、音声取得端末と前記音声認識装置との間に接続され、前記音声認識装置の中から音声取得端末の接続先を選択する接続先制御装置とを備え、前記接続先制御装置は、音声取得端末から音声認識装置に対して送信される情報のデータ形式と、各音声認識装置のCPU負荷と、各音声認識装置のメモリ使用率のうちの少なくとも1つの情報に基づいて、当該音声取得端末の接続先を選択する選択手段を備えた付記1から付記6のうちのいずれか1つに記載の音声認識システム。 (Appendix 7) At least one or more voice recognition devices, a connection destination control device connected between the voice acquisition terminal and the voice recognition device, and selecting a connection destination of the voice acquisition terminal from the voice recognition devices; The connection destination control device includes at least one of a data format of information transmitted from the voice acquisition terminal to the voice recognition device, a CPU load of each voice recognition device, and a memory usage rate of each voice recognition device. The speech recognition system according to any one of supplementary notes 1 to 6, further comprising selection means for selecting a connection destination of the speech acquisition terminal based on one piece of information.
(付記8)状況検知手段は、音声の入力環境を表す雑音レベル、タスクサイズを示す音声認識対象の語彙数または複雑度、自音声取得端末の状況を示すCPU負荷またはメモリ使用率、音声認識装置の状況を示すCPU負荷またはメモリ使用率、自音声取得端末と音声認識装置との間の回線状態のうちの少なくとも1つを表す状況を検知する付記2から付記7のうちのいずれか1つに記載の音声認識システム。 (Supplementary note 8) The situation detection means includes a noise level indicating a voice input environment, the number of words or complexity of a voice recognition target indicating a task size, a CPU load or a memory usage rate indicating a situation of the own voice acquisition terminal, a voice recognition device Any one of appendix 2 to appendix 7 for detecting a situation representing at least one of a CPU load or a memory usage rate indicating the status of the line, and a line status between the own voice acquisition terminal and the voice recognition device The speech recognition system described.
(付記9)音声取得端末は、音声認識結果の尤もらしさを表す尤度を算出する尤度算出手段と、複数の音声認識結果の中から1つの音声認識結果を選択する音声認識結果選択手段とを備え、前記音声認識結果選択手段は、入力信号に対する音声認識処理が音声取得端末と音声認識装置の両方で行われた場合に、音声取得端末による音声認識結果と音声認識装置による音声認識結果のうち、前記尤度がより高い音声認識結果を選択する付記1から付記8のうちのいずれか1つに記載の音声認識システム。 (Supplementary note 9) The voice acquisition terminal includes a likelihood calculation unit that calculates a likelihood representing the likelihood of the voice recognition result, a voice recognition result selection unit that selects one voice recognition result from a plurality of voice recognition results, and The speech recognition result selection means includes a speech recognition result obtained by the speech acquisition terminal and a speech recognition result obtained by the speech recognition device when speech recognition processing for the input signal is performed by both the speech acquisition terminal and the speech recognition device. Of these, the speech recognition system according to any one of appendix 1 to appendix 8, which selects a speech recognition result having a higher likelihood.
(付記10)通信手段は、特徴量の算出処理を行う装置に音声認識装置が選択された場合に、入力信号を当該音声認識装置に送信し、音声認識を行う装置に音声認識装置が選択された場合に、自音声認識装置で算出した特徴量を当該音声認識装置に送信する付記6から付記9のうちのいずれか1つに記載の音声認識システム。 (Supplementary Note 10) When a speech recognition device is selected as a device that performs a feature amount calculation process, the communication unit transmits an input signal to the speech recognition device, and the speech recognition device is selected as a device that performs speech recognition. The speech recognition system according to any one of appendix 6 to appendix 9, wherein the feature amount calculated by the own speech recognizer is transmitted to the speech recognizer.
(付記11)音声が入力され、当該音声を表す入力信号を取得する音声取得端末であって、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、前記音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択する処理装置判断手段を備え、前記処理装置判断手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択することを特徴とする音声取得端末。 (Supplementary Note 11) A voice acquisition terminal that receives a voice and acquires an input signal representing the voice, and includes a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice acquisition terminal. Then, at least one device that performs a feature amount calculation process used for speech recognition is selected according to the speech input status, and based on the feature amount calculated from the speech recognition device and the own speech acquisition terminal. A processing device determination unit that selects at least one device that performs voice recognition according to the input status, the processing device determination unit including a voice input environment, a task size, a situation of the own voice acquisition terminal, the voice A device for calculating a feature value and a device for performing speech recognition are selected according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device. Audio acquiring terminal, characterized by.
(付記12)音声の入力状況を検知する状況検知手段を備え、前記状況検知手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す状況を検知し、処理装置判断手段は、前記状況検知手段が検知した状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する付記11記載の音声取得端末。 (Additional remark 12) It is provided with the condition detection means which detects the input condition of an audio | voice, and the said condition detection means includes the input environment of voice, the task size, the situation of the own voice acquisition terminal, the situation of the voice recognition device, the own voice acquisition terminal, A device that detects a situation representing at least one of communication situations with the voice recognition device, and a processing device determination unit that performs a feature amount calculation process according to the situation detected by the situation detection unit; The voice acquisition terminal according to appendix 11, which selects a device that performs voice recognition.
(付記13)音声が入力され、当該音声を表す入力信号を取得する音声取得端末が、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、前記音声取得端末が、前記音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択し、前記音声取得端末が、特徴量の算出処理を行う装置及び音声認識を行う装置を、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて選択することを特徴とする音声認識分担方法。 (Additional remark 13) The audio | voice acquisition terminal which acquires the input signal which represents the audio | voice and the audio | voice is input from the audio | voice recognition apparatus which performs audio | voice recognition based on the information transmitted from an audio | voice acquisition terminal, and an own audio | voice acquisition terminal, At least one device that performs a feature amount calculation process used for speech recognition is selected according to a speech input state, and the speech acquisition terminal is calculated from the speech recognition device and the own speech acquisition terminal. At least one device that performs speech recognition based on a feature amount is selected according to the input situation, and the speech acquisition terminal determines a device that performs feature amount calculation processing and a device that performs speech recognition as a speech input environment. Selection based on information representing at least one of a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. Speech recognition sharing method to.
(付記14)音声取得端末が、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す状況を音声の入力状況として検知し、音声取得端末が、検知された状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する付記13記載の音声認識分担方法。 (Supplementary Note 14) The voice acquisition terminal is at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. The voice recognition according to appendix 13, wherein a situation representing one is detected as a voice input situation, and the voice acquisition terminal selects a device for calculating a feature value and a device for performing voice recognition according to the detected status. Sharing method.
(付記15)音声が入力され、当該音声を表す入力信号を取得するコンピュータに適用される音声認識プログラムであって、前記コンピュータに、当該コンピュータから送信される情報に基づいて音声認識を行う音声認識装置と当該コンピュータの中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、前記音声認識装置と当該コンピュータの中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択する処理装置判断処理を実行させ、前記処理装置判断処理で、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択させるための音声認識プログラム。 (Supplementary Note 15) A speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer From the apparatus and the computer, at least one apparatus that performs processing for calculating a feature value used for speech recognition is selected according to the input state of the speech, and is calculated from the speech recognition apparatus and the computer. A processing device determination process for selecting at least one device that performs voice recognition based on a feature amount according to the input status is executed, and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations: the status of the voice recognition device; and the communication status between the own voice acquisition terminal and the voice recognition device In response, the speech recognition program for selecting a device and apparatus for speech recognition performs calculation processing of the feature.
(付記16)コンピュータに、音声の入力状況を検知する状況検知処理を実行させ、前記状況検知処理で、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す状況を検知させ、処理装置判断処理で、前記状況検知処理で検知された状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択させる付記15記載の音声認識プログラム。 (Supplementary note 16) A computer is caused to execute a situation detection process for detecting a voice input situation. In the situation detection process, a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition device, A condition representing at least one of communication conditions between the voice acquisition terminal and the voice recognition device is detected, and a feature amount is calculated according to the status detected in the status detection process in the processing device determination process. The speech recognition program according to appendix 15, which selects a device that performs processing and a device that performs speech recognition.
 以上、実施形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
 この出願は、2010年5月26日に出願された日本特許出願2010-121016を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application 2010-121016 filed on May 26, 2010, the entire disclosure of which is incorporated herein.
 本発明は、端末とサーバ装置との間で音声認識処理を分担する音声認識システムに好適に適用される。 The present invention is preferably applied to a voice recognition system that shares voice recognition processing between a terminal and a server device.
 99 マイクロフォン
 100,300,500 端末
 101 入力信号取得部
 102 特徴量変換部
 103 音声認識部
 104 状況検知部
 105,202,302,401,502,601 処理制御部
 105a 特徴量変換装置判断部
 105b 音声認識装置判断部
 106 送受信部
 107 認識結果統合部
 108 認識結果表示部
 200,400,600 サーバ装置
 301,402 音声検出部
 501,602 雑音除去部
 302a 音声検出装置判断部
 302b 特徴量変換装置判断部
 302c 音声認識装置判断部
 502a 雑音除去装置判断部
 700 接続状態コントローラ
 701 制御部
99 Microphone 100, 300, 500 Terminal 101 Input signal acquisition unit 102 Feature amount conversion unit 103 Speech recognition unit 104 Situation detection unit 105, 202, 302, 401, 502, 601 Processing control unit 105a Feature amount conversion device determination unit 105b Speech recognition Device determination unit 106 Transmission / reception unit 107 Recognition result integration unit 108 Recognition result display unit 200, 400, 600 Server device 301, 402 Audio detection unit 501, 602 Noise removal unit 302a Audio detection device determination unit 302b Feature quantity conversion device determination unit 302c Audio Recognition device determination unit 502a Noise removal device determination unit 700 Connection state controller 701 Control unit

Claims (10)

  1.  音声が入力され、当該音声を表す入力信号を取得する音声取得端末と、
     前記音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置とを備え、
     前記音声取得端末は、
     音声の入力状況に応じて、音声認識に用いられる特徴量の算出処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、前記入力状況に応じて、算出された特徴量に基づいて音声認識を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択する処理装置判断手段を備え、
     前記処理装置判断手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する
     ことを特徴とする音声認識システム。
    A voice acquisition terminal that receives voice and acquires an input signal representing the voice;
    A voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal;
    The voice acquisition terminal is
    According to the input state of the voice, at least one device that performs the calculation processing of the feature amount used for the speech recognition is selected from the own voice acquisition terminal and the voice recognition device, and is calculated according to the input state. A processing device determination unit that selects at least one device that performs speech recognition based on a feature amount from the own speech acquisition terminal and the speech recognition device;
    The processing device determination means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. A speech recognition system characterized by selecting a device that performs a feature amount calculation process and a device that performs speech recognition in accordance with the information that represents.
  2.  音声取得端末は、音声の入力状況を検知する状況検知手段を備え、
     前記状況検知手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す状況を検知し、
     処理装置判断手段は、前記状況検知手段が検知した状況に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する
     請求項1記載の音声認識システム。
    The voice acquisition terminal includes a status detection unit that detects a voice input status,
    The situation detection means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. Detect the situation you represent,
    The speech recognition system according to claim 1, wherein the processing device determination unit selects a device that performs a feature amount calculation process and a device that performs speech recognition according to the situation detected by the situation detection unit.
  3.  音声取得端末は、入力信号から音声区間を抽出する音声検出手段を備え、
     処理装置判断手段は、音声の入力状況に応じて、前記音声区間の抽出処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、
     前記音声検出手段は、前記処理装置判断手段が自音声取得端末を選択した場合に、入力信号から音声区間を抽出する
     請求項1または請求項2記載の音声認識システム。
    The voice acquisition terminal includes voice detection means for extracting a voice section from an input signal,
    The processing device determination means selects at least one device that performs the extraction processing of the speech section according to a speech input situation from the own speech acquisition terminal and the speech recognition device,
    The voice recognition system according to claim 1, wherein the voice detection unit extracts a voice section from an input signal when the processing device determination unit selects the own voice acquisition terminal.
  4.  音声取得端末は、入力信号から雑音成分を除去する雑音除去手段を備え、
     処理装置判断手段は、音声の入力状況に応じて、前記雑音成分の除去処理を行う装置を自音声取得端末と前記音声認識装置の中から少なくとも1つ選択し、
     前記雑音除去手段は、前記処理装置判断手段が自音声取得端末を選択した場合に、入力信号から雑音成分を除去する
     請求項1から請求項3のうちのいずれか1項に記載の音声認識システム。
    The voice acquisition terminal includes noise removal means for removing a noise component from the input signal,
    The processing device determination means selects at least one device for performing the noise component removal processing from the own voice acquisition terminal and the voice recognition device according to the voice input status,
    The speech recognition system according to any one of claims 1 to 3, wherein the noise removing unit removes a noise component from an input signal when the processing device determining unit selects a self-voice acquisition terminal. .
  5.  処理装置判断手段は、状況検知手段が検知する状況に応じて予め定められた指標である得点をもとに、音声の入力状況に応じた得点の総計を算出し、算出された総計を予め定められた条件と比較することにより、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する
     請求項2から請求項4のうちのいずれか1項に記載の音声認識システム。
    The processing device determination means calculates a total score according to the voice input status based on a score that is a predetermined index according to the situation detected by the status detection means, and determines the calculated total in advance. The speech recognition system according to any one of claims 2 to 4, wherein a device that performs a feature amount calculation process and a device that performs speech recognition are selected by comparing with a specified condition.
  6.  音声取得端末は、
     検知した入力信号を表す情報または算出された特徴量を表す情報を音声認識装置に送信する通信手段を備え、
     前記通信手段は、前記情報のデータ形式を音声認識装置に通知した後、当該情報を音声認識装置に送信し、当該情報に対する音声認識結果を前記音声認識装置から受信する
     請求項1から請求項5のうちのいずれか1項に記載の音声認識システム。
    The voice acquisition terminal
    Comprising communication means for transmitting information representing the detected input signal or information representing the calculated feature quantity to the speech recognition apparatus;
    The communication means, after notifying the voice recognition device of the data format of the information, transmits the information to the voice recognition device, and receives a voice recognition result for the information from the voice recognition device. The speech recognition system according to any one of the above.
  7.  少なくとも1つ以上の音声認識装置と、
     音声取得端末と前記音声認識装置との間に接続され、前記音声認識装置の中から音声取得端末の接続先を選択する接続先制御装置とを備え、
     前記接続先制御装置は、
     音声取得端末から音声認識装置に対して送信される情報のデータ形式と、各音声認識装置のCPU使用率と、各音声認識装置のメモリ使用率のうちの少なくとも1つの情報に基づいて、当該音声取得端末の接続先を選択する選択手段を備えた
     請求項1から請求項6のうちのいずれか1項に記載の音声認識システム。
    At least one speech recognition device;
    A connection destination controller connected between the voice acquisition terminal and the voice recognition device, and selecting a connection destination of the voice acquisition terminal from the voice recognition device;
    The connection destination control device
    Based on at least one information of the data format of information transmitted from the voice acquisition terminal to the voice recognition device, the CPU usage rate of each voice recognition device, and the memory usage rate of each voice recognition device, the voice The voice recognition system according to any one of claims 1 to 6, further comprising selection means for selecting a connection destination of the acquisition terminal.
  8.  音声が入力され、当該音声を表す入力信号を取得する音声取得端末であって、
     音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、前記音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択する処理装置判断手段を備え、
     前記処理装置判断手段は、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択する
     ことを特徴とする音声取得端末。
    A voice acquisition terminal that receives voice and acquires an input signal representing the voice,
    According to the input state of the voice, at least a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and a device that performs calculation processing of a feature amount used for voice recognition are selected. A processing device determining unit that selects one device and selects at least one device that performs speech recognition based on the calculated feature amount from the speech recognition device and the own speech acquisition terminal according to the input status; ,
    The processing device determination means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. A voice acquisition terminal characterized by selecting a device that performs a feature amount calculation process and a device that performs voice recognition in accordance with information that represents.
  9.  音声が入力され、当該音声を表す入力信号を取得する音声取得端末が、音声取得端末から送信される情報に基づいて音声認識を行う音声認識装置と自音声取得端末の中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、
     前記音声取得端末が、前記音声認識装置と自音声取得端末の中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択し、
     前記音声取得端末が、特徴量の算出処理を行う装置及び音声認識を行う装置を、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて選択する
     ことを特徴とする音声認識分担方法。
    A voice acquisition terminal that receives a voice and acquires an input signal representing the voice is used for voice recognition among a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and a self-voice acquisition terminal. Select at least one device that performs a feature amount calculation process according to the voice input status,
    The voice acquisition terminal selects at least one device that performs voice recognition based on the calculated feature amount from the voice recognition device and the own voice acquisition terminal according to the input situation,
    The voice acquisition terminal includes a device for performing feature amount calculation processing and a device for voice recognition. The voice input environment, the task size, the status of the own voice acquisition terminal, the status of the voice recognition device, the own voice acquisition terminal, A method for sharing voice recognition, comprising: selecting according to information representing at least one of communication states with a voice recognition device.
  10.  音声が入力され、当該音声を表す入力信号を取得するコンピュータに適用される音声認識プログラムであって、
     前記コンピュータに、
     当該コンピュータから送信される情報に基づいて音声認識を行う音声認識装置と当該コンピュータの中から、音声認識に用いられる特徴量の算出処理を行う装置を、音声の入力状況に応じて少なくとも1つ選択し、前記音声認識装置と当該コンピュータの中から、算出された特徴量に基づいて音声認識を行う装置を、前記入力状況に応じて少なくとも1つ選択する処理装置判断処理を実行させ、
     前記処理装置判断処理で、音声の入力環境、タスクサイズ、自音声取得端末の状況、前記音声認識装置の状況、自音声取得端末と前記音声認識装置との間の通信状況のうちの少なくとも1つを表す情報に応じて、特徴量の算出処理を行う装置及び音声認識を行う装置を選択させる
     ための音声認識プログラム。
    A speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech,
    In the computer,
    Select at least one of a speech recognition device that performs speech recognition based on information transmitted from the computer and a device that performs processing for calculating a feature value used for speech recognition, from the computer, according to the input state of the speech Then, from among the voice recognition device and the computer, a device that performs voice recognition based on the calculated feature amount is selected according to the input status, and a processing device determination process is executed.
    In the processing apparatus determination process, at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus A speech recognition program for selecting a device that performs a feature amount calculation process and a device that performs speech recognition in accordance with information that represents the information.
PCT/JP2011/002764 2010-05-26 2011-05-18 Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program WO2011148594A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2010121016 2010-05-26
JP2010-121016 2010-05-26

Publications (1)

Publication Number Publication Date
WO2011148594A1 true WO2011148594A1 (en) 2011-12-01

Family

ID=45003595

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2011/002764 WO2011148594A1 (en) 2010-05-26 2011-05-18 Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program

Country Status (1)

Country Link
WO (1) WO2011148594A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014048507A (en) * 2012-08-31 2014-03-17 National Institute Of Information & Communication Technology Local language resource reinforcement device, and service provision facility device
JP2015018238A (en) * 2013-07-08 2015-01-29 インタラクションズ コーポレイション Automated speech recognition proxy system for natural language understanding
JP2015537237A (en) * 2012-10-12 2015-12-24 タタ・コンサルタンシー・サーヴィシズ・リミテッド Real-time traffic detection
US9245525B2 (en) 2011-01-05 2016-01-26 Interactions Llc Automated speech recognition proxy system for natural language understanding
JP2016507079A (en) * 2013-02-01 2016-03-07 テンセント テクノロジー (シェンジェン) カンパニー リミテッド System and method for load balancing in a speech recognition system
JP2016180915A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, client device, voice recognition method, and program
JP2016180916A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, voice recognition method, and program
JP2016180914A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, voice recognition method, and program
US9472185B1 (en) 2011-01-05 2016-10-18 Interactions Llc Automated recognition system for natural language understanding
JP2017151210A (en) * 2016-02-23 2017-08-31 Nttテクノクロス株式会社 Information processing device, voice recognition method, and program
JP2018081185A (en) * 2016-11-15 2018-05-24 クラリオン株式会社 Speech recognition device and speech recognition system
WO2019016938A1 (en) * 2017-07-21 2019-01-24 三菱電機株式会社 Speech recognition device and speech recognition method
US20190228776A1 (en) * 2018-01-19 2019-07-25 Toyota Jidosha Kabushiki Kaisha Speech recognition device and speech recognition method
JP7470839B2 (en) 2019-02-06 2024-04-18 グーグル エルエルシー Voice Query Quality of Service QoS based on client-computed content metadata

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002041083A (en) * 2000-07-19 2002-02-08 Nec Corp Remote control system, remote control method and memory medium
JP2003044091A (en) * 2001-07-31 2003-02-14 Ntt Docomo Inc Voice recognition system, portable information terminal, device and method for processing audio information, and audio information processing program
JP2007304505A (en) * 2006-05-15 2007-11-22 Nippon Telegr & Teleph Corp <Ntt> Server/client type speech recognition method, system and server/client type speech recognition program, and recording medium
WO2009019783A1 (en) * 2007-08-09 2009-02-12 Panasonic Corporation Voice recognition device and voice recognition method
JP2009520224A (en) * 2005-12-20 2009-05-21 インターナショナル・ビジネス・マシーンズ・コーポレーション Method for processing voice application, server, client device, computer-readable recording medium (sharing voice application processing via markup)

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002041083A (en) * 2000-07-19 2002-02-08 Nec Corp Remote control system, remote control method and memory medium
JP2003044091A (en) * 2001-07-31 2003-02-14 Ntt Docomo Inc Voice recognition system, portable information terminal, device and method for processing audio information, and audio information processing program
JP2009520224A (en) * 2005-12-20 2009-05-21 インターナショナル・ビジネス・マシーンズ・コーポレーション Method for processing voice application, server, client device, computer-readable recording medium (sharing voice application processing via markup)
JP2007304505A (en) * 2006-05-15 2007-11-22 Nippon Telegr & Teleph Corp <Ntt> Server/client type speech recognition method, system and server/client type speech recognition program, and recording medium
WO2009019783A1 (en) * 2007-08-09 2009-02-12 Panasonic Corporation Voice recognition device and voice recognition method

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9741347B2 (en) 2011-01-05 2017-08-22 Interactions Llc Automated speech recognition proxy system for natural language understanding
US9245525B2 (en) 2011-01-05 2016-01-26 Interactions Llc Automated speech recognition proxy system for natural language understanding
US9472185B1 (en) 2011-01-05 2016-10-18 Interactions Llc Automated recognition system for natural language understanding
US10049676B2 (en) 2011-01-05 2018-08-14 Interactions Llc Automated speech recognition proxy system for natural language understanding
US10810997B2 (en) 2011-01-05 2020-10-20 Interactions Llc Automated recognition system for natural language understanding
US10147419B2 (en) 2011-01-05 2018-12-04 Interactions Llc Automated recognition system for natural language understanding
JP2014048507A (en) * 2012-08-31 2014-03-17 National Institute Of Information & Communication Technology Local language resource reinforcement device, and service provision facility device
JP2015537237A (en) * 2012-10-12 2015-12-24 タタ・コンサルタンシー・サーヴィシズ・リミテッド Real-time traffic detection
JP2016507079A (en) * 2013-02-01 2016-03-07 テンセント テクノロジー (シェンジェン) カンパニー リミテッド System and method for load balancing in a speech recognition system
JP2015018238A (en) * 2013-07-08 2015-01-29 インタラクションズ コーポレイション Automated speech recognition proxy system for natural language understanding
JP2016180915A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, client device, voice recognition method, and program
JP2016180916A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, voice recognition method, and program
JP2016180914A (en) * 2015-03-25 2016-10-13 日本電信電話株式会社 Voice recognition system, voice recognition method, and program
JP2017151210A (en) * 2016-02-23 2017-08-31 Nttテクノクロス株式会社 Information processing device, voice recognition method, and program
WO2018092786A1 (en) * 2016-11-15 2018-05-24 クラリオン株式会社 Speech recognition device and speech recognition system
JP2018081185A (en) * 2016-11-15 2018-05-24 クラリオン株式会社 Speech recognition device and speech recognition system
US11087764B2 (en) 2016-11-15 2021-08-10 Clarion Co., Ltd. Speech recognition apparatus and speech recognition system
WO2019016938A1 (en) * 2017-07-21 2019-01-24 三菱電機株式会社 Speech recognition device and speech recognition method
US20190228776A1 (en) * 2018-01-19 2019-07-25 Toyota Jidosha Kabushiki Kaisha Speech recognition device and speech recognition method
JP7470839B2 (en) 2019-02-06 2024-04-18 グーグル エルエルシー Voice Query Quality of Service QoS based on client-computed content metadata

Similar Documents

Publication Publication Date Title
WO2011148594A1 (en) Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program
US9966077B2 (en) Speech recognition device and method
EP2560158B1 (en) Operating system and method of operating
US9767795B2 (en) Speech recognition processing device, speech recognition processing method and display device
US7941313B2 (en) System and method for transmitting speech activity information ahead of speech features in a distributed voice recognition system
US8050911B2 (en) Method and apparatus for transmitting speech activity in distributed voice recognition systems
JP3004883B2 (en) End call detection method and apparatus and continuous speech recognition method and apparatus
WO2015029304A1 (en) Speech recognition method and speech recognition device
US8831939B2 (en) Voice data transferring device, terminal device, voice data transferring method, and voice recognition system
JP5613335B2 (en) Speech recognition system, recognition dictionary registration system, and acoustic model identifier sequence generation device
JP6139598B2 (en) Speech recognition client system, speech recognition server system and speech recognition method for processing online speech recognition
KR20140058127A (en) Voice recognition apparatus and voice recogniton method
JP2007264126A (en) Speech processing device, speech processing method and speech processing program
JPWO2007055181A1 (en) Dialogue support device
KR20060022156A (en) Distributed speech recognition system and method
JP2007033754A (en) Voice monitor system, method and program
KR101863097B1 (en) Apparatus and method for keyword recognition
CN107104994B (en) Voice recognition method, electronic device and voice recognition system
EP2504745B1 (en) Communication interface apparatus and method for multi-user
WO2019075829A1 (en) Voice translation method and apparatus, and translation device
US7177806B2 (en) Sound signal recognition system and sound signal recognition method, and dialog control system and dialog control method using sound signal recognition system
JP6549009B2 (en) Communication terminal and speech recognition system
JP2003241788A (en) Device and system for speech recognition
KR101165906B1 (en) Voice-text converting relay apparatus and control method thereof
WO2023047893A1 (en) Authentication device and authentication method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11786298

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11786298

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP