WO2011148594A1

WO2011148594A1 - Voice recognition system, voice acquisition terminal, voice recognition distribution method and voice recognition program

Info

Publication number: WO2011148594A1
Application number: PCT/JP2011/002764
Authority: WO
Inventors: 荒川隆行; 越仲孝文
Original assignee: 日本電気株式会社
Priority date: 2010-05-26
Filing date: 2011-05-18
Publication date: 2011-12-01

Abstract

A processing device determination means selects in accordance with the voice input situation at least one device from a local-voice acquisition terminal and a voice recognition device to perform calculations for feature quantities used in voice recognition, and selects in accordance with the input situation at least one device from the local-voice acquisition terminal and the voice recognition device to perform voice recognition on the basis of the feature quantities calculated. Furthermore, the processing device determination means selects a device to perform feature quantity calculations and a device to perform voice recognition, in accordance with information representing at least one from among the voice input environment, task size, local-voice acquisition terminal situation, voice recognition device situation, and communication situation between the local-voice acquisition terminal and the voice recognition device.

Description

Speech recognition system, speech acquisition terminal, speech recognition sharing method, and speech recognition program

The present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device.

When performing speech recognition with a terminal having relatively low CPU power such as a portable terminal, a distributed speech recognition (hereinafter referred to as DSR) system is widely used. In a general DSR system, a terminal first performs speech detection, noise suppression, and feature amount conversion processing on speech input by a user, and transmits the compressed feature amount to a server. Next, the server device performs speech recognition based on the feature amount transmitted from the terminal, and transmits the recognition result to the terminal side. Then, the terminal notifies the user of the recognition result. With such a configuration, a large-scale dictionary that cannot be mounted on the terminal and voice recognition that requires a large amount of CPU power can be performed via the terminal.

Such a DSR system is described in Patent Document 1. FIG. 14 is an explanatory diagram showing a voice recognition system described in Patent Document 1. As shown in FIG. In the speech recognition system described in Patent Document 1, a plurality of client stations (terminals) 320, 330, and 340 are connected to a server 310 via a public Internet network 350.

The terminal 330 includes an interface (IF) 331 that acquires a user's voice signal, a communication interface (COM) 332 that communicates with the server 310, and a spectrum analysis unit that obtains a feature vector from the acquired voice signal ( SAS) 333, a speech recognition unit (SR) 334 that performs speech recognition from the feature vector, a speech controller (SC) 335 that distributes a part of the feature vector to the server 310 depending on the speech recognition result, And a controllable switch (SW) 336 for determining whether or not a feature vector has been transmitted to the server 310 via the communication interface 332. In addition, the server 310 includes a communication interface (COM) 312 that communicates with a terminal, and a speech recognition unit (REC) 314 that performs speech recognition from a feature quantity vector received from the terminal.

In the distributed speech recognition system described in Patent Document 1, the speech recognition unit 334 on the terminal side performs speech recognition that has relatively little vocabulary and requires less CPU power. The server-side voice recognition unit 314 performs voice recognition that has a relatively large vocabulary and requires a lot of CPU power. In this way, voice recognition with good response is performed by efficiently distributing voice recognition processing.

In addition, Patent Document 2 describes an information terminal that performs voice recognition processing in a shared manner. The information terminal described in Patent Document 2 extracts feature points of the captured voice data, determines the complexity of the voice data, and determines a device that performs voice recognition processing according to the complexity.

Special Table 2002-540479 JP 2007-41089 A

In the speech recognition system described in Patent Document 1, processing related to speech recognition is shared by changing the amount of speech input signal transmitted according to the CPU load on the terminal side or the CPU load on the server side. However, the speech recognition system described in Patent Document 1 has a problem in that the processing performed in the speech recognition cannot be appropriately shared because the processing is distributed considering only the CPU load. That is, there is a problem that it is not sufficient to consider the CPU load in order to appropriately share the voice recognition processing among a plurality of devices.

On the other hand, in the information terminal described in Patent Document 2, feature points of the captured voice data are extracted, and the feature points are used as factors for determining the distribution destination. Therefore, it is considered that the information terminal described in Patent Document 2 can more appropriately share the processing performed by voice recognition.

However, in the information terminal described in Patent Document 2, feature point extraction processing, which is one of speech recognition processing, is performed in the information terminal. In other words, the feature point extraction process is always performed by the information terminal, and thus it is difficult to say that the process performed in the speech recognition is appropriately shared.

Therefore, the present invention relates to a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that share voice recognition processing between a terminal and a server device. An object of the present invention is to provide a voice recognition system, a voice acquisition terminal, a voice recognition sharing method, and a voice recognition program that can be appropriately shared.

The speech recognition system according to the present invention includes a speech acquisition terminal that receives speech and acquires an input signal representing the speech, and a speech recognition device that performs speech recognition based on information transmitted from the speech acquisition terminal. The acquisition terminal selects at least one of the own voice acquisition terminal and the voice recognition device as a feature amount calculation process used for voice recognition according to the voice input status, and calculates according to the input status. And a processing device determination unit that selects at least one device that performs voice recognition based on the feature amount, from the own voice acquisition terminal and the voice recognition device, and the processing device determination unit includes a voice input environment, a task size, A feature amount calculation process is performed according to information representing at least one of the status of the own voice acquisition terminal, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device. And selecting a device that performs location and voice recognition.

The voice acquisition terminal according to the present invention is a voice acquisition terminal that receives an input of voice and acquires an input signal representing the voice, and a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice At least one device that performs processing for calculating feature values used for speech recognition is selected from the acquisition terminals according to the voice input status, and the feature calculated from the speech recognition device and the own speech acquisition terminal is selected. A processing device determination unit that selects at least one device that performs voice recognition based on the volume according to an input situation, and the processing device determination unit includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a voice Select a device for performing feature amount calculation and a device for performing speech recognition according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device Characterized in that it features a Rukoto.

The voice recognition sharing method according to the present invention includes a voice recognition device that receives a voice and obtains an input signal representing the voice, and performs voice recognition based on information transmitted from the voice acquisition terminal. From the terminal, at least one device that performs the calculation processing of the feature value used for voice recognition is selected according to the voice input status, and the voice acquisition terminal is selected from the voice recognition device and the own voice acquisition terminal. At least one device that performs speech recognition based on the calculated feature amount is selected according to the input situation, and the speech acquisition terminal determines whether the device that performs feature amount calculation processing and the device that performs speech recognition Selection is made according to information representing at least one of the environment, task size, status of the own voice acquisition terminal, status of the voice recognition device, and communication status between the own voice acquisition terminal and the voice recognition device. To.

A speech recognition program according to the present invention is a speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer. Select at least one device for calculating the feature value used for speech recognition from the speech recognition device and the computer to be used, and calculate from the speech recognition device and the computer. A processing device determination process for selecting at least one device that performs voice recognition based on the input feature amount according to an input situation, and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations, the status of the voice recognition device, and the communication status between the own voice acquisition terminal and the voice recognition device In response, characterized in that to select a device for performing device and speech recognition performs calculation processing of the feature.

According to the present invention, when the voice recognition process is shared between the terminal and the server apparatus, the process performed by the voice recognition can be appropriately shared between the terminal and the server apparatus.

It is a block diagram which shows the example of the speech recognition system in the 1st Embodiment of this invention. It is a flowchart which shows the example of operation | movement by the side of the terminal in 1st Embodiment. It is a flowchart which shows the example of operation | movement by the side of the server in 1st Embodiment. It is a block diagram which shows the example of the speech recognition system in the 2nd Embodiment of this invention. It is a flowchart which shows the example of operation | movement by the side of the terminal in 2nd Embodiment. It is a flowchart which shows the example of the operation | movement by the side of the server in 2nd Embodiment. It is a block diagram which shows the example of the speech recognition system in the 3rd Embodiment of this invention. It is a flowchart which shows the example of operation | movement by the side of the terminal in 3rd Embodiment. It is a flowchart which shows the example of operation | movement by the side of the server in 3rd Embodiment. It is explanatory drawing which shows the example of a score table. It is explanatory drawing which shows the example of the speech recognition system in 5th Embodiment. It is a block diagram which shows the example of the minimum structure of the speech recognition system by this invention. It is a block diagram which shows the example of the minimum structure of the audio | voice acquisition terminal by this invention. It is explanatory drawing which shows the speech recognition system described in patent document 1.

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

Embodiment 1. FIG.
FIG. 1 is a block diagram showing an example of a speech recognition system according to the first embodiment of the present invention. The voice recognition system according to the present embodiment includes a terminal 100 and a server device 200. In the following description, the terminal 100 may be referred to as the terminal side, and the server device 200 may be referred to as the server side. The terminal 100 and the server device 200 are connected via, for example, a public Internet network.

The terminal 100 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a processing control unit 105, a transmission / reception unit 106, a recognition result integration unit 107, and a recognition result. And a display unit 108.

The input signal acquisition unit 101 converts the input voice into input sound data (hereinafter referred to as an input signal). Specifically, the input signal acquisition unit 101 cuts out time-series input sound data collected by the microphone 99 or the like for each frame of unit time.

The feature quantity conversion unit 102 converts the time series of the input signal into a time series of feature quantities used for speech recognition. The feature quantity conversion unit 102 converts an input signal into a feature quantity by using a method such as LPC (Linear Predictive Coding) cepstrum analysis or MFCC (Mel-Frequency Cepstrum Coefficient) analysis. However, the method by which the feature amount conversion unit 102 converts the input signal into the feature amount is not limited to the above method.

The voice recognition unit 103 performs voice recognition based on the time series of the converted feature values. In addition, the voice recognition unit 103 calculates a score representing the recognition result at the same time. Here, the score representing the recognition result (hereinafter sometimes simply referred to as a score) is an index representing the likelihood of speech recognition. The speech recognition unit 103 may calculate, for example, a distance between the feature amount time series and the acoustic model, an acoustic score such as likelihood, a language score representing linguistic coincidence, and the like as a score representing a recognition result. . At this time, the voice recognition unit 103 may obtain a score for the entire recognition result, or may obtain a score in various units such as for each frame, for each word, or for each utterance section. Note that a method for performing speech recognition based on a feature amount used for speech recognition and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here.

The situation detection unit 104 detects a situation where sound is input. Specifically, the status detection unit 104 includes the environment in which voice is input, the status of the terminal 100 and the server device 200, the task size, the line status between the terminal 100 and the server device 200 that transmits the input signal, and the like. Detect various situations. The status detection unit 104 detects, for example, the CPU load and the memory usage status as the status of the terminal 100 and the server device 200. However, the statuses of the terminal 100 and the server device 200 detected by the status detection unit 104 are not limited to the CPU load and the memory usage status.

Also, task size is an index that represents the difficulty in speech recognition of utterances. For example, the number of vocabulary words that can be recognized by speech may be used as a scale representing the task size. In addition, as a measure representing the task size, the complexity of utterance accepted by speech recognition may be used. For example, utterance complexity may be expressed by keyword recognition or natural language recognition.

Note that these measures are detected depending on the situation and application in which the user inputs voice. For example, when the user is speaking clearly and the utterance is relatively easy to distinguish, the situation detection unit 104 determines that the difficulty of speech recognition is low even if the vocabulary is large or the utterance is complicated. May be. On the other hand, if it is difficult to identify the utterance due to the user's quickness or low voice, it is necessary to reduce the number of vocabularies or simplify the utterance in order to minimize errors in speech recognition. . Therefore, in such a case, the situation detection unit 104 may determine that the difficulty in voice recognition is high. In addition, the situation detection unit 104 may detect difficulty in speech recognition due to the large number of words or complexity based on specifications required by the application.

Also, the environment where voice is input includes noise level and the like. The noise level represents the degree of noise included in the input voice. For example, when the part other than the utterance that is the target of voice recognition among the voices input through the microphone 99 is set as noise, the magnitude (sound pressure) of the noise may be set as the noise level. The situation detection unit 104 may detect the noise level from, for example, the sound pressure of the input signal input to the microphone 99 before the user speaks.

The status detection unit 104 may detect the CPU load of the terminal 100 and the server device 200 and the line status between the devices using, for example, an API (Application Program Interface) provided by an OS (Operating System). For example, the status detection unit 104 may transmit a packet requesting the CPU usage status to the server side and calculate the CPU load from information included in the returned packet. Further, for example, the status detection unit 104 transmits an ICMP (Internet Control Message Protocol) Echo request packet to the server side, measures the time until the ICMP packet returned from the server side is received, and determines the line state. It may be detected. However, the method by which the state detection unit 104 detects the CPU load and the line state between devices is not limited to the above method.

The process control unit 105 includes a feature amount conversion device determination unit 105a and a voice recognition device determination unit 105b. The process control unit 105 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the feature amount conversion device determination unit 105a determines which device is to execute the feature amount calculation process based on the detected situation. In addition, the voice recognition device determination unit 105b determines which device should perform voice recognition based on the feature amount based on the detected situation. In the example illustrated in FIG. 1, the feature amount conversion device determination unit 105 a and the speech recognition device determination unit 105 b cause the terminal 100 to execute processing, cause the server device 200 to execute processing, or the terminal 100 and the server device 200. Select whether or not to cause both to execute processing. A specific method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b select which device to execute the subsequent processing will be described later.

The transmission / reception unit 106 determines the time series of the input signal and the time series of the feature amount according to the determination results of the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b). It transmits to the server apparatus 200. In addition, the transmission / reception unit 106 receives the result of speech recognition performed by the server device 200.

For example, when it is determined that the server apparatus 200 is to execute the feature amount calculation process, the transmission / reception unit 106 transmits the time series of the input signal to the server apparatus 200. In addition, when it is determined that the server device 200 is to perform the voice recognition process based on the feature amount, the transmission / reception unit 106 transmits the time series of the feature amount to the server device 200.

The recognition result integration unit 107 compares and integrates the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition performed by the server device 200 received by the transmission / reception unit 106. That is, the recognition result integration unit 107 selects a more appropriate speech recognition result from the result of speech recognition performed by the speech recognition unit 103 and the result of speech recognition by the server device 200 received by the transmission / reception unit 106, Merge selected results. For example, when the speech recognition process is performed only in one of the terminal side and the server side, the recognition result integration unit 107 may use the speech recognition performed in each device as the speech recognition result. On the other hand, when the speech recognition process is performed on either the terminal side or the server side, the recognition result integration unit 107 may select a more likely speech recognition result and use the selected result as the speech recognition result. .

The recognition result display unit 108 displays the result of speech recognition compared and integrated by the recognition result integration unit 107 to the user. The recognition result display unit 108 is realized by, for example, a display device.

The server device 200 includes a transmission / reception unit 201, a processing control unit 202, a feature amount conversion unit 203, and a voice recognition unit 204. Server device 200 performs voice recognition based on information transmitted from terminal 100.

The transmission / reception unit 201 receives data transmitted from the terminal 100. Further, the transmission / reception unit 201 transmits the speech recognition result by the speech recognition unit 204 to the terminal 100.

The process control unit 202 determines subsequent process contents based on the information received from the terminal 100. Specifically, the processing control unit 202 determines the subsequent processing contents depending on whether the data received from the terminal 100 is a time series of input signals or a time series of feature amounts. For example, when the data received from the terminal 100 is a time series of input signals, the process control unit 202 causes the feature amount conversion unit 203 to execute a feature amount calculation process. On the other hand, when the data received from the terminal 100 is a time series of feature amounts, the process control unit 202 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.

The feature value conversion unit 203 converts the time series of the received input signal into a time series of feature values used for speech recognition. The speech recognition unit 204 performs speech recognition based on the time series of feature amounts converted by the feature amount conversion unit 203 or the time series of feature amounts received from the terminal 100. In addition, the voice recognition unit 204 calculates a score representing the recognition result at the same time. Note that a method for converting an input signal into a feature value, a method for performing speech recognition based on the feature value, and a method for calculating an index representing the likelihood of speech recognition are widely known, and thus description thereof is omitted here. To do.

Input signal acquisition unit 101, feature amount conversion unit 102, speech recognition unit 103, situation detection unit 104, processing control unit 105 (more specifically, feature amount conversion device determination unit 105a, speech recognition device determination Unit 105b), transmission / reception unit 106, and recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program). For example, the program is stored in a storage unit (not shown) of the terminal 100, and the CPU reads the program, and in accordance with the program, the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit. 104, the process control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 may be operated.

The input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the processing control unit 105, the transmission / reception unit 106, and the recognition result integration unit 107 are dedicated to each. It may be realized by hardware.

Next, the operation will be described. FIG. 2 is a flowchart showing an example of the operation on the terminal side. FIG. 3 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.

On the terminal side, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 cuts out the collected time-series input sound data for each unit time frame (step S101). . For example, the input signal acquisition unit 101 may sequentially cut out waveform data for a unit time while shifting a portion to be cut out from input sound data by a predetermined time. Hereinafter, this unit time is referred to as a frame width, and this predetermined time is referred to as a frame shift. For example, when the input sound data is 16-bit Linear-PCM (Pulse Code Modulation) with a sampling frequency of 8000 Hz, waveform data for 8000 points per second is included. In this case, the input signal acquisition unit 101 may extract the waveform data sequentially in time series at a frame width of 200 points (ie, 25 milliseconds) and a frame shift of 80 points (ie, 10 milliseconds).

Next, the feature quantity conversion device determination unit 105a converts the time series of the input signal into the feature quantity time series by the feature quantity conversion unit 102 on the terminal side according to the situation detected by the situation detection unit 104, or It is determined whether the transmission / reception unit 106 transmits to the server device 200 (that is, whether the server side feature amount conversion unit 203 converts the feature amount into a time series) or both of them convert into a feature amount time series. (Step S102).

For example, the feature amount conversion device determination unit 105a determines which device converts the time series of the input signal into the time series of the feature amount based on the following conditions.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 200 is high, the feature amount conversion device determination unit 105a does not transmit the input signal to the server device 200. Then, it is determined that the terminal side converts to a feature value.
2. Other than the above, when the CPU load of the terminal 100 is high or the noise level is high, the feature amount conversion apparatus determination unit 105a transmits the input signal to the server side, and does not perform the feature amount conversion on the terminal side. to decide. That is, the feature quantity conversion device determination unit 105a determines that the server performs the feature quantity conversion.
3. In cases other than the above, the feature quantity conversion device determination unit 105a transmits the input signal to the server side, and determines that the terminal side also converts to the feature quantity. That is, the feature quantity conversion device determination unit 105a determines that the feature quantity conversion is performed on both the terminal side and the server side.

Note that whether the communication speed is low, whether the CPU load on the terminal 100 and the server device 200 is high, and whether the noise level is high is determined by comparing with a predetermined threshold.

When the feature value conversion apparatus determination unit 105a determines that the feature value conversion is performed on the terminal side (“terminal side” in step S102), the feature value conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).

On the other hand, when the feature value conversion device determination unit 105a determines that the feature value conversion is performed on the server side (“server side” in step S102), the feature value conversion device determination unit 105a transmits the input signal to the transmission / reception unit 106. (Step S104). For example, the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal. At this time, the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is an input signal.

Further, the transmission / reception unit 106 may transmit the input signal to the server side after notifying the server side of the data format. By doing in this way, it becomes possible for the server side to judge the content of the data received after that.

Next, the voice recognition device determination unit 105b causes the terminal-side voice recognition unit 103 to perform voice recognition based on the time series of the feature amounts converted by the feature amount conversion unit 102 in accordance with the situation detected by the situation detection unit 104. It is determined whether processing is performed, a time series of feature values is transmitted to the server device 200 (that is, whether the server-side speech recognition unit 204 performs speech recognition processing), or both are recognized (step S105).

If the feature amount conversion device determination unit 105a determines that the input signal is not transmitted to the server device 200, the voice recognition device determination unit 105b, for example, based on the following conditions, based on the time series of the feature amount Whether to perform voice recognition processing.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server apparatus 200 is high, the speech recognition apparatus determination unit 105b does not transmit the feature amount to the server apparatus 200, and the terminal It is determined that voice recognition is performed on the side.
2. Other than the above, when the task size is large (that is, more difficult), the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side does not perform speech recognition. That is, the voice recognition device determination unit 105b determines that voice recognition is performed on the server side.
3. In cases other than the above, the speech recognition apparatus determination unit 105b transmits the feature amount to the server apparatus 200 and determines that the terminal side also performs speech recognition. That is, the speech recognition device determination unit 105b determines that speech recognition is performed on both the terminal side and the server side.

Note that whether or not the communication speed is low, whether or not the CPU load of the server device 200 is high, and whether or not the difficulty is high is determined by comparison with a predetermined threshold value.

When the speech recognition apparatus determination unit 105b determines that speech recognition is to be performed on the terminal side (“terminal side” in step S105), the speech recognition unit 103 performs speech recognition on the time series of feature amounts (step S106). . Specifically, the speech recognition unit 103 searches for a corresponding word string from a storage unit (not shown) provided in the terminal 100, and uses the search result as the speech recognition result. At this time, the voice recognition unit 103 simultaneously calculates a score representing the recognition result.

On the other hand, when the voice recognition device determination unit 105b determines that the server side performs voice recognition (“server side” in step S105), the voice recognition device determination unit 105b causes the transmission / reception unit 106 to transmit the feature amount (step S107). ). For example, the transmission / reception unit 106 may compress and transmit the time series of feature amounts for each unit, or may encode and transmit the time series of feature amounts. At this time, the transmission / reception unit 106 may add header information or the like to the head of the data to indicate that the content to be transmitted is a feature amount.

When the speech recognition is performed on the server side (Yes in step S108), the recognition result integration unit 107 selects either the terminal side speech recognition result or the server side speech recognition result, and The speech recognition results are integrated (step S109).

Here, when speech recognition is performed only on either the terminal side or the server side, the recognition result integration unit 107 may select one speech recognition result without integrating the speech recognition results. On the other hand, when speech recognition is performed on both the terminal side and the server side, the recognition result integration unit 107 integrates the speech recognition results. Specifically, the recognition result integration unit 107 selects one of the speech recognitions performed on both the terminal side and the server side, and integrates the selected results.

The recognition result integration unit 107 may select a speech recognition result having a higher score from the scores calculated by the speech recognition unit 103 and the speech recognition unit 204, for example. For example, the recognition result integration unit 107 may compare the speech recognition results in units of division such as words, and select a speech recognition result having a higher score in the compared division units.

Furthermore, when the voice recognition result from the server apparatus 200 cannot be obtained for a certain period of time due to the influence of the line speed or the like, the recognition result integration unit 107 does not use the voice recognition result from the server apparatus 200 and does not use the voice recognition result on the terminal side. Only the result may be used.

When voice recognition is not performed on the server side (No in step S108), or after integration of the voice recognition results, the recognition result display unit 108 displays the voice recognition results (step S110). For example, the recognition result display unit 108 may display the recognition result as a character string on a display device or the like. Furthermore, the recognition result display unit 108 may notify the result of voice synthesis from the voice recognition result using a headphone stereo or a speaker (not shown).

Next, the operation on the server side will be described with reference to FIG. On the server side, first, the transmission / reception unit 201 receives data from the terminal side (step S201). When the received data is compressed or encoded, the transmission / reception unit 201 decompresses and decodes the data.

Next, the processing control unit 202 changes the subsequent processing content according to the content of the received data (step S202). When the received data is a time series of the input signal (“input signal” in step S202), the feature amount conversion unit 203 converts the input signal into a feature amount (step S203). On the other hand, when the received data is a time series of feature amounts (“feature amount” in step S202), the feature amount conversion unit 203 does not perform a feature amount conversion process.

When performing feature value conversion, the server-side feature value conversion unit 203 converts the time series of the input signal into the time series of feature values for each frame. When the received data is a time series of feature quantities (“feature quantity” in step S202), or after the feature quantity conversion unit 203 converts the time series of the input signal to the time series of feature quantities, the speech recognition unit 204 Performs voice recognition on the time series of feature values (step S204). Specifically, the voice recognition unit 204 searches for a corresponding word string from a storage unit (not shown) provided in the server device 200, and uses the search result as the voice recognition result. At this time, the voice recognition unit 204 simultaneously calculates a score representing the recognition result. Then, the transmission / reception unit 201 transmits the voice recognition result to the terminal 100 (step S205).

Next, the effect of this embodiment will be described. As described above, according to the present embodiment, the feature quantity conversion device determination unit 105a performs the voice input status (for example, the environment in which voice is input, the task size, the status of the terminal 100 or the server device 200, the communication line Depending on the situation, it is selected where to calculate the feature value used for speech recognition. In addition, the voice recognition device determination unit 105b selects where to perform voice recognition according to the voice input status. Therefore, the process performed by voice recognition can be appropriately shared by the terminal 100 and the server apparatus 200.

Specifically, the status detection unit 104 determines the CPU load on the terminal side, the CPU load on the server side, the memory usage status on the terminal side, the memory usage status on the server side, the task size, the noise level, the status of the transmission line, etc. Detect. Then, the processing control unit 105 (more specifically, the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b) controls the processing content performed on the terminal side and the processing content performed on the server side. Therefore, efficient processing distribution can be performed and voice recognition with good response can be realized.

In other words, in order for each device to share processing related to speech recognition at an optimal rate, the sharing destination is determined based on various factors other than the CPU load, such as task size, noise level, and the state of the line for transmitting information. There is a need. For example, in the speech recognition system described in Patent Document 1, it is difficult to say that a sufficient effect due to sharing is obtained because the sharing destination is determined based on the CPU load. However, according to the present embodiment, the sharing destination is determined based on various factors other than the CPU load. Therefore, the process performed by voice recognition can be appropriately shared between the terminal and the server device.

Embodiment 2. FIG.
FIG. 4 is a block diagram illustrating an example of a speech recognition system according to the second embodiment of the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted.

The voice recognition system in this embodiment includes a terminal 300 and a server device 400. In the following description, the terminal 300 may be referred to as the terminal side, and the server device 400 may be referred to as the server side. The terminal 300 and the server device 400 are connected via, for example, a public Internet network.

The terminal 300 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a voice detection unit 301, a processing control unit 302, a transmission / reception unit 106, and a recognition result integration. Unit 107 and recognition result display unit 108. That is, the terminal 300 in the second embodiment is different from the terminal 100 in the first embodiment in that a voice detection unit 301 is added. Further, in the terminal 300 in the second embodiment, the process control unit 105 in the first embodiment is replaced with a process control unit 302.

The voice detection unit 301 determines a voice section to be recognized from the time series of the input signal input to the input signal acquisition unit 101, and cuts the time series of the voice section. That is, the voice detection unit 301 extracts a voice section from a time series of input signals. For example, as described in Reference Document 1, the voice detection unit 301 may detect an utterance section by measuring the energy of framed voice data. In addition, as described in Reference Document 2, the voice detection unit 301 may detect a voice section using a plurality of feature amounts extracted from the input signal. However, the method by which the voice detection unit 301 detects the voice section is not limited to the above method.
[Reference 1] Japanese Patent Laid-Open No. 2005-31632 [Reference 2] Japanese Patent Laid-Open No. 2007-17620

The process control unit 302 includes a voice detection device determination unit 302a, a feature amount conversion device determination unit 302b, and a voice recognition device determination unit 302c. The process control unit 302 controls which apparatus is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the voice detection device determination unit 302a determines which device is to execute the process of extracting the voice section based on the detected situation. Further, the feature amount conversion device determination unit 302b determines which device is to execute processing for converting the input signal from which the speech section is extracted into the feature amount based on the detected state. Furthermore, the voice recognition device determination unit 302c determines which device is to execute the voice recognition processing based on the feature amount based on the detected situation. Note that the method by which the speech recognition device determination unit 302c selects a device is the same as the method by which the speech recognition device determination unit 105b in the first embodiment selects a device.

In the example illustrated in FIG. 4, the voice detection device determination unit 302a, the feature amount conversion device determination unit 302b, and the voice recognition device determination unit 302c cause the terminal 300 to execute processing, or causes the server device 400 to execute processing, The terminal 300 and the server device 400 are selected to execute processing. Note that a method by which the voice detection device determination unit 302a and the feature amount conversion device determination unit 302b select which device to execute the subsequent processing will be described later.

The server apparatus 400 includes a transmission / reception unit 201, a processing control unit 401, a voice detection unit 402, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server apparatus 400 in the second embodiment is different from the server apparatus 200 in the first embodiment in that a voice detection unit 402 is added. In the server device 400 in the second embodiment, the process control unit 202 in the first embodiment is replaced with a process control unit 401.

The processing control unit 401 determines subsequent processing contents based on the information received from the terminal 300. Specifically, the processing control unit 401 determines the subsequent processing contents depending on whether the data received from the terminal 300 is a time series of an input signal, a time series of an input signal obtained by cutting a speech section, or a time series of feature amounts. judge.

For example, when the data received from the terminal 300 is a time series of the input signal, the process control unit 401 causes the voice detection unit 402 to execute a process of cutting out a voice section. In addition, when the data received from the terminal 300 is a time series of input signals obtained by cutting out voice segments, the processing control unit 401 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 300 is a time series of feature amounts, the process control unit 401 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.

The voice detection unit 402 determines a voice section to be recognized from the time series of the input signal received from the terminal 200, and cuts the time series of the voice section. That is, the voice detection unit 402 extracts a voice section from a time series of input signals.

The input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit 302a) The feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (speech recognition program).

In addition, the input signal acquisition unit 101, the feature amount conversion unit 102, the voice recognition unit 103, the situation detection unit 104, the voice detection unit 301, and the processing control unit 302 (more specifically, the voice detection device determination unit). 302a, the feature amount conversion device determination unit 302b, the speech recognition device determination unit 302c), the transmission / reception unit 106, and the recognition result integration unit 107 may each be realized by dedicated hardware.

Next, the operation will be described. FIG. 5 is a flowchart showing an example of the operation on the terminal side. FIG. 6 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.

On the terminal side, as in the first embodiment, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101). Next, the voice detection device determination unit 302a determines the voice section to be recognized from the time series of the input signal on the terminal side according to the situation detected by the situation detection unit 104, or extracts the voice section from the transmission / reception unit 106 to the server. It is determined whether to transmit to the side (that is, to determine and cut out on the server side) or to determine and cut out on both sides (step S301).

For example, based on the following conditions, the speech detection device determination unit 302a determines which device determines and extracts a speech section from the time series of input signals.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the voice detection device determination unit 302a does not transmit the input signal to the server device 400, It is determined that the voice section is cut out on the terminal side.
2. Other than the above, when the CPU load of the terminal 300 is high, when the terminal side uses a lot of memory, or when the noise level is high, the voice detection device determination unit 302a transmits the input signal to the server side, It is determined that the voice segment is not cut out. That is, the voice detection device determination unit 302a determines to cut out a voice section on the server side.
3. In cases other than the above, the voice detection device determination unit 302a transmits an input signal to the server side, and determines that the voice section is also cut out on the terminal side. That is, the voice detection device determination unit 302a determines that a voice section is cut out on both the terminal side and the server side.

Note that whether the communication speed is low, whether the CPU load on the terminal 300 and the server device 400 is high, whether the terminal side memory is large, and whether the noise level is high are determined in advance. It is judged by comparing with the threshold value.

When the voice detection device determination unit 302a determines to cut out a voice section on the terminal side (“terminal side” in step S301), the voice detection unit 301 determines a voice section to be recognized from the time series of the input signal. To do. And the audio | voice detection part 301 cuts out only an audio | voice area from the time series of the input signal (step S302). At this time, the voice detection unit 301 may cut out with a margin of several frames before and after the voice section.

On the other hand, when the voice detection device determination unit 302a determines to cut out a voice section on the server side (“server side” in step S301), the voice detection device determination unit 302a causes the transmission / reception unit 106 to transmit an input signal ( Step S104). Similar to the first embodiment, the transmission / reception unit 106 may compress and transmit the time series of the input signal for each unit, or may encode and transmit the time series of the input signal. .

Next, the feature quantity conversion device determination unit 302b converts the time series of the speech segment extracted on the terminal side according to the situation detected by the situation detection unit 104, and the feature quantity time series of the feature quantity on the terminal side. Or is transmitted from the transmission / reception unit 106 to the server device 400 (that is, the feature amount conversion unit 203 on the server side converts the feature amount into a time series), or both are converted into a feature amount time series. It is determined whether or not to perform (step S303).

When the speech detection device determination unit 302a determines that the input signal is not transmitted to the server device 400, the feature amount conversion device determination unit 302b uses the time series of the input signal as a feature amount based on the following conditions, for example. Judge whether to convert to time series.
1. When the transmission line is disconnected, when the communication speed is very slow, or when the CPU load of the server device 400 is high, the feature amount conversion device determination unit 302b stores the cut voice section in the server device 400. It is determined that the data is not transmitted but converted to a feature value on the terminal side.
2. Other than the above, when the CPU load on the terminal side is high, the memory usage on the terminal side is large, or the noise level is high, the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side Then, the terminal side determines that the feature value is not converted. That is, the feature amount conversion device determination unit 302b determines that the feature amount is converted on the server side.
3. In cases other than the above, the feature amount conversion device determination unit 302b transmits the extracted speech section to the server side, and determines that the terminal side also converts to the feature amount. That is, the feature quantity conversion device determination unit 302b determines to convert the feature quantity on both the terminal side and the server side.

Note that whether the communication speed is low, whether the CPU load on the terminal 300 and the server device 400 is high, whether there is a large amount of used memory on the terminal side, and whether the noise level is high are as follows: As in the case of determination by the determination unit 302a, the determination is made by comparing with a predetermined threshold value.

When the feature amount conversion apparatus determination unit 302b determines that the feature amount conversion is to be performed on the terminal side (“terminal side” in step S303), the feature amount conversion unit 102 uses the time series of the input signal cut out for each frame. Conversion into a time series of feature amounts (step S103).

On the other hand, when the feature amount conversion device determination unit 302b determines that the feature amount conversion is to be performed on the server side (“server side” in step S303), the feature amount conversion device determination unit 302b receives the input signal obtained by cutting out the speech section. It is made to transmit to the transmission / reception part 106 (step S304).

Thereafter, the processing until the speech recognition device determination unit 302c determines a device that performs speech recognition processing and the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 is illustrated in step S105 illustrated in FIG. ~ Similar to step S110.

Next, the operation on the server side will be described with reference to FIG. On the server side, as in the first embodiment, first, the transmission / reception unit 201 receives data from the terminal side (step S201). When the received data is compressed or encoded, the transmission / reception unit 201 decompresses and decodes the data.

Next, the processing control unit 401 changes the subsequent processing content according to the content of the received data (step S401). When the received data is a time series of an input signal from which a voice section is not cut out (“input signal” in step S401), the voice detection unit 402 uses the time series of the input signal as a voice section to be subjected to voice recognition. And the determined speech section is cut out (step S402). At this time, the voice detection unit 402 may cut out with a margin of several frames before and after the voice section.

After the voice detection unit 402 cuts out the voice section, or when the received data is a time series of the input signal cut out of the voice section (“voice section cut-out input signal” in step S401), the feature amount conversion unit 203 Converts the input signal into a feature value (step S203).

After the feature amount conversion unit 203 converts the input signal into a feature amount, or when the received data is a feature amount time series (“feature amount” in step S401), the speech recognition unit 204 performs speech recognition. (Step S204). Hereinafter, the process in which the transmission / reception unit 201 transmits the speech recognition result to the terminal 300 is the same as the process in step S205 illustrated in FIG.

Next, the effect of this embodiment will be described. In general, the voice detection process can increase the accuracy of voice recognition, but requires a lot of CPU power. On the server side, it is possible to perform such voice detection processing compared to the terminal side. According to the present embodiment, in addition to the configuration in the first embodiment, the voice detection device determination unit 302a selects which device performs voice segment extraction processing according to the voice input status. When the terminal 300 is selected, the voice detection unit 301 extracts a voice section from the input signal. Therefore, for example, when the noise level is high, a voice recognition result with higher accuracy can be obtained by performing voice detection processing on the server side.

Embodiment 3. FIG.
FIG. 7 is a block diagram showing an example of a speech recognition system according to the third embodiment of the present invention. In addition, about the structure similar to 1st Embodiment, the code | symbol same as FIG. 1 is attached | subjected and description is abbreviate | omitted.

The voice recognition system in this embodiment includes a terminal 500 and a server device 600. The terminal 500 includes an input signal acquisition unit 101, a feature amount conversion unit 102, a voice recognition unit 103, a situation detection unit 104, a noise removal unit 501, a processing control unit 502, a transmission / reception unit 106, and a recognition result integration. Unit 107 and recognition result display unit 108. That is, the terminal 500 in the third embodiment is different from the terminal 100 in the first embodiment in that a noise removing unit 501 is added. In the terminal 500 according to the third embodiment, the process control unit 105 according to the first embodiment is replaced with a process control unit 502.

The noise removing unit 501 removes noise components from the input signal. The noise removing unit 501 removes a noise component using a method such as a spectral subtraction method or a Wiener filter, for example. However, the method by which the noise removing unit 501 removes noise components is not limited to the above method. The noise removing unit 501 may remove the noise component using another widely known method.

The process control unit 502 includes a noise removal device determination unit 502a, a feature amount conversion device determination unit 105a, and a speech recognition device determination unit 105b. Similar to the first embodiment, the processing control unit 502 controls which device is to execute the subsequent processing based on the situation detected by the situation detection unit 104. Specifically, the noise removal device determination unit 502a determines which device is to execute the noise removal process based on the detected situation. Further, the feature amount conversion device determination unit 105a determines which device is to execute the process of converting the input signal from which noise has been removed into the feature amount based on the detected state. The voice recognition device determination unit 105b is the same as that in the first embodiment.

In the example illustrated in FIG. 7, the noise removal device determination unit 502 a, the feature amount conversion device determination unit 105 a, and the speech recognition device determination unit 105 b cause the terminal 500 to execute processing, or causes the server device 600 to execute processing, The terminal 500 and the server apparatus 600 are selected to execute processing. The noise removal device determination unit 502a determines which device is to execute the process of removing noise by, for example, a method similar to the method in which the feature amount conversion device determination unit 105a or the speech recognition device determination unit 105b selects a device. May be. Note that the method by which the feature amount conversion device determination unit 105a and the speech recognition device determination unit 105b determine which device to execute the subsequent processing is the same as in the first embodiment.

The server apparatus 600 includes a transmission / reception unit 201, a processing control unit 601, a noise removal unit 602, a feature amount conversion unit 203, and a voice recognition unit 204. That is, the server device 600 in the third embodiment is different from the server device 200 in the first embodiment in that a noise removing unit 602 is added. In the server device 600 according to the third embodiment, the process control unit 202 according to the first embodiment is replaced with a process control unit 601.

The processing control unit 601 determines the subsequent processing content based on the information received from the terminal 500. Specifically, the processing control unit 601 determines the subsequent processing contents depending on whether the data received from the terminal 500 is a time series of input signals, a time series of input signals from which noise is removed, or a time series of feature amounts. judge.

For example, when the data received from the terminal 500 is a time series of the input signal, the processing control unit 601 causes the noise removal unit 602 to execute processing for removing noise from the input signal. Also, when the data received from the terminal 500 is a time series of input signals from which noise has been removed, the processing control unit 601 causes the feature amount conversion unit 203 to execute a feature amount calculation process. Further, when the data received from the terminal 500 is a time series of feature amounts, the process control unit 601 causes the speech recognition unit 204 to perform a speech recognition process using the feature amounts.

The noise removing unit 602 removes noise from the input signal in the same manner as the noise removing unit 501. Note that the method by which the noise removing unit 602 removes noise may be the same method as the noise removing unit 501, or may be different.

The input signal acquisition unit 101, the feature amount conversion unit 102, the speech recognition unit 103, the situation detection unit 104, the noise removal unit 301, and the processing control unit 502 (more specifically, the noise removal device determination unit 502a) The feature amount conversion device determination unit 105a, the voice recognition device determination unit 105b), the transmission / reception unit 106, and the recognition result integration unit 107 are realized by a CPU of a computer that operates according to a program (voice recognition program).

In the present embodiment, the case where the noise removing unit 501 and the noise removing unit 602 are provided in addition to the voice recognition system in the first embodiment has been described. Note that the noise removing unit 501 and the noise removing unit 602 may be included in the speech recognition system in the second embodiment.

Next, the operation will be described. FIG. 8 is a flowchart showing an example of the operation on the terminal side. FIG. 9 is a flowchart showing an example of the operation on the server side. First, the operation on the terminal side will be described with reference to FIG.

On the terminal side, as in the first embodiment, first, when an input sound is collected using the microphone 99 or the like, the input signal acquisition unit 101 converts the collected time-series input sound data into unit time. Are cut out for each frame (step S101). Next, the noise removal device determination unit 502a performs a process of removing noise from the input signal on the terminal side, the server side, or both depending on the situation detected by the situation detection unit 104. Is determined (step S501).

When the noise removal device determination unit 502a determines to remove noise from the input signal on the terminal side (“terminal side” in step S501), the noise removal unit 501 removes noise from the time series of the input signal (step S502). . On the other hand, when the noise removal device determination unit 502a determines that the noise of the input signal is removed on the server side (“server side” in step S501), the noise removal device determination unit 502a causes the transmission / reception unit 106 to transmit the input signal ( Step S503).

Hereinafter, the process until the recognition result display unit 108 displays the recognition result integrated by the recognition result integration unit 107 after the feature amount conversion device determination unit 105a determines a device for calculating the feature amount is illustrated in FIG. This is the same as the processing from S102 to S110.

Next, the operation on the server side will be described with reference to FIG. On the server side, as in the second embodiment, first, the transmission / reception unit 201 receives data from the terminal side (step S201). Next, the processing control unit 601 changes the subsequent processing content according to the content of the received data (step S601). When the received data is a time series of an input signal from which noise has not been removed (“input signal” in step S601), the noise removal unit 602 removes noise from the time series of the input signal (step S602).

After the noise removal unit 602 removes noise or when the received data is a time series of the input signal from which noise has been removed (“noise-removed input signal” in step S601), the feature amount conversion unit 203 The input signal is converted into a feature value (step S203). Thereafter, the processing from performing speech recognition based on the feature amount to transmitting the speech recognition result to the terminal 500 is the same as the processing in steps S204 to S205 illustrated in FIG.

Next, the effect of this embodiment will be described. In general, noise suppression processing can increase the accuracy of speech recognition, but requires a large amount of CPU power. On the server side, it is possible to perform such noise suppression processing compared to the terminal side. According to the present embodiment, in addition to the configuration in the first embodiment, the noise removal device determination unit 502a selects which device to perform noise component removal processing according to the voice input status. When the terminal 500 is selected, the noise removing unit 501 removes a noise component from the input signal. Therefore, for example, when the noise level is high, a speech recognition result with higher accuracy can be obtained by performing noise suppression processing on the server side.

Embodiment 4 FIG.
Next, a speech recognition system according to the fourth embodiment of the present invention will be described. As described in the first to third embodiments, the process control unit 105, the process control unit 302, and the process control unit 502 (hereinafter referred to as each process control unit) are detected by the situation detection unit 104. Based on the situation, it is controlled which device performs the subsequent processing.

In the present embodiment, an index (hereinafter referred to as a score table) that is determined according to the situation detected by the situation detection unit 104 is set in advance, and each processing control unit thereafter performs the processing based on the score table. A case where it is determined which apparatus is to execute the process will be described. Specifically, each processing control unit calculates the total score according to the voice input status based on the score determined according to the status detected by the status detection unit 104. Then, each processing control unit compares the calculated total with a predetermined condition, and selects which device performs the feature amount calculation processing and voice recognition. The score table is stored in advance in a storage unit (not shown) on the terminal side, for example.

FIG. 10 is an explanatory diagram showing an example of a score table. The score table illustrated in FIG. 10 is set in association with the situation detected by the situation detection unit 104 and the score indicating the weight in the situation. For example, when the situation detection unit 104 detects a situation where communication between the terminal side and the server side is disconnected, the weight of the situation is set to 5 points.

Each processing control unit calculates the total number V of points corresponding to the situation detected by the situation detection unit 104. Each processing control unit selects which device is to execute the subsequent processing based on a predetermined condition. For example, when the total V is 4 or more, information is not transmitted to the server side (that is, processing is performed on the terminal side), and when the total V is 2 or more and less than 4, the feature amount is transmitted to the server side. If V is less than 2, a condition may be set such that an input signal is transmitted to the server side. Each processing control unit selects which device is to execute the subsequent processing based on the conditions set in this way. The set conditions are not limited to the above contents.

定める Predetermining such a score table makes it possible to make detailed judgments according to the environment.

Embodiment 5. FIG.
Next, a speech recognition system according to the fifth embodiment of the present invention will be described. In the first to fourth embodiments, the case has been described in which the terminal-side device and the server-side device are connected one-to-one. In the speech recognition system according to the present embodiment, the connection between the terminal-side device and the server-side device is not limited to one-to-one. Two or more terminal-side devices and server-side devices may be connected to each other.

FIG. 11 is an explanatory diagram showing an example of a voice recognition system in the present embodiment. The speech recognition system in this embodiment includes terminals A to D, server apparatuses A to D, and a connection state controller 700. The connection state controller 700 is connected between the terminals A to D and the server apparatuses A to D. The configurations of the terminals A to D are the same as those of the

terminals

100, 300, and 500 in the first to fourth embodiments. The configurations of the server apparatuses A to D are the same as those of the

server apparatuses

200, 400, and 600 in the first to fourth embodiments.

The connection state controller 700 selects the server devices A to D to which the terminals A to D are connected. Specifically, the control unit 701 of the connection state controller 700 determines the terminal based on at least one of the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side. Server apparatuses A to D to which A to D are connected are selected. Here, the data transmitted from the terminal side may include information indicating whether it is an input signal, an input signal obtained by cutting out a voice section, an input signal from which noise has been removed, or a feature amount. The connection state controller 700 is realized by a server device, for example. Further, the control unit 701 of the connection state controller 700 is realized by a CPU included in the connection state controller 700, for example.

Hereinafter, the operation of the connection state controller 700 will be described in detail. Upon receiving a connection request including a data format to be transmitted from the terminal side, the control unit 701 selects a server device that can support the received data format. In addition, the control unit 701 may further select a server device from a plurality of selected server devices on the basis of a low CPU load or a small memory usage. At this time, the number of server devices selected by the control unit 701 is not limited to one, and may be two or more. After selecting the server device, the control unit 701 sets a connection between the terminal that has received the connection request and the selected server device. Thereafter, data transmission / reception is performed between the terminal to which the connection is set and the server device.

In the above description, a case has been described in which the control unit 701 further selects a server device based on a criterion such as a low CPU load or a small memory usage from among the selected server devices. However, the criteria for the control unit 701 to select a server device are not limited to the above. The control unit 701 may select the server device using a standard determined according to the data format transmitted from the terminal side, the CPU load on the server side, and the memory usage rate on the server side.

Thus, since the control unit 701 selects the server device to be connected to the terminals A to D, the combination of the terminals A to D and the server devices A to D is not limited to the combination of configurations exemplified in each embodiment. The voice recognition system in the present embodiment may include, for example, the terminal 100 in the first embodiment and the server device 600 in the third embodiment.

In this case, the terminal status detection unit 104 determines the status of the server device with the least CPU load, the status of the server with the least memory usage, the server device with the fastest communication speed, and the like. It may be detected as a line state. And the connection control controller 700 should just connect to a server apparatus according to the connection request | requirement based on the information which the condition detection part 104 detected.

Next, the minimum configuration of the present invention will be described. FIG. 12 is a block diagram showing an example of the minimum configuration of the speech recognition system according to the present invention. FIG. 13 is a block diagram showing an example of the minimum configuration of the voice acquisition terminal according to the present invention. The voice recognition system according to the present invention receives a voice and receives an input signal (for example, input sound data) representing the voice, and information transmitted from the voice acquisition terminal 80. And a speech recognition device 90 (for example, server device 200) that performs speech recognition based on the above.

The voice acquisition terminal 80 selects at least one device that performs the calculation processing of the feature amount used for voice recognition from the own voice acquisition terminal 80 and the voice recognition device 90 according to the voice input status (for example, feature The amount conversion device determination unit 105a selects, and selects at least one device that performs speech recognition based on the calculated feature amount from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state. A processing device determination unit 81 (for example, the processing control unit 105) (for example, selected by the speech recognition device determination unit 105b) is provided.

The processing device determination unit 81 includes a voice input environment (for example, a noise level), a task size (for example, the number of vocabulary of words that can be recognized by speech, the degree of utterance complexity), a situation of the own voice acquisition terminal 80 (for example, the terminal 100). CPU load and memory usage status), voice recognition device 90 status (for example, CPU load and memory usage status of server device 200), and communication status between own voice acquisition terminal 80 and voice recognition device 90 (for example, The device that performs the feature amount calculation processing and the device that performs the speech recognition are selected according to information representing at least one of communication disconnection and communication speed is low.

With such a configuration, processing performed by voice recognition can be appropriately shared between the terminal and the server device.

Specifically, the voice acquisition terminal 80 may include status detection means (for example, the status detection unit 104) that detects a voice input status. At this time, the situation detection means 104 includes the voice input environment, the task size, the situation of the own voice acquisition terminal 80, the situation of the voice recognition device 90, and the communication situation between the own voice acquisition terminal 80 and the voice recognition device 90. A situation representing at least one of the following is detected. The processing device determination unit 81 may select a device that performs a feature amount calculation process and a device that performs voice recognition according to the situation detected by the situation detection unit.

Further, the voice acquisition terminal 80 may include voice detection means (for example, a voice detection unit 301) that extracts a voice section from the input signal. At this time, the processing device determination unit 81 selects at least one device for performing speech segment extraction processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the input state of the speech, and the speech detection unit When the processing device determination unit selects the own voice acquisition terminal 80, the voice section may be extracted from the input signal. Thus, by sharing the voice detection process that requires a lot of CPU power, a voice recognition result with higher accuracy can be obtained.

Further, the voice acquisition terminal 80 may include noise removing means (for example, the noise removing unit 501) for removing a noise component from the input signal. At this time, the processing device determination unit 81 selects at least one device for performing noise component removal processing from the own speech acquisition terminal 80 and the speech recognition device 90 according to the voice input status, and the noise removal unit The noise component may be removed from the input signal when the processing device determination means selects the own voice acquisition terminal 80. Thus, by sharing the noise removal processing that requires a lot of CPU power, a more accurate speech recognition result can be obtained.

In addition, the processing device determination unit 81 responds to the voice input status based on a score (for example, a score table illustrated in FIG. 8) that is a predetermined index according to the status detected by the status detection unit. A device that performs a feature amount calculation process and a device that performs voice recognition may be selected by calculating a total score (for example, a total V) and comparing the calculated total with a predetermined condition. Predetermining such a score table makes it possible to make detailed judgments according to the environment.

Further, the voice acquisition terminal 80 may include a communication unit (for example, the transmission / reception unit 106) that transmits information representing the detected input signal or information representing the calculated feature value to the voice recognition device. At this time, the communication unit may notify the voice recognition device 90 of the data format of the information, and then transmit the information to the voice recognition device 90. The communication unit may receive a voice recognition result for the information from the voice recognition device 90.

A speech recognition system is connected between at least one speech recognition device 90 (for example, the server devices A to D) and the speech acquisition terminal 80 and the speech recognition device 90. A connection destination control device (for example, a connection state controller 700) that selects a connection destination of the voice acquisition terminal 80 may be provided. Then, the data format of information transmitted from the voice acquisition terminal 80 to the voice recognition device 90 (for example, information indicating an input signal, information indicating a feature amount) and the CPU of each voice recognition device. A selection unit (for example, the control unit 701) that selects a connection destination of the voice acquisition terminal 80 based on at least one of the load and the memory usage rate of each voice recognition device may be provided.

Further, the voice acquisition terminal illustrated in FIG. 13 also includes a processing device determination unit 81 (for example, a processing control unit 105). The contents of the processing device determination unit 81 are the same as the contents shown in FIG.

Note that a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

(Additional remark 1) It is provided with the audio | voice acquisition terminal which acquires the audio | voice and the input signal showing the said audio | voice, and the audio | voice recognition apparatus which performs audio | voice recognition based on the information transmitted from the said audio | voice acquisition terminal, The said audio | voice acquisition terminal Selects at least one device for performing calculation processing of a feature amount used for speech recognition from the own speech acquisition terminal and the speech recognition device according to the speech input status, and calculates according to the input status. And a processing device determination unit that selects at least one device that performs speech recognition based on the feature amount, from the own speech acquisition terminal and the speech recognition device, and speech that performs speech recognition of an input signal based on the selection result Recognizing means, and the processing device judging means includes a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and between the own voice acquisition terminal and the voice recognition apparatus. Speech recognition system, characterized by selecting the device in response to information representing at least one of signal status, performing apparatus and speech recognition performs calculation processing of the feature.

(Supplementary Note 2) The voice acquisition terminal includes a status detection unit that detects a voice input status, and the status detection unit includes a voice input environment, a task size, a status of the own voice acquisition terminal, a status of the voice recognition device, A situation representing at least one of communication situations between the own voice acquisition terminal and the voice recognition device is detected, and the processing device determination means calculates the feature amount according to the situation detected by the situation detection means. The speech recognition system according to supplementary note 1, wherein a device that performs processing and a device that performs speech recognition are selected.

(Supplementary Note 3) The voice acquisition terminal includes a voice detection unit that extracts a voice section from an input signal, and the processing device determination unit acquires a device that performs the voice section extraction process according to a voice input state. At least one of the terminal and the speech recognition device is selected, and the speech detection means extracts a speech section from the input signal when the processing device determination means selects the own speech acquisition terminal. The speech recognition system described.

(Supplementary Note 4) The voice acquisition terminal includes a noise removal unit that removes a noise component from the input signal, and the processing device determination unit acquires the device that performs the noise component removal process according to the voice input status. At least one of the terminal and the speech recognition device is selected, and the noise removal unit removes a noise component from the input signal when the processing device determination unit selects the own voice acquisition terminal. The speech recognition system according to any one of the above.

(Supplementary Note 5) The processing device determination means calculates the total score according to the voice input status based on the score which is a predetermined index according to the situation detected by the situation detection means, The speech recognition system according to any one of appendix 2 to appendix 4, wherein a device that performs a feature amount calculation process and a device that performs speech recognition is selected by comparing the total with a predetermined condition.

(Supplementary Note 6) The voice acquisition terminal includes a communication unit that transmits information representing the detected input signal or information representing the calculated feature amount to the voice recognition device, and the communication unit recognizes the data format of the information by voice recognition. 6. The speech recognition system according to any one of supplementary notes 1 to 5, wherein the information is transmitted to the speech recognition device after being notified to the device, and the speech recognition result for the information is received from the speech recognition device.

(Appendix 7) At least one or more voice recognition devices, a connection destination control device connected between the voice acquisition terminal and the voice recognition device, and selecting a connection destination of the voice acquisition terminal from the voice recognition devices; The connection destination control device includes at least one of a data format of information transmitted from the voice acquisition terminal to the voice recognition device, a CPU load of each voice recognition device, and a memory usage rate of each voice recognition device. The speech recognition system according to any one of supplementary notes 1 to 6, further comprising selection means for selecting a connection destination of the speech acquisition terminal based on one piece of information.

(Supplementary note 8) The situation detection means includes a noise level indicating a voice input environment, the number of words or complexity of a voice recognition target indicating a task size, a CPU load or a memory usage rate indicating a situation of the own voice acquisition terminal, a voice recognition device Any one of appendix 2 to appendix 7 for detecting a situation representing at least one of a CPU load or a memory usage rate indicating the status of the line, and a line status between the own voice acquisition terminal and the voice recognition device The speech recognition system described.

(Supplementary note 9) The voice acquisition terminal includes a likelihood calculation unit that calculates a likelihood representing the likelihood of the voice recognition result, a voice recognition result selection unit that selects one voice recognition result from a plurality of voice recognition results, and The speech recognition result selection means includes a speech recognition result obtained by the speech acquisition terminal and a speech recognition result obtained by the speech recognition device when speech recognition processing for the input signal is performed by both the speech acquisition terminal and the speech recognition device. Of these, the speech recognition system according to any one of appendix 1 to appendix 8, which selects a speech recognition result having a higher likelihood.

(Supplementary Note 10) When a speech recognition device is selected as a device that performs a feature amount calculation process, the communication unit transmits an input signal to the speech recognition device, and the speech recognition device is selected as a device that performs speech recognition. The speech recognition system according to any one of appendix 6 to appendix 9, wherein the feature amount calculated by the own speech recognizer is transmitted to the speech recognizer.

(Supplementary Note 11) A voice acquisition terminal that receives a voice and acquires an input signal representing the voice, and includes a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and the own voice acquisition terminal. Then, at least one device that performs a feature amount calculation process used for speech recognition is selected according to the speech input status, and based on the feature amount calculated from the speech recognition device and the own speech acquisition terminal. A processing device determination unit that selects at least one device that performs voice recognition according to the input status, the processing device determination unit including a voice input environment, a task size, a situation of the own voice acquisition terminal, the voice A device for calculating a feature value and a device for performing speech recognition are selected according to information representing at least one of the status of the recognition device and the communication status between the own voice acquisition terminal and the speech recognition device. Audio acquiring terminal, characterized by.

(Additional remark 12) It is provided with the condition detection means which detects the input condition of an audio | voice, and the said condition detection means includes the input environment of voice, the task size, the situation of the own voice acquisition terminal, the situation of the voice recognition device, the own voice acquisition terminal, A device that detects a situation representing at least one of communication situations with the voice recognition device, and a processing device determination unit that performs a feature amount calculation process according to the situation detected by the situation detection unit; The voice acquisition terminal according to appendix 11, which selects a device that performs voice recognition.

(Additional remark 13) The audio | voice acquisition terminal which acquires the input signal which represents the audio | voice and the audio | voice is input from the audio | voice recognition apparatus which performs audio | voice recognition based on the information transmitted from an audio | voice acquisition terminal, and an own audio | voice acquisition terminal, At least one device that performs a feature amount calculation process used for speech recognition is selected according to a speech input state, and the speech acquisition terminal is calculated from the speech recognition device and the own speech acquisition terminal. At least one device that performs speech recognition based on a feature amount is selected according to the input situation, and the speech acquisition terminal determines a device that performs feature amount calculation processing and a device that performs speech recognition as a speech input environment. Selection based on information representing at least one of a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. Speech recognition sharing method to.

(Supplementary Note 14) The voice acquisition terminal is at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. The voice recognition according to appendix 13, wherein a situation representing one is detected as a voice input situation, and the voice acquisition terminal selects a device for calculating a feature value and a device for performing voice recognition according to the detected status. Sharing method.

(Supplementary Note 15) A speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech, and performs speech recognition on the computer based on information transmitted from the computer From the apparatus and the computer, at least one apparatus that performs processing for calculating a feature value used for speech recognition is selected according to the input state of the speech, and is calculated from the speech recognition apparatus and the computer. A processing device determination process for selecting at least one device that performs voice recognition based on a feature amount according to the input status is executed, and in the processing device determination process, a voice input environment, a task size, and a self-voice acquisition terminal Information representing at least one of the following situations: the status of the voice recognition device; and the communication status between the own voice acquisition terminal and the voice recognition device In response, the speech recognition program for selecting a device and apparatus for speech recognition performs calculation processing of the feature.

(Supplementary note 16) A computer is caused to execute a situation detection process for detecting a voice input situation. In the situation detection process, a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition device, A condition representing at least one of communication conditions between the voice acquisition terminal and the voice recognition device is detected, and a feature amount is calculated according to the status detected in the status detection process in the processing device determination process. The speech recognition program according to appendix 15, which selects a device that performs processing and a device that performs speech recognition.

As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

This application claims priority based on Japanese Patent Application 2010-121016 filed on May 26, 2010, the entire disclosure of which is incorporated herein.

The present invention is preferably applied to a voice recognition system that shares voice recognition processing between a terminal and a server device.

99

Microphone

100, 300, 500 Terminal 101 Input signal acquisition unit 102 Feature amount conversion unit 103 Speech recognition unit 104

Situation detection unit

105, 202, 302, 401, 502, 601 Processing control unit 105a Feature amount conversion device determination unit 105b Speech recognition Device determination unit 106 Transmission / reception unit 107 Recognition result integration unit 108 Recognition

result display unit

200, 400, 600

Server device

301, 402

Audio detection unit

501, 602 Noise removal unit 302a Audio detection device determination unit 302b Feature quantity conversion device determination unit 302c Audio Recognition device determination unit 502a Noise removal device determination unit 700 Connection state controller 701 Control unit

Claims

A voice acquisition terminal that receives voice and acquires an input signal representing the voice;
A voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal;
The voice acquisition terminal is
According to the input state of the voice, at least one device that performs the calculation processing of the feature amount used for the speech recognition is selected from the own voice acquisition terminal and the voice recognition device, and is calculated according to the input state. A processing device determination unit that selects at least one device that performs speech recognition based on a feature amount from the own speech acquisition terminal and the speech recognition device;
The processing device determination means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. A speech recognition system characterized by selecting a device that performs a feature amount calculation process and a device that performs speech recognition in accordance with the information that represents.
The voice acquisition terminal includes a status detection unit that detects a voice input status,
The situation detection means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. Detect the situation you represent,
The speech recognition system according to claim 1, wherein the processing device determination unit selects a device that performs a feature amount calculation process and a device that performs speech recognition according to the situation detected by the situation detection unit.
The voice acquisition terminal includes voice detection means for extracting a voice section from an input signal,
The processing device determination means selects at least one device that performs the extraction processing of the speech section according to a speech input situation from the own speech acquisition terminal and the speech recognition device,
The voice recognition system according to claim 1, wherein the voice detection unit extracts a voice section from an input signal when the processing device determination unit selects the own voice acquisition terminal.
The voice acquisition terminal includes noise removal means for removing a noise component from the input signal,
The processing device determination means selects at least one device for performing the noise component removal processing from the own voice acquisition terminal and the voice recognition device according to the voice input status,
The speech recognition system according to any one of claims 1 to 3, wherein the noise removing unit removes a noise component from an input signal when the processing device determining unit selects a self-voice acquisition terminal. .
The processing device determination means calculates a total score according to the voice input status based on a score that is a predetermined index according to the situation detected by the status detection means, and determines the calculated total in advance. The speech recognition system according to any one of claims 2 to 4, wherein a device that performs a feature amount calculation process and a device that performs speech recognition are selected by comparing with a specified condition.
The voice acquisition terminal
Comprising communication means for transmitting information representing the detected input signal or information representing the calculated feature quantity to the speech recognition apparatus;
The communication means, after notifying the voice recognition device of the data format of the information, transmits the information to the voice recognition device, and receives a voice recognition result for the information from the voice recognition device. The speech recognition system according to any one of the above.
At least one speech recognition device;
A connection destination controller connected between the voice acquisition terminal and the voice recognition device, and selecting a connection destination of the voice acquisition terminal from the voice recognition device;
The connection destination control device
Based on at least one information of the data format of information transmitted from the voice acquisition terminal to the voice recognition device, the CPU usage rate of each voice recognition device, and the memory usage rate of each voice recognition device, the voice The voice recognition system according to any one of claims 1 to 6, further comprising selection means for selecting a connection destination of the acquisition terminal.
A voice acquisition terminal that receives voice and acquires an input signal representing the voice,
According to the input state of the voice, at least a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and a device that performs calculation processing of a feature amount used for voice recognition are selected. A processing device determining unit that selects one device and selects at least one device that performs speech recognition based on the calculated feature amount from the speech recognition device and the own speech acquisition terminal according to the input status; ,
The processing device determination means includes at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus. A voice acquisition terminal characterized by selecting a device that performs a feature amount calculation process and a device that performs voice recognition in accordance with information that represents.
A voice acquisition terminal that receives a voice and acquires an input signal representing the voice is used for voice recognition among a voice recognition device that performs voice recognition based on information transmitted from the voice acquisition terminal and a self-voice acquisition terminal. Select at least one device that performs a feature amount calculation process according to the voice input status,
The voice acquisition terminal selects at least one device that performs voice recognition based on the calculated feature amount from the voice recognition device and the own voice acquisition terminal according to the input situation,
The voice acquisition terminal includes a device for performing feature amount calculation processing and a device for voice recognition. The voice input environment, the task size, the status of the own voice acquisition terminal, the status of the voice recognition device, the own voice acquisition terminal, A method for sharing voice recognition, comprising: selecting according to information representing at least one of communication states with a voice recognition device.
A speech recognition program applied to a computer that receives speech and acquires an input signal representing the speech,
In the computer,
Select at least one of a speech recognition device that performs speech recognition based on information transmitted from the computer and a device that performs processing for calculating a feature value used for speech recognition, from the computer, according to the input state of the speech Then, from among the voice recognition device and the computer, a device that performs voice recognition based on the calculated feature amount is selected according to the input status, and a processing device determination process is executed.
In the processing apparatus determination process, at least one of a voice input environment, a task size, a situation of the own voice acquisition terminal, a situation of the voice recognition apparatus, and a communication situation between the own voice acquisition terminal and the voice recognition apparatus A speech recognition program for selecting a device that performs a feature amount calculation process and a device that performs speech recognition in accordance with information that represents the information.