WO2005010868A1

WO2005010868A1 - Voice recognition system and its terminal and server

Info

Publication number: WO2005010868A1
Application number: PCT/JP2003/009598
Authority: WO
Inventors: Tomohiro Narita; Takashi Sudou; Toshiyuki Hanazawa
Original assignee: Mitsubishi Denki Kabushiki Kaisha
Priority date: 2003-07-29
Filing date: 2003-07-29
Publication date: 2005-02-03
Also published as: JPWO2005010868A1

Abstract

A voice recognition system performing high accuracy voice recognition in a variety of working environments. In a client and server type voice recognition system where a voice recognition terminal (2) and a voice recognition server (6) connected with a network shares voice recognition processing for calculating a voice feature amount from a voice signal collected by an external microphone (1), storing a plurality of acoustic models, selecting an acoustic model suitable to the sound collecting environment of the external microphone (1) from the plurality of acoustic models, and outputting recognition results by performing pattern matching of a standard pattern of the acoustic model and the voice feature, the voice recognition terminal (2) is provided with a sensor (12) in order to detect the sound collecting environment of the external microphone (1) and a section (13) for transmitting the output from the sensor (12) to the voice recognition server (6) is provided.

Description

Specification

Voice recognition system and its terminal and server

The present invention relates to a speech recognition system and its terminal and server, and in particular, selects an appropriate acoustic model according to the use condition from a plurality of acoustic models assumed for various shelf conditions, and performs speech recognition. It concerns the art of performing recognition. Background leakage

Speech recognition is performed by extracting a time series of a speech spread amount from an input speech, and calculating by comparing the speech feature quantity / time series with an acoustic model prepared in advance.

However, since the background is superimposed on the voice uttered in the actual use, the accuracy of voice recognition is degraded. The manner of superimposition of background mats depends on the usage environment. Therefore, in order to achieve high-accuracy speech recognition, it is necessary to select multiple acoustic models and select an acoustic model suitable for the current fiber from among the multiple acoustic models. As a method of selecting such an acoustic model, for example, there is Japanese Patent Application Laid-Open No. 2000-295500 (Patent).

In the acoustic model selection method according to Patent 1, for example, in a voice recognition device, a value output from various in-vehicle sensors such as a センサ sensor (refers to data obtained by subjecting an analog signal from the sensor to AZD conversion. This is referred to as sensor information), and a noise spectrum is calculated from the noise, and this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed. When the sensor information from the various sensors obtained and the sensor information of the noise spectrum stored in advance are within the new fixed value of the class, the noise spectrum of this sensor information is stored in the voice basket. From the time series of the quantity. However, this method has a problem that the accuracy of speech recognition cannot be improved without using it until now. Therefore, for example, at the time of factory shipment, some predetermined values are selected in advance from the output values of various sensors, and an acoustic model learned under the condition that the sensors output these values is created. Then, we can compare the sensor information obtained in the actual use difficulties with the difficult conditions of the acoustic model and select an appropriate acoustic model.

By the way, the data size of one acoustic model varies depending on the method of setting up and implementing the speech and speech intellectual system, but may reach several hundred kilobytes. In mobile navigation systems such as car navigation systems and mobile phones, the size of the housing and the weight limit the storage capacity of the storage device that can be mounted is severely limited. Therefore, it is not realistic to adopt a configuration that allows Monoyl fungi to consider a plurality of acoustic models having such a large size.

In particular, when there are a few sensors, if a plurality of sensor information values are selected for each sensor and an acoustic model corresponding to the combination is selected, an enormous storage capacity is required.

The present invention has been made in order to solve the above-mentioned problem. By transmitting sensor information via a network from a voice recognition device to a voice recognition server storing a plurality of acoustic models, the present invention has been made to solve the problem. The purpose is to select an acoustic model suitable for the occupation of the company and to achieve high-accuracy speech recognition. Disclosure of the invention

The speech recognition system according to the present invention includes:

A speech recognition system in which a speech recognition supporter and a plurality of speech recognition terminals are connected via a network,

The hated speech recognition terminal

Connect an external microphone and input the audio signal collected by the external microphone Input end to

A client-side acoustic analysis unit that calculates the volume of voice from the voice signal input from the input terminal

A sensor for detecting sensor information indicating a horse t to be superimposed on the disagreeable 3 voice signal, ΐίί a client-side transmitting means for transmitting the disagreeable 3 sensor information to the tfrK voice recognition server via a network,

Client side receiving means for receiving an acoustic model from the disagreeable 3 voice recognition server, and client side matching means for comparing the sickle 3 acoustic model with the disagreeable 3 voice feature amount,

Ken 3 voice recognition server

A server-side receiving means for receiving the sensor information transmitted by the Itfffi client-side transmitting means;

Server-side acoustic model storage means for storing a plurality of acoustic models;

(3) server-side acoustic model selecting means for selecting an acoustic model that matches the touch sensor information from a plurality of acoustic models;

Server-side transmission means for transmitting the acoustic model selected by the server-side acoustic model Hi selection means to the obscene speech recognition terminal.

As described above, in this speech recognition system, a plurality of acoustic models corresponding to various sound collection jobs are stored in a speech recognition server having an unlimited storage capacity, and a sensor provided in each speech recognition terminal is stored. Based on the information from, a sound model suitable for the sound collection of the speech recognition terminal was selected and sent to the end of the speech recognition. As a result, the speech recognition terminal is limited in its own storage capacity due to limitations such as the case and weight of the terminal. Acquisition and speech recognition using that acoustic model can improve the accuracy of speech recognition. Brief Description of Drawings

FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 1 of the present invention.

FIG. 2 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 1 of the present invention.

FIG. 3 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 2 of the present invention.

FIG. 4 is a flowchart illustrating a clustering process of an acoustic model according to Embodiment 2 of the present invention.

FIG. 5 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 2 of the present invention.

FIG. 6 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 3 of the present invention.

FIG. 7 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 3 of the present invention.

FIG. 8 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 4 of the present invention.

FIG. 9 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 4 of the present invention,

FIG. 10 is a configuration diagram of a data format of sensor information and voice data transmitted from the voice recognition terminal to the voice recognition server according to Embodiment 4 of the present invention.

FIG. 1.1 is a block diagram showing a configuration of a speech recognition server from a speech recognition terminal according to Embodiment 5 of the present invention.

FIG. 12 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 5 of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

Example 1.

FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to an embodiment of the present invention. In the figure, a microphone 1 is a device or component that collects voice, and a voice recognition terminal 2 is a device that performs voice recognition of the voice collected by the microphone microphone 1 via an input terminal 3 and outputs a recognition result 4. It is. The input terminal 3 is an audio terminal or microphone microphone connection ^ ?.

The speech recognition terminal 2 is connected to a speech recognition server 6 via a network 5. Network 5 is a network that communicates digital and blue information such as the Internet, LAN (Local Area Network), public line network, mobile phone network, and communication network using artificial satellites. However, as a result, the network 5 only needs to transmit and receive digital data between devices connected to this network, and does not ask the format of the information transmitted on the network 5. Absent. Therefore, for example, a bus designed to connect a plurality of devices, such as USB (Universal Serial 1 Bus) and SCS I (Sma 11 Computing System Interfac e). It does not matter. When the voice recognition terminal 2 is a vehicle-mounted voice recognition device, the network 5 uses a data communication service of mobile communication.In the _ri data communication service, data to be transmitted and received is called a packet. I ^: is divided into units and sent and received one by one. In the packet, in addition to the data that the transmitting side β intends to transmit to the receiving side β, information (destination address) on the receiving side β for identifying the receiving side β, and the packet The S voice recognition server 6 to which control information such as position information indicating which part of the whole is to be configured and error correction is added is configured to be a male voice recognition terminal 2 via the network 5. It is a server computer. The voice recognition server 6 is a storage device such as a hard disk device or a memory having a larger storage capacity than the voice recognition terminal 2. And stores standard patterns required for speech recognition. Also, a plurality of speech recognition terminals 2 are disliked by the speech recognition server 6 via the network 5.

Next, a detailed configuration of the voice recognition terminal 2 will be described. The voice recognition terminal 2 is composed of a terminal ¾ m jf ι and a sensor 12, mmmn 3, a terminal transmission sound 4, and a terminal ”. Is provided.

The terminal-side acoustic analysis unit 11 performs acoustic analysis based on the audio signal input from the input terminal 3 and calculates an audio feature amount.

The sensor 12 is a sensor for detecting an environmental condition with a view to obtaining information on the view of the horse superimposed on the audio signal obtained by the microphone-phone 1, and the microphone-phone 1 is provided. An element or device that detects or obtains a physical quantity in a certain environment or a change in the physical quantity. However, not only this, but also an element or a device that converts the detected amount into an appropriate signal and outputs the signal may be included. In addition, 'physical quantity here' includes sculpture 'pressure' flow rate, photomagnetism, time, electromagnetic waves, and the like. Thus, for example, the GPS antenna is a sensor for the GPS signal. Also, it is not always necessary to detect a physical quantity by acquiring some signal from the outside world.For example, a circuit that acquires the time of the place where the microphone is placed based on the built-in clock Is also included in the sensor mentioned here.

In the following description, these physical quantities are collectively referred to as sensor information. In addition, the sensor outputs an analog signal in (1) to (4), and the normal configuration is to sample the output analog signal into a digital signal by means of A / D conversion or elements. Therefore, the sensor 12 may include such an AZD variable or element. Furthermore, a plurality of types of sensors, for example, the voice recognition terminal 2 are terminals of the navigation system: ^ indicates a sensor for monitoring the rotation of the evacuation sensor engine, and a monitoring status of the operation of the wiper. Multiple sensors may be combined, such as a sensor that rings in the evening, a sensor that monitors the opening and closing status of the door glass, and a sensor that monitors the car audio program.

The terminal-side transmission unit 13 is a unit that transmits sensor information near the microphone 1 obtained by the sensor 12 to the speech recognition server 6.

The terminal-side receiving unit 14 is configured to receive information from the speech recognition server 6 and output the received information to the terminal-side acoustic model selecting unit 16. The terminal-side transmission unit 13 and the terminal-side reception unit 14 are composed of circuits or elements that send signals to the network cable and receive signals from the network cable, and are used to control these circuits or elements. The computer program may be included in the terminal-side transmitting unit 13 and the terminal-side receiving unit 14. However, when the network 5 is a wireless network, the terminal-side transmitting unit 13 and the terminal-side receiving unit 14 have antennas for transmitting and receiving communication waves. Note that the terminal-side transmission unit 13 and the terminal-side reception unit 1 may be configured as separate parts, or may be configured by the same network input / output device.

The terminal-side acoustic model storage unit 15 is a storage eave or a circuit for storing an acoustic model. Here, it is assumed that a plurality of acoustic models can be provided according to the learning fiber, and only a part of them is stored in the terminal-side acoustic model storage unit 15. In addition, each acoustic model is associated with sensor information indicating an environmental condition in which the acoustic model has been learned, and an acoustic model suitable for the environmental condition can be specified from the numerical value of the sensor information. For example, if the voice recognition terminal 2 is a ^ ffl voice recognition device, the sound generated based on the sample uttered under the horse i ^ i when the car is running at 40 km / h Some models, such as models, and acoustic models created based on samples uttered in a noisy environment where a car is traveling at 50 km / h, are mentioned. However, as shown in the figure, since the speech recognition server 6 also remembers the acoustic model capability | It is not necessary for the model storage unit 15 to remember the acoustic model power learned under all difficult conditions. By adopting such a configuration, the voice recognition terminal 2 must be mounted, and the storage capacity of the storage device can be extremely small.

The terminal-side acoustic model selection unit 16 includes a terminal-side reception unit 14-acquired acoustic model (or an acoustic model stored in the terminal-side acoustic model storage unit 15) and a terminal-side acoustic analysis. This is a sound for calculating the likelihood with the speech feature output by the unit 11. The terminal-side matching unit 17 is a unit that selects a vocabulary based on the likelihood calculated by the terminal-side acoustic model selecting unit 16 and outputs it as a recognition result 4.

Among the components of the speech recognition terminal 2, the terminal-side acoustic analysis unit 11, the terminal-side transmission unit 13, the terminal-side reception unit 14, the terminal-side acoustic model storage unit 15, and the terminal-side acoustic model HI selection unit 16 and the terminal-side collating unit 17 may be configured by dedicated circuits. ”However, the central processing unit (CPU), network I / O device (network adapter device, etc.), and storage device You may make it comprise as a computer program which performs the process equivalent to a function.

Next, a detailed configuration of the speech recognition server 6 will be described. The speech recognition server 6 includes a server-side receiving unit 21, a server-side acoustic model storage unit 22, a server-side acoustic model HI selecting unit 23, and a server-side transmitting unit 24. The server-side receiving unit 21 is a unit that receives the sensor information transmitted from the terminal-side transmitting unit 13 of the voice recognition terminal 2 via the network 5.

The server-side acoustic model storage unit 22 is a storage device for storing a plurality of acoustic models. This server-side acoustic model storage unit 22 is configured as a large-capacity storage device using a combination of a disk drive age, a CD-ROM medium and a CD-ROM drive.

Unlike the terminal-side acoustic model storage unit 15, the server-side acoustic model storage unit 22 stores all of the acoustic models that may be worthy of this speech recognition system. It has a large storage capacity. The server-side acoustic model JH1 selection unit 23 is a tone for selecting an acoustic model suitable for the sensor information received by the server-side reception unit 21 from the acoustic models stored in the server-side acoustic model storage unit 22.

The server-side transmitting unit 24 is a unit that transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 via the network 5.

In addition, among the components of the speech recognition server 6, the server-side receiving unit 21, the server-side acoustic model consideration unit 22, the server-side acoustic model measuring unit 23, and the server-side transmitting unit 24 May be configured by dedicated circuits, respectively, but are equivalent to central processing (CPU), network IZO device (network adapter device, etc.), and recording device. It may be configured as a computer program for executing processing. '

Next, operations of the voice recognition terminal 2 and the voice recognition server 6 will be described with reference to the drawings. FIG. 2 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment. In the figure, when the user performs a voice input from the microphone 1 (step S 101), a voice signal is input to the terminal-side acoustic analysis unit 11 via the input terminal 3. Subsequently, the terminal-side acoustic analysis unit 11 converts the digital signal into a digital signal using an A / D converter, and calculates a time series of speech features such as an LPC cepstrum (Linear Predictive Coding Cepstrum) (step S 102). .

Next, the sensor 12 acquires a physical quantity around the microphone 1 (step S103). For example, the voice recognition terminal 2 is a force navigation system, and the sensor 12 is a ¾ ^ sensor that detects an evacuation of a vehicle (car) equipped with the force navigation system: In ^, evasion corresponds to such a physical quantity. In FIG. 2, the sensor information in step S103 is to be performed next to the acoustic analysis in step S102. However, it is needless to say that the processing of step S103 may be performed before the processing of steps S101 to S102, or may be performed simultaneously or in parallel. Absent. Subsequently, the terminal-side acoustic model selection unit 16 selects the sensor information obtained by the sensor 12, that is, the acoustic model learned under the condition that is closest to the sound of the microphone-phone 1. Here, the fiber condition of the acoustic model is considered to be multiple, and the terminal-side acoustic model storage unit 15 does not necessarily store all of the fiber conditions. Therefore, if none of the acoustic models stored in the terminal-side acoustic model storage unit 15 at the H pole is learned under environmental conditions close to the marrow conditions of the microphone-mouth phone 1, the speech recognition 6 to get an acoustic model. Next, prior to the description of the processing, terms and notations are defined, and the sensor information about the sensor k under the condition where the acoustic model kn has been learned is simply referred to as “sensor information of the acoustic model m”. I do. The terminal-side acoustic model storage unit 15 stores M acoustic models, and represents each acoustic model as an acoustic model m (where m = l, 2,..., M). The sensor 12 is composed of K sensors, and each sensor is a sensor k (where k = l, 2, ... ', K). Further, the sensor information about the sensor k under the environmental conditions in which the acoustic model m has been learned is represented by S _m , _k , and the current sensor information of the sensor k (the sensor information output in step S103) is represented by _Sm , _k. x _k . 'Hereinafter, these processes will be described more specifically. First, the terminal-side acoustic model selection unit 16 calculates a distance value D (m) between the sensor information S _m , _k of the acoustic model m and the sensor information x _k obtained by the sensor 12 (step S104). Assume that a distance value between the sensor information x k of a certain sensor _k and the sensor information S _m , _k of the acoustic model m is D k (x _k , S _m , _k ). As a specific value of the distance value D _k (x _k , S _m , _k ), for example, an absolute value of a difference between sensor information may be adopted. In other words, if the sensor information is speed, the difference (l O kmZh) between the speed at learning (for example, S _ra , _k = 40 kmZh) and the current speed (for example, x _k = 50 km / h) is defined as the distance. The value is D _k (x _k , S _m , _k ).

For the distance value D (m), the distance value D _k (x _k , S _m , _k ) for each sensor is used. And calculate as follows.

M

^{^{^{D (m) = ^ w k}}} D k (^ S m, k) (1) where, w _k is a weighting factor for each sensor.

Here, the relationship between the sensor information as a physical quantity and the ίΕΙ value D (m) will be described. If the sensor 1 blue light is the position (may be determined based on the key or latitude, or may be determined based on the distance from a specific place as the origin), the sensor information Have different dimensions as physical quantities. However, here, by adjusting the weighting factor w _k , the contribution of w _k D _k (x _k , S _m , _k ) to the distance value can be set appropriately. No problem. The same applies even if the unit system is different. For example, if kmZh is used as the avoidance unit; ^ and mph are used, different values can be used as sensor information even if the speed is physically the same. In such a case, for example, if a weighting factor of 1.6 is given to the value calculated by kmZh, and a weighting factor of 1.0 is given to the speed value calculated by mph, Can be equalized. Next, the terminal-side acoustic model selection unit 16 obtains the minimum value min {D (m)} of the distance value D (m) for each m calculated by the equation (1), and obtains this min {D (m) Evaluate whether it is smaller than the fixed value T (Step S105). In other words, it is determined whether or not there is a condition sufficiently close to the current environmental conditions at which the microphone 1 picks up, among the conditions of the terminal-side acoustic model stored in the terminal-side acoustic model storage unit 15. is there. The predetermined value T is a value set in advance to determine whether or not such a condition is satisfied.

If the min {D (m)} force is smaller than the fixed value T (step S105: eses), the process proceeds to step S106. Terminal-side acoustic modelling! The selection unit 16 selects the acoustic model m on the terminal side as an acoustic model suitable for the current difficulty in which the microphone 1 collects sound (step S106). 'Then, proceed to the collation processing (step S 1 1 2). Or later Ί key for the processing.

If min {D (m)} is equal to or larger than the fixed value ^ f (step S105: Νο), the process proceeds to step S107. this:! In ^, during the difficult condition of the acoustic model stored in the acoustic model storage unit 15 on the terminal side, the profession will not be strong enough under the current environmental conditions in which the microphone 1 collects sound. Therefore, the terminal-side transmitting unit 13 transmits the sensor information to the voice recognition server 6 (Step S107).

When the predetermined value Τ is increased, the frequency that min {D (m)} is determined to be smaller than T increases, and the number of times that step S107 is executed decreases. That is, if the value of Τ is increased, the number of transmissions and receptions via the network 5 can be reduced. Therefore, the effect of suppressing the transmission amount of the network 5 occurs.

Conversely, when the value of Τ is reduced, the number of transmissions and receptions on the network 5 increases. However, in this method, the speech recognition is performed by inputting an acoustic model having a smaller distance value between the sensor information obtained by the sensors 12 and the condition under which the acoustic model was learned, so that the accuracy of the speech recognition is improved. be able to. From the above, the value of よい may be determined in consideration of the transmission of the network 5 and the target speech recognition. .

In the voice recognition server 6, the terminal-side receiving unit 21 receives the sensor information via the network 5 (step S108). The server-side acoustic model selection unit 23 calculates a distance value between the environmental condition in which the acoustic model stored in the server-side acoustic model storage unit 22 is learned and the sensor information received by the server-side reception unit 21. The calculation is performed in the same manner as in step S104, and the acoustic model with the smallest distance value is selected (step S109). Subsequently, the server-side transmitting unit 24 transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 (Step S110).

The terminal-side receiving unit 14 of the voice recognition terminal 2 receives the acoustic model transmitted by the server-side transmitting unit 24 via the network 5 (Step S111).

Next, the terminal-side collating unit 17 converts the speech feature and the sound output by the terminal-side acoustic analysis unit 1 1. The matching process with the sound model is performed (step S112). Here, the sickle with the highest ¾ ^ between the standard pattern stored as the acoustic model and the time series of the speech difficulty is set as the recognition result 4. For example, pattern matching by DP (Dynamic Programming) matching is performed, and the one with the smallest distance value is set as the recognition result 4. As described above, according to the speech recognition terminal 2 and the server 6 according to the first embodiment, even when only a small number of acoustic models can be stored in the speech recognition terminal 2, the sound pickup ^^ of the microphone mouth phone 1 can be detected by the sensor. Speech Recognition Server 6 Power acquired by 1 and 2 Able to perform speech recognition by selecting an acoustic model learned under environmental conditions close to this sound collection job from among many acoustic models remembered.

Therefore, it is not necessary to mount a large-capacity storage element, circuit, or storage medium in the voice recognition terminal 2, and the β configuration can be simplified, and a voice recognition terminal that performs high-precision voice recognition can be provided to the user. As mentioned above, the data size of one acoustic model is several hundred kilobytes in size, depending on how it is implemented. Therefore, the effect of reducing the number of acoustic models that need to be stored by the speech recognition terminal is significant.

It should be noted that the sensor information can take successive values. Usually, however, some values are selected from the input values, and an acoustic model using this value as sensor information is learned. Now, the sensor 12 is composed of a few types of sensors (the first sensor and the second sensor), and the speech recognition terminal 2 and the speech recognition server 6 recognize each of the acoustic models. If the number of values selected as sensor information for the first sensor is Μ1 and the number of values selected as sensor information for the second sensor is Μ2, the voice recognition terminal 2 and the voice recognition server 6 store The total number of acoustic models used is calculated as Μ 1 X Μ 2.

In this equation, Μ 1 <Μ 2 ¾¾ΪΓ | ^: ^, that is, the number of values selected as sensor information of the first sensor is greater than the number of values selected as sensor information of the second sensor. Small: ^ is the weighting factor for the sensor information of the first sensor Is smaller than the weight coefficient of the second sensor with respect to the sensor information, it is possible to select an acoustic model according to the difficulty in collecting sound of the microphone / mouth phone 1.

Also, the speech recognition terminal 2 includes a terminal-side acoustic model storage unit 15 and a terminal-side acoustic model HI selection unit 16 so that the speech recognition terminal 2 remembers the acoustic model and the speech recognition server 6 stores the acoustic model. The model was selected appropriately to perform voice awakening. However, it is not essential that the speech recognition terminal 2 include the terminal-side acoustic model storage unit 15 and the terminal-side acoustic model unit 1δ. That is, it goes without saying that a configuration is possible in which the acoustic model remembered by the speech recognition server 6 is unconditionally based on the sensor information acquired by the sensor 12. Even if such a configuration is adopted, while reducing the storage capacity of the voice recognition terminal 2, an acoustic model suited to the sound collecting job of the microphone mouth phone 1 by the sensor 1 2 is selected, and highly accurate voice recognition processing is performed. That is, the feature of the present invention that can be performed is not impaired.

Further, in addition to the configuration described above, the acoustic model received from the speech recognition server 6 is newly stored in the acoustic model storage unit 15 on the terminal side, and instead of the acoustic model of the voice recognition 1 ^ end 2 side, A configuration for storing the acoustic model received from the speech recognition server 6 is also possible. By doing so, next time speech recognition is performed using the same acoustic model again: ^, since there is no need to transfer the acoustic model again from the speech recognition server 6, the transmission load on the network 5 can be reduced, and transmission and reception can be reduced. Example 2 can be shortened.

According to the speech recognition according to the first embodiment, when the speech recognition terminal does not consider the acoustic model corresponding to the sensor information, an acoustic model suitable for the sensor information is obtained from the speech recognition server. .

However, considering the size of the data per sound model, the sound model from the speech recognition server! / ^ Using the body as a voice recognition terminal via a network is It cannot place a heavy load on the network, nor can it affect the overall processing performance due to the time required for acoustic model data.

One way to avoid such problems is to design the speech awakening algorithm so that the data size of the acoustic model is as small as possible. This is because, if the size of the acoustic model is small, transferring the acoustic model from the speech recognition server to the speech recognition terminal does not add much load to the network.

On the other hand, a plurality of acoustic models that are similar to each other are clustered, and between the acoustic models in the same cluster is determined in advance, and stored in the speech recognition server.

Therefore, it is necessary to make an acoustic model: In ^, only the difference from the acoustic model stored in the speech recognition terminal is used, and the speech recognition terminal power is calculated from the acoustic model stored in the speech recognition terminal and the sound of the speech recognition server. A method of synthesizing the model is also conceivable. The speech recognition terminal and the server according to the second embodiment operate based on a powerful principle.

FIG. 3 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the second embodiment. In the figure, the acoustic model conversion unit 18 calculates the acoustic model stored in the speech recognition server 6 from the contents received by the terminal-side receiving unit 14 and the acoustic model stored in the terminal-side acoustic model storage unit 15. This is a part for synthesizing a simple acoustic model. Also, the acoustic model difference calculation unit 25 calculates the dispersion between the acoustic model stored in the terminal-side acoustic model storage unit 15 and the acoustic model stored in the server-side acoustic model storage unit 22. . In addition, since the sound | 5 {standing for the same reference numeral as in FIG. 1 is the same as that in the first embodiment, the description is omitted.

As described above, the speech recognition device 2 and the server 6 according to the second embodiment assume that the acoustic model is clustered in advance. Therefore, the class ringing method of the acoustic model will be described first. Note that the clustering of the acoustic model has been completed before the speech recognition processing is performed by the speech recognition device 2 and the supervisor ₆ . The acoustic model shows the volume I 音声 * of the speech basket volume of each ¾ϋ (or phoneme or syllable) from a large amount of speech uttered by many mysteries. Hiki is an average The value vector ί = β (2), ..., β (Κ)}, and the diagonal covariance vector ∑ = {σ (1).

² , σ (2) ² ,…, hi (Κ) ² } force. Therefore, the acoustic model of the phoneme ρ is represented by Ν _ρ { _ρ _{ρ ρ} }.

The clustering of the acoustic model is performed by the LBG algorithm improved so that the maximum VQ distortion cluster is divided sequentially as described below. FIG. 4 is a flowchart showing the clustering process of the acoustic model.

First, an initial cluster is created (step S201). Here, one initial cluster is created from all possible acoustic models that are shelved by this speech recognition system. Equations (2) and (3) are used to calculate the t t of the initial cluster r. Here, N represents the number of distributions belonging to the cluster, and K represents ^ number of the speech feature.

(3)

Next, it is determined whether or not the required number of clusters has already been obtained by the clustering process executed so far (step S202). The required number of clusters is determined when designing the speech recognition processing system. Generally speaking, the greater the number of clusters, the smaller the distance between acoustic models in the same cluster. As a result, the amount of information of the difference data is reduced, and the amount of difference data transmitted and received via the network 5 can be suppressed. Tokuko, Speech Recognition ϋ The number of acoustic models stored in Terminal 2 and Server 6 is large:! For ^, increase the number of clusters.

However, it is not always possible to simply increase the number of clusters in all cases. The reason is as follows. That is, in the second embodiment, The acoustic model stored in the speech recognition server 6 is synthesized by combining the acoustic model stored in the end 2 (hereinafter referred to as the “oral sound model”) and scatter, or the acoustic model stored in the speech recognition server 6 The aim is to obtain an equivalent acoustic model.

The difference used here is local! / Combine with the acoustic model, and must be determined between this oral acoustic model and acoustic models belonging to the same cluster. Since the acoustic model synthesized due to the difficulty corresponds to the sensor information, the most efficient state is that the acoustic model corresponding to the sensor information and the oral sound model are classified into the same class. become.

By the way, when the number of clusters increases, the number of acoustic models belonging to each cluster decreases, and each acoustic model is divided into many clusters. In such an experiment, the number of acoustic models belonging to the same class as the low acoustic model stored in the speech recognition terminal 2 tends to decrease. Furthermore, the probability that the acoustic model corresponding to the sensor information and the local V acoustic model stored in the speech recognition terminal 2 belong to the same class is reduced.

As a result, there are situations such as those in which differences between acoustic models belonging to different clusters cannot be prepared, or in which the data size does not become sufficiently small even if dispersion is reduced.

For this reason, Ichikari! If it is not possible to increase the number of sound models, that is, if the storage capacity of the memory device such as the hard disk mounted on the voice Better not.

If the required number of classes is 2 or more, since the number of clusters is 1 immediately after the initial cluster creation, the process proceeds to step S203 (step S202: No). processing and is obtained evening plurality of classes by, if the number is click the raster number than necessary, and ends (step S 2 0 2: Y es) 0 Next, a VQ distortion cluster division is performed (step S203). Here, the largest VQ distortion (Λ cluster r max (initial cluster in the first loop)) is divided into two clusters, r 1 and r 2, thereby increasing the number of class IT. The class ff4 after the division is calculated by the following equation: where is a small value that is predetermined for each dimension of the speech volume.

μ _Γΐ () = μ, (n) + Δ () (= l, 2,-, K) (4)

^ ( ^Ta ) = _x ( ^-Δ ( ^ta ) (k = l, 2,-, K (5) a _rl (ky = _ormax (k) (k = i, 2,-, K) (6)

(N) '= hi Λ ( ² (k = l, ¾-, K) (7) Next, the statistics of each acoustic model and the statistics of each cluster (all clusters divided in step S203) (Step S204) Here, the distance is calculated by selecting one from each of the acoustic models and all the clusters already obtained. However, the distance is not calculated again for the combination of the acoustic model and the cluster for which ¾ | is already calculated.To perform such control, the flag of the acoustic model for which the distance has been calculated for each cluster is used. For the distance value of the statistics of the acoustic model and the statistics of each cluster, for example, the Bhattachaxyya distance value defined by equation (8) is used.

In Equation (8), the parameter with a suffix of 1 is the statistical value of the acoustic model, and the parameter with a suffix of 2 is the statistical value of the cluster. Based on the distance values obtained above, each acoustic model is assigned the class with the smallest distance value. Belong to the evening. Note that the distance value between the acoustic model statistics and the cluster statistics may be calculated by the method of equation (8). Even in such a case, it is desirable to adopt an equation that can obtain a distance value belonging to the same class evening for an i-feed whose distance value calculated by the equation (1) is close. However, this is not «^.

Next,: ^ is performed in the code book of each class (step S205). For this purpose, the representative values of the acoustic models belonging to the class are calculated using Equations (2) and (3). The distance between the statistical value of the acoustic model belonging to the cluster and the representative value is accumulated using Eq. (8), and this is defined as the VQ distortion of the current cluster.

Subsequently, a reward value for clustering is calculated (step S206). Here is the VQ distortion for all classes! Let b be the 夕 face value of the class evening ring. Steps S204 to S207 constitute a loop that is executed a plurality of times. The contract surface value calculated in step S206 is stored until the next execution of the task. Then, the scatter of the consultation surface value and the basket value calculated in the previous loop execution is obtained, and it is determined whether or not the thread size value is less than a predetermined threshold value (step S207). If the difference force f is less than the threshold, the acoustic model belongs to an appropriate cluster among clusters that have already been obtained, and the process returns to step S202 (step S207: Y es). On the other hand, if the difference is greater than or equal to the threshold value of the differential force slicing, the acoustic model force j¾ ^ which does not belong to the right class is used, so that step S204 is performed (step S207: No).

The above is the clustering process. Next, speech recognition processing in the speech recognition device 2 and the server 6 according to the second embodiment, which is performed based on the acoustic models clustered as described above, will be described with reference to the drawings. FIG. 5 is a flowchart of the operation of the voice recognition device 2 and the server 6. In the figure, in steps S101 to S105, a voice is input from the microphone 1 as in the first embodiment, and after performing sound analysis and acquisition of sensor information, a speech suitable for the sensor information is obtained. Ka! ^ Judge whether the sound model does. And, even if we have a local I / Sound model (the number or name that identifies this local acoustic model is called m) that has the smallest distance from the sensor information, that らない will not be less than the predetermined threshold ::! For ^, go to step S208 (step S105: Νο

Next, the terminal-side transmitting unit 13 transmits the sensor information and the information m for performing the low-power Jl ^ sound model to the speech recognition server 6 (step S208).

The server-side receiver 21 receives the sensor information and m (step S209), and the server-side acoustic model HI selector 23 selects the acoustic model most suitable for the received sensor information. (Step S109). Then, it is determined whether or not this acoustic model and the oral model / sound model m belong to the same cluster (step S210). If they belong to the same class, the process goes to step S211 (step S210: Yes), and the acoustic model difference calculation unit 25 calculates the acoustic model and the local! / ^ The server-side transmitting unit 24 calculates the difference (step S211), and transmits the difference to the speech recognition terminal 2 (step S212). .

In order to obtain the difference, for example, the difference may be calculated based on the difference between the values of the components of the voice volume: ^ and the offset (the difference between the storage positions of the respective elements). It is known to find a difference value between different binary data (such as between binary files), and that may be used. Further, since the cage according to the second embodiment does not require a special request for the data structure of the acoustic model, a method of designing a data structure that can easily obtain the difference can be considered.

On the other hand, if they do not belong to the same cluster, go directly to step S212 (step S210: No). this

Is transmitted (step S2 1 2).

Note that, in the above processing, the voice recognition terminal 2 side determines that the low-power acoustic model determined to be most suitable for the sensor information (in step S105, the acoustic model determined to have the smallest 鼸 with the sensor information) ), Thank you for generating the difference The Therefore, such a mouthful zo! / Information about the acoustic model m was transmitted in advance in step S208. However, in addition to this, the voice recognition server 2 stores the voice recognition terminal 2 in the memory! / ^ Understand (or manage) the type of sound model, and after the speech recognition server selects the sound model, close to the sensor information, and then select the sound model. The difference may be calculated by selecting from the low-power reverberation model that manages the model. In this case, it is necessary to notify the speech recognition terminal 2 which speech model the difference detected by the speech recognition server 6 is based on. 6 sends the information that makes the ^^ calculation of the ^^ sound model.

Next, the terminal-side receiving unit 14 of the speech recognition terminal 2 receives the difference data or the acoustic model (Step S213). If the received data is a difference, the acoustic model generation unit 18 synthesizes an acoustic model from the utterance model m which is the key to the difference and the difference (step S2 14). Then, the terminal-side matching unit 17 performs pattern matching between the standard pattern of the acoustic model and the voice feature amount, and recognizes the most likely likelihood! Output i ^ i 'as recognition result 4.

As is apparent from the above description, only the difference between the mouth sword / sound model stored in the speech recognition terminal 2 of the second embodiment and the sound model remembered by the speech recognition server 6 is transmitted and received via the network. did. Therefore, even when the storage capacity of the speech recognition terminal 2 is small, high-accuracy speech recognition can be performed based on a variety of acoustic models suitable for the sound collecting job of the microphone 1, in addition to the effect of the first embodiment. This has the effect of improving the processing performance by reducing the load on the network and reducing the time required for data transfer. Example 3.

In the speech recognition terminal 2 according to the first and second embodiments, the sound module required for speech recognition is used. Does not memorize Dell: Even if ^, the acoustic model stored in the speech recognition server 6 is received via the network 5 to perform speech recognition according to the sound collection ring of the microphone 1. Met. However, instead of transmitting and receiving the acoustic model, a voice feature may be transmitted and received. The speech recognition terminal and server according to the third embodiment operate based on such a principle.

FIG. 6 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the third embodiment. In the figure, the parts denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description is omitted. Also in the third embodiment, the voice recognition terminal 2 and the voice recognition server 6 are sickled via the network 5. However, the speech recognition amount and sensor information are transmitted from the speech recognition terminal 2 to the speech recognition server 6, and the recognition result 7 is output from the speech recognition server 6. This is different from the first embodiment. In the speech recognition server 6, the server-side collating unit 27 generates a sound for performing collation between the speech tree and the acoustic model, similarly to the terminal-side collating unit 17 of the first embodiment.

Next, operations of the speech recognition terminal 2 and the speech recognition server 6 in the third embodiment will be described with reference to the drawings. FIG. 7 is a flowchart showing processing between the speech recognition terminal 2 and the speech recognition server 6 according to the second embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with ^ unique to the flowchart.

First, when a user performs a voice input from the microphone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the acoustic analysis unit 11 converts the voice signal into the voice signal. The time series of the voice feature is calculated (step S102), and sensor information is collected by the sensor 12 (step S103). Next, the sensor information and the voice coverage amount are transferred to the voice recognition server 6 via the network 5 by the terminal-side transmission unit 13 (step S301), and the server-side reception unit 2.1 transmits the information. Then, the sensor information and the voice characteristics are taken into the voice recognition server 6 (step S302). The server-side acoustic model storage unit 22 of the voice recognition server 6 stores an acoustic model in advance according to a plurality of sensor information, and the server-side acoustic model selection unit 23 is acquired by the server-side reception unit 21. A distance value between the obtained sensor information and the sensor information of each acoustic model is calculated by equation (1), and an acoustic model having the smallest distance value is selected (step S109).

Subsequently, the server-side matching unit 27 performs pattern matching between the standard pattern in the selected acoustic model and the speech feature amount acquired by the server-side receiving unit 21 and outputs the highest vocabulary as the recognition result 7. (Step S303). This process is the same as the matching process (step S112) in the first embodiment, and thus a detailed description is omitted.

As described above, according to the voice recognition terminal 2 and the superuser 6 according to the third embodiment, only the calculation of the voice feature amount and the acquisition of the sensor information are performed by the voice S ninth terminal 2, and the sensor I Based on this, an appropriate acoustic model was selected from the acoustic models whose speech characteristics were stored in the speech recognition server 6, and the speech was recognized. This eliminates the need for a memory or a memory or a circuit for storing an acoustic model in the voice recognition terminal 2, and can simplify the configuration of the voice recognition terminal 2.

Further, since only the voice feature amount and the sensor information are transferred to the voice recognition server 6 via the network 5, voice recognition can be performed without imposing a transmission load on the network 5.

As mentioned above, the data size of the acoustic model is several hundred kilobytes: ^. Therefore, if the bandwidth of the network is limited, the transmission capability may reach the limit when trying to transmit the acoustic model itself. However, if the speech feature quantity can maintain a bandwidth of at most 20 kbps, it can sufficiently transfer data in real time. Therefore, a client server-side speech recognition system with extremely low network load can be constructed, and the sound collection ring of microphone 1 It is possible to perform highly accurate voice awakening according to the boundary.

Unlike the first embodiment, the third embodiment has a configuration in which the recognition result 7 is output from the voice recognition server 6 instead of being output from the voice recognition terminal 2. For example, when the speech recognition terminal 2 is browsing the Internet, the speech recognition utters a URL (Unifom rm Resource Relocation), and the speech recognition server 6 obtains a Web page determined from the URL, and Such a configuration is sufficient if the recognition terminal 2 is to be transmitted.

However, as in Example 1, the voice recognition terminal 2 may output a recognition result. In this experiment, the voice recognition terminal 2 was provided with a terminal-side receiving unit, and the voice recognition server 6 was provided with a server-side transmitting unit. The output result of the matching unit 27 was transmitted from the transmitting unit of the voice recognition server 6 to the network 5. It may be configured to transmit the data to the receiving unit of the voice recognition terminal 2 via the terminal and output the data to a desired output destination from the receiving unit. Example 4.

Instead of the transmission / reception of the acoustic model in the first and second embodiments and the transmission / reception of the audio feature amount in the third embodiment, a method of transmitting / receiving audio data may be considered. The voice recognition terminal and the server according to the fourth embodiment operate based on such a principle.

FIG. 8 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the fourth embodiment. In the figure, the parts denoted by the same symbols as those in FIG. 1 are the same as those in the first embodiment, and therefore the description is omitted. Also in the fourth embodiment, the voice recognition I-terminal 2 and the voice recognition superuser 6 are hated via the network 5. However, voice data and sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6. However, this is different from the first embodiment.

The audio digital processing unit 19 is a unit that converts audio input from the input terminal 3 into digital data, and includes an A / D transformer, eaves, or a circuit. The Furthermore, a dedicated circuit for converting the AZD-converted sampling data into a format suitable for transmission via the network 5, or a computer program for performing processing equivalent to such a dedicated circuit, and a central device for executing this program. May be provided. Further, the server-side acoustic analysis unit 28 is a unit that calculates the speech difficulty from the input speech on the speech recognition server 6, and has the same function as the terminal-side acoustic analysis unit 11 in the first and second embodiments. You.

Next, operations of the speech recognition terminal 2 and the speech recognition server 6 in the fourth embodiment will be described with reference to the drawings. FIG. 9 is a flowchart illustrating processing performed by the speech recognition terminal 2 and the speech recognition superuser 6 according to the first embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with reference numerals unique to the flowchart. , '

First, when the user performs voice input from the microphone mouth phone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the voice digital processing unit 19 proceeds to step S101. The audio signal input at 101 is sampled by A / D conversion (step S401). In the audio digital processing section 19, it is preferable to perform audio data encoding or ffi-compression processing that can be performed only by A / D conversion of the audio signal, but this is not essential. Specific examples of voice ffi-compression methods include the u-law 64kbps PCM ¾¾ (Pulse Coded Modulation, ITU-T G.711) used in the public spring telephone network (ISDN, etc.) «Differential encoding PCM method used in PHS (Adaptive Differential encoding PCM, ADPCM. ITU-T G.726), VSELP ^ (Vector Sum Excited linear Prediction) used in mobile phones. CELP ¾¾ ( Code Excited Linear Prediction) is applied. One of these ^; should be selected according to the available bandwidth and traffic of the communication network. For example, for an age with a bandwidth of 64 kbps, u-law PCM ^ 16-40 kbps: ADPCM for t½, 11.2 kbps: ^ for VSELP;, It is 5.6kbps: CELP is considered suitable for ^. However, the characteristics of the present invention are not lost even if other encoding methods are applied.

Next, the sensor information is lightened by the sensor 12 (step S103), and the combined sensor information and voice data are rearranged into, for example, a format as shown in FIG. Then, the data is transferred to the voice recognition server 6 via the network 5 by the terminal-side transmitting section 13 (step S402).

In FIG. 10, a frame number indicating the processing time of the audio data is stored in the area 701. This frame number is uniquely determined based on, for example, the sampling time of the audio data. Here, the meaning of the word “uniquely determined” includes “determined based on a relative time adjusted between the voice recognition terminal 2 and the voice recognition server 6”, and the relative time Is different! ^ Means to give a different frame number. Alternatively, a specific time is supplied from a clock external to the voice recognition terminal 2 and the voice recognition server 6, and the frame number is uniquely determined based on this time. Good. To calculate the frame number from the time, for example, year (four digits in the Christian era), month (two digits are assigned in the range 1 to 12), day (two digits are assigned in the range 1 to 31) , Hour (assign 2 digits in the range 0 to 23), minute (assign 2 digits in the range 0 to 59), seconds (assign 2 digits in the range 0 to 59), thousandth of a second (value 0 ~ 99 9 9 harms 3 digits !!), pad each number with the number of digits, and concatenate them as a digit string in these order, or year, month, day in bit units. Hours, minutes, seconds and milliseconds may be packed to obtain a constant value.

The data size occupied by the sensor information is stored in the data format area 702 of FIG. For example, if the sensor information is a 32 bit value, the size of the area required to store the sensor information (4 bits) is expressed in bytes and 4 is stored. The sensor 12 is composed of a plurality of sensors: ^ stores the data size of the array area necessary to store the sensor information for each. It becomes. Further, an area 703 is an area in which the sensor information acquired by the sensor 12 in step S103 is stored. Sensor 1 2 Force 構成される An array of sensor information is stored in an area 703 that is composed of several sensors. The data size of the area 703 matches the data size held in the area 702. The audio data size and power S are stored in the area 704. Note that the transmitting unit 13 divides the voice data into a plurality of packets (the structure of which is assumed to be the same as the data format shown in FIG. 5) and transmits the packet: ^. In this case, what is stored in the area 704 is the data size of the audio data included in each packet. The case where the packet is divided into a plurality of packets will be described later. Subsequently, audio data is stored in the area 705.

When the upper limit of the packet size is determined from the characteristics of the network 5, the terminal-side transmission unit 13 divides the voice data input via the input terminal 3 into a plurality of packets. In the format shown in FIG. 7, the frame number stored in the area 701 is information indicating the processing time of the audio data, and the frame number is a sampling number of the audio data included in each bucket. Determined based on time. Further, as already described, the data size of the audio data included in each bucket is stored in the area 704. In addition, the output results of the sensors constituting the sensors 12 and 12 have the property of changing every moment in a short time! In ^, the sensor information stored in the area 7Q3 also differs between buckets. For example, the voice recognition terminal 2 is an in-vehicle voice recognition device, and the sensor 12 obtains the loudness of the background heavy sound (such as a microphone different from the microphone 1). When the car enters and exits the tunnel, the background is heavy! The loudness of the sound will vary significantly. In such a case, by transmitting the packet in the data format shown in FIG. 10, the sensor information can be appropriately reflected even during the utterance. For this reason, the terminal-side transmitting unit 13 separates the voice data when the sensor information changes, regardless of the characteristics of the network 5, regardless of the characteristics of the network 5, regardless of the characteristics of the sensor 5 when the sensor I green report changes significantly during the utterance. It is desirable to send a bucket containing different sensor information.

Subsequently, the operation of the speech recognition terminal 2 and the speech recognition server 6 will be described. The server side receiving unit 21 takes in the sensor information and the voice data to the voice recognition server 6 (step S403). The server-side acoustic analysis unit 28 performs an acoustic analysis of the acquired audio data, and calculates a time series of the audio basket amount (step S404). Further, the hatch-side acoustic model selecting section 23 selects the most appropriate acoustic model based on the acquired sensor skin report (step S109), and the server-side matching section 26 selects this acoustic model. The standard pattern of the model and the speech feature are collated (step S405).

As is clear from the above, in the fourth embodiment, the voice recognition terminal 2 transfers the sensor information and the voice data to the voice recognition server 6, so that the voice recognition terminal 2 does not perform the acoustic analysis, Highly accurate speech recognition can be performed based on an acoustic model suitable for the sound environment. '

Therefore, special parts, circuits, and compilation

-The voice recognition function can be realized without an evening program. Further, according to the fourth embodiment, the sensor information is transmitted for each frame. Therefore, even when the environmental conditions in which the microphone-phone 1 collects sound during the utterance change rapidly, an appropriate You can select an acoustic model and perform speech recognition. Note that the method of dividing the transmission from the voice recognition terminal 2 into a plurality of frames ^: can also be applied to the transmission of the voice feature amount of the third embodiment. That is, since the audio feature has a time-series component, when dividing into frames, it is preferable to divide the frame in the time-series order. Further, the sensor information at the time in the time series is stored in each frame in the same manner as in the fourth embodiment, and the voice recognition server 6 selects a delicate acoustic model based on the latest sensor information included in each frame. By doing so, the accuracy of speech recognition can be further improved. Example 5. In the speech recognition systems of the first to fourth embodiments, the acoustic model stored in the speech recognition terminal 2 and the server 6 is selected based on the difficult condition acquired by the sensor 12 included in the speech recognition terminal 2, so that the The voice awakening process was performed accordingly. However, it is also conceivable to select an acoustic model by combining working information obtained from the Internet, etc., using only the ^ ^ information obtained from the sensor 12. The voice recognition system according to the fifth embodiment has such features.

As described above, the feature of the fifth embodiment is that the acoustic model is selected by combining the working information obtained from the Internet and the sensor information, so that the speech recognition system according to any of the first to fourth embodiments can be used. It is possible to combine them, and the same effect is obtained.However, here, as an example, a case where the speech recognition system of the first embodiment is combined with mouth information obtained from the Internet will be described. .

FIG. 11 is a block diagram illustrating the configuration of the speech recognition system according to the fifth embodiment. As is clear from this figure, the speech recognition system of the fifth embodiment is the same as the speech recognition system of the first embodiment, except that an internet information acquisition unit 29 is added. The components marked with are the same as those in the first embodiment, and a description thereof will not be repeated. Also, the Internet information acquisition unit 29 is a unit that acquires information that works via the Internet!] More specifically, a Web page is acquired by http (Hyper Text Transfer Protocol). It has a function equivalent to an Internet browser. Further, in the acoustic model stored in the speech recognition server 6 in the fifth embodiment, it is assumed that the environmental conditions in which the acoustic model has been learned are represented by sensor information and operation information.

Here, the working opening information is, for example, weather information or information. The Internet has a web site that provides weather information and ¾1 information, and according to these web sites, it is possible to obtain weather conditions, traffic congestion information, construction status, etc. in various places. it can. Therefore, in order to perform speech recognition with higher accuracy by using such work information, an acoustic model matching the available work information is boxed. For example, if the weather information is the work mouth information, the acoustic model is learned by taking into account the effect of the background noise caused by the bow and the like. Also, for example, for ^ in the 情報 1 information, the acoustic model is learned in consideration of the influence of background noise generated by road construction and the like.

Next, operations of the speech recognition terminal 2 and the server 6 according to the fifth embodiment will be described. FIG. 12 is a flowchart showing the operation of the speech recognition terminal 2 and the server 6 according to the fifth embodiment. The only difference between the flowchart of FIG. 12 and the flowchart of FIG. 2 is the presence or absence of step S501. Therefore, hereinafter, the processing of step S501 will be mainly described.

After receiving the sensor information at the voice recognition server 6 (step S108), the Internet information collection unit 29 transmits information that affects the sound collection of the microphone 1 connected to the voice recognition terminal 2 to the Internet. (Step S501). For example, when the sensor 12 is provided with a GPS antenna, the sensor information includes the position information where the voice recognition terminal 2 and the microphone 1 are located. In step 9, based on the position information, additional information such as weather information and traffic information of the voice recognition terminal 2 and the microphone 1 is provided from the Internet.

Subsequently, the server-side acoustic model selection unit 23 selects an acoustic model based on the sensor information and the work information. Specifically, first, it is determined whether or not the working information of the current voice recognition terminal 2 and the working location of the microphone 1 match the working information of the acoustic model. Then, from among the acoustic models having the same information, an acoustic model with the smallest distance value calculated based on the equation (1) shown in the first embodiment is selected for the sensor information.

Subsequent processing is the same as in Difficult Example 1, and a description thereof will not be repeated.

As is clear from the above, according to Wei Example 5's speech recognition system, the acoustic model Even if the conditions for learning are not completely expressed by the sensor information alone, it can be expressed using the information, so select a more appropriate acoustic model for the sound collection environment of the microphone microphone 1 It can be powerful. As a result, the speech recognition accuracy can be improved.

In the above, the method of obtaining information through the Internet has been described.However, the negative significance of using the M Blue Report is one of the environmental factors that degrade the accuracy of speech recognition. It consists in shaking acoustic models based on elements that cannot be represented by sensor information3. Therefore, enter such mouth information

1,

The method used is not limited to the Internet. For example, a dedicated system or a dedicated computer for providing additional information may be prepared.

Industrial potential

As described above, the voice recognition system, the terminal, and the server according to the present invention are useful for providing high-precision voice recognition even if they are used, and Due to the size and weight of the housing, such as a navigation system and a mobile phone, and the limitations of the price crane, it is suitable for providing a voice recognition function to β, which has a limited capacity of a consideration device that can be mounted.

Claims

The scope of the claims

1. In a voice recognition system in which a voice recognition server and a plurality of voice recognition terminals are connected via a network,

Sickle 3 voice recognition terminal,

An input terminal for connecting an external microphone mouthphone and inputting a sound signal collected by the external microphone mouthphone;

3 Client-side acoustic analysis means for calculating voice features from voice signals input from input terminals

A sensor that detects sensor information indicating the inversion of horsepower to be superimposed on the voice signal, and a client-side transmitting unit that transmits the sensor information to the voice recognition server via a network.

Client-side communication means for receiving an acoustic model from the unfavorable S-voice recognition server; and client-side matching means for collating the acoustic model with the disagreeable 3 voice features.

The speech recognition server,

a server-side receiving means for receiving the sensor information transmitted by the tfitS client-side transmitting means,

Dislike 3 Selects an acoustic model that matches the disgusting sensor information from multiple acoustic models.

—Pa side acoustic model Hi selection means,

3. A voice recognition system, comprising: a server-side transmitting means for transmitting the acoustic model selected by the server-side acoustic model to the negative voice recognition terminal.

2. In a voice recognition system in which a voice recognition server and multiple voice recognition terminals are connected via a network,

Dislike 3 Voice recognition terminal, A human-powered end to which an external microphone mouthpiece is connected and the audio signal collected by the external microphone mouthphone is input;

Client side that calculates the amount of voice i from the voice signal input from the input terminal

Sickle 3 A sensor that detects the sensor information that represents U, which is superimposed on the voice signal, and a client-side transmission unit that transmits the sensor information and the disgusting voice to the Ml self-voice recognition server via the network. Prepare,

The 3 voice recognition server

Server-side receiving means for receiving disgusting sensor information and disagreeable 3 voice feature amounts; server-side acoustic model storing means for storing a plurality of acoustic models;

Unfavorable 3 Server-side acoustic model that selects an acoustic model that fits the unfavorable 3 sensor information from multiple acoustic models ^! / § Selection means,

A speech recognition system, comprising: a server-side matching unit that compares an acoustic model selected by the server-side acoustic model selecting unit with a negative three-speech feature amount.

3. In a voice recognition system in which a voice recognition server and a Nehui voice recognition terminal are connected via a network,

Touch 3 Voice recognition terminal

An input terminal for connecting an external microphone and inputting an audio signal collected by the external microphone;

A sensor for detecting sensor information representing a horse's character to be superimposed on the voice signal, and lift self voice sensor information and tin self sensor information and Iff!

And a client-side transmitting means for transmitting to the server,

The distasteful speech recognition server,

Server-side receiving means for receiving the kamami sensor information and disagreement 3 audio signals, and server-side acoustic analysis means for calculating the amount of voice laying down from the disgusting audio signals

Server-side acoustic model storage means for storing a plurality of acoustic models; A service that selects an acoustic model that matches the disgusting sensor information from multiple acoustic models

—Ba side acoustic model; Hi selection means,

Edition 3 Server-Side Acoustic Model A speech recognition system comprising: server-side matching means for matching the acoustic model selected by the HI selection means with the touch S voice rinse amount.

4. Dislike 3 Speech recognition server,

夕 1 Information is acquired from the Internet. ¾1 Information acquisition and wear stage is further provided.Touch 3 Server-side acoustic model selection means uses recognition sensor I 'Blue information and wing information acquired by The speech recognition system according to any one of claims 1 to 3, wherein the number of glues is such that an acoustic model that conforms to both is selected from a plurality of disgusting acoustic models.

5. The speech recognition server comprises:

It further comprises weather information acquisition means for acquiring weather information from the Internet, and the touch-side server-side acoustic model selection means includes both the three sensor I green report and the weather information acquired by the weather I green report acquisition means. The voice recognition and recognition system according to any one of claims 1 to 3, wherein an acoustic model conforming to (3) is selected from a plurality of acoustic models.

6. Connect an external microphone and input the audio signal collected by the external microphone.

A client-side acoustic analysis means for calculating a voice feature from a voice signal input from a disgusting input terminal;

Selects a sensor that detects sensor information that indicates ij, which is the name of the horse's liability superimposed on the self-editing voice signal, and selects an acoustic model that fits the three-sensor information from multiple acoustic models, and transmits this acoustic model via the network. Client-side transmission means for transmitting the unfavorable 3 sensor information to the voice recognition server

Sickle 3 Client-side receiving means for receiving the disagreeable 3 acoustic model transmitted by the 3 speech recognition server, l A client that matches self-acoustic models with tiff self-speech feature values.

7. While storing a plurality of acoustic models, select an acoustic model suitable for the sound collection of the plurality of speech recognition terminals from a number of sickle acoustic models, and dislike the acoustic model to each speech recognition terminal via a network. In the transmitting speech recognition server,

A server-side receiving means for receiving sensor information representing the Ml self-collecting environment from each tiff's own voice recognition terminal;

Server-side acoustic model storage means for storing a plurality of disgusting acoustic models;

Server-side acoustic modeler that selects an acoustic model that matches the 3 sensor information!

Speech recognition server characterized by comprising: server-side transmission means for transmitting the acoustic model selected by the dislike server-side sound model Hi selection means to each of the dislike speech recognition terminals.

8. An acoustic model difference calculating means for calculating a difference between the acoustic model stored in the touching voice recognition terminal and the acoustic model selected by the sickle 3 server-side acoustic model! / S selecting means,

8. The speech recognition server according to claim 7, wherein the negative server-side transmitting means transmits a touch 3 difference instead of the acoustic model.

9. Dislike 3 server-side acoustic model storage means further stores a plurality of acoustic models clustered in advance based on the statistics of the acoustic model,

The negative acoustic model! / ^ Minute calculating means calculates a difference between the clustered acoustic models,

9. The speech recognition server according to claim 8, wherein: .

1 0. Among a plurality of acoustic models stored by the distasteful speech recognition server, ϋ½) a local acoustic model storage unit that considers the acoustic model;

Unfavorable 3 low force ^ Acoustic model stored in the acoustic model storage means, and the acoustic model and the unfavorable 3 voice recognition server selected as the acoustic model suitable for the unfavorable 3 sensor information An acoustic modeler generating means for generating an acoustic model adapted to the disgusting sensor information by adding a difference from the acoustic model.

7. The speech recognition terminal according to claim 6, wherein the self-client-side receiving means uses a glue-sheet to receive the tactile dissemination transmitted by the self-voice recognition instead of the kamami sound model. .

1 1. In addition to storing multiple acoustic models, receiving voice features of input voices extracted by multiple voice recognition terminals via a network, and adapting to the sound collection environment of each voice recognition terminal. The selected acoustic model is selected from a plurality of acoustic models, and the acoustic model is used to recognize the disagreeable voice difficulty. Sickle 3 server-side receiving means for receiving voice features,

Server-side acoustic model storing means for storing a plurality of disgusting acoustic models; server-side acoustic model selecting means for selecting an acoustic model matching the sensor information;

A speech recognition server, comprising: server-side matching means for matching the speech feature quantity with the sound model selected by the return server-side sound model selection means.

1 2. Open the external microphone, and input the audio signal collected by the external microphone.

A client-side acoustic analysis unit that calculates audio features from audio signals input from the three input terminals

A sensor that detects sensor information that indicates the inflection U of the horse that is superimposed on the leaked voice signal, and three-sensor information from multiple acoustic models (selects a suitable acoustic model and creates a network based on that acoustic model A client-side transmitting means for transmitting ΙίίΙΒsensor information and ΙίίΙ3amount of voice space to a voice recognition server which performs voice recognition of voice features received via the

A voice recognition terminal that has a home.

1 3. The leaked client side transmitting means harms the sickle voice feature amount to a plurality of frames in a time series order, and dislikes the sensor information detected by the sensor at each time in the time series. 13. The speech recognition terminal according to claim 12, wherein the speech recognition terminal transmits each frame.

1 4. 1913 server control ”The receiving means receives ΙίίΙ3 sensor information and Kit self-voice features for each frame,

Sickle 3 The server-side acoustic model selection means selects an acoustic model that matches the touch sensor I

The unfavorable 3 server-side matching means is configured to perform matching between the acoustic model selected for each frame by the unfavorable 3 server-side acoustic model HI selecting means and the voice feature of the touch 3 frame. Speech recognition server according to paragraph 11 of the scope.

1 5. While receiving voice digital signals from a plurality of speech recognition terminals via a network, selecting an acoustic model suitable for the sound collection environment of each of the speech recognition terminals from the plurality of acoustic models. In a speech recognition server that performs speech recognition of digital signals of self-voice using an acoustic model,

Server-side receiving means for receiving, from each of the voice recognition terminals, sensor information indicating a Kamaki sound collection environment and the voice digital signal;

Server-side acoustic model storage means for calculating a plurality of acoustic models, such as server-side acoustic analysis means for calculating an audio feature amount from the audio digital signal,

3 Server-side acoustic model for selecting an acoustic model that matches the sensor information!

A voice recognition server that includes server-side matching means for matching the three difficult sounds and the acoustic model selected by the three server-side sound model selecting means.

1 6. Connect the external microphone microphone and input the audio signal collected by the external microphone microphone.

Touch 3 A voice data that calculates a voice digital signal from a voice signal input from the input terminal. Digital processing means;

A sensor that detects the sensor information representing J, which is superimposed on the disgusting voice signal, and an acoustic model that matches the return sensor information from a plurality of acoustic models are selected, and a network is created based on the acoustic model. Client-side transmitting means for transmitting the Kamaki sensor information and the Kamaki voice digital signal to a voice recognition server for recognizing the voice signal digital signal received via the

A voice recognition terminal that is equipped with:

1 7. The client transmission means divides the til self voice digital signal into a plurality of frames in chronological order, and edits the sensor information detected by the sick sensor at each time in the time series. The first aspect of the invention is characterized in that the frame is sent to each frame.

The voice recognition terminal according to item 6.

1 8. Thank you 3 The receiving means on the server side receives the audio digital signal and sensor information for each frame,

The server rule sound dividing unit calculates a speech feature amount for each frame from a tfftS speech digital signal,

The HI selection means on the server 3 side acoustic model selects an acoustic model that matches the sensory information for each 3 frame sickle B frame,

The unfavorable 3 server-side matching means performs matching between the acoustic model selected for each 3 frames by the friendly 3 server-side acoustic model selecting means and the audio feature amount of the unfavorable 3 frames. A speech recognition server according to clause 15.

1 9. It is further equipped with a traffic information collecting step for acquiring traffic information from the Internet, and the server-side sound module HI selecting means is obtained by using the unfavorable 3 sensor information and the unpleasant traffic report. Claims 7 to 9, 9, 11, and 14, wherein an acoustic model that conforms to both the selected wing information is selected from a plurality of acoustic models. A speech recognition server according to any one of paragraphs 15 and 18.

20 · The sound model selection means on the disgusting server side obtains weather information from It is further equipped with a means to acquire the weather that humiliates' ft,

The sound model selection means on the disgusting server side selects an acoustic model that matches both the disgusting sensor Iff report and the meteorological information obtained by The speech recognition server according to any one of claims 7 to 9, 11, 11, 14, 15, and 18, wherein the speech recognition server performs the process.