WO2005010868A1 - Voice recognition system and its terminal and server - Google Patents

Voice recognition system and its terminal and server Download PDF

Info

Publication number
WO2005010868A1
WO2005010868A1 PCT/JP2003/009598 JP0309598W WO2005010868A1 WO 2005010868 A1 WO2005010868 A1 WO 2005010868A1 JP 0309598 W JP0309598 W JP 0309598W WO 2005010868 A1 WO2005010868 A1 WO 2005010868A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
acoustic model
voice
acoustic
speech recognition
Prior art date
Application number
PCT/JP2003/009598
Other languages
French (fr)
Japanese (ja)
Inventor
Tomohiro Narita
Takashi Sudou
Toshiyuki Hanazawa
Original Assignee
Mitsubishi Denki Kabushiki Kaisha
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Denki Kabushiki Kaisha filed Critical Mitsubishi Denki Kabushiki Kaisha
Priority to PCT/JP2003/009598 priority Critical patent/WO2005010868A1/en
Priority to JP2005504586A priority patent/JPWO2005010868A1/en
Publication of WO2005010868A1 publication Critical patent/WO2005010868A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit

Definitions

  • the present invention relates to a speech recognition system and its terminal and server, and in particular, selects an appropriate acoustic model according to the use condition from a plurality of acoustic models assumed for various shelf conditions, and performs speech recognition. It concerns the art of performing recognition. Background leakage
  • Speech recognition is performed by extracting a time series of a speech spread amount from an input speech, and calculating by comparing the speech feature quantity / time series with an acoustic model prepared in advance.
  • a value output from various in-vehicle sensors such as a ⁇ ⁇ ⁇ sensor (refers to data obtained by subjecting an analog signal from the sensor to AZD conversion. This is referred to as sensor information), and a noise spectrum is calculated from the noise, and this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed.
  • sensor information refers to data obtained by subjecting an analog signal from the sensor to AZD conversion.
  • sensor information refers to data obtained by subjecting an analog signal from the sensor to AZD conversion.
  • a noise spectrum is calculated from the noise
  • this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed.
  • the noise spectrum of this sensor information is stored in the voice basket. From the time series of the quantity.
  • this method has a problem that the accuracy of speech recognition cannot be improved without using it until now.
  • some predetermined values are selected in advance from the output values of various sensors, and an acoustic model learned under the condition that the sensors output these values is created. Then, we can compare the sensor information obtained in the actual use difficulties with the difficult conditions of the acoustic model and select an appropriate acoustic model.
  • the data size of one acoustic model varies depending on the method of setting up and implementing the speech and speech intellectual system, but may reach several hundred kilobytes.
  • the size of the housing and the weight limit the storage capacity of the storage device that can be mounted is severely limited. Therefore, it is not realistic to adopt a configuration that allows Monoyl fungi to consider a plurality of acoustic models having such a large size.
  • the present invention has been made in order to solve the above-mentioned problem.
  • the present invention By transmitting sensor information via a network from a voice recognition device to a voice recognition server storing a plurality of acoustic models, the present invention has been made to solve the problem.
  • the purpose is to select an acoustic model suitable for the occupation of the company and to achieve high-accuracy speech recognition. Disclosure of the invention
  • the speech recognition system includes:
  • a speech recognition system in which a speech recognition supporter and a plurality of speech recognition terminals are connected via a network
  • a client-side acoustic analysis unit that calculates the volume of voice from the voice signal input from the input terminal
  • Client side receiving means for receiving an acoustic model from the disagreeable 3 voice recognition server, and client side matching means for comparing the sickle 3 acoustic model with the disagreeable 3 voice feature amount,
  • a server-side receiving means for receiving the sensor information transmitted by the Itfffi client-side transmitting means
  • Server-side acoustic model storage means for storing a plurality of acoustic models
  • server-side acoustic model selecting means for selecting an acoustic model that matches the touch sensor information from a plurality of acoustic models
  • Server-side transmission means for transmitting the acoustic model selected by the server-side acoustic model Hi selection means to the obscene speech recognition terminal.
  • a plurality of acoustic models corresponding to various sound collection jobs are stored in a speech recognition server having an unlimited storage capacity, and a sensor provided in each speech recognition terminal is stored. Based on the information from, a sound model suitable for the sound collection of the speech recognition terminal was selected and sent to the end of the speech recognition. As a result, the speech recognition terminal is limited in its own storage capacity due to limitations such as the case and weight of the terminal. Acquisition and speech recognition using that acoustic model can improve the accuracy of speech recognition.
  • FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 1 of the present invention.
  • FIG. 2 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 1 of the present invention.
  • FIG. 3 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 2 of the present invention.
  • FIG. 4 is a flowchart illustrating a clustering process of an acoustic model according to Embodiment 2 of the present invention.
  • FIG. 5 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 2 of the present invention.
  • FIG. 6 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 3 of the present invention.
  • FIG. 7 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 3 of the present invention.
  • FIG. 8 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 4 of the present invention.
  • FIG. 9 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 4 of the present invention.
  • FIG. 10 is a configuration diagram of a data format of sensor information and voice data transmitted from the voice recognition terminal to the voice recognition server according to Embodiment 4 of the present invention.
  • FIG. 1.1 is a block diagram showing a configuration of a speech recognition server from a speech recognition terminal according to Embodiment 5 of the present invention.
  • FIG. 12 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 5 of the present invention.
  • FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to an embodiment of the present invention.
  • a microphone 1 is a device or component that collects voice
  • a voice recognition terminal 2 is a device that performs voice recognition of the voice collected by the microphone microphone 1 via an input terminal 3 and outputs a recognition result 4. It is.
  • the input terminal 3 is an audio terminal or microphone microphone connection ⁇ ?.
  • the speech recognition terminal 2 is connected to a speech recognition server 6 via a network 5.
  • Network 5 is a network that communicates digital and blue information such as the Internet, LAN (Local Area Network), public line network, mobile phone network, and communication network using artificial satellites.
  • the network 5 only needs to transmit and receive digital data between devices connected to this network, and does not ask the format of the information transmitted on the network 5. Absent. Therefore, for example, a bus designed to connect a plurality of devices, such as USB (Universal Serial 1 Bus) and SCS I (Sma 11 Computing System Interfac e). It does not matter.
  • the network 5 uses a data communication service of mobile communication.
  • data to be transmitted and received is called a packet.
  • I ⁇ is divided into units and sent and received one by one.
  • the S voice recognition server 6 to which control information such as position information indicating which part of the whole is to be configured and error correction is added is configured to be a male voice recognition terminal 2 via the network 5. It is a server computer.
  • the voice recognition server 6 is a storage device such as a hard disk device or a memory having a larger storage capacity than the voice recognition terminal 2. And stores standard patterns required for speech recognition. Also, a plurality of speech recognition terminals 2 are disliked by the speech recognition server 6 via the network 5.
  • the voice recognition terminal 2 is composed of a terminal 3 ⁇ 4 m jf ⁇ and a sensor 12, mmmn 3, a terminal transmission sound 4, and a terminal ”. Is provided.
  • the terminal-side acoustic analysis unit 11 performs acoustic analysis based on the audio signal input from the input terminal 3 and calculates an audio feature amount.
  • the sensor 12 is a sensor for detecting an environmental condition with a view to obtaining information on the view of the horse superimposed on the audio signal obtained by the microphone-phone 1, and the microphone-phone 1 is provided.
  • An element or device that detects or obtains a physical quantity in a certain environment or a change in the physical quantity may be included.
  • 'physical quantity here' includes sculpture 'pressure' flow rate, photomagnetism, time, electromagnetic waves, and the like.
  • the GPS antenna is a sensor for the GPS signal. Also, it is not always necessary to detect a physical quantity by acquiring some signal from the outside world.For example, a circuit that acquires the time of the place where the microphone is placed based on the built-in clock Is also included in the sensor mentioned here.
  • the sensor outputs an analog signal in (1) to (4), and the normal configuration is to sample the output analog signal into a digital signal by means of A / D conversion or elements. Therefore, the sensor 12 may include such an AZD variable or element.
  • the voice recognition terminal 2 are terminals of the navigation system: ⁇ indicates a sensor for monitoring the rotation of the evacuation sensor engine, and a monitoring status of the operation of the wiper. Multiple sensors may be combined, such as a sensor that rings in the evening, a sensor that monitors the opening and closing status of the door glass, and a sensor that monitors the car audio program.
  • the terminal-side transmission unit 13 is a unit that transmits sensor information near the microphone 1 obtained by the sensor 12 to the speech recognition server 6.
  • the terminal-side receiving unit 14 is configured to receive information from the speech recognition server 6 and output the received information to the terminal-side acoustic model selecting unit 16.
  • the terminal-side transmission unit 13 and the terminal-side reception unit 14 are composed of circuits or elements that send signals to the network cable and receive signals from the network cable, and are used to control these circuits or elements.
  • the computer program may be included in the terminal-side transmitting unit 13 and the terminal-side receiving unit 14.
  • the network 5 is a wireless network
  • the terminal-side transmitting unit 13 and the terminal-side receiving unit 14 have antennas for transmitting and receiving communication waves.
  • the terminal-side transmission unit 13 and the terminal-side reception unit 1 may be configured as separate parts, or may be configured by the same network input / output device.
  • the terminal-side acoustic model storage unit 15 is a storage eave or a circuit for storing an acoustic model.
  • a plurality of acoustic models can be provided according to the learning fiber, and only a part of them is stored in the terminal-side acoustic model storage unit 15.
  • each acoustic model is associated with sensor information indicating an environmental condition in which the acoustic model has been learned, and an acoustic model suitable for the environmental condition can be specified from the numerical value of the sensor information.
  • the voice recognition terminal 2 is a ⁇ ffl voice recognition device
  • the speech recognition server 6 since the speech recognition server 6 also remembers the acoustic model capability
  • the voice recognition terminal 2 must be mounted, and the storage capacity of the storage device can be extremely small.
  • the terminal-side acoustic model selection unit 16 includes a terminal-side reception unit 14-acquired acoustic model (or an acoustic model stored in the terminal-side acoustic model storage unit 15) and a terminal-side acoustic analysis. This is a sound for calculating the likelihood with the speech feature output by the unit 11.
  • the terminal-side matching unit 17 is a unit that selects a vocabulary based on the likelihood calculated by the terminal-side acoustic model selecting unit 16 and outputs it as a recognition result 4.
  • the terminal-side acoustic analysis unit 11, the terminal-side transmission unit 13, the terminal-side reception unit 14, the terminal-side acoustic model storage unit 15, and the terminal-side acoustic model HI selection unit 16 and the terminal-side collating unit 17 may be configured by dedicated circuits. ”However, the central processing unit (CPU), network I / O device (network adapter device, etc.), and storage device You may make it comprise as a computer program which performs the process equivalent to a function.
  • CPU central processing unit
  • network I / O device network adapter device, etc.
  • storage device You may make it comprise as a computer program which performs the process equivalent to a function.
  • the speech recognition server 6 includes a server-side receiving unit 21, a server-side acoustic model storage unit 22, a server-side acoustic model HI selecting unit 23, and a server-side transmitting unit 24.
  • the server-side receiving unit 21 is a unit that receives the sensor information transmitted from the terminal-side transmitting unit 13 of the voice recognition terminal 2 via the network 5.
  • the server-side acoustic model storage unit 22 is a storage device for storing a plurality of acoustic models.
  • This server-side acoustic model storage unit 22 is configured as a large-capacity storage device using a combination of a disk drive age, a CD-ROM medium and a CD-ROM drive.
  • the server-side acoustic model storage unit 22 stores all of the acoustic models that may be worthy of this speech recognition system. It has a large storage capacity.
  • the server-side acoustic model JH1 selection unit 23 is a tone for selecting an acoustic model suitable for the sensor information received by the server-side reception unit 21 from the acoustic models stored in the server-side acoustic model storage unit 22.
  • the server-side transmitting unit 24 is a unit that transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 via the network 5.
  • the server-side receiving unit 21, the server-side acoustic model consideration unit 22, the server-side acoustic model measuring unit 23, and the server-side transmitting unit 24 May be configured by dedicated circuits, respectively, but are equivalent to central processing (CPU), network IZO device (network adapter device, etc.), and recording device. It may be configured as a computer program for executing processing. '
  • FIG. 2 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment.
  • a voice signal is input to the terminal-side acoustic analysis unit 11 via the input terminal 3.
  • the terminal-side acoustic analysis unit 11 converts the digital signal into a digital signal using an A / D converter, and calculates a time series of speech features such as an LPC cepstrum (Linear Predictive Coding Cepstrum) (step S 102).
  • LPC cepstrum Linear Predictive Coding Cepstrum
  • the sensor 12 acquires a physical quantity around the microphone 1 (step S103).
  • the voice recognition terminal 2 is a force navigation system
  • the sensor 12 is a 3 ⁇ 4 ⁇ sensor that detects an evacuation of a vehicle (car) equipped with the force navigation system: In ⁇ , evasion corresponds to such a physical quantity.
  • the sensor information in step S103 is to be performed next to the acoustic analysis in step S102.
  • the processing of step S103 may be performed before the processing of steps S101 to S102, or may be performed simultaneously or in parallel. Absent.
  • the terminal-side acoustic model selection unit 16 selects the sensor information obtained by the sensor 12, that is, the acoustic model learned under the condition that is closest to the sound of the microphone-phone 1.
  • the fiber condition of the acoustic model is considered to be multiple, and the terminal-side acoustic model storage unit 15 does not necessarily store all of the fiber conditions. Therefore, if none of the acoustic models stored in the terminal-side acoustic model storage unit 15 at the H pole is learned under environmental conditions close to the marrow conditions of the microphone-mouth phone 1, the speech recognition 6 to get an acoustic model.
  • the sensor information about the sensor k under the environmental conditions in which the acoustic model m has been learned is represented by S m , k
  • the current sensor information of the sensor k is represented by Sm , k. x k .
  • the terminal-side acoustic model selection unit 16 calculates a distance value D (m) between the sensor information S m , k of the acoustic model m and the sensor information x k obtained by the sensor 12 (step S104).
  • a distance value between the sensor information x k of a certain sensor k and the sensor information S m , k of the acoustic model m is D k (x k , S m , k ).
  • D k (x k , S m , k ) an absolute value of a difference between sensor information may be adopted.
  • the value is D k (x k , S m , k ).
  • the distance value D (m) For the distance value D (m), the distance value D k (x k , S m , k ) for each sensor is used. And calculate as follows.
  • the relationship between the sensor information as a physical quantity and the ⁇ value D (m) will be described. If the sensor 1 blue light is the position (may be determined based on the key or latitude, or may be determined based on the distance from a specific place as the origin), the sensor information Have different dimensions as physical quantities. However, here, by adjusting the weighting factor w k , the contribution of w k D k (x k , S m , k ) to the distance value can be set appropriately. No problem. The same applies even if the unit system is different. For example, if kmZh is used as the avoidance unit; ⁇ and mph are used, different values can be used as sensor information even if the speed is physically the same.
  • the terminal-side acoustic model selection unit 16 obtains the minimum value min ⁇ D (m) ⁇ of the distance value D (m) for each m calculated by the equation (1), and obtains this min ⁇ D (m) Evaluate whether it is smaller than the fixed value T (Step S105). In other words, it is determined whether or not there is a condition sufficiently close to the current environmental conditions at which the microphone 1 picks up, among the conditions of the terminal-side acoustic model stored in the terminal-side acoustic model storage unit 15. is there.
  • the predetermined value T is a value set in advance to determine whether or not such a condition is satisfied.
  • step S105 If the min ⁇ D (m) ⁇ force is smaller than the fixed value T (step S105: eses), the process proceeds to step S106.
  • step S107 If min ⁇ D (m) ⁇ is equal to or larger than the fixed value ⁇ f (step S105: ⁇ ), the process proceeds to step S107. this:! In ⁇ , during the difficult condition of the acoustic model stored in the acoustic model storage unit 15 on the terminal side, the profession will not be strong enough under the current environmental conditions in which the microphone 1 collects sound. Therefore, the terminal-side transmitting unit 13 transmits the sensor information to the voice recognition server 6 (Step S107).
  • the frequency that min ⁇ D (m) ⁇ is determined to be smaller than T increases, and the number of times that step S107 is executed decreases. That is, if the value of ⁇ is increased, the number of transmissions and receptions via the network 5 can be reduced. Therefore, the effect of suppressing the transmission amount of the network 5 occurs.
  • the speech recognition is performed by inputting an acoustic model having a smaller distance value between the sensor information obtained by the sensors 12 and the condition under which the acoustic model was learned, so that the accuracy of the speech recognition is improved. be able to. From the above, the value of ⁇ ⁇ may be determined in consideration of the transmission of the network 5 and the target speech recognition. .
  • the terminal-side receiving unit 21 receives the sensor information via the network 5 (step S108).
  • the server-side acoustic model selection unit 23 calculates a distance value between the environmental condition in which the acoustic model stored in the server-side acoustic model storage unit 22 is learned and the sensor information received by the server-side reception unit 21. The calculation is performed in the same manner as in step S104, and the acoustic model with the smallest distance value is selected (step S109). Subsequently, the server-side transmitting unit 24 transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 (Step S110).
  • the terminal-side receiving unit 14 of the voice recognition terminal 2 receives the acoustic model transmitted by the server-side transmitting unit 24 via the network 5 (Step S111).
  • the terminal-side collating unit 17 converts the speech feature and the sound output by the terminal-side acoustic analysis unit 1 1.
  • the matching process with the sound model is performed (step S112).
  • the sickle with the highest 3 ⁇ 4 ⁇ between the standard pattern stored as the acoustic model and the time series of the speech difficulty is set as the recognition result 4.
  • pattern matching by DP (Dynamic Programming) matching is performed, and the one with the smallest distance value is set as the recognition result 4.
  • the speech recognition terminal 2 and the server 6 according to the first embodiment even when only a small number of acoustic models can be stored in the speech recognition terminal 2, the sound pickup ⁇ of the microphone mouth phone 1 can be detected by the sensor.
  • Speech Recognition Server 6 Power acquired by 1 and 2 Able to perform speech recognition by selecting an acoustic model learned under environmental conditions close to this sound collection job from among many acoustic models remembered.
  • the data size of one acoustic model is several hundred kilobytes in size, depending on how it is implemented. Therefore, the effect of reducing the number of acoustic models that need to be stored by the speech recognition terminal is significant.
  • the sensor information can take successive values. Usually, however, some values are selected from the input values, and an acoustic model using this value as sensor information is learned.
  • the sensor 12 is composed of a few types of sensors (the first sensor and the second sensor), and the speech recognition terminal 2 and the speech recognition server 6 recognize each of the acoustic models. If the number of values selected as sensor information for the first sensor is ⁇ 1 and the number of values selected as sensor information for the second sensor is ⁇ 2, the voice recognition terminal 2 and the voice recognition server 6 store The total number of acoustic models used is calculated as ⁇ 1 X ⁇ 2.
  • that is, the number of values selected as sensor information of the first sensor is greater than the number of values selected as sensor information of the second sensor.
  • is the weighting factor for the sensor information of the first sensor Is smaller than the weight coefficient of the second sensor with respect to the sensor information, it is possible to select an acoustic model according to the difficulty in collecting sound of the microphone / mouth phone 1.
  • the speech recognition terminal 2 includes a terminal-side acoustic model storage unit 15 and a terminal-side acoustic model HI selection unit 16 so that the speech recognition terminal 2 remembers the acoustic model and the speech recognition server 6 stores the acoustic model.
  • the model was selected appropriately to perform voice awakening.
  • the speech recognition terminal 2 include the terminal-side acoustic model storage unit 15 and the terminal-side acoustic model unit 1 ⁇ . That is, it goes without saying that a configuration is possible in which the acoustic model remembered by the speech recognition server 6 is unconditionally based on the sensor information acquired by the sensor 12.
  • the acoustic model received from the speech recognition server 6 is newly stored in the acoustic model storage unit 15 on the terminal side, and instead of the acoustic model of the voice recognition 1 ⁇ end 2 side, A configuration for storing the acoustic model received from the speech recognition server 6 is also possible. By doing so, next time speech recognition is performed using the same acoustic model again: ⁇ , since there is no need to transfer the acoustic model again from the speech recognition server 6, the transmission load on the network 5 can be reduced, and transmission and reception can be reduced. Example 2 can be shortened.
  • an acoustic model suitable for the sensor information is obtained from the speech recognition server.
  • the sound model from the speech recognition server! / ⁇ is It cannot place a heavy load on the network, nor can it affect the overall processing performance due to the time required for acoustic model data.
  • One way to avoid such problems is to design the speech awakening algorithm so that the data size of the acoustic model is as small as possible. This is because, if the size of the acoustic model is small, transferring the acoustic model from the speech recognition server to the speech recognition terminal does not add much load to the network.
  • a plurality of acoustic models that are similar to each other are clustered, and between the acoustic models in the same cluster is determined in advance, and stored in the speech recognition server.
  • only the difference from the acoustic model stored in the speech recognition terminal is used, and the speech recognition terminal power is calculated from the acoustic model stored in the speech recognition terminal and the sound of the speech recognition server.
  • a method of synthesizing the model is also conceivable.
  • the speech recognition terminal and the server according to the second embodiment operate based on a powerful principle.
  • FIG. 3 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the second embodiment.
  • the acoustic model conversion unit 18 calculates the acoustic model stored in the speech recognition server 6 from the contents received by the terminal-side receiving unit 14 and the acoustic model stored in the terminal-side acoustic model storage unit 15. This is a part for synthesizing a simple acoustic model.
  • the acoustic model difference calculation unit 25 calculates the dispersion between the acoustic model stored in the terminal-side acoustic model storage unit 15 and the acoustic model stored in the server-side acoustic model storage unit 22. .
  • the description is omitted.
  • the speech recognition device 2 and the server 6 assume that the acoustic model is clustered in advance. Therefore, the class ringing method of the acoustic model will be described first. Note that the clustering of the acoustic model has been completed before the speech recognition processing is performed by the speech recognition device 2 and the supervisor 6 .
  • the acoustic model of the phoneme ⁇ is represented by ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • FIG. 4 is a flowchart showing the clustering process of the acoustic model.
  • an initial cluster is created (step S201).
  • one initial cluster is created from all possible acoustic models that are shelved by this speech recognition system. Equations (2) and (3) are used to calculate the t t of the initial cluster r.
  • N represents the number of distributions belonging to the cluster
  • K represents ⁇ number of the speech feature.
  • step S202 it is determined whether or not the required number of clusters has already been obtained by the clustering process executed so far.
  • the required number of clusters is determined when designing the speech recognition processing system. Generally speaking, the greater the number of clusters, the smaller the distance between acoustic models in the same cluster. As a result, the amount of information of the difference data is reduced, and the amount of difference data transmitted and received via the network 5 can be suppressed.
  • Tokuko, Speech Recognition ⁇ The number of acoustic models stored in Terminal 2 and Server 6 is large:! For ⁇ , increase the number of clusters.
  • the acoustic model stored in the speech recognition server 6 is synthesized by combining the acoustic model stored in the end 2 (hereinafter referred to as the “oral sound model”) and scatter, or the acoustic model stored in the speech recognition server 6 The aim is to obtain an equivalent acoustic model.
  • the difference used here is local! / Combine with the acoustic model, and must be determined between this oral acoustic model and acoustic models belonging to the same cluster. Since the acoustic model synthesized due to the difficulty corresponds to the sensor information, the most efficient state is that the acoustic model corresponding to the sensor information and the oral sound model are classified into the same class. become.
  • step S203 a VQ distortion cluster division is performed (step S203).
  • the largest VQ distortion ( ⁇ cluster r max (initial cluster in the first loop)) is divided into two clusters, r 1 and r 2, thereby increasing the number of class IT.
  • the class ff4 after the division is calculated by the following equation: where is a small value that is predetermined for each dimension of the speech volume.
  • Step S204 the statistics of each acoustic model and the statistics of each cluster (all clusters divided in step S203)
  • the distance is calculated by selecting one from each of the acoustic models and all the clusters already obtained. However, the distance is not calculated again for the combination of the acoustic model and the cluster for which 3 ⁇ 4
  • the Bhattachaxyya distance value defined by equation (8) is used for the distance value of the statistics of the acoustic model and the statistics of each cluster.
  • Equation (8) the parameter with a suffix of 1 is the statistical value of the acoustic model, and the parameter with a suffix of 2 is the statistical value of the cluster.
  • each acoustic model is assigned the class with the smallest distance value. Belong to the evening.
  • the distance value between the acoustic model statistics and the cluster statistics may be calculated by the method of equation (8). Even in such a case, it is desirable to adopt an equation that can obtain a distance value belonging to the same class evening for an i-feed whose distance value calculated by the equation (1) is close. However, this is not « ⁇ .
  • is performed in the code book of each class (step S205).
  • the representative values of the acoustic models belonging to the class are calculated using Equations (2) and (3).
  • the distance between the statistical value of the acoustic model belonging to the cluster and the representative value is accumulated using Eq. (8), and this is defined as the VQ distortion of the current cluster.
  • Step S206 a reward value for clustering is calculated (step S206).
  • VQ distortion for all classes Let b be the ⁇ face value of the class evening ring.
  • Steps S204 to S207 constitute a loop that is executed a plurality of times.
  • the contract surface value calculated in step S206 is stored until the next execution of the task.
  • the scatter of the consultation surface value and the basket value calculated in the previous loop execution is obtained, and it is determined whether or not the thread size value is less than a predetermined threshold value (step S207). If the difference force f is less than the threshold, the acoustic model belongs to an appropriate cluster among clusters that have already been obtained, and the process returns to step S202 (step S207: Y es).
  • step S204 is performed (step S207: No).
  • FIG. 5 is a flowchart of the operation of the voice recognition device 2 and the server 6.
  • steps S101 to S105 a voice is input from the microphone 1 as in the first embodiment, and after performing sound analysis and acquisition of sensor information, a speech suitable for the sensor information is obtained.
  • step S208 For ⁇ , go to step S208 (step S105: ⁇
  • the terminal-side transmitting unit 13 transmits the sensor information and the information m for performing the low-power Jl ⁇ sound model to the speech recognition server 6 (step S208).
  • the server-side receiver 21 receives the sensor information and m (step S209), and the server-side acoustic model HI selector 23 selects the acoustic model most suitable for the received sensor information. (Step S109). Then, it is determined whether or not this acoustic model and the oral model / sound model m belong to the same cluster (step S210). If they belong to the same class, the process goes to step S211 (step S210: Yes), and the acoustic model difference calculation unit 25 calculates the acoustic model and the local! / ⁇ The server-side transmitting unit 24 calculates the difference (step S211), and transmits the difference to the speech recognition terminal 2 (step S212). .
  • the difference may be calculated based on the difference between the values of the components of the voice volume: ⁇ and the offset (the difference between the storage positions of the respective elements). It is known to find a difference value between different binary data (such as between binary files), and that may be used. Further, since the cage according to the second embodiment does not require a special request for the data structure of the acoustic model, a method of designing a data structure that can easily obtain the difference can be considered.
  • step S210 if they do not belong to the same cluster, go directly to step S212 (step S210: No).
  • the voice recognition terminal 2 side determines that the low-power acoustic model determined to be most suitable for the sensor information (in step S105, the acoustic model determined to have the smallest ⁇ with the sensor information) ), Thank you for generating the difference The Therefore, such a mouthful zo! / Information about the acoustic model m was transmitted in advance in step S208.
  • the voice recognition server 2 stores the voice recognition terminal 2 in the memory! / ⁇ Understand (or manage) the type of sound model, and after the speech recognition server selects the sound model, close to the sensor information, and then select the sound model.
  • the difference may be calculated by selecting from the low-power reverberation model that manages the model. In this case, it is necessary to notify the speech recognition terminal 2 which speech model the difference detected by the speech recognition server 6 is based on. 6 sends the information that makes the ⁇ calculation of the ⁇ sound model.
  • the terminal-side receiving unit 14 of the speech recognition terminal 2 receives the difference data or the acoustic model (Step S213). If the received data is a difference, the acoustic model generation unit 18 synthesizes an acoustic model from the utterance model m which is the key to the difference and the difference (step S2 14). Then, the terminal-side matching unit 17 performs pattern matching between the standard pattern of the acoustic model and the voice feature amount, and recognizes the most likely likelihood! Output i ⁇ i 'as recognition result 4.
  • the sound module required for speech recognition is used. Does not memorize Dell: Even if ⁇ , the acoustic model stored in the speech recognition server 6 is received via the network 5 to perform speech recognition according to the sound collection ring of the microphone 1. Met. However, instead of transmitting and receiving the acoustic model, a voice feature may be transmitted and received.
  • the speech recognition terminal and server according to the third embodiment operate based on such a principle.
  • FIG. 6 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the third embodiment.
  • the parts denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description is omitted.
  • the voice recognition terminal 2 and the voice recognition server 6 are sickled via the network 5.
  • the speech recognition amount and sensor information are transmitted from the speech recognition terminal 2 to the speech recognition server 6, and the recognition result 7 is output from the speech recognition server 6.
  • the server-side collating unit 27 generates a sound for performing collation between the speech tree and the acoustic model, similarly to the terminal-side collating unit 17 of the first embodiment.
  • FIG. 7 is a flowchart showing processing between the speech recognition terminal 2 and the speech recognition server 6 according to the second embodiment.
  • the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with ⁇ unique to the flowchart.
  • a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the acoustic analysis unit 11 converts the voice signal into the voice signal.
  • the time series of the voice feature is calculated (step S102), and sensor information is collected by the sensor 12 (step S103).
  • the sensor information and the voice coverage amount are transferred to the voice recognition server 6 via the network 5 by the terminal-side transmission unit 13 (step S301), and the server-side reception unit 2.1 transmits the information.
  • the sensor information and the voice characteristics are taken into the voice recognition server 6 (step S302).
  • the server-side acoustic model storage unit 22 of the voice recognition server 6 stores an acoustic model in advance according to a plurality of sensor information, and the server-side acoustic model selection unit 23 is acquired by the server-side reception unit 21.
  • a distance value between the obtained sensor information and the sensor information of each acoustic model is calculated by equation (1), and an acoustic model having the smallest distance value is selected (step S109).
  • Step S303 This process is the same as the matching process (step S112) in the first embodiment, and thus a detailed description is omitted.
  • the voice recognition terminal 2 and the superuser 6 according to the third embodiment, only the calculation of the voice feature amount and the acquisition of the sensor information are performed by the voice S ninth terminal 2, and the sensor I Based on this, an appropriate acoustic model was selected from the acoustic models whose speech characteristics were stored in the speech recognition server 6, and the speech was recognized. This eliminates the need for a memory or a memory or a circuit for storing an acoustic model in the voice recognition terminal 2, and can simplify the configuration of the voice recognition terminal 2.
  • voice recognition can be performed without imposing a transmission load on the network 5.
  • the data size of the acoustic model is several hundred kilobytes: ⁇ . Therefore, if the bandwidth of the network is limited, the transmission capability may reach the limit when trying to transmit the acoustic model itself. However, if the speech feature quantity can maintain a bandwidth of at most 20 kbps, it can sufficiently transfer data in real time. Therefore, a client server-side speech recognition system with extremely low network load can be constructed, and the sound collection ring of microphone 1 It is possible to perform highly accurate voice awakening according to the boundary.
  • the third embodiment has a configuration in which the recognition result 7 is output from the voice recognition server 6 instead of being output from the voice recognition terminal 2.
  • the speech recognition terminal 2 is browsing the Internet, the speech recognition utters a URL (Unifom rm Resource Relocation), and the speech recognition server 6 obtains a Web page determined from the URL, and Such a configuration is sufficient if the recognition terminal 2 is to be transmitted.
  • a URL Uniform Resource Relocation
  • the voice recognition terminal 2 may output a recognition result.
  • the voice recognition terminal 2 was provided with a terminal-side receiving unit, and the voice recognition server 6 was provided with a server-side transmitting unit.
  • the output result of the matching unit 27 was transmitted from the transmitting unit of the voice recognition server 6 to the network 5. It may be configured to transmit the data to the receiving unit of the voice recognition terminal 2 via the terminal and output the data to a desired output destination from the receiving unit.
  • a method of transmitting / receiving audio data may be considered.
  • the voice recognition terminal and the server according to the fourth embodiment operate based on such a principle.
  • FIG. 8 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the fourth embodiment.
  • the parts denoted by the same symbols as those in FIG. 1 are the same as those in the first embodiment, and therefore the description is omitted.
  • the voice recognition I-terminal 2 and the voice recognition superuser 6 are hated via the network 5.
  • voice data and sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6.
  • this is different from the first embodiment.
  • the audio digital processing unit 19 is a unit that converts audio input from the input terminal 3 into digital data, and includes an A / D transformer, eaves, or a circuit.
  • the server-side acoustic analysis unit 28 is a unit that calculates the speech difficulty from the input speech on the speech recognition server 6, and has the same function as the terminal-side acoustic analysis unit 11 in the first and second embodiments. You.
  • FIG. 9 is a flowchart illustrating processing performed by the speech recognition terminal 2 and the speech recognition superuser 6 according to the first embodiment.
  • the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with reference numerals unique to the flowchart. , '
  • step S101 when the user performs voice input from the microphone mouth phone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the voice digital processing unit 19 proceeds to step S101.
  • the audio signal input at 101 is sampled by A / D conversion (step S401).
  • voice ffi-compression methods include the u-law 64kbps PCM 3 ⁇ 43 ⁇ 4 (Pulse Coded Modulation, ITU-T G.711) used in the public spring telephone network (ISDN, etc.) «Differential encoding PCM method used in PHS (Adaptive Differential encoding PCM, ADPCM. ITU-T G.726), VSELP ⁇ (Vector Sum Excited linear Prediction) used in mobile phones.
  • CELP 3 ⁇ 43 ⁇ 4 Code Excited Linear Prediction
  • One of these ⁇ should be selected according to the available bandwidth and traffic of the communication network.
  • u-law PCM ⁇ 16-40 kbps ADPCM for t1 ⁇ 2, 11.2 kbps: ⁇ for VSELP;
  • CELP is considered suitable for ⁇ .
  • the characteristics of the present invention are not lost even if other encoding methods are applied.
  • the sensor information is lightened by the sensor 12 (step S103), and the combined sensor information and voice data are rearranged into, for example, a format as shown in FIG. Then, the data is transferred to the voice recognition server 6 via the network 5 by the terminal-side transmitting section 13 (step S402).
  • a frame number indicating the processing time of the audio data is stored in the area 701.
  • This frame number is uniquely determined based on, for example, the sampling time of the audio data.
  • the meaning of the word “uniquely determined” includes “determined based on a relative time adjusted between the voice recognition terminal 2 and the voice recognition server 6”, and the relative time Is different! ⁇ Means to give a different frame number.
  • a specific time is supplied from a clock external to the voice recognition terminal 2 and the voice recognition server 6, and the frame number is uniquely determined based on this time. Good.
  • the data size occupied by the sensor information is stored in the data format area 702 of FIG.
  • the sensor information is a 32 bit value
  • the size of the area required to store the sensor information (4 bits) is expressed in bytes and 4 is stored.
  • the sensor 12 is composed of a plurality of sensors: ⁇ stores the data size of the array area necessary to store the sensor information for each. It becomes.
  • an area 703 is an area in which the sensor information acquired by the sensor 12 in step S103 is stored.
  • Sensor 1 2 Force ⁇ ⁇ ⁇ ⁇
  • An array of sensor information is stored in an area 703 that is composed of several sensors.
  • the data size of the area 703 matches the data size held in the area 702.
  • the audio data size and power S are stored in the area 704.
  • the transmitting unit 13 divides the voice data into a plurality of packets (the structure of which is assumed to be the same as the data format shown in FIG. 5) and transmits the packet: ⁇ .
  • the area 704 what is stored in the area 704 is the data size of the audio data included in each packet. The case where the packet is divided into a plurality of packets will be described later. Subsequently, audio data is stored in the area 705.
  • the terminal-side transmission unit 13 divides the voice data input via the input terminal 3 into a plurality of packets.
  • the frame number stored in the area 701 is information indicating the processing time of the audio data, and the frame number is a sampling number of the audio data included in each bucket. Determined based on time.
  • the data size of the audio data included in each bucket is stored in the area 704.
  • the output results of the sensors constituting the sensors 12 and 12 have the property of changing every moment in a short time! In ⁇ , the sensor information stored in the area 7Q3 also differs between buckets.
  • the voice recognition terminal 2 is an in-vehicle voice recognition device, and the sensor 12 obtains the loudness of the background heavy sound (such as a microphone different from the microphone 1).
  • the background is heavy!
  • the loudness of the sound will vary significantly.
  • the terminal-side transmitting unit 13 separates the voice data when the sensor information changes, regardless of the characteristics of the network 5, regardless of the characteristics of the network 5, regardless of the characteristics of the sensor 5 when the sensor I green report changes significantly during the utterance. It is desirable to send a bucket containing different sensor information.
  • the server side receiving unit 21 takes in the sensor information and the voice data to the voice recognition server 6 (step S403).
  • the server-side acoustic analysis unit 28 performs an acoustic analysis of the acquired audio data, and calculates a time series of the audio basket amount (step S404).
  • the hatch-side acoustic model selecting section 23 selects the most appropriate acoustic model based on the acquired sensor skin report (step S109), and the server-side matching section 26 selects this acoustic model.
  • the standard pattern of the model and the speech feature are collated (step S405).
  • the voice recognition terminal 2 transfers the sensor information and the voice data to the voice recognition server 6, so that the voice recognition terminal 2 does not perform the acoustic analysis, Highly accurate speech recognition can be performed based on an acoustic model suitable for the sound environment.
  • the voice recognition function can be realized without an evening program.
  • the sensor information is transmitted for each frame. Therefore, even when the environmental conditions in which the microphone-phone 1 collects sound during the utterance change rapidly, an appropriate You can select an acoustic model and perform speech recognition.
  • the method of dividing the transmission from the voice recognition terminal 2 into a plurality of frames ⁇ can also be applied to the transmission of the voice feature amount of the third embodiment. That is, since the audio feature has a time-series component, when dividing into frames, it is preferable to divide the frame in the time-series order.
  • the sensor information at the time in the time series is stored in each frame in the same manner as in the fourth embodiment, and the voice recognition server 6 selects a delicate acoustic model based on the latest sensor information included in each frame. By doing so, the accuracy of speech recognition can be further improved.
  • Example 5 In the speech recognition systems of the first to fourth embodiments, the acoustic model stored in the speech recognition terminal 2 and the server 6 is selected based on the difficult condition acquired by the sensor 12 included in the speech recognition terminal 2, so that the The voice awakening process was performed accordingly. However, it is also conceivable to select an acoustic model by combining working information obtained from the Internet, etc., using only the ⁇ ⁇ information obtained from the sensor 12.
  • the voice recognition system according to the fifth embodiment has such features.
  • the feature of the fifth embodiment is that the acoustic model is selected by combining the working information obtained from the Internet and the sensor information, so that the speech recognition system according to any of the first to fourth embodiments can be used. It is possible to combine them, and the same effect is obtained.However, here, as an example, a case where the speech recognition system of the first embodiment is combined with mouth information obtained from the Internet will be described. .
  • FIG. 11 is a block diagram illustrating the configuration of the speech recognition system according to the fifth embodiment.
  • the speech recognition system of the fifth embodiment is the same as the speech recognition system of the first embodiment, except that an internet information acquisition unit 29 is added.
  • the components marked with are the same as those in the first embodiment, and a description thereof will not be repeated.
  • the Internet information acquisition unit 29 is a unit that acquires information that works via the Internet!] More specifically, a Web page is acquired by http (Hyper Text Transfer Protocol). It has a function equivalent to an Internet browser.
  • http Hyper Text Transfer Protocol
  • the working opening information is, for example, weather information or information.
  • the Internet has a web site that provides weather information and 3 ⁇ 41 information, and according to these web sites, it is possible to obtain weather conditions, traffic congestion information, construction status, etc. in various places. it can. Therefore, in order to perform speech recognition with higher accuracy by using such work information, an acoustic model matching the available work information is boxed.
  • the weather information is the work mouth information
  • the acoustic model is learned by taking into account the effect of the background noise caused by the bow and the like.
  • the acoustic model is learned in consideration of the influence of background noise generated by road construction and the like.
  • FIG. 12 is a flowchart showing the operation of the speech recognition terminal 2 and the server 6 according to the fifth embodiment.
  • the only difference between the flowchart of FIG. 12 and the flowchart of FIG. 2 is the presence or absence of step S501. Therefore, hereinafter, the processing of step S501 will be mainly described.
  • the Internet information collection unit 29 After receiving the sensor information at the voice recognition server 6 (step S108), the Internet information collection unit 29 transmits information that affects the sound collection of the microphone 1 connected to the voice recognition terminal 2 to the Internet. (Step S501). For example, when the sensor 12 is provided with a GPS antenna, the sensor information includes the position information where the voice recognition terminal 2 and the microphone 1 are located. In step 9, based on the position information, additional information such as weather information and traffic information of the voice recognition terminal 2 and the microphone 1 is provided from the Internet.
  • the server-side acoustic model selection unit 23 selects an acoustic model based on the sensor information and the work information. Specifically, first, it is determined whether or not the working information of the current voice recognition terminal 2 and the working location of the microphone 1 match the working information of the acoustic model. Then, from among the acoustic models having the same information, an acoustic model with the smallest distance value calculated based on the equation (1) shown in the first embodiment is selected for the sensor information.
  • the acoustic model Even if the conditions for learning are not completely expressed by the sensor information alone, it can be expressed using the information, so select a more appropriate acoustic model for the sound collection environment of the microphone microphone 1 It can be powerful. As a result, the speech recognition accuracy can be improved.
  • the negative significance of using the M Blue Report is one of the environmental factors that degrade the accuracy of speech recognition. It consists in shaking acoustic models based on elements that cannot be represented by sensor information3. Therefore, enter such mouth information
  • the method used is not limited to the Internet.
  • a dedicated system or a dedicated computer for providing additional information may be prepared.
  • the voice recognition system, the terminal, and the server according to the present invention are useful for providing high-precision voice recognition even if they are used, and Due to the size and weight of the housing, such as a navigation system and a mobile phone, and the limitations of the price crane, it is suitable for providing a voice recognition function to ⁇ , which has a limited capacity of a consideration device that can be mounted.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A voice recognition system performing high accuracy voice recognition in a variety of working environments. In a client and server type voice recognition system where a voice recognition terminal (2) and a voice recognition server (6) connected with a network shares voice recognition processing for calculating a voice feature amount from a voice signal collected by an external microphone (1), storing a plurality of acoustic models, selecting an acoustic model suitable to the sound collecting environment of the external microphone (1) from the plurality of acoustic models, and outputting recognition results by performing pattern matching of a standard pattern of the acoustic model and the voice feature, the voice recognition terminal (2) is provided with a sensor (12) in order to detect the sound collecting environment of the external microphone (1) and a section (13) for transmitting the output from the sensor (12) to the voice recognition server (6) is provided.

Description

明 細 書  Specification
音声認識システム及びその端末とサーバ 擺分野 Voice recognition system and its terminal and server
この発明は、 音声認識システム及びその端末とサーバに係るものであり、.特に さまざまな棚状況を想定して された複数の音響モデルから、 使用状況に応 じて適切な音響モデルを選択し音声認識を行う 術に関するものである。 背景漏  The present invention relates to a speech recognition system and its terminal and server, and in particular, selects an appropriate acoustic model according to the use condition from a plurality of acoustic models assumed for various shelf conditions, and performs speech recognition. It concerns the art of performing recognition. Background leakage
音声認識は、 入力音声から音声榭敷量の時系列を抽出し、 この音声特徴量 ώ時 系列と予め準備された音響モデルとの照合によって、 を算出することによ り行われる。  Speech recognition is performed by extracting a time series of a speech spread amount from an input speech, and calculating by comparing the speech feature quantity / time series with an acoustic model prepared in advance.
しかし現実の使用髓で発声された音声には、 背景 が重畳しているため、 音声認識の精度が劣化する。 背景馬緒の種赚び重畳の仕方は、 使用環境によつ て異なる。 そのため、 精度の高い音声認識を佇うには、 複数の音響モデルを聰 し、 さらに複数の音響モデルの中から現在の使用纖に適した音響モデルを選択 する必要がある。 このような音響モデルの選キ肪法として、 例えば、 特開 2 0 0 0 - 2 9 5 0 0 (特許 がある。  However, since the background is superimposed on the voice uttered in the actual use, the accuracy of voice recognition is degraded. The manner of superimposition of background mats depends on the usage environment. Therefore, in order to achieve high-accuracy speech recognition, it is necessary to select multiple acoustic models and select an acoustic model suitable for the current fiber from among the multiple acoustic models. As a method of selecting such an acoustic model, for example, there is Japanese Patent Application Laid-Open No. 2000-295500 (Patent).
特許 1による音響モデルの選 法は、 例えば 音声認識装置におい て、 ¾ センサなどの各種車載センサが出力する値 (センサからのアナログ信号 を AZD変換して得たデータをいう。 以後、 この値のことをセンサ情報と呼ぶこ ととする) に対 る雑音から雑音スペクトルを算出して、 この雑音スペクトル と各種職センサからのセンサ情報とを関連 けて記憶しておき、 次回の音声認 識時に得られる各種 センサからのセンサ情報と、 予め記憶している雑音スぺ クトルのセンサ情報との類搬カ新定値以内の:^に、 このセンサ情麵こ対 ϋδΤ る雑音スぺクトルを音声籠量の時系列から する、 というものである。 しかしこの方法では、 今まで使用したことのない 下で音声認識の精度を向 上させることができないという問題がある。 そこで、 例えば工場出荷時に、 予め 各種センサの出力値の中から所定の値をいくつか選択し、 センサがこれらの値を 出力する 件下で学習した音響モデルを作成しておく。 そして、 現実の使用 難で得られたセンサ情報と音響モデルの難条件とを比較して、 適切な音響モ デルを選択する方法カ堵えられる。 In the acoustic model selection method according to Patent 1, for example, in a voice recognition device, a value output from various in-vehicle sensors such as a セ ン サ sensor (refers to data obtained by subjecting an analog signal from the sensor to AZD conversion. This is referred to as sensor information), and a noise spectrum is calculated from the noise, and this noise spectrum and the sensor information from various job sensors are stored in association with each other, and the next time speech recognition is performed. When the sensor information from the various sensors obtained and the sensor information of the noise spectrum stored in advance are within the new fixed value of the class, the noise spectrum of this sensor information is stored in the voice basket. From the time series of the quantity. However, this method has a problem that the accuracy of speech recognition cannot be improved without using it until now. Therefore, for example, at the time of factory shipment, some predetermined values are selected in advance from the output values of various sensors, and an acoustic model learned under the condition that the sensors output these values is created. Then, we can compare the sensor information obtained in the actual use difficulties with the difficult conditions of the acoustic model and select an appropriate acoustic model.
ところで、 1つの音響モデルのデータサイズは、 音声言忍識システムの設^方法 や実装方法によっても異なるものの、 数百キロバイトにも及ぶ場合がある。 カー ナビゲーションシステムや携帯電話のようなモバイル薩では、 筐体サイズや重 量の制約から、 搭載可能な記憶装置の容量が厳しく制限される。 したがって、 モ ノィル菌に、 これほどのデ一夕サイズを有する音響モデルを複数個記慮させる 構成を採用するのは現実的ではない。  By the way, the data size of one acoustic model varies depending on the method of setting up and implementing the speech and speech intellectual system, but may reach several hundred kilobytes. In mobile navigation systems such as car navigation systems and mobile phones, the size of the housing and the weight limit the storage capacity of the storage device that can be mounted is severely limited. Therefore, it is not realistic to adopt a configuration that allows Monoyl fungi to consider a plurality of acoustic models having such a large size.
特にセンサが复数個ある場合に、 各センサのセンサ情報の値をそれぞれ複数選 択して、 それらの組み合わせに対応した音響モデルを'職しょうとすると、 膨大 な記憶容量が必要となってしまう。  In particular, when there are a few sensors, if a plurality of sensor information values are selected for each sensor and an acoustic model corresponding to the combination is selected, an enormous storage capacity is required.
この発明は、 上記纏を解決するためになされたもので、 複数の音響モデルを 記憶している音声認識サーバに、 音声認 末からネットヮ一クを介してセンサ '情報を送信することにより、 現実の使用職に適した音響モデルを選択して高精 度な音声認鶄拠理を実現することを目的としている。 発明の開示  The present invention has been made in order to solve the above-mentioned problem. By transmitting sensor information via a network from a voice recognition device to a voice recognition server storing a plurality of acoustic models, the present invention has been made to solve the problem. The purpose is to select an acoustic model suitable for the occupation of the company and to achieve high-accuracy speech recognition. Disclosure of the invention
この発明に係る音声認識システムは、  The speech recognition system according to the present invention includes:
音声認識サ一パと複数の音声認識端末とをネットワークにより纖した音声認 識システムであって、  A speech recognition system in which a speech recognition supporter and a plurality of speech recognition terminals are connected via a network,
嫌己音声認識端末は、  The hated speech recognition terminal
外部マイク口ホンを接続し、 その外部マイク口ホンが集音した音声信号を入力 する入力端と、 Connect an external microphone and input the audio signal collected by the external microphone Input end to
鎌己入力端から入力された音声信号から音声 量を算出するクライアント側 音響分析手段と  A client-side acoustic analysis unit that calculates the volume of voice from the voice signal input from the input terminal
嫌 3音声信号に重畳する馬 tの を表すセンサ情報を検出するセンサと、 ΐίί ネットワークを介して嫌3センサ情報を tfrK音声認識サーバに送 するク ライアン卜側送信手段と、  A sensor for detecting sensor information indicating a horse t to be superimposed on the disagreeable 3 voice signal, ΐίί a client-side transmitting means for transmitting the disagreeable 3 sensor information to the tfrK voice recognition server via a network,
嫌 3音声認識サーバから音響モデルを受信するクライアント側受信手段と、 鎌 3音響モデルと嫌 3音声特徴量とを照合するクライアント側照合手段と、 を 備え、  Client side receiving means for receiving an acoustic model from the disagreeable 3 voice recognition server, and client side matching means for comparing the sickle 3 acoustic model with the disagreeable 3 voice feature amount,
謙 3音声認識サーバは、  Ken 3 voice recognition server
Itfffiクライアント側送信手段が送信したセンサ情報を受信するサーバ側受信手 段と、 ·  A server-side receiving means for receiving the sensor information transmitted by the Itfffi client-side transmitting means;
複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、  Server-side acoustic model storage means for storing a plurality of acoustic models;
嫌 3複数の音響モデルから觸己センサ情報に適合する音響モデルを選択するサ ーバ側音響モデ 択手段と、  (3) server-side acoustic model selecting means for selecting an acoustic model that matches the touch sensor information from a plurality of acoustic models;
前記サ一バ側音響モデ Hi択手段が選択した音響モデルを嫌己音声認識端末に 送信するサーバ側送信手段と、 を備えたものである。  Server-side transmission means for transmitting the acoustic model selected by the server-side acoustic model Hi selection means to the obscene speech recognition terminal.
このように、 この音声認識システムでは、 記憶容量に制限のない音声認識サー バに、 様々な集音職に対応した複数の音響モデルを記憶させておき、 各音声認 識端末に備えられたセンサからの情報に基づいてその音声認識端末の集音 に 適合した音響モデルを選択して、 音声認議末に送信するようにした。 これによ り、 音声認識端末は、 筐# ^イスや重量などの制約から、 その端末自身の記慮容 量が制限される:^であっても、 その集音環境に適合した音響モデルを取得し、 その音響モデルを用いて音声認識を行うので、 音声認識の精度を向上できるので ある。 図面の簡単な説明 As described above, in this speech recognition system, a plurality of acoustic models corresponding to various sound collection jobs are stored in a speech recognition server having an unlimited storage capacity, and a sensor provided in each speech recognition terminal is stored. Based on the information from, a sound model suitable for the sound collection of the speech recognition terminal was selected and sent to the end of the speech recognition. As a result, the speech recognition terminal is limited in its own storage capacity due to limitations such as the case and weight of the terminal. Acquisition and speech recognition using that acoustic model can improve the accuracy of speech recognition. Brief Description of Drawings
図 1はこの発明の実施例 1による音声認識端末及びサーバの構成を示したプロ ック図、  FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 1 of the present invention.
図 2はこの発明の実施例 1による音声認識端末及びサーバの動作を示すフロ一 チヤ一ト、  FIG. 2 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 1 of the present invention.
図 3はこの発明の実施例 2による音声認識端末及びサーバの構成を示したプロ ック図、  FIG. 3 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 2 of the present invention.
図 4はこの発明の実施例 2による音響モデルのクラスタリング処理示すフ口一 チヤ一卜、  FIG. 4 is a flowchart illustrating a clustering process of an acoustic model according to Embodiment 2 of the present invention.
図 5はこの発明の実施例 2による音声認識端末及びサーバの動作を示すフロー チヤ一卜、  FIG. 5 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 2 of the present invention.
図 6はこの発明の実施例 3による音声認識端末及びサーバの構成を示したプロ ック図、  FIG. 6 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 3 of the present invention.
図 7はこの発明の実施例 3による音声認識端末及びサーバの動作を示すフ口一 チャート、  FIG. 7 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 3 of the present invention.
図 8はこの発明の実施例 4による音声認識端末及びサーバの構成を示すプロッ ク図、  FIG. 8 is a block diagram showing a configuration of a speech recognition terminal and a server according to Embodiment 4 of the present invention.
図 9はこの発明の実施例 4による音声認識端末及びサーバの動作を示すフロー チャート、  FIG. 9 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 4 of the present invention,
図 1 0はこの発明の実施例 4による音声認識端末から音声認識サーバに送信さ れるセンサ情報及び音声データのデータフォーマットの構成図、  FIG. 10 is a configuration diagram of a data format of sensor information and voice data transmitted from the voice recognition terminal to the voice recognition server according to Embodiment 4 of the present invention.
図 1.1はこの発明の実施例 5による音声認識端末から音声認識サーバの構成を 示すブロック図、  FIG. 1.1 is a block diagram showing a configuration of a speech recognition server from a speech recognition terminal according to Embodiment 5 of the present invention.
図 1 2はこの発明の実施例 5による音声認識端末及びサーバの動作を示すフロ —チヤ一卜である。 発明を実施するための最良の形態、 FIG. 12 is a flowchart showing the operation of the speech recognition terminal and the server according to Embodiment 5 of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION
実施例 1. Example 1.
図 1は、 この発明の一実施例による音声認識端末及びサーバの構成を示すプロ ック図である。 図において、 マイクロホン 1は音声を纏する装置または部品で あって、 音声認識端末 2は入力端 3を介してマイク口ホン 1が纏した音声を音 声認識して、 認識結果 4を出力する装置である。 また入力端 3は、 オーディオ端 子またはマイク口ホン接続^?である。  FIG. 1 is a block diagram showing a configuration of a speech recognition terminal and a server according to an embodiment of the present invention. In the figure, a microphone 1 is a device or component that collects voice, and a voice recognition terminal 2 is a device that performs voice recognition of the voice collected by the microphone microphone 1 via an input terminal 3 and outputs a recognition result 4. It is. The input terminal 3 is an audio terminal or microphone microphone connection ^ ?.
音声認識端末 2はネットワーク 5を介して音声認識サーバ 6と接続されている。 ネットワーク 5はインターネットや LAN (Loc a l Area Ne two r k)、公衆回線網、携帯電話網、人工衛星を用いた通信網などディジ夕 青報を 通信するネットワーク網である。 ただしネットワーク 5は、 結果として、 このネ ットワークに接続されている機器間でディジタルデータを送受信するようになつ ていればよいのであって、 ネットワーク 5上に伝送されている情報の形式を問う ものではない。 したがって、 例えば、 USB (Un i ve r s a l Se r i a 1 Bus) や SCS I (Sma 1 1 Compu t e r Sy s t ems I n t e r f ac e) などのように複数の機器を接続するように設計されたバスで あっても構わない。 また、 音声認識端末 2が車載用の音声認識装置である場合に は、 ネットワーク 5は移動体通信のデータ通信サービスを利用することになる ri データ通信サ一ビスでは、 送受信するデータをパケットと呼ばれる単位に分割し て一つ一つ送受信する通 i ^:を棚する。 パケットには、 送信側 βが受信側 βに送信しょうとしているデータの他に、 受信側 βを特^ るための受信側 βを藷剛する情報(送信先ァドレス)、そのパケットがデ一夕全体のどの部分を 構成するかを示す位置情報、 誤り訂正 などの制御情報が付加されている S 音声認識サーバ 6は、 ネットワーク 5を介して音声認識端末 2と雄されるよ うに構成されているサーバコンピュータである。 音声認識サーバ 6は、 音声認識 端末 2よりも大きな記慮容量のハードディスク装置またはメモリなどの記憶装置 を有しており、 音声認識に必要となる標準パターンを記憶している。 また、 複数 の音声認識端末 2が、 ネットワーク 5を介して音声認識サーバ 6と嫌されるよ うになつている。 The speech recognition terminal 2 is connected to a speech recognition server 6 via a network 5. Network 5 is a network that communicates digital and blue information such as the Internet, LAN (Local Area Network), public line network, mobile phone network, and communication network using artificial satellites. However, as a result, the network 5 only needs to transmit and receive digital data between devices connected to this network, and does not ask the format of the information transmitted on the network 5. Absent. Therefore, for example, a bus designed to connect a plurality of devices, such as USB (Universal Serial 1 Bus) and SCS I (Sma 11 Computing System Interfac e). It does not matter. When the voice recognition terminal 2 is a vehicle-mounted voice recognition device, the network 5 uses a data communication service of mobile communication.In the ri data communication service, data to be transmitted and received is called a packet. I ^: is divided into units and sent and received one by one. In the packet, in addition to the data that the transmitting side β intends to transmit to the receiving side β, information (destination address) on the receiving side β for identifying the receiving side β, and the packet The S voice recognition server 6 to which control information such as position information indicating which part of the whole is to be configured and error correction is added is configured to be a male voice recognition terminal 2 via the network 5. It is a server computer. The voice recognition server 6 is a storage device such as a hard disk device or a memory having a larger storage capacity than the voice recognition terminal 2. And stores standard patterns required for speech recognition. Also, a plurality of speech recognition terminals 2 are disliked by the speech recognition server 6 via the network 5.
次に音声認識端末 2の詳細な構成について説明する。 音声認識端末 2は、 端末 ¾ m jf ιとセンサ 1 2、 mmmn 3、 端末御授信音 4、 端末』 音響モデル記憶部 1 5、 端末側音響モデリ HI択部 1 6、 端末側照合部 1 7を備え ている。  Next, a detailed configuration of the voice recognition terminal 2 will be described. The voice recognition terminal 2 is composed of a terminal ¾ m jf ι and a sensor 12, mmmn 3, a terminal transmission sound 4, and a terminal ”. Is provided.
端末側音響分析部 1 1は、 入力端 3から入力された音声信号に基づいて音響分 析を行い、 音声特徴量を算出する咅 M立である。  The terminal-side acoustic analysis unit 11 performs acoustic analysis based on the audio signal input from the input terminal 3 and calculates an audio feature amount.
センサ 1 2は、 マイク口ホン 1が 得する音声信号に重畳する馬赔の觀 Uに関 する情報を得ることを目的として、 環境条件を検出するセンサであって、 マイク 口ホン 1が設置されている環境における物理量や、 その変化量を検出又は取得す る素子、 または装置である。 しかし、 それのみならず、 さらに検出量を適切な信 号に変換して出力する素子又は装置をも含んでよい。 また、 'ここでいう物理量と は、 雕'圧力'流量 ·光■磁気の他、 時間や電磁波なども含むものとする。 し たがつて、 例えば G P Sァンテナは G P S信号に対するセンサである。 また必ず しも外界から何らかの信号を取得して物理量を検出するものである必要はなく、 例えば内蔵クロックに基づいてマイク口ホンのおかれている地点の時刻を取得す るようになっている回路も、 ここでいうセンサに含まれる。  The sensor 12 is a sensor for detecting an environmental condition with a view to obtaining information on the view of the horse superimposed on the audio signal obtained by the microphone-phone 1, and the microphone-phone 1 is provided. An element or device that detects or obtains a physical quantity in a certain environment or a change in the physical quantity. However, not only this, but also an element or a device that converts the detected amount into an appropriate signal and outputs the signal may be included. In addition, 'physical quantity here' includes sculpture 'pressure' flow rate, photomagnetism, time, electromagnetic waves, and the like. Thus, for example, the GPS antenna is a sensor for the GPS signal. Also, it is not always necessary to detect a physical quantity by acquiring some signal from the outside world.For example, a circuit that acquires the time of the place where the microphone is placed based on the built-in clock Is also included in the sensor mentioned here.
なお、 以降の説明では、 これらの物理量を総称して、 センサ情報と呼ぶことと する。 また ~¾に、 センサはアナログ信号を出力するようになっており、 出力さ れたアナログ信号を A/D変纏又は素子によって、 ディジ夕ル信号にサンプリ ングするのが通常の構成である。 したがって、 センサ 1 2は、 このような AZD 変 ^^又は素子を含むものであってもよい。 さらに、 複数種類のセンサ、 例えば 音声認識端末 2が ナビゲ一シヨンシステムの端末である: ^には、 避セ ンサゃエンジンの回 をモニタリングするセンサ、 ワイパーの稼働状況をモニ 夕リングするセンサ、 ドアのガラスの開閉状況をモニタリングするセンサ、 カー オーディォのポリュ一ムをモニタリングするセンサなど、 複数のセンサを組み合 わせてもよい。 In the following description, these physical quantities are collectively referred to as sensor information. In addition, the sensor outputs an analog signal in (1) to (4), and the normal configuration is to sample the output analog signal into a digital signal by means of A / D conversion or elements. Therefore, the sensor 12 may include such an AZD variable or element. Furthermore, a plurality of types of sensors, for example, the voice recognition terminal 2 are terminals of the navigation system: ^ indicates a sensor for monitoring the rotation of the evacuation sensor engine, and a monitoring status of the operation of the wiper. Multiple sensors may be combined, such as a sensor that rings in the evening, a sensor that monitors the opening and closing status of the door glass, and a sensor that monitors the car audio program.
端末側送信部 1 3は、 センサ 1 2によって得られたマイクロホン 1近傍のセン サ情報を音声認識サーバ 6に送信する部位である。  The terminal-side transmission unit 13 is a unit that transmits sensor information near the microphone 1 obtained by the sensor 12 to the speech recognition server 6.
端末側受信部 1 4は、 音声認識サーバ 6からの情報を受信する 立であり、 端 末側音響モデリ 択部 1 6に受信した情報を出力するようになっている。 端末側 送信部 1 3と端末側受信部 1 4は、 ネットワークケーブルに信号を送出し、 また ネットワークケーブルから信号を受信する回路又は素子から構成されているが、 この回路又は素子を制御するためのコンピュータプログラムを端末側送信部 1 3 と端末側受信部 1 4の に含めてもよい。 もっとも、 ネットワーク 5が無 信網である場合には、 端末側送信部 1 3と端末側受信部 1 4は通信波を送受信す るようなアンテナを備えることになる。 なお、 端末側送信部 1 3と端末側受信部 1 とを別体の部位として構成してもよいが、 同一のネットワーク入出力装置で 構成するようにしてもよい。  The terminal-side receiving unit 14 is configured to receive information from the speech recognition server 6 and output the received information to the terminal-side acoustic model selecting unit 16. The terminal-side transmission unit 13 and the terminal-side reception unit 14 are composed of circuits or elements that send signals to the network cable and receive signals from the network cable, and are used to control these circuits or elements. The computer program may be included in the terminal-side transmitting unit 13 and the terminal-side receiving unit 14. However, when the network 5 is a wireless network, the terminal-side transmitting unit 13 and the terminal-side receiving unit 14 have antennas for transmitting and receiving communication waves. Note that the terminal-side transmission unit 13 and the terminal-side reception unit 1 may be configured as separate parts, or may be configured by the same network input / output device.
端末側音響モデル記憶部 1 5は、 音響モデルを記憶するための記憶軒又は回 路である。 ここで、音響モデルは、学習纖に応じて複数個擁しうるものとし、 そのうちの一部のみが端末側音響モデル記憶部 1 5に記憶されているものとする。 また各音響モデルは、 その音響モデルを学習した環境条件を表すセンサ情報と関 , 連づけられており、 センサ情報の数値から、 その環境条件に適した音響モデルが 特定できるようになつている。 例えば、 音声認識端末 2が^ ffl音声認識装置で ある場合には、 自動車が時速 4 0 kmで走行している場合の馬 i^i ^下で発声さ れたサンプルに基づいて作成された音響モデル、 自動車が時速 5 0 kmで走行し ている の騒音環境下で発声されたサンプルに基づいて作成された音響モデル、 といったものが、 «されている。 ただし、 ί するように、 音声認識サーバ 6に もさまざまな髓条件に対応した音響モデル力|己憶されているので、 端末側音響 モデル記憶部 1 5に、 すべての難条件下で学習された音響モデル力 己憶されて いる必要はない。 このような構成を採用することで、 音声認識端末 2が搭載しな くてはならなレ、記憶装置の記憶容量は極めて小さく済む。 The terminal-side acoustic model storage unit 15 is a storage eave or a circuit for storing an acoustic model. Here, it is assumed that a plurality of acoustic models can be provided according to the learning fiber, and only a part of them is stored in the terminal-side acoustic model storage unit 15. In addition, each acoustic model is associated with sensor information indicating an environmental condition in which the acoustic model has been learned, and an acoustic model suitable for the environmental condition can be specified from the numerical value of the sensor information. For example, if the voice recognition terminal 2 is a ^ ffl voice recognition device, the sound generated based on the sample uttered under the horse i ^ i when the car is running at 40 km / h Some models, such as models, and acoustic models created based on samples uttered in a noisy environment where a car is traveling at 50 km / h, are mentioned. However, as shown in the figure, since the speech recognition server 6 also remembers the acoustic model capability | It is not necessary for the model storage unit 15 to remember the acoustic model power learned under all difficult conditions. By adopting such a configuration, the voice recognition terminal 2 must be mounted, and the storage capacity of the storage device can be extremely small.
端末側音響モデ 択部 1 6は、 端末側受信部 1 4カ诹得した音響モデル (あ るい 末側音響モデル記憶部 1 5に記 f意されている音響モデル) と、 端末側音 響分析部 1 1が出力した音声特徴量との尤度を算出する音啦である。 端末側照合 部 1 7は、端末側音響モデ 択部 1 6が算出した尤度に基づいて語彙を選択し、 認識結果 4として出力する部位である。  The terminal-side acoustic model selection unit 16 includes a terminal-side reception unit 14-acquired acoustic model (or an acoustic model stored in the terminal-side acoustic model storage unit 15) and a terminal-side acoustic analysis. This is a sound for calculating the likelihood with the speech feature output by the unit 11. The terminal-side matching unit 17 is a unit that selects a vocabulary based on the likelihood calculated by the terminal-side acoustic model selecting unit 16 and outputs it as a recognition result 4.
なお、 音声認識端末 2の構成要素のうち、 端末側音響分析部 1 1、 端末側送信 部 1 3、 端末側受信部 1 4、 端末側音響モデル記憶部 1 5、 端末側音響モデ HI 択部 1 6、 端末側照合部 1 7はそれぞれ専用の回路により構成してもよ"が、 中 央演算装置 (C PU) 及びネットワーク I /O装置 (ネットワークアダプタ装置 など)、記憶装置に、それぞれの機能に相当する処理を実行させるコンピュータプ ログラムとして構成するようにしてもよい。  Among the components of the speech recognition terminal 2, the terminal-side acoustic analysis unit 11, the terminal-side transmission unit 13, the terminal-side reception unit 14, the terminal-side acoustic model storage unit 15, and the terminal-side acoustic model HI selection unit 16 and the terminal-side collating unit 17 may be configured by dedicated circuits. ”However, the central processing unit (CPU), network I / O device (network adapter device, etc.), and storage device You may make it comprise as a computer program which performs the process equivalent to a function.
続いて、 音声認識サーバ 6の詳細な構成について説明する。 音声認識サーバ 6 はサ―バ側受信部 2 1、 サーバ側音響モデル記憶部 2 2、 サーバ側音響モデ HI 択部 2 3、 サーバ側送信部 2 4とを備えている。 サーバ側受信部 2 1は、 ネット ワーク 5を介して音声認識端末 2の端末側送信部 1 3から送信されてくるセンサ 情報を受信する部位である。  Next, a detailed configuration of the speech recognition server 6 will be described. The speech recognition server 6 includes a server-side receiving unit 21, a server-side acoustic model storage unit 22, a server-side acoustic model HI selecting unit 23, and a server-side transmitting unit 24. The server-side receiving unit 21 is a unit that receives the sensor information transmitted from the terminal-side transmitting unit 13 of the voice recognition terminal 2 via the network 5.
サーバ側音響モデル記憶部 2 2は、 複数の音響モデルを記憶するための記憶装 置である。 このサ一バ側音響モデル記憶部 2 2 —ドディスク装齢、 CD— R OM媒体と C D— R OMドライブとの組み合わせなどによる大容量記憶装置と して構成される。  The server-side acoustic model storage unit 22 is a storage device for storing a plurality of acoustic models. This server-side acoustic model storage unit 22 is configured as a large-capacity storage device using a combination of a disk drive age, a CD-ROM medium and a CD-ROM drive.
サーバ側音響モデル記憶部 2 2は、 端末側音響モデル記憶部 1 5とは異なり、 この音声認識システムで価する可能性のある音響モデルをすベて記憶しており、 さらにそうするのに十分な記憶容量を有しているものとする。 サーバ側音響モデ JHl択部 2 3は、 サーバ側音響モデル記憶部 2 2が 憶する 音響モデルから、 サーバ側受信部 2 1が受信したセンサ情報に適する音響モデル を選択する音位である。 Unlike the terminal-side acoustic model storage unit 15, the server-side acoustic model storage unit 22 stores all of the acoustic models that may be worthy of this speech recognition system. It has a large storage capacity. The server-side acoustic model JH1 selection unit 23 is a tone for selecting an acoustic model suitable for the sensor information received by the server-side reception unit 21 from the acoustic models stored in the server-side acoustic model storage unit 22.
サーバ側送信部 2 4は、 サーバ側音響モデ 択部 2 3が選択した音響モデル をネットワーク 5を介して音声認識端末 2に送信する部位である。  The server-side transmitting unit 24 is a unit that transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 via the network 5.
なお、 音声認識サ一バ 6の構 素のうち、 サ一バ側受信部 2 1、 サーバ側音 響モデル記慮部 2 2、 サーバ側音響モデ耀尺部 2 3、 サーバ側送信部 2 4はそ れぞれ専用の回路により構成してもよいが、 中央演算 (C PU) 及びネット ワーク I ZO装置(ネットヮ一クアダプタ装置など)、記【音装置に、それぞれの機 能に相当する処理を実行させるコンピュータプログラムとして構成するようにし てもよい。 '  In addition, among the components of the speech recognition server 6, the server-side receiving unit 21, the server-side acoustic model consideration unit 22, the server-side acoustic model measuring unit 23, and the server-side transmitting unit 24 May be configured by dedicated circuits, respectively, but are equivalent to central processing (CPU), network IZO device (network adapter device, etc.), and recording device. It may be configured as a computer program for executing processing. '
次に音声認識端末 2及び音声認識サーバ 6の動作について、 図を参照しながら 説明する。 図 2は実施例 1による音声認識端末 2と音声認識サーバ 6との処理を 示したフローチャートである。 図において、 利用者がマイクロホン 1から音声入 力を行うと (ステップ S 1 0 1 )、入力端 3を介して端末側音響分析部 1 1に音声 信号が入力される。 続いて、 端末側音響分析部 1 1において A/D変換器により ディジタル信号に変換されて、 L P Cケプストラム (Linear Predictive Coding Cepstrum)などの音声特徴量の時系列を算出する (ステップ S 1 0 2)。  Next, operations of the voice recognition terminal 2 and the voice recognition server 6 will be described with reference to the drawings. FIG. 2 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment. In the figure, when the user performs a voice input from the microphone 1 (step S 101), a voice signal is input to the terminal-side acoustic analysis unit 11 via the input terminal 3. Subsequently, the terminal-side acoustic analysis unit 11 converts the digital signal into a digital signal using an A / D converter, and calculates a time series of speech features such as an LPC cepstrum (Linear Predictive Coding Cepstrum) (step S 102). .
次に、 センサ 1 2はマイクロホン 1周辺の物理量を取得する (ステップ S 1 0 3 )。例えば、音声認識端末 2が力一ナビゲ一ションシステムであって、センサ 1 2が、 この力一ナビゲ一シヨンシステムが搭載されている車両 (自動車) の避 などを検出する ¾ ^センサである:^には、避がこのような物理量に相当する。 なお図 2において、 ステップ S 1 0 3によるセンサ情報の をステップ S 1 0 2による音響分析の次に行うこととしている。 しかし、 ステップ S 1 0 3の処理 はステップ S 1 0 1〜S 1 0 2の処理よりも前に行ってもよいし、 また同時に、 または並行して行うようにしてもよいことはいうまでもない。 続いて、 端末側音響モデ 択部 1 6は、 センサ 1 2が得たセンサ情報、 すな わちマイク口ホン 1が音声を ί«する に最も近い条件で学習した音響モデ を選択する。 ここで、 音響モデルの纖条件は複籠り考えられ さらに端末側 音響モデル記憶部 1 5はそのすベてを記憶しているわけではない。 そこで、 端末 側音響モデル記憶部 1 5が H極 3憶している音響モデルの中に、 マイク口ホン 1 の髓条件に近い環境条件で学習されたものがない には、 音声認識サ一パ 6 より音響モデ'ルを取得するのである。 次に処理の説明に先立って、 用語と表記の定義を行ってお 音響モデリ knが 学習された条件下のセンサ kについてのセンサ情報を、 単に 「音響モデル mのセ ンサ情報」 と呼ぶこととする。 端末側音響モデル記憶部 1 5は、 M個の音響モデ ' ルを記憶しているものとし、 各音響モデルを音響モデル m (ただし m= l, 2 , ···, M) として表す。 またセンサ 1 2は K個のセンサから構成されており、 それ ぞれのセンサをセンサ k (ただし k= l , 2 , ··'·, K) とする。 さらに音響モデ ル mが学習された環境条件下におけるセンサ kについてのセンサ情報を S m, kで 表すことにし、 またセンサ kの現在のセンサ情報 (ステップ S 1 0 3で出力した センサ情報) を xkとする。 ' 以下、 これらの処理をより具体的に説明する。 まず、 端末側音響モデ! 択部 1 6は、 音響モデル mのセンサ情報 Sm, kと、 センサ 1 2によって取得されたセ ンサ情報 xkとの距離値 D (m) を算出する (ステップ S 1 0 4)。 いま、 あるセ ンサ kにおけるセンサ情報 xkと音響モデル mのセンサ情報 Sm, kとの距離値を D k (xk, Sm, k) とする。距離値 Dk (xk, Sm, k) の具体的な値としては、 例え ばセンサ情報の差分の絶対値などを採用すればよい。 すなわちセンサ情報が速度 であるならば、 学習時の速度 (例えば Sra, k= 4 0 kmZh) と現在の速度 (例 えば xk= 5 0 km/h) の差(l O kmZh) を距離値 Dk (xk, Sm, k) とす る。 Next, the sensor 12 acquires a physical quantity around the microphone 1 (step S103). For example, the voice recognition terminal 2 is a force navigation system, and the sensor 12 is a ¾ ^ sensor that detects an evacuation of a vehicle (car) equipped with the force navigation system: In ^, evasion corresponds to such a physical quantity. In FIG. 2, the sensor information in step S103 is to be performed next to the acoustic analysis in step S102. However, it is needless to say that the processing of step S103 may be performed before the processing of steps S101 to S102, or may be performed simultaneously or in parallel. Absent. Subsequently, the terminal-side acoustic model selection unit 16 selects the sensor information obtained by the sensor 12, that is, the acoustic model learned under the condition that is closest to the sound of the microphone-phone 1. Here, the fiber condition of the acoustic model is considered to be multiple, and the terminal-side acoustic model storage unit 15 does not necessarily store all of the fiber conditions. Therefore, if none of the acoustic models stored in the terminal-side acoustic model storage unit 15 at the H pole is learned under environmental conditions close to the marrow conditions of the microphone-mouth phone 1, the speech recognition 6 to get an acoustic model. Next, prior to the description of the processing, terms and notations are defined, and the sensor information about the sensor k under the condition where the acoustic model kn has been learned is simply referred to as “sensor information of the acoustic model m”. I do. The terminal-side acoustic model storage unit 15 stores M acoustic models, and represents each acoustic model as an acoustic model m (where m = l, 2,..., M). The sensor 12 is composed of K sensors, and each sensor is a sensor k (where k = l, 2, ... ', K). Further, the sensor information about the sensor k under the environmental conditions in which the acoustic model m has been learned is represented by S m , k , and the current sensor information of the sensor k (the sensor information output in step S103) is represented by Sm , k. x k . 'Hereinafter, these processes will be described more specifically. First, the terminal-side acoustic model selection unit 16 calculates a distance value D (m) between the sensor information S m , k of the acoustic model m and the sensor information x k obtained by the sensor 12 (step S104). Assume that a distance value between the sensor information x k of a certain sensor k and the sensor information S m , k of the acoustic model m is D k (x k , S m , k ). As a specific value of the distance value D k (x k , S m , k ), for example, an absolute value of a difference between sensor information may be adopted. In other words, if the sensor information is speed, the difference (l O kmZh) between the speed at learning (for example, S ra , k = 40 kmZh) and the current speed (for example, x k = 50 km / h) is defined as the distance. The value is D k (x k , S m , k ).
また距離値 D (m) については、 センサ毎の距離値 Dk (xk, Sm, k) を用い て、 次のように算出する。 For the distance value D (m), the distance value D k (x k , S m , k ) for each sensor is used. And calculate as follows.
M  M
D(m) = ^ wkDk (^ S m,k ) ( 1) ここで、 wkは各センサに対する重み係数である。 D (m) = ^ w k D k (^ S m, k) (1) where, w k is a weighting factor for each sensor.
ここで、 物理量としてのセンサ情報と ίΕΙ値 D (m) との関係について説明し ておく。 センサ 1青報が位置 (鍵や緯度に基づいて定めてもよいし、 特定の場所 を原点として、,そこからの距離によって定めてもよい) である と、 であ る場合とでは、 センサ情報の物理量としての次元が相違する。 しかしここでは、 重み係数 wkを調 ることで、 wkDk (xk, Sm, k) の距離値への寄与度を適 切に設定できるので、 ^¾の相違を無視しても問題がない。 また単位系;^相違す る場合であっても同様である。 例えば、 避の単位として kmZhを用いる;^ と、 mp hを用いる場合では、 物理的に同じ速度であっても、 センサ情報として 異なる値をとりうる。 このような場合、 例えば kmZhで算出した 値に対し ては 1. 6の重み係数を与え、 mp hで算出した速度値に対しては 1. 0の重み 係数を与えれば、 距離値の算出における の効果を等しくすることができる。 次に、 端末側音響モデ ½1択部 1 6は、 式 (1 ) で算出した各 mに対する距離 値 D (m) の最小値 m i n {D (m) } を求め、 この m i n {D (m) }力 ;?定の 値 Tよりも小さいかどうかを評価する(ステップ S 1 0 5)。すなわち、端末側音 響モデル記憶部 1 5が己憶している端末側音響モデルの髓条件中に、 マイクロ ホン 1が集音する現在の環境条件に十分近いものが 在するかどうかを る のである。 所定の値 Tとは、 このような条件を満たすかどうかを^するために 予め設定された値である。 Here, the relationship between the sensor information as a physical quantity and the ίΕΙ value D (m) will be described. If the sensor 1 blue light is the position (may be determined based on the key or latitude, or may be determined based on the distance from a specific place as the origin), the sensor information Have different dimensions as physical quantities. However, here, by adjusting the weighting factor w k , the contribution of w k D k (x k , S m , k ) to the distance value can be set appropriately. No problem. The same applies even if the unit system is different. For example, if kmZh is used as the avoidance unit; ^ and mph are used, different values can be used as sensor information even if the speed is physically the same. In such a case, for example, if a weighting factor of 1.6 is given to the value calculated by kmZh, and a weighting factor of 1.0 is given to the speed value calculated by mph, Can be equalized. Next, the terminal-side acoustic model selection unit 16 obtains the minimum value min {D (m)} of the distance value D (m) for each m calculated by the equation (1), and obtains this min {D (m) Evaluate whether it is smaller than the fixed value T (Step S105). In other words, it is determined whether or not there is a condition sufficiently close to the current environmental conditions at which the microphone 1 picks up, among the conditions of the terminal-side acoustic model stored in the terminal-side acoustic model storage unit 15. is there. The predetermined value T is a value set in advance to determine whether or not such a condition is satisfied.
m i n {D (m) }力 定の値 Tよりも小さい場合には(ステップ S 1 0 5: Υ e s )、ステップ S 1 0 6に進む。端末側音響モデリ!/ 択部 1 6は、マイクロホン 1が集音する現在の難に適する音響モデルとして、 端末側の音響モデル mを選 択する (ステップ S 1 0 6)。'そして照合処理(ステップ S 1 1 2) に進む。以降 の処理については ί鍵する。 If the min {D (m)} force is smaller than the fixed value T (step S105: eses), the process proceeds to step S106. Terminal-side acoustic modelling! The selection unit 16 selects the acoustic model m on the terminal side as an acoustic model suitable for the current difficulty in which the microphone 1 collects sound (step S106). 'Then, proceed to the collation processing (step S 1 1 2). Or later Ί key for the processing.
また、 m i n {D (m) }が^ f定の値 Τ以上である場合には(ステップ S 1 0 5 : Ν ο)、ステップ S 1 0 7に進む。 この:!^には、端末側音響モデル記憶部 1 5が 記憶している音響モデルの難条件中に、 マイクロホン 1が集音する現在の環境 条件に十 いもの力職しないことになる。 そこで、 端末側送信部 1 3は、 音 声認識サーバ 6にセンサ情報を送信する (ステップ S 1 0 7 )。  If min {D (m)} is equal to or larger than the fixed value ^ f (step S105: Νο), the process proceeds to step S107. this:! In ^, during the difficult condition of the acoustic model stored in the acoustic model storage unit 15 on the terminal side, the profession will not be strong enough under the current environmental conditions in which the microphone 1 collects sound. Therefore, the terminal-side transmitting unit 13 transmits the sensor information to the voice recognition server 6 (Step S107).
なお、所定の値 Τを大きくすると、 m i n {D (m) }が Tよりも小さいと判断 される頻度が多くなり、 ステップ S 1 0 7が実行される回数が減少する。 すなわ ち、 Τの値を大きくとれば、ネットワーク 5を介した送受信の回数を削減できる。 したがってネットワーク 5の伝送量を抑制する効果が発生する。  When the predetermined value Τ is increased, the frequency that min {D (m)} is determined to be smaller than T increases, and the number of times that step S107 is executed decreases. That is, if the value of Τ is increased, the number of transmissions and receptions via the network 5 can be reduced. Therefore, the effect of suppressing the transmission amount of the network 5 occurs.
また反対に、 Τの値を小さくすると、 ネットワーク 5の送受信回数が増えるこ とになる。 しかしこの には、 センサ 1 2が¾得したセンサ情報と音響モデル が学習された条件との距離値がより小さな音響モデルを麵して、 音声認識が行 われるので、 音声認識の精度を向上させることができる。 以上のことから、 ネッ トワーク 5の伝^ Μと目標とする音声認 度とを考慮して Τの値を決^ rると よい。.  Conversely, when the value of Τ is reduced, the number of transmissions and receptions on the network 5 increases. However, in this method, the speech recognition is performed by inputting an acoustic model having a smaller distance value between the sensor information obtained by the sensors 12 and the condition under which the acoustic model was learned, so that the accuracy of the speech recognition is improved. be able to. From the above, the value of よ い may be determined in consideration of the transmission of the network 5 and the target speech recognition. .
音声認識サーバ 6において、 端末側受信部 2 1はネットワーク 5を介してセン サ情報を受信する(ステップ S 1 0 8)。そしてサーバ側音響モデリ 択部 2 3は、 サーバ側音響モデル記憶部 2 2によって記憶されている音響モデルが学習された 環境条件とサーバ側受信部 2 1が受信したセンサ情報との距離値を、 ステップ S 1 0 4と同様にして算出し、 この距離値が最小となる音響モデルを選択する (ス テツプ S 1 0 9)。続いてサーバ側送信部 2 4は、サーバ側音響モデリ 択部 2 3 が選択した音響モデルを音声認識端末 2に送信する (ステップ S 1 1 0)。  In the voice recognition server 6, the terminal-side receiving unit 21 receives the sensor information via the network 5 (step S108). The server-side acoustic model selection unit 23 calculates a distance value between the environmental condition in which the acoustic model stored in the server-side acoustic model storage unit 22 is learned and the sensor information received by the server-side reception unit 21. The calculation is performed in the same manner as in step S104, and the acoustic model with the smallest distance value is selected (step S109). Subsequently, the server-side transmitting unit 24 transmits the acoustic model selected by the server-side acoustic model selecting unit 23 to the speech recognition terminal 2 (Step S110).
音声認識端末 2の端末側受信部 1 4は、 サーバ側送信部 2 4が送信した音響モ デルをネットワーク 5を介して受信する (ステツプ S 1 1 1 )。  The terminal-side receiving unit 14 of the voice recognition terminal 2 receives the acoustic model transmitted by the server-side transmitting unit 24 via the network 5 (Step S111).
次に、 端末側照合部 1 7は、 端末側音響分析部 1 1.が出力した音声特徴量と音 響モデルとの照合処理を行う (ステップ S 1 1 2)。 ここでは、音響モデルとして 記憶されている標準パターンと音声難量の時系列との間で最も ¾ ^の高い鎌 を認識結果 4とする。 例えば、 D P (Dynamic Programming) マッチングに よるパターンマッチングを行い、 距離値が最小のものを認識結果 4とする。 以上のように、 実施例 1による音声認識端末 2及びサーバ 6によれば、 音声認 識端末 2に少数の音響モデルしか記憶できない場合であっても、 マイク口ホン 1 の集音 ^^をセンサ 1 2によって取得し、 音声認識サーバ 6力 己憶している多数 の音響モデルの中から、 この集音職に近い環境条件で学習した音響モデルを選 択して音声認識を行うことができる。 Next, the terminal-side collating unit 17 converts the speech feature and the sound output by the terminal-side acoustic analysis unit 1 1. The matching process with the sound model is performed (step S112). Here, the sickle with the highest ¾ ^ between the standard pattern stored as the acoustic model and the time series of the speech difficulty is set as the recognition result 4. For example, pattern matching by DP (Dynamic Programming) matching is performed, and the one with the smallest distance value is set as the recognition result 4. As described above, according to the speech recognition terminal 2 and the server 6 according to the first embodiment, even when only a small number of acoustic models can be stored in the speech recognition terminal 2, the sound pickup ^^ of the microphone mouth phone 1 can be detected by the sensor. Speech Recognition Server 6 Power acquired by 1 and 2 Able to perform speech recognition by selecting an acoustic model learned under environmental conditions close to this sound collection job from among many acoustic models remembered.
したがって、 音声認識端末 2には大容量の記憶素子や回路、 記憶媒体を搭載す る必要がなくなり、 β構成を簡素化し、 薩に高精度の音声認識を行う音声認 編末を提供できる。 前述の通り、 一つの音響モデルのデータサイズは、 実装の 仕方にもよるが、数百キロバイト禾疆のサイズを有する齢がある。したがって、 音声認識端末が記憶する必要のある音響モデルの個数を肖【』減することによる効果 は大きい。  Therefore, it is not necessary to mount a large-capacity storage element, circuit, or storage medium in the voice recognition terminal 2, and the β configuration can be simplified, and a voice recognition terminal that performs high-precision voice recognition can be provided to the user. As mentioned above, the data size of one acoustic model is several hundred kilobytes in size, depending on how it is implemented. Therefore, the effect of reducing the number of acoustic models that need to be stored by the speech recognition terminal is significant.
なお、 センサ情麵鍵続的な値をとりうるが、 通常はその 镜値からいくつか の値を選択し、この値をセンサ情報とする音響モデルを学習することになる。今、 センサ 1 2が ϋ数種類のセンサ (第 1のセンサ、 及び第 2のセンサとする) から 構成されていて、 音声認識端末 2及び音声認識サーバ 6が己憶している各音響モ デルの第 1のセンサに関するセンサ情報として選択された値の個数を Μ 1、 第 2 のセンサに関するセンサ情報として選択された値の個数を Μ 2とすると、 音声認 識端末 2及び音声認識サーバ 6が記憶している音響モデルの総数は Μ 1 X Μ 2 として算出される。  It should be noted that the sensor information can take successive values. Usually, however, some values are selected from the input values, and an acoustic model using this value as sensor information is learned. Now, the sensor 12 is composed of a few types of sensors (the first sensor and the second sensor), and the speech recognition terminal 2 and the speech recognition server 6 recognize each of the acoustic models. If the number of values selected as sensor information for the first sensor is Μ1 and the number of values selected as sensor information for the second sensor is Μ2, the voice recognition terminal 2 and the voice recognition server 6 store The total number of acoustic models used is calculated as Μ 1 X Μ 2.
この において、 Μ 1 <Μ 2 ¾¾ΪΓ|^:^、 つまり第 1のセンサのセンサ 情報として選択された値の個数の方が、 第 2のセンサのセンサ情報として選択さ れた値の個数よりも小さい:^に、 第 1のセンサのセンサ情報に対する重み係数 を第 2のセンサのセンサ情報に対する重み係数よりも小さくすることで、 マイク 口ホン 1の集音難に応じた音響モデルを選択することが きる。 In this equation, Μ 1 <Μ 2 ¾¾ΪΓ | ^: ^, that is, the number of values selected as sensor information of the first sensor is greater than the number of values selected as sensor information of the second sensor. Small: ^ is the weighting factor for the sensor information of the first sensor Is smaller than the weight coefficient of the second sensor with respect to the sensor information, it is possible to select an acoustic model according to the difficulty in collecting sound of the microphone / mouth phone 1.
また、 音声認識端末 2に〖 末側音響モデル記憶部 1 5と端末側音響モデ HI 択部 1 6を備えて、 音声認識端末 2か己憶する音響モデルと、 音声認識サーバ 6 が記憶する音響モデルとを、 適切に選 して音声認醒理を行うこととした。 し かし音声認識端末 2に端末側音響モデル記憶部 1 5と端末側音響モデ赚部 1 δを備えることは必須ではない。 すなわち、 センサ 1 2の取得するセンサ情報に 基づいて、 無条件に音声認識サーバ 6力 己憶する音響モデルを するような構 成も可能であることはいう,までもない。 このような構成を採用しても、 音声認識 端末 2の記憶容量を削減しつつ、 センサ 1 2によるマイク口ホン 1の集音職に 即した音響モデルを選択し、 精度の高い音声認讓理を行うことができるという この発明の特徴が損なわれることがないのである。  Also, the speech recognition terminal 2 includes a terminal-side acoustic model storage unit 15 and a terminal-side acoustic model HI selection unit 16 so that the speech recognition terminal 2 remembers the acoustic model and the speech recognition server 6 stores the acoustic model. The model was selected appropriately to perform voice awakening. However, it is not essential that the speech recognition terminal 2 include the terminal-side acoustic model storage unit 15 and the terminal-side acoustic model unit 1δ. That is, it goes without saying that a configuration is possible in which the acoustic model remembered by the speech recognition server 6 is unconditionally based on the sensor information acquired by the sensor 12. Even if such a configuration is adopted, while reducing the storage capacity of the voice recognition terminal 2, an acoustic model suited to the sound collecting job of the microphone mouth phone 1 by the sensor 1 2 is selected, and highly accurate voice recognition processing is performed. That is, the feature of the present invention that can be performed is not impaired.
また上記に説明した構成に加えて、 音声認識サーバ 6より受信した音響モデル を端末側音響モデル記憶部 1 5に新たに記慮させたり、 音声認 1 ^末 2側の音響 モデルの に代えて音声認識サーバ 6より受信した音響モデルを記憶させる構 成も可能である。 こうすることで、 次回再び同じ音響モデルを用いて音声認識す る:^に、音声認識サ"パ 6より再度音響モデルを転送する必要がなくなるので、 ネットワーク 5の伝送負荷を軽減できるし、 送受信に要する時間を短縮すること もできる。 実施例 2.  Further, in addition to the configuration described above, the acoustic model received from the speech recognition server 6 is newly stored in the acoustic model storage unit 15 on the terminal side, and instead of the acoustic model of the voice recognition 1 ^ end 2 side, A configuration for storing the acoustic model received from the speech recognition server 6 is also possible. By doing so, next time speech recognition is performed using the same acoustic model again: ^, since there is no need to transfer the acoustic model again from the speech recognition server 6, the transmission load on the network 5 can be reduced, and transmission and reception can be reduced. Example 2 can be shortened.
実施例 1による音声認 ^末によれば、 センサ情報に対応した音響モデルを音 声認識端末が 慮していない場合には、 音声認識サーバからセンサ情報〖こ適した 音響モデルを する構成とした。  According to the speech recognition according to the first embodiment, when the speech recognition terminal does not consider the acoustic model corresponding to the sensor information, an acoustic model suitable for the sensor information is obtained from the speech recognition server. .
しかし音響モデル 1個あたりのデ一夕サイズを考€fると、 音声認識サーバか ら音響モデ!/^体をネットワークを介して音声認識端末に することは、.ネッ トワークに大きな負荷を与え、 また音響モデルのデータ に要する時間がよつ て全体の処理性能に与える影響も することが、できない。 However, considering the size of the data per sound model, the sound model from the speech recognition server! / ^ Using the body as a voice recognition terminal via a network is It cannot place a heavy load on the network, nor can it affect the overall processing performance due to the time required for acoustic model data.
このような問題を回避する一つの方法は、 音響モデルのデ一タサイズがなるベ く小さくなるように音声認醒理を設計することである。 音響モデルのサイズを 小さければ、 音響モデルを音声認識サーバから音声認識端末に転送しても、 ネッ 卜ワークにはそれほど負荷を与えることにはならないからである。  One way to avoid such problems is to design the speech awakening algorithm so that the data size of the acoustic model is as small as possible. This is because, if the size of the acoustic model is small, transferring the acoustic model from the speech recognition server to the speech recognition terminal does not add much load to the network.
一方、 相互に類似する複数の音響モデルをクラスタリングし、 同一クラスタ内 の音響モデル間で を予め求めておいた上で、 音声認識サーバの記憶している On the other hand, a plurality of acoustic models that are similar to each other are clustered, and between the acoustic models in the same cluster is determined in advance, and stored in the speech recognition server.
, 音響モデルを する必要がある:^に、 音声認識端末カ 憶している音響モデ ルとの差分のみを し、 音声認識端末力 己憶している音響モデルと散から音 声認識サーバの音響モデルを合成する方法も考えられる。 実施例 2による音声認 識端末及びサーバは、 力かる原理に基づいて動 るものである。 Therefore, it is necessary to make an acoustic model: In ^, only the difference from the acoustic model stored in the speech recognition terminal is used, and the speech recognition terminal power is calculated from the acoustic model stored in the speech recognition terminal and the sound of the speech recognition server. A method of synthesizing the model is also conceivable. The speech recognition terminal and the server according to the second embodiment operate based on a powerful principle.
図 3は、 実施例 2による音声認識端末及びサーバの構成を示すブロック図であ る。 図において、 音響モデ哈成部 1 8は、 端末側受信部 1 4の受信内容と端末 側音響モデル記憶部 1 5が 憶している音響モデルから、 音声認識サーバ 6の記 憶する音響モデルと な音響モデルを合成する部位である。 また音響モデル差 分算出部 2 5 末側音響モデル記憶部 1 5が記憶している音響モデルとサーバ 側音響モデル記憶部 2 2が記憶している音響モデルとの散を算出する部位であ る。 その他、 図 1と同一の符号を付した音 |5{立については実施例 1と同様であるの で、 説明を省略する。  FIG. 3 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the second embodiment. In the figure, the acoustic model conversion unit 18 calculates the acoustic model stored in the speech recognition server 6 from the contents received by the terminal-side receiving unit 14 and the acoustic model stored in the terminal-side acoustic model storage unit 15. This is a part for synthesizing a simple acoustic model. Also, the acoustic model difference calculation unit 25 calculates the dispersion between the acoustic model stored in the terminal-side acoustic model storage unit 15 and the acoustic model stored in the server-side acoustic model storage unit 22. . In addition, since the sound | 5 {standing for the same reference numeral as in FIG. 1 is the same as that in the first embodiment, the description is omitted.
前述の通り、 実施例 2の音声認識装置 2及びサ一バ 6は、 音響モデルを予めク ラスタリングしている点を とする。 そこで、 まず音響モデルのクラス夕リン グ方法について説明する。 なお音響モデルのクラスタリングは、 音声認識装置 2 及びサ―パ 6によって音声認識処理がなされる前に完了しているものである。 音響モデルは、 多数の謎によって発声された大量の音声から各 ¾ϋ (または 音素あるいは音節) の音声籠量の統 I†*を示したものである。 統隱は、 均 値ベクトル ί = β(2), ···, β(Κ)}と、対角共分散ベクトル∑= { σ( 1 ).As described above, the speech recognition device 2 and the server 6 according to the second embodiment assume that the acoustic model is clustered in advance. Therefore, the class ringing method of the acoustic model will be described first. Note that the clustering of the acoustic model has been completed before the speech recognition processing is performed by the speech recognition device 2 and the supervisor 6 . The acoustic model shows the volume I 音 声 * of the speech basket volume of each ¾ϋ (or phoneme or syllable) from a large amount of speech uttered by many mysteries. Hiki is an average The value vector ί = β (2), ..., β (Κ)}, and the diagonal covariance vector ∑ = {σ (1).
2, σ ( 2)2, …, ひ (Κ)2}力、ら構成される。 そこで、 音韻 ρの音響モデルを Νρ { ρρ}で表すこととする。 2 , σ (2) 2 ,…, hi (Κ) 2 } force. Therefore, the acoustic model of the phoneme ρ is represented by Ν ρ { ρ ρ ρ }.
音響モデルのクラスタリングは、 以下に述べるように、 最大 VQ歪クラスタを 逐次分割するように改良した L B Gアルゴリズムにより行う。 図 4は、 音響モデ ルのクラスタリング処理を示すフローチヤ一トである。  The clustering of the acoustic model is performed by the LBG algorithm improved so that the maximum VQ distortion cluster is divided sequentially as described below. FIG. 4 is a flowchart showing the clustering process of the acoustic model.
まず、初期クラスタの作成を行う (ステップ S 2 0 1 )。 ここでは、 この音声認 識システムで棚され ¾可能性のあるすベての音響モデルから、 一つの初期クラ スタを作成する。 初期クラスタ rの統 t«の算出には、 式 (2 ) と式 (3 ) を用 いる。 ここで、 Nはクラスタに属する分布の数を、 また Kは音声特徴量の ^数 を表す。  First, an initial cluster is created (step S201). Here, one initial cluster is created from all possible acoustic models that are shelved by this speech recognition system. Equations (2) and (3) are used to calculate the t t of the initial cluster r. Here, N represents the number of distributions belonging to the cluster, and K represents ^ number of the speech feature.
( 3 )
Figure imgf000018_0001
(3)
Figure imgf000018_0001
次に、 これまで実行してきたクラスタリング処理によって、 すでに必要となる クラスタの個数が得られているかどうかを判定する (ステツプ S 2 0 2 )。必要な クラスタの個数は、音声認識処理システム設計時に決定される。一般的にいって、 クラスタ数が多ければ多いほど、 同一クラスタ内の音響モデル間の距離が小さく なる。 その結果、 差分デ一夕の情報量を小さくなり、 ネットワーク 5を介して送 受信される差分データのデータ量も抑制できる。 特〖こ、 音声認 ϋ端末 2及びサー ノ 6が 憶している音響モデルの総数が多い:!^には、 クラスタ数を多くすると よい。  Next, it is determined whether or not the required number of clusters has already been obtained by the clustering process executed so far (step S202). The required number of clusters is determined when designing the speech recognition processing system. Generally speaking, the greater the number of clusters, the smaller the distance between acoustic models in the same cluster. As a result, the amount of information of the difference data is reduced, and the amount of difference data transmitted and received via the network 5 can be suppressed. Tokuko, Speech Recognition ϋ The number of acoustic models stored in Terminal 2 and Server 6 is large:! For ^, increase the number of clusters.
しかし、 あらゆる場合に単純にクラスタの数を多くすればよいというわけには いかない。 その理由は次のとおりである。すなわち、 実施例 2では、 音声認識端 末 2が記憶している音響モデル (以下 > 口一カ 響モデルと呼ぶ) と散とを 組み合わせて音声認識サーバ 6の記憶する音響モデルを合成する、 あるいは音声 認識サーバ 6の記憶する音響モデルと同等の音響モデルを得ようとするものであ る。 However, it is not always possible to simply increase the number of clusters in all cases. The reason is as follows. That is, in the second embodiment, The acoustic model stored in the speech recognition server 6 is synthesized by combining the acoustic model stored in the end 2 (hereinafter referred to as the “oral sound model”) and scatter, or the acoustic model stored in the speech recognition server 6 The aim is to obtain an equivalent acoustic model.
ここで使用される差分は、 ローカリ!/ 響モデルと組み合わせるものであり、 こ の口一力 響モデルと同じクラスタに属する音響モデルとの間で求められたも のでなければならない。 難によって合成される音響モデルはセンサ情報に対応 したものだから、 そうすると、 センサ情報に対応した音響モデルと口一カリ 響 モデルが同一のクラス夕に分類されている状態が最も効率のよい 態ということ になる。  The difference used here is local! / Combine with the acoustic model, and must be determined between this oral acoustic model and acoustic models belonging to the same cluster. Since the acoustic model synthesized due to the difficulty corresponds to the sensor information, the most efficient state is that the acoustic model corresponding to the sensor information and the oral sound model are classified into the same class. become.
ところで、 クラスタ数が多くなると、 それぞれのクラスタに属する音響モデル の個数は少なくなつて、各音響モデルは多数のクラスタに分断された 態となる。 このような驗、 音声認識端末 2が 憶しているロー力 響モデルと同じクラ ス夕に属する音響モデリ も少なくなる傾向〖こある。 さらに、 センサ情報に対応 した音響モデルと音声認識端末 2が 憶するローカ V 響モデルとが同じクラス 夕に属する確率も小さくなる。  By the way, when the number of clusters increases, the number of acoustic models belonging to each cluster decreases, and each acoustic model is divided into many clusters. In such an experiment, the number of acoustic models belonging to the same class as the low acoustic model stored in the speech recognition terminal 2 tends to decrease. Furthermore, the probability that the acoustic model corresponding to the sensor information and the local V acoustic model stored in the speech recognition terminal 2 belong to the same class is reduced.
その結果、 ごのような 、 異なるクラスタに属する音響モデル間の差分を準 備できない状況や、 あるいは散を聰してもそのデータサイズが十分小さいも のにはならない状況が生じる。  As a result, there are situations such as those in which differences between acoustic models belonging to different clusters cannot be prepared, or in which the data size does not become sufficiently small even if dispersion is reduced.
このような理由から、 口一カリ! 響モデルの個数を多くすることができない場 合、 つまり音声認識端末 2に搭載するメモリゃハ一ドディスクなどの記憶装置の 言己'慮容量力 雀保できない場合には、 クラス夕数を多くしない方がよい。  For this reason, Ichikari! If it is not possible to increase the number of sound models, that is, if the storage capacity of the memory device such as the hard disk mounted on the voice Better not.
なお、 必要なクラス夕数が 2以上であれば、 初期クラスタ作成直後はクラスタ 数が 1であるので、ステップ S 2 0 3に進む(ステップ S 2 0 2 : N o)oまたす でに する処理によって複数のクラス夕が得られており、 その個数が必要なク ラスタの個数以上であれば、 終了する (ステップ S 2 0 2: Y e s )0 次に、駄 VQ歪クラスタ分割を行う (ステップ S 2 0 3)。 ここでは、 VQ歪 が最も大き (Λクラスタ r max ( 1·回目のループの時は初期クラスタ) を r 1 r 2 の 2つのクラスタに分割する。 これにより、 クラス夕の個数が増力 I Tる。 分割後 のクラス夕統 ff4は、 以下の式〖こよって算出する。 なお、 は、音声 «量の 各次元毎に予め定められた微小値とする。 If the required number of classes is 2 or more, since the number of clusters is 1 immediately after the initial cluster creation, the process proceeds to step S203 (step S202: No). processing and is obtained evening plurality of classes by, if the number is click the raster number than necessary, and ends (step S 2 0 2: Y es) 0 Next, a VQ distortion cluster division is performed (step S203). Here, the largest VQ distortion (Λ cluster r max (initial cluster in the first loop)) is divided into two clusters, r 1 and r 2, thereby increasing the number of class IT. The class ff4 after the division is calculated by the following equation: where is a small value that is predetermined for each dimension of the speech volume.
μΓΐ ( ) = μ, (ん) + Δ ( ) ( =l,2,-,K) (4)μ Γΐ () = μ, (n) + Δ () (= l, 2,-, K) (4)
^(た)= x ( - Δ (た) (k=l,2,-,K ( 5) arl(ky = ormax (k) (k=i,2,-,K) ( 6) ^ ( Ta ) = x ( ( ta ) (k = l, 2,-, K (5) a rl (ky = ormax (k) (k = i, 2,-, K) (6)
(ん )' =ひ Λ ( 2 (k=l,¾-,K) (7) 続いて、 各音響モデルの統計量と各クラスタ (ステップ S 2 0 3で分割された すべてのクラスタ)の統計量との距離値を算出する(ステップ S 2 0 4)。 ここで は、 すべての音響モデルと、 すでに求められているすべてのクラスタからそれぞ れ一つずつ選択されて距離が算出される。 ただしすでに ¾|が算出されている音 響モデルとクラスタの組み合わせについては再び距離が算出されることはない。 そのような制御を行うために、 クラスタ毎に距離を算出済みの音響モデルのフラ グを設けるようにしてもよい。 この音響モデルの統 fl量と各クラスタの統計量の 距離値には、 例えば式 (8) で定義するバタチャリア (Bhattachaxyya) 距離値 を用いる。
Figure imgf000020_0001
なお、 式 (8) において、 1をサフィックスとするパラメ一タは音響モデルの 統計量であり、 2をサフィックスとするパラメータはクラスタの統計量である。 以上求められた距離値に基づいて、 各音響モデルを最も距離値の小さいクラス 夕に属するようにする。 なお、 式 (8) の方法で、 音響モデルの統 ff«とク ラスタの統 との距離値を算出してもよい。 その であっても、 式 (1 ) に よって算出される距離値が近い i給に、 同一のクラス夕に属するような距離値が 得られる式を採用することが望ましい。 ただしこのこと « ^、須ではない。
(N) '= hi Λ ( 2 (k = l, ¾-, K) (7) Next, the statistics of each acoustic model and the statistics of each cluster (all clusters divided in step S203) (Step S204) Here, the distance is calculated by selecting one from each of the acoustic models and all the clusters already obtained. However, the distance is not calculated again for the combination of the acoustic model and the cluster for which ¾ | is already calculated.To perform such control, the flag of the acoustic model for which the distance has been calculated for each cluster is used. For the distance value of the statistics of the acoustic model and the statistics of each cluster, for example, the Bhattachaxyya distance value defined by equation (8) is used.
Figure imgf000020_0001
In Equation (8), the parameter with a suffix of 1 is the statistical value of the acoustic model, and the parameter with a suffix of 2 is the statistical value of the cluster. Based on the distance values obtained above, each acoustic model is assigned the class with the smallest distance value. Belong to the evening. Note that the distance value between the acoustic model statistics and the cluster statistics may be calculated by the method of equation (8). Even in such a case, it is desirable to adopt an equation that can obtain a distance value belonging to the same class evening for an i-feed whose distance value calculated by the equation (1) is close. However, this is not «^.
次に各クラス夕のコードブックの: ^を行う (ステップ S 2 0 5)。そのために、 式 (2) 及び (3) を用いて、 クラス夕に属する音響モデルの統 f«の代表値を 算出する。 また式 (8 ) を用いて、 クラスタに属する音響モデルの統計量と、 代 表値との距離を累積し、 これを現在のクラスタの VQ歪と定義する。  Next,: ^ is performed in the code book of each class (step S205). For this purpose, the representative values of the acoustic models belonging to the class are calculated using Equations (2) and (3). The distance between the statistical value of the acoustic model belonging to the cluster and the representative value is accumulated using Eq. (8), and this is defined as the VQ distortion of the current cluster.
続いてクラスタリングの謝面値を算出する (ステップ S 2 0 6)。 ここでは、全 クラス夕の VQ歪の!^ロをクラス夕リングの ΐ 面値とする。 なおステップ S 2 0 4〜ステップ S 2 0 7は複数回実行されるループを構成する。 そして、 ステップ S 2 0 6で算出された請面値は、次回の プ実行まで記憶されている。そして、 この諮面値と前回ループ実行時に算出された籠値との散を求め、 その糸 寸値 か 定の閾値未満か否かを判定する(ステップ S 2 0 7)。この差分力 f定の閾値 未満である場合は、すべての音響モデルがすでに求められているクラスタのうち、 適切なクラスタに所属したので、 ステップ S 2 0 2に戻る (ステップ S 2 0 7: Y e s )。一方、差分力斬定の閾値以上である場合は、ま 切なクラス夕に属し ていない音響モデル力 j¾ ^るので、 ステップ S 2 0 4に る (ステップ S 2 0 7 : N o  Subsequently, a reward value for clustering is calculated (step S206). Here is the VQ distortion for all classes! Let b be the 夕 face value of the class evening ring. Steps S204 to S207 constitute a loop that is executed a plurality of times. The contract surface value calculated in step S206 is stored until the next execution of the task. Then, the scatter of the consultation surface value and the basket value calculated in the previous loop execution is obtained, and it is determined whether or not the thread size value is less than a predetermined threshold value (step S207). If the difference force f is less than the threshold, the acoustic model belongs to an appropriate cluster among clusters that have already been obtained, and the process returns to step S202 (step S207: Y es). On the other hand, if the difference is greater than or equal to the threshold value of the differential force slicing, the acoustic model force j¾ ^ which does not belong to the right class is used, so that step S204 is performed (step S207: No).
以上がクラスタリング処理である。 次に、 このようにしてクラスタリングされ た音響モデルに基づいて行われる実施例 2の音声認識装置 2及びサーバ 6におけ る音声認讓理について、 図を用いて説明する。 図 5は、 音声認識装置 2及びサ ーバ 6の動作のフロ一チャートである。 図において、 ステップ S 1 0 1〜S 1 0 5においては、 実施例 1と同様に音声がマイクロホン 1から入力され、 音響分析 とセンサ情報の取得を行った後に、 このセンサ情報に適した口一カ! ^響モデル が するかどうかを判定する。 そして、 センサ情報との距離が最も小さいローカ I/ 響モデル (このローカル 音響モデルを ϋ¾する番号または名前を mと呼ぶ) をもってしても、 その ¾ϋが 所定の閾値 Τ未満とならない:!^には、 ステップ S 2 0 8に進む (ステップ S 1 0 5: Ν ο The above is the clustering process. Next, speech recognition processing in the speech recognition device 2 and the server 6 according to the second embodiment, which is performed based on the acoustic models clustered as described above, will be described with reference to the drawings. FIG. 5 is a flowchart of the operation of the voice recognition device 2 and the server 6. In the figure, in steps S101 to S105, a voice is input from the microphone 1 as in the first embodiment, and after performing sound analysis and acquisition of sensor information, a speech suitable for the sensor information is obtained. Ka! ^ Judge whether the sound model does. And, even if we have a local I / Sound model (the number or name that identifies this local acoustic model is called m) that has the smallest distance from the sensor information, that ら な い will not be less than the predetermined threshold ::! For ^, go to step S208 (step S105: Νο
次に、 端末側送信部 1 3は、 センサ情報とロー力 Jl^響モデルを鑭【Jする情報 mとを、 音声認識サーバ 6に送信する (ステップ S 2 0 8)。  Next, the terminal-side transmitting unit 13 transmits the sensor information and the information m for performing the low-power Jl ^ sound model to the speech recognition server 6 (step S208).
サーバ側受信部 2 1は、センサ情報と mとを受信し (ステップ S 2 0 9)、サ一 パ側音響モデリ HI択部 2 3は、 受信したセンサ情報に最も適した音響モデルを選 択する(ステップ S 1 0 9)。そして、 この音響モデルと口一カリ!/ ¾響モデル mと が同一のクラスタに属するか否かを判断する(ステツプ S 2 1 0 )。同一のクラス 夕に属する場合には、ステップ S 2 1 1に進み (ステップ S 2 1 0: Y e s )、音 響モデル差分算出部 2 5は、 この音響モデルとローカ !/^響モデル mとの差分を 算出して (ステップ S 2 1 1 )、サーバ側送信部 2 4は差分を音声認識端末 2に送 信する (ステップ S 2 1 2)。 .  The server-side receiver 21 receives the sensor information and m (step S209), and the server-side acoustic model HI selector 23 selects the acoustic model most suitable for the received sensor information. (Step S109). Then, it is determined whether or not this acoustic model and the oral model / sound model m belong to the same cluster (step S210). If they belong to the same class, the process goes to step S211 (step S210: Yes), and the acoustic model difference calculation unit 25 calculates the acoustic model and the local! / ^ The server-side transmitting unit 24 calculates the difference (step S211), and transmits the difference to the speech recognition terminal 2 (step S212). .
なお差分を求めるには、 例えば、 音声體量の各: ^の成分の値の差異やオフ セットのずれ(それぞれの要素の格納位置の差) に基づいて算出すればよい。 異 なるバイナリデータ間 (バイナリファイル間など) の差分値を求める は公知 となっているので、 それを利用してもよい。 また、 実施例 2による籠は、 音響 モデルのデータ構造をついて特別な要求を求めるものではないので、 差分を求め やすいデータ構造を設計しておく方法も考えられる。  In order to obtain the difference, for example, the difference may be calculated based on the difference between the values of the components of the voice volume: ^ and the offset (the difference between the storage positions of the respective elements). It is known to find a difference value between different binary data (such as between binary files), and that may be used. Further, since the cage according to the second embodiment does not require a special request for the data structure of the acoustic model, a method of designing a data structure that can easily obtain the difference can be considered.
一方、 同一のクラスタに属さない には、 直接ステップ S 2 1 2に進む (ス テツプ S 2 1 0: N o)。 この
Figure imgf000022_0001
On the other hand, if they do not belong to the same cluster, go directly to step S212 (step S210: No). this
Figure imgf000022_0001
のを送信する (ステップ S 2 1 2)。 Is transmitted (step S2 1 2).
なお、 上記の処理においては、 音声認識端末 2側でセンサ情報に最も適してい ると判断したロー力 ^響モデル (ステップ S 1 0 5で、 センサ情報との鼸が 最も小さいと判断した音響モデル) を,に差分を生成することを謝是としてい る。 そのため、 このような口一力ゾ!/ ¾響モデル mに関する情報を前もってステツ プ S 2 0 8で送信した。 しかし、 この他にも音声認識サーバ 6側で音声認識端末 2が記憶している口一カ! /^響モデルの種類を把握 (あるいは管理) しておき、 さらに音声認識サーバがセンサ情報に近レ、音響モデルを選択した後に、 選択され た音響モデルと同じクラス夕に属する口一カリ暗響モデルを管理しているロー力 リ暗響モデルから選択して、 それらの差分を算出するようにしてもよい。 この場 合には、 音声認識サーバ 6によって箅出された差分がどの口一カ 響モデルに 基づいているかを音声認識端末 2に通知する必要があるので、 ステップ S 2 1 2 において、 音声認識サーバ 6が^^算出の纏とした口一カ !^響モデルを する情報を送信する。 Note that, in the above processing, the voice recognition terminal 2 side determines that the low-power acoustic model determined to be most suitable for the sensor information (in step S105, the acoustic model determined to have the smallest 鼸 with the sensor information) ), Thank you for generating the difference The Therefore, such a mouthful zo! / Information about the acoustic model m was transmitted in advance in step S208. However, in addition to this, the voice recognition server 2 stores the voice recognition terminal 2 in the memory! / ^ Understand (or manage) the type of sound model, and after the speech recognition server selects the sound model, close to the sensor information, and then select the sound model. The difference may be calculated by selecting from the low-power reverberation model that manages the model. In this case, it is necessary to notify the speech recognition terminal 2 which speech model the difference detected by the speech recognition server 6 is based on. 6 sends the information that makes the ^^ calculation of the ^^ sound model.
次に音声認識端末 2の端末側受信部 1 4は、 差分データ、 あるいは音響モデル を受信する(ステップ S 2 1 3)。受信したデータが差分である場合には、音響モ デリ 成部 1 8が差分の鍵となる口一カ 響モデル mと差分から音響モデル を合成する(ステップ S 2 1 4)。そして、端末側照合部 1 7が音響モデルの標準 パターンと音声特徴量とのパターンマッチングを行って最も尤度の高い認! i ^i 'を認識結果 4として出力する。  Next, the terminal-side receiving unit 14 of the speech recognition terminal 2 receives the difference data or the acoustic model (Step S213). If the received data is a difference, the acoustic model generation unit 18 synthesizes an acoustic model from the utterance model m which is the key to the difference and the difference (step S2 14). Then, the terminal-side matching unit 17 performs pattern matching between the standard pattern of the acoustic model and the voice feature amount, and recognizes the most likely likelihood! Output i ^ i 'as recognition result 4.
以上から明らかなように、 実施例 2の音声認識端末 2が記憶する口一力刀/ ¾響 モデルと音声認識サーバ 6が己憶する音響モデルとの差分のみをネットワークを 介して送受信することとした。 そのため、 音声認識端末 2の記憶容量が小さい場 合でも、 マイクロホン 1の集音職に即した多様な音響モデルに基づいて高精度 な音声認識を行うことができるという実施例 1の効果に加えて、 ネットワークに 与える負荷を低減し、 データ転送に要する時間を短くすることによって処理性能 を向上するという効果を奏するのである。 実施例 3.  As is apparent from the above description, only the difference between the mouth sword / sound model stored in the speech recognition terminal 2 of the second embodiment and the sound model remembered by the speech recognition server 6 is transmitted and received via the network. did. Therefore, even when the storage capacity of the speech recognition terminal 2 is small, high-accuracy speech recognition can be performed based on a variety of acoustic models suitable for the sound collecting job of the microphone 1, in addition to the effect of the first embodiment. This has the effect of improving the processing performance by reducing the load on the network and reducing the time required for data transfer. Example 3.
実施例 1及び 2による音声認識端末 2では、.音声認讓理に必要となる音響モ 一 デルを記憶していない:^であっても、 音声認識サーバ 6が ¾憶する音響モデル を、 ネットワーク 5を介して受信することにより、 マイクロホン 1の集音環 ¾に 即した音声認識を行うものであった。 しかし、 音響モデルの送受信に代えて、 音 声特徴量を送受信するようにしてもよい。 実施例 3による音声認識端末及びサー バはこのような原理に基づいて動作するものである。 In the speech recognition terminal 2 according to the first and second embodiments, the sound module required for speech recognition is used. Does not memorize Dell: Even if ^, the acoustic model stored in the speech recognition server 6 is received via the network 5 to perform speech recognition according to the sound collection ring of the microphone 1. Met. However, instead of transmitting and receiving the acoustic model, a voice feature may be transmitted and received. The speech recognition terminal and server according to the third embodiment operate based on such a principle.
図 6は、 実施例 3による音声認識端末及びサーバの構成を示すプロック図であ る。 図において、 図 1と同一の符号を付した部位については実施例 1と同様であ るので、 説明を省略する。 実施例 3においても、 音声認識端末 2と音声認識サ一 ノ 6はネットワーク 5を介して鎌されている。 しかし、 音声認識端末 2から音 声認識サーバ 6に対して音声特徴量とセンサ情報が送信されるようになつており、 また認識結果 7が音声認識サーバ 6より出力されるようになっている点で、 実施 例 1と異なる。 なお、 音声認識サーバ 6において、 サ一バ側照合部 2 7は、 実施 例 1の端末側照合部 1 7と同様に音声樹糧と音響モデルとの照合を行う音啦で める。  FIG. 6 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the third embodiment. In the figure, the parts denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description is omitted. Also in the third embodiment, the voice recognition terminal 2 and the voice recognition server 6 are sickled via the network 5. However, the speech recognition amount and sensor information are transmitted from the speech recognition terminal 2 to the speech recognition server 6, and the recognition result 7 is output from the speech recognition server 6. This is different from the first embodiment. In the speech recognition server 6, the server-side collating unit 27 generates a sound for performing collation between the speech tree and the acoustic model, similarly to the terminal-side collating unit 17 of the first embodiment.
次に実施例 3における音声認識端末 2及び音声認識サーバ 6の動作について、 図を参照しながら説明する。 図 7は、 実施例 2による音声認識端末 2と音声認識 サーバ 6との処理を示したフローチヤ一トである。 なおこのフローチャートにお いて、 図 2と同一の符号を付した処理については実施例 1と同様である。 そこで 以下においては、 このフローチヤ一ト独自の^を付した処理を中心に説明を行 う。  Next, operations of the speech recognition terminal 2 and the speech recognition server 6 in the third embodiment will be described with reference to the drawings. FIG. 7 is a flowchart showing processing between the speech recognition terminal 2 and the speech recognition server 6 according to the second embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with ^ unique to the flowchart.
まず、 利用者がマイクロホン 1から音声入力を行うと、 入力端 3を介して音声 認識端末 2に音声信号が入力され(ステップ S 1 0 1 )、入力された音声信号から 音響分析部 1 1によって音声特徴量の時系列が算出されるとともに (ステップ S 1 0 2)、 センサ 1 2によってセンサ情報が収集される (ステップ S 1 0 3)。 次に端末側送信部 1 3によってセンサ情報と音声樹敷量がネットワーク 5を介 して音声認識サーバ 6に転送され(ステップ S 3 0 1 )、サーバ側受信部 2 .1によ つてセンサ情報と音声特«が音声認識サーバ 6に取り込まれる (ステップ S 3 0 2)。音声認識サーバ 6のサーバ側音響モデル記憶部 2 2は、音響モデルを複数 のセンサ情報に合わせて予め嘿しており、 サーバ側音響モデリ 択部 2 3は、 サーバ側受信部 2 1によって取得されたセンサ情報と、 各音響モデルのセンサ情 報との距離値を式 ( 1 ) によって算出して、 最も距離値の小さい音響モデルを選 択する (ステップ S 1 0 9)。 First, when a user performs a voice input from the microphone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the acoustic analysis unit 11 converts the voice signal into the voice signal. The time series of the voice feature is calculated (step S102), and sensor information is collected by the sensor 12 (step S103). Next, the sensor information and the voice coverage amount are transferred to the voice recognition server 6 via the network 5 by the terminal-side transmission unit 13 (step S301), and the server-side reception unit 2.1 transmits the information. Then, the sensor information and the voice characteristics are taken into the voice recognition server 6 (step S302). The server-side acoustic model storage unit 22 of the voice recognition server 6 stores an acoustic model in advance according to a plurality of sensor information, and the server-side acoustic model selection unit 23 is acquired by the server-side reception unit 21. A distance value between the obtained sensor information and the sensor information of each acoustic model is calculated by equation (1), and an acoustic model having the smallest distance value is selected (step S109).
続いてサーバ側照合部 2 7は、 選択された音響モデルにおける標準パターンと サーバ側受信部 2 1によって取得された音声特徴量とのパターンマッチングを行 つて、最も の高い語彙を認識結果 7として出力する(ステップ S 3 0 3)。 こ の処理は、 実施例 1の照合処理 (ステップ S 1 1 2) と同様であるので、 詳細な 説明については省略する。  Subsequently, the server-side matching unit 27 performs pattern matching between the standard pattern in the selected acoustic model and the speech feature amount acquired by the server-side receiving unit 21 and outputs the highest vocabulary as the recognition result 7. (Step S303). This process is the same as the matching process (step S112) in the first embodiment, and thus a detailed description is omitted.
以上のように、 実施例 3による音声認識端末 2およびサ一パ 6によれば、 音声 S忍識端末 2において音声特徴量の算出とセンサ情報の取得のみを行い、 このセン サ I青報に基づいて、 音声認識サーバ 6に音声特徴が記憶する音響モデルから適切 な音響モデルを選択して、 音声認識することとした。 こうすることで、 音声認識 端末 2に音響モデルを記憶するための記 fig置、 あるいは^又は回路が不要と なり、 音声認識端末 2の構成を簡素化することができる。  As described above, according to the voice recognition terminal 2 and the superuser 6 according to the third embodiment, only the calculation of the voice feature amount and the acquisition of the sensor information are performed by the voice S ninth terminal 2, and the sensor I Based on this, an appropriate acoustic model was selected from the acoustic models whose speech characteristics were stored in the speech recognition server 6, and the speech was recognized. This eliminates the need for a memory or a memory or a circuit for storing an acoustic model in the voice recognition terminal 2, and can simplify the configuration of the voice recognition terminal 2.
また、 音声特徴量とセンサ情報のみをネットワーク 5を介して音声認識サーバ 6に転送するようにしたので、 ネットワーク 5に伝送負荷をかけずに音声認識を 行うことができる。  Further, since only the voice feature amount and the sensor information are transferred to the voice recognition server 6 via the network 5, voice recognition can be performed without imposing a transmission load on the network 5.
なお、 前述の通り、 音響モデルのデータサイズは数百キロバイトに及ぶ:^が ある。 したがってネットヮ一クの帯域幅が制限されている場合には、 音響モデル そのものを送信しょうとすると伝送能力の限界に達してしまう場合もある。 しか し音声特徴量であれば、 せいぜい 2 0 k b p sの帯域幅が 保できれば、 実時間 内に十分転送カ坷能である。 したがって極めてネットワーク負荷が いクライァ ントサーパ側音声認識システムを構築できるとともに、 マイクロホン 1の集音環 境に合わせた高精度な音声認醒理を行うことができる。 As mentioned above, the data size of the acoustic model is several hundred kilobytes: ^. Therefore, if the bandwidth of the network is limited, the transmission capability may reach the limit when trying to transmit the acoustic model itself. However, if the speech feature quantity can maintain a bandwidth of at most 20 kbps, it can sufficiently transfer data in real time. Therefore, a client server-side speech recognition system with extremely low network load can be constructed, and the sound collection ring of microphone 1 It is possible to perform highly accurate voice awakening according to the boundary.
なお実施例 1とは異なり、 実施例 3では認識結果 7を音声認識端末 2から出力 するのではなぐ 音声認識サーバ 6から出力する構成とした。 例えば音声認識端 末 2がインターネットを閲覧しており、発話によって URL Un i f o rm R e s o u r c e L o c a t i o n) を音声入力し、 この URLから決定される We bページを音声認識サーバ 6が 得して、 音声認識端末 2に送信して さ せるような場合は、 このような構成で十分である。  Unlike the first embodiment, the third embodiment has a configuration in which the recognition result 7 is output from the voice recognition server 6 instead of being output from the voice recognition terminal 2. For example, when the speech recognition terminal 2 is browsing the Internet, the speech recognition utters a URL (Unifom rm Resource Relocation), and the speech recognition server 6 obtains a Web page determined from the URL, and Such a configuration is sufficient if the recognition terminal 2 is to be transmitted.
しかしながら、 例 1と同じように、 音声認識端末 2が認識結果を出力する . ような構成とすることもできる。 この驗は、 音声認識端末 2に端末側受信部、 音声認識サーバ 6にサーバ側送信部を備えるようにし、 照合部 2 7の出力結果を 音声認識サ一バ 6の送信部からネットヮ一ク 5を介して音声認識端末 2の受信部 に送信し、 この受信部から所望の出力先に出力するように構成すればよい。 実施例 4.  However, as in Example 1, the voice recognition terminal 2 may output a recognition result. In this experiment, the voice recognition terminal 2 was provided with a terminal-side receiving unit, and the voice recognition server 6 was provided with a server-side transmitting unit. The output result of the matching unit 27 was transmitted from the transmitting unit of the voice recognition server 6 to the network 5. It may be configured to transmit the data to the receiving unit of the voice recognition terminal 2 via the terminal and output the data to a desired output destination from the receiving unit. Example 4.
実施例 1及び 2における音響モデルの送受信、 実施例 3における音声特徴量の 送受信に代えて、 音声データを送受信する方法も考えられる。 実施例 4による音 声認識端末及びサーバはこのような原理に基づいて動作するものである。  Instead of the transmission / reception of the acoustic model in the first and second embodiments and the transmission / reception of the audio feature amount in the third embodiment, a method of transmitting / receiving audio data may be considered. The voice recognition terminal and the server according to the fourth embodiment operate based on such a principle.
図 8は、 実施例 4による音声認識端末及びサーバの構成を示すブロック図であ る。 図において、 図 1と同一の を付した部位については実施例 1と同様であ るので、 説明を省略する。 実施例 4においても、 音声認 I»末 2と音声認識サ一 パ 6はネットワーク 5を介して嫌されている。 しかし、 音声認識端末 2から音 声認識サーバ 6に対して音声データとセンサ情報が送信されるようになっており、 また認識結果 7が音声認識サーバ 6より出力されるようになっている点で、 実施 例 1と異なる。  FIG. 8 is a block diagram illustrating a configuration of a speech recognition terminal and a server according to the fourth embodiment. In the figure, the parts denoted by the same symbols as those in FIG. 1 are the same as those in the first embodiment, and therefore the description is omitted. Also in the fourth embodiment, the voice recognition I-terminal 2 and the voice recognition superuser 6 are hated via the network 5. However, voice data and sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6. However, this is different from the first embodiment.
音声ディジタル処理部 1 9は入力端 3から入力された音声をディジタルデータ に変換する部位であって、 A/D変鶴あるいは軒又は回路を備えるものであ る。 さらに AZD変換されたサンプリングデータをネットワーク 5を介して伝送 するのに適する形式に変換する専用回路、 またはこのような専用回路と同等の処 理を行うコンピュータプログラムとこのプログラムを実行する中央 装置をさ らに備えるようにしてもよい。 また、 サーバ側音響分析部 2 8は音声認識サーバ 6上で入力音声から音声難量を算出する部位であって、 実施例 1及び 2におけ る端末側音響分析部 1 1と同様の機能を る。 The audio digital processing unit 19 is a unit that converts audio input from the input terminal 3 into digital data, and includes an A / D transformer, eaves, or a circuit. The Furthermore, a dedicated circuit for converting the AZD-converted sampling data into a format suitable for transmission via the network 5, or a computer program for performing processing equivalent to such a dedicated circuit, and a central device for executing this program. May be provided. Further, the server-side acoustic analysis unit 28 is a unit that calculates the speech difficulty from the input speech on the speech recognition server 6, and has the same function as the terminal-side acoustic analysis unit 11 in the first and second embodiments. You.
次に実施例 4における音声認識端末 2及び音声認識サーバ 6の動作について、 図を参照しながら説明する。 図 9は、 実施例 1による音声認識端末 2と音声認識 サ一パ 6との処理を示したフローチヤ一トである。 なおこのフローチャートにお いて、 図 2と同一の符号を付した処理については実施例 1と同様である。 そこで 以下においては、 このフローチャート独自の符号を付した処理を中心に説明を行 う。 , '  Next, operations of the speech recognition terminal 2 and the speech recognition server 6 in the fourth embodiment will be described with reference to the drawings. FIG. 9 is a flowchart illustrating processing performed by the speech recognition terminal 2 and the speech recognition superuser 6 according to the first embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, description will be made focusing on the processing with reference numerals unique to the flowchart. , '
まず、 利用者がマイク口ホン 1から音声入力を行うと、 入力端 3を介して音声 認識端末 2に音声信号が入力され(ステヅプ S 1 0 1 )、音声ディジタル処理部 1 9は、 ステップ S 1 0 1で入力された音声信号を A/D変換によってサンプリン グする (ステップ S 4 0 1 )。なお、音声ディジタル処理部 1 9では、音声信号の A/D変換だけでなぐ 音声データの符号化、 あるいは ffi縮処理を行うことカ塑 ましいが、 このことは必須ではない。 具体的な音声の ffi縮方法としては、 デイジ 夕 Jレ: の公衆有泉電話網( I S D Nなど)で されている u-law 64kbps PCM ¾¾(Pulse Coded Modulation, ITU-T G.711)や、 PHSで使用されてレる«差 分符号化 PCM方式 (Adaptive Differential encoding PCM、 ADPCM. ITU-T G.726)、 携帯電話で使用されている VSELP ^(Vector Sum Excited linear Prediction). CELP ¾¾(Code Excited Linear Prediction)等を適用する。通信網 の 可能帯域幅やトラフィックに応じて、 これらの^;のうちのいず かを選 択するとよい。 例えば、 帯域幅が 64kbpsである齢には u-law PCM ^ 16 〜40kbpsである: t½には ADPCM 、 11.2kbpsである: ^には VSELP ;、 5.6kbpsである:^には CELP が適していると考えられる。ただし他の符号 化方式を適用しても、 この発明の特徴が失われるわけではない。 First, when the user performs voice input from the microphone mouth phone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the voice digital processing unit 19 proceeds to step S101. The audio signal input at 101 is sampled by A / D conversion (step S401). In the audio digital processing section 19, it is preferable to perform audio data encoding or ffi-compression processing that can be performed only by A / D conversion of the audio signal, but this is not essential. Specific examples of voice ffi-compression methods include the u-law 64kbps PCM ¾¾ (Pulse Coded Modulation, ITU-T G.711) used in the public spring telephone network (ISDN, etc.) «Differential encoding PCM method used in PHS (Adaptive Differential encoding PCM, ADPCM. ITU-T G.726), VSELP ^ (Vector Sum Excited linear Prediction) used in mobile phones. CELP ¾¾ ( Code Excited Linear Prediction) is applied. One of these ^; should be selected according to the available bandwidth and traffic of the communication network. For example, for an age with a bandwidth of 64 kbps, u-law PCM ^ 16-40 kbps: ADPCM for t½, 11.2 kbps: ^ for VSELP;, It is 5.6kbps: CELP is considered suitable for ^. However, the characteristics of the present invention are not lost even if other encoding methods are applied.
次に、センサ 1 2によってセンサ情報が輕され (ステップ S 1 0 3)、さらに 纏されたセンサ情報と 化された音声データは、 例えば図 1 0で示すような デ、一夕フォーマツトに並べ替えられて、 端末側送信部 1 3によってネットワーク 5を介して音声認識サーバ 6に転送される (ステツプ S 4 0 2 )。  Next, the sensor information is lightened by the sensor 12 (step S103), and the combined sensor information and voice data are rearranged into, for example, a format as shown in FIG. Then, the data is transferred to the voice recognition server 6 via the network 5 by the terminal-side transmitting section 13 (step S402).
なお、 図 1 0において領域 7 0 1には、 音声デ一夕の処理時刻を表すフレーム 番号が格納される。 このフレーム番号は、 例えば音声データのサンプリング诗刻 に基づいて、一意に決定される。 ここで、 「一意に決定される」 という語の意義は、 音声認識端末 2と音声認識サーバ 6との間で調整された相対的な時刻に基づいて 決定される を含み、 この相対的な時刻が異なる:!^には、 異なるフレーム番 号が与えられるようにする、 という意味である。 あるいは、 音声認識端末 2と音 声認識サ一バ 6との外部に存 ¾ る時計より糸舰的な時刻の供給を受け、 この時 刻に基づいてフレーム番号を一意に決^るようにしてもよい。 時刻からフレー ム番号を算出するには、例えば年(西暦 4桁力壁ましい)、 月 (値域 1〜1 2で 2 桁を割り当てる)、 日 (値域 1〜3 1で 2桁を割り当てる)、 時 (値域 0〜2 3で 2桁を割り当てる)、 分 (値域 0〜 5 9で 2桁を割り当てる)、 秒 (値域 0〜5 9 で 2桁を割り当てる)、千分の一秒(値域 0〜9 9 9で 3桁を害!!り当てる)の各数 値をそれぞれの桁数でパディングし、 これらの順に数字列として連結してもよい し、 ビット単位で年 ·月 ·日 '時 ·分 ·秒'ミリ秒の 直をパックして一定の値 を得るようにしてもよい。  In FIG. 10, a frame number indicating the processing time of the audio data is stored in the area 701. This frame number is uniquely determined based on, for example, the sampling time of the audio data. Here, the meaning of the word “uniquely determined” includes “determined based on a relative time adjusted between the voice recognition terminal 2 and the voice recognition server 6”, and the relative time Is different! ^ Means to give a different frame number. Alternatively, a specific time is supplied from a clock external to the voice recognition terminal 2 and the voice recognition server 6, and the frame number is uniquely determined based on this time. Good. To calculate the frame number from the time, for example, year (four digits in the Christian era), month (two digits are assigned in the range 1 to 12), day (two digits are assigned in the range 1 to 31) , Hour (assign 2 digits in the range 0 to 23), minute (assign 2 digits in the range 0 to 59), seconds (assign 2 digits in the range 0 to 59), thousandth of a second (value 0 ~ 99 9 9 harms 3 digits !!), pad each number with the number of digits, and concatenate them as a digit string in these order, or year, month, day in bit units. Hours, minutes, seconds and milliseconds may be packed to obtain a constant value.
また、 図 1 0のデータフォ一マツトの領域 7 0 2には、 センサ情報の占有する データサイズが格納される。 例えばセンサ情報が 3 2ビット値であるならば、 セ ンサ情報を格納するのに必要な領域の大きさ (4パイト) をバイトで表現して 4 が格納される。 センサ 1 2が複数個のセンサから構成される:^には、 それぞれ のセンサ情報を格納するのに必要となる配列領域のデ一夕サイズが格納されるこ とになる。 さらに領域 7 0 3には、 ステップ S 1 0 3においてセンサ 1 2によつ て取得されたセンサ情報カ^納される領域であ 。 センサ 1 2力 复数個のセンサ から構成される は、 領域 7 0 3にセンサ情報の配列が格納される。 また領域 7 0 3のデータサイズは、 領域 7 0 2に保持されたデータサイズと一致する。 . 領域 7 0 4には音声データサイズ、力 S格納される。 なお、 送信部 1 3は音声デ一 タを複数のパケット (その構造は図 Ίで示されるデータフォーマツ卜と等しいも のとする) に分割して送信する: ^がある。 その場合、 領域 7 0 4に格納される のは、 それぞれのパケッ卜に含まれる音声データのデータサイズである。 複数の パケットに分割する場合については、 後に再ぴ述べることにする。 続いて領域 7 0 5には音声データが格納される。 The data size occupied by the sensor information is stored in the data format area 702 of FIG. For example, if the sensor information is a 32 bit value, the size of the area required to store the sensor information (4 bits) is expressed in bytes and 4 is stored. The sensor 12 is composed of a plurality of sensors: ^ stores the data size of the array area necessary to store the sensor information for each. It becomes. Further, an area 703 is an area in which the sensor information acquired by the sensor 12 in step S103 is stored. Sensor 1 2 Force 構成 さ れ る An array of sensor information is stored in an area 703 that is composed of several sensors. The data size of the area 703 matches the data size held in the area 702. The audio data size and power S are stored in the area 704. Note that the transmitting unit 13 divides the voice data into a plurality of packets (the structure of which is assumed to be the same as the data format shown in FIG. 5) and transmits the packet: ^. In this case, what is stored in the area 704 is the data size of the audio data included in each packet. The case where the packet is divided into a plurality of packets will be described later. Subsequently, audio data is stored in the area 705.
ネットワーク 5の特性から、パケットサイズの上限が定められている場 には、 端末側送信部 1 3は入力端 3を介して入力された音声データを複数のパケットに 分割する。 図 7のデ一タフォ一マットにおいて、 領域 7 0 1に格納されるフレー ム番号は、 その音声データの処理時刻を表す情報であり、 このフレーム番号は、 それぞれのバケツトに含まれる音声データのサンプリング時刻に基づいて決定さ れる。 さらにすでに述べたように、 領域 7 0 4にそれぞれのバケツトに含まれる 音声データのデータサイズを格納する。 またセンサ 1 2を構成するセンサの出力 結果が短時間の間に刻々と変化する性質を有する!^には、 領域 7 Q 3に格納さ れるセンサ情報もバケツト間で異なることになる。 例えば音声認識端末 2が車載 用音声認識装置であり、センサ 1 2が背景重難音の大きさを取得するセンサ (マ イク口ホン 1とは別のマイクロホンなど) i ^の発話の最中に自動車が トンネルを出入りすると、 背景重;! ^音の大きさは著しく異なることになる。 こ のような場合に、図 1 0のデータフォーマツトによるパケットを送信することで、 発話の途中であってもセンサ情報を適切に反映させることが 能となる。 そのた めに端末側送信部 1 3は、 発話の最中にセンサ I青報が大きく変化した驗に、 ネ ットワーク 5の特性とは関係なく、 センサ情報が变化した時点で音声データを分 割し、 異なるセンサ情報を格納したバケツトを送信するのが望ましい。 When the upper limit of the packet size is determined from the characteristics of the network 5, the terminal-side transmission unit 13 divides the voice data input via the input terminal 3 into a plurality of packets. In the format shown in FIG. 7, the frame number stored in the area 701 is information indicating the processing time of the audio data, and the frame number is a sampling number of the audio data included in each bucket. Determined based on time. Further, as already described, the data size of the audio data included in each bucket is stored in the area 704. In addition, the output results of the sensors constituting the sensors 12 and 12 have the property of changing every moment in a short time! In ^, the sensor information stored in the area 7Q3 also differs between buckets. For example, the voice recognition terminal 2 is an in-vehicle voice recognition device, and the sensor 12 obtains the loudness of the background heavy sound (such as a microphone different from the microphone 1). When the car enters and exits the tunnel, the background is heavy! The loudness of the sound will vary significantly. In such a case, by transmitting the packet in the data format shown in FIG. 10, the sensor information can be appropriately reflected even during the utterance. For this reason, the terminal-side transmitting unit 13 separates the voice data when the sensor information changes, regardless of the characteristics of the network 5, regardless of the characteristics of the network 5, regardless of the characteristics of the sensor 5 when the sensor I green report changes significantly during the utterance. It is desirable to send a bucket containing different sensor information.
引き続き、 音声認識端末 2及び音声認識サーバ 6の動作を説明する。 サーバ側 受信部 2 1によってセンサ情報と音声データ音声認識サーバ 6に取り込まれる (ステップ S 4 0 3)。サーバ側音響分析部 2 8は、取り込まれた音声データを音 響分析して、音声籠量の時系列を算出する(ステップ S 4 0 4)。さらに廿ーバ 側音響モデ 択部 2 3は、 取得したセンサ膚報に基づいて、 最も適切な音響モ デルを選択し(ステップ S 1 0 9)、サ一バ側照合部 2 6はこの音響モデルの標準 パターンと音声特徴量とを照合する (ステツプ S 4 0 5 )。  Subsequently, the operation of the speech recognition terminal 2 and the speech recognition server 6 will be described. The server side receiving unit 21 takes in the sensor information and the voice data to the voice recognition server 6 (step S403). The server-side acoustic analysis unit 28 performs an acoustic analysis of the acquired audio data, and calculates a time series of the audio basket amount (step S404). Further, the hatch-side acoustic model selecting section 23 selects the most appropriate acoustic model based on the acquired sensor skin report (step S109), and the server-side matching section 26 selects this acoustic model. The standard pattern of the model and the speech feature are collated (step S405).
以上より明らかなように、 この実施例 4では、 音声認識端末 2がセンサ情報と 音声データを音声認識サーバ 6に転送することとしたので、 音声認識端末 2側で 音響分析を行うことなく、 集音環境に適した音響モデルに基づいて高精度な音声 認 li 理を行うことができる。 '  As is clear from the above, in the fourth embodiment, the voice recognition terminal 2 transfers the sensor information and the voice data to the voice recognition server 6, so that the voice recognition terminal 2 does not perform the acoustic analysis, Highly accurate speech recognition can be performed based on an acoustic model suitable for the sound environment. '
したがって、 音声認識端末 2に音声認識のための特別な部品や回路、 コンビュ Therefore, special parts, circuits, and compilation
—夕プログラムなどを設けなくても音声認識機能を実現することができる。 また実施例 4によれば、 フレーム毎にセンサ情報を送信するようにしたので、 発話中にマイク口ホン 1が集音する環境条件が急激に変化した場合であっても、 フレーム毎に適切な音響モデルを選択して、 音声認識を行うことができ'る。 なお、 音声認識端末 2からの送信を複数のフレームに分割するという方法^:、 実施例 3の音声特徴量の送信にも適用できる。 すなわち、 音声特徴量は時系列成 分を有するから、 フレームに分割する場合には、 その時系列順にフレーム分割す るとよい。 またそれぞれのフレームに、 その時系列の時刻におけるセンサ情報を 実施例 4と同様に格納し、 音声認識サーバ 6側で、 各フレームに含まれる最新の センサ情報に基づいて繊な音響モデルを選択するようにすれば、 さらに音声認 識の精度を向上させることができる。 実施例 5. 実施例 1〜4の音声認識システムでは、 音声認識端末 2の備えるセンサ 1 2が 取得した難条件に基づいて、 音声認識端末 2及びサーバ 6の記憶する音響モデ ルを選択することにより、 実 に対応した音声認醒理を行うというものであ つた。 しかし、 センサ 1 2力報得した^ ^件だけでなぐ インターネットなど から得られる働口情報を組み合わせて、音響モデルを選択する方法も考えられる。 実施例 5の音声認、 ϋシステムはこのような特徴を有するものである。 -The voice recognition function can be realized without an evening program. Further, according to the fourth embodiment, the sensor information is transmitted for each frame. Therefore, even when the environmental conditions in which the microphone-phone 1 collects sound during the utterance change rapidly, an appropriate You can select an acoustic model and perform speech recognition. Note that the method of dividing the transmission from the voice recognition terminal 2 into a plurality of frames ^: can also be applied to the transmission of the voice feature amount of the third embodiment. That is, since the audio feature has a time-series component, when dividing into frames, it is preferable to divide the frame in the time-series order. Further, the sensor information at the time in the time series is stored in each frame in the same manner as in the fourth embodiment, and the voice recognition server 6 selects a delicate acoustic model based on the latest sensor information included in each frame. By doing so, the accuracy of speech recognition can be further improved. Example 5. In the speech recognition systems of the first to fourth embodiments, the acoustic model stored in the speech recognition terminal 2 and the server 6 is selected based on the difficult condition acquired by the sensor 12 included in the speech recognition terminal 2, so that the The voice awakening process was performed accordingly. However, it is also conceivable to select an acoustic model by combining working information obtained from the Internet, etc., using only the ^ ^ information obtained from the sensor 12. The voice recognition system according to the fifth embodiment has such features.
なお、 実施例 5の特徴は上記のとおり、 インターネットから得られる働口情報 とセンサ情報とを組み合わせて、 音響モデルを選択する、 というものなので、 実 施例 1〜 4のいずれの音声認識システムと組み合わせることも可能であり、 得ら れる効果についても同じであるが、 ここでは例として実施例 1の音声認識システ ムにィンターネットから得られるィ 口情報を組み合わせた場合について説明する ことにする。  As described above, the feature of the fifth embodiment is that the acoustic model is selected by combining the working information obtained from the Internet and the sensor information, so that the speech recognition system according to any of the first to fourth embodiments can be used. It is possible to combine them, and the same effect is obtained.However, here, as an example, a case where the speech recognition system of the first embodiment is combined with mouth information obtained from the Internet will be described. .
図 1 1は、 実施例 5による音声認識システムの構成を示すブロック図である。 この図から明らかなとおり、 実施例 5の音声認識システムは、 実施例 1の音声認 識システムに、 イン夕一ネッ卜情報取得部 2 9を付加したものであって、 図 1と 同一の符号を付した構成要素は実施例 1と同様であるので、 説明を省略する。 ま た、 インターネット情報取得部 2 9は、 インターネットを介して働!]情報を取得 する部位であり、 具体的には h t t p (Hy p e r T e x t T r a n s f e r P r o t o c o l ) によって We bページを取得:するィンターネットブラウ ザ相当の機能を有するものである。 さらに、 実施例 5における音声認識サーバ 6 が 憶している音響モデルでは、 その音響モデルを学習した環境条件をセンサ情 報と働 Π情報とで表現するようにしているものとする。  FIG. 11 is a block diagram illustrating the configuration of the speech recognition system according to the fifth embodiment. As is clear from this figure, the speech recognition system of the fifth embodiment is the same as the speech recognition system of the first embodiment, except that an internet information acquisition unit 29 is added. The components marked with are the same as those in the first embodiment, and a description thereof will not be repeated. Also, the Internet information acquisition unit 29 is a unit that acquires information that works via the Internet!] More specifically, a Web page is acquired by http (Hyper Text Transfer Protocol). It has a function equivalent to an Internet browser. Further, in the acoustic model stored in the speech recognition server 6 in the fifth embodiment, it is assumed that the environmental conditions in which the acoustic model has been learned are represented by sensor information and operation information.
ここで、 働口情報とは、 例えば気象情報や 情報である。 インタ一ネットに は気象情報や ¾1情報を提供する We bサイ卜力 j字在しており、 これらの We b サイトによれば、 各地の気象条件や渋滞情報、 工事状況などを入手することがで きる。 そこで、 このような働口情報を利用して、 より精度の高い音声認識を行うため に、 入手できる働 Π情報にあわせた音響モデルを箱する。 例えば、 気象情報が 働口情報である は、 麵ゃ弓鎮などによって生じる背景雑音の影響をカロ味し て音響モデルが学習される。 また例えば ¾1情報の ^は、 道路工事などによつ て生じる背景雑音の影響を加味して音響モデルが学習される。 Here, the working opening information is, for example, weather information or information. The Internet has a web site that provides weather information and ¾1 information, and according to these web sites, it is possible to obtain weather conditions, traffic congestion information, construction status, etc. in various places. it can. Therefore, in order to perform speech recognition with higher accuracy by using such work information, an acoustic model matching the available work information is boxed. For example, if the weather information is the work mouth information, the acoustic model is learned by taking into account the effect of the background noise caused by the bow and the like. Also, for example, for ^ in the 情報 1 information, the acoustic model is learned in consideration of the influence of background noise generated by road construction and the like.
次に実施例 5による音声認識端末 2及びサーバ 6の動作について説明する。 図 1 2は、 実施例 5による音声認識端末 2及びサーバ 6の動作を示すフローチャー トである。 図 1 2のフ口一チャートと図 2のフローチャートとが異なるのは、 ス テツプ S 5 0 1の有無のみである。 そこで、 以降では、 ステップ S 5 0 1の処理 を中心に説明することとする。  Next, operations of the speech recognition terminal 2 and the server 6 according to the fifth embodiment will be described. FIG. 12 is a flowchart showing the operation of the speech recognition terminal 2 and the server 6 according to the fifth embodiment. The only difference between the flowchart of FIG. 12 and the flowchart of FIG. 2 is the presence or absence of step S501. Therefore, hereinafter, the processing of step S501 will be mainly described.
音声認識サーバ 6において、センサ情報を受信した後に (ステップ S 1 0 8)、 インターネット情報取 ¾2 9は、 音声認識端末 2に接続されたマイクロホン 1 が集音する に影響を与える情報をインタ一ネットから する (ステップ S 5 0 1 )。例えば、センサ 1 2に GP Sアンテナが備えられている場合、センサ情 報には音声認識端末 2及びマイクロホン 1の存¾^"る位置情報が含まれることに なる。 そこで、 インターネット情報取 ¾ 2 9は、 この位置情報に基づいて音声 認鶄繩末 2及びマイクロホン 1の存 する場所の気象情報や交通情報などの付加 情報をインタ一ネットから ίΚ ^する。  After receiving the sensor information at the voice recognition server 6 (step S108), the Internet information collection unit 29 transmits information that affects the sound collection of the microphone 1 connected to the voice recognition terminal 2 to the Internet. (Step S501). For example, when the sensor 12 is provided with a GPS antenna, the sensor information includes the position information where the voice recognition terminal 2 and the microphone 1 are located. In step 9, based on the position information, additional information such as weather information and traffic information of the voice recognition terminal 2 and the microphone 1 is provided from the Internet.
続いて、 サーバ側音響モデ 択部 2 3は、 センサ情報と働口情報とに基づい て音響モデルを選択する。 具体的には、 まず 在の音声認識端末 2及びマイクロ ホン 1の る場所の働口情報と音響モデルの働口情報が一致しているかどう かが判定される。 そして Π情報が一致している音響モデルの中から、 次にセン サ情報について、 実施例 1で示した式 ( 1 ) に基づいて算出された距離値が最小 となる音響モデルを選択する。  Subsequently, the server-side acoustic model selection unit 23 selects an acoustic model based on the sensor information and the work information. Specifically, first, it is determined whether or not the working information of the current voice recognition terminal 2 and the working location of the microphone 1 match the working information of the acoustic model. Then, from among the acoustic models having the same information, an acoustic model with the smallest distance value calculated based on the equation (1) shown in the first embodiment is selected for the sensor information.
以後の処理については難例 1と同様であるので、 説明を省略する。  Subsequent processing is the same as in Difficult Example 1, and a description thereof will not be repeated.
以上から明らかなように、 魏例 5の音声認識システムによれば、 音響モデル を学習した 条件が、センサ情報だけでは完全に表現できないものであっても、 ¾Π情報を用いて表現することができるので、 マイク口ホン 1の集音環境につい てより適切な音響モデルを選択すること力 きる。 またこの結果として、 音声認 識精度を向上させることが きる、 という効果を奏する。 As is clear from the above, according to Wei Example 5's speech recognition system, the acoustic model Even if the conditions for learning are not completely expressed by the sensor information alone, it can be expressed using the information, so select a more appropriate acoustic model for the sound collection environment of the microphone microphone 1 It can be powerful. As a result, the speech recognition accuracy can be improved.
なお上記において、 ¾ロ情報を入手する方法としてインタ一ネットを経由する 方法について説明したが、 M青報を用いる嫌的意義は、 音声認識の精度を劣 化させる環境的諸要因のうち、 あくまでもセンサ情報で 現できない要素に基 づいて音響モデルを嘴蓆することにある 3。 したがって、 このような 口情報を入  In the above, the method of obtaining information through the Internet has been described.However, the negative significance of using the M Blue Report is one of the environmental factors that degrade the accuracy of speech recognition. It consists in shaking acoustic models based on elements that cannot be represented by sensor information3. Therefore, enter such mouth information
1 ,  1,
手する方法は、 インターネットに限定されるものではなく、 例えば、 付加情報を 提供するための専用システムや専用コンピュータを準備してもよい。 The method used is not limited to the Internet. For example, a dedicated system or a dedicated computer for providing additional information may be prepared.
産業上の利用の可能性 Industrial potential
以上のように、 この発明に係る音声認識システム並びに端末及びサーバは、 使 用する場所が ¾ (匕しても高精度の音声認 I»理を藥するために有用であり、 特 に力一ナビゲ一シヨンシステムや携帯電話など、 筐体の大きさや重量、 価格鶴 の制限から、 搭載可能な記慮装置の容量が限られた βに音声認識機能を提供す るのに適している。  As described above, the voice recognition system, the terminal, and the server according to the present invention are useful for providing high-precision voice recognition even if they are used, and Due to the size and weight of the housing, such as a navigation system and a mobile phone, and the limitations of the price crane, it is suitable for providing a voice recognition function to β, which has a limited capacity of a consideration device that can be mounted.

Claims

請 求 の 範 囲 The scope of the claims
1. 音声認識サーバと複数の音声認識端末とをネットワークにより纖した音 声認識システムにおいて、 1. In a voice recognition system in which a voice recognition server and a plurality of voice recognition terminals are connected via a network,
鎌 3音声認識端末は、  Sickle 3 voice recognition terminal,
外部マイク口ホンを接続し、 その外部マイク口ホンが集音した音声信号を入力 する入力端と、  An input terminal for connecting an external microphone mouthphone and inputting a sound signal collected by the external microphone mouthphone;
編 3入力端から入力された音声信号から音声特徴量を算出するクライアント側 音響分析手段と  3 Client-side acoustic analysis means for calculating voice features from voice signals input from input terminals
漏 3音声信号に重畳する馬賠の翻 Uを表すセンサ情報を検出するセンサと、 編 3ネットヮ一クを介して前記センサ情報を前記音声認識サーバに送信するク ライアン卜側送信手段と、  A sensor that detects sensor information indicating the inversion of horsepower to be superimposed on the voice signal, and a client-side transmitting unit that transmits the sensor information to the voice recognition server via a network.
嫌 S音声認識サーバから音響モデルを受信するクライアント側 信手段と、 前記音響モデルと嫌 3音声特徴量とを照合するクライアント側照合手段と、 を 備え、  Client-side communication means for receiving an acoustic model from the unfavorable S-voice recognition server; and client-side matching means for collating the acoustic model with the disagreeable 3 voice features.
前記音声認識サーバは、  The speech recognition server,
tfitSクライアン卜側送信手段が送信したセンサ情報を受信するサーバ側受信手 段と、  a server-side receiving means for receiving the sensor information transmitted by the tfitS client-side transmitting means,
複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、  Server-side acoustic model storage means for storing a plurality of acoustic models;
嫌 3複数の音響モデルから嫌己センサ情報に適合する音響モデルを選択するサ Dislike 3 Selects an acoustic model that matches the disgusting sensor information from multiple acoustic models.
—パ側音響モデリ Hi択手段と、 —Pa side acoustic model Hi selection means,
觸 3サーバ側音響モデ Hi択手段が選択した音響モデルを嫌 3音声認識端末に 送信するサーバ側送信手段と、 を備えることを特徴とする音声認識システム。  3. A voice recognition system, comprising: a server-side transmitting means for transmitting the acoustic model selected by the server-side acoustic model to the negative voice recognition terminal.
2. 音声認識サーバと複数の音声認識端末とをネットワークにより接続した音 声認識システムにおいて、  2. In a voice recognition system in which a voice recognition server and multiple voice recognition terminals are connected via a network,
嫌 3音声認識端末は、 外部マイク口ホンを接続し、 その外部マイク口ホンが集音した音声信号を入力 する人力端と、 Dislike 3 Voice recognition terminal, A human-powered end to which an external microphone mouthpiece is connected and the audio signal collected by the external microphone mouthphone is input;
歸己入力端から入力された音声信号から音声 i¾量を算出するクライアント側 '音響分析手段と  Client side that calculates the amount of voice i from the voice signal input from the input terminal
鎌 3音声信号に重畳する馬結の纏 Uを表すセンサ情報を検出するセンサと、 ΙϋΕネットワークを介して センサ情報と嫌己音声 »とを Ml己音声認識 サーバに送信するクライアント側送信手段とを備え、  Sickle 3 A sensor that detects the sensor information that represents U, which is superimposed on the voice signal, and a client-side transmission unit that transmits the sensor information and the disgusting voice to the Ml self-voice recognition server via the network. Prepare,
嫌 3音声認識サーバは、  The 3 voice recognition server
嫌己センサ情報と嫌 3音声特徴量とを受信するサーバ側受信手段と、 複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、  Server-side receiving means for receiving disgusting sensor information and disagreeable 3 voice feature amounts; server-side acoustic model storing means for storing a plurality of acoustic models;
嫌 3複数の音響モデルから嫌 3センサ情報に適合する音響モデルを選択するサ ーバ側音響モ^!/ §択手段と、  Unfavorable 3 Server-side acoustic model that selects an acoustic model that fits the unfavorable 3 sensor information from multiple acoustic models ^! / § Selection means,
前記サーバ側音響モデリ 択手段が選択した音響モデルと嫌 3音声特徴量とを 照合するサーバ側照合手段と、 を備えることを特徴とする音声認識システム。  A speech recognition system, comprising: a server-side matching unit that compares an acoustic model selected by the server-side acoustic model selecting unit with a negative three-speech feature amount.
3. 音声認識サーバとネ徽の音声認識端末とをネットワークにより接続した音 声認識システムにおいて、 3. In a voice recognition system in which a voice recognition server and a Nehui voice recognition terminal are connected via a network,
觸 3音声認識端末は、  Touch 3 Voice recognition terminal
外部マイクロホンを接続し、 その外部マイクロホンが集音した音声信号を入力 する入力端と、  An input terminal for connecting an external microphone and inputting an audio signal collected by the external microphone;
前記音声信号に重畳する馬 の を表すセンサ情報を検出するセンサと、 ttitSネットワークを介して tin己センサ情報と Iff!己音声信号とを lift己音声認識サ A sensor for detecting sensor information representing a horse's character to be superimposed on the voice signal, and lift self voice sensor information and tin self sensor information and Iff!
—バに送信するクライアント側送信手段とを備え、 And a client-side transmitting means for transmitting to the server,
嫌己音声認識サーバは、 ,  The distasteful speech recognition server,
鎌己センサ情報と嫌 3音声信号どを受信するサーバ側受信手段と、 嫌己音声信号から音声精敷量を算出するサーバ側音響分析手段と  Server-side receiving means for receiving the kamami sensor information and disagreement 3 audio signals, and server-side acoustic analysis means for calculating the amount of voice laying down from the disgusting audio signals
複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、 編己複数の音響モデルから嫌己センサ情報に適合する音響モデルを選択するサServer-side acoustic model storage means for storing a plurality of acoustic models; A service that selects an acoustic model that matches the disgusting sensor information from multiple acoustic models
—バ側音響モデ; Hi択手段と、 —Ba side acoustic model; Hi selection means,
編 3サーバ側音響モデ HI択手段が選択した音響モデルと觸 S音声濯量とを 照合するサーバ側照合手段と、 を備えることを特徴とする音声認識システム。  Edition 3 Server-Side Acoustic Model A speech recognition system comprising: server-side matching means for matching the acoustic model selected by the HI selection means with the touch S voice rinse amount.
4. 嫌 3音声認識サーバは、 4. Dislike 3 Speech recognition server,
ィン夕ーネットから ¾1情報を取得する ¾1情報取 ί耗段をさらに備え、 觸 3サーバ側音響モデ 択手段は、 認センサ I '青報と嫌 3翅情報取得手段 により取得された翅情報との双方に適合する音響モデルを、 嫌己複数の音響モ デルから選択することを糊数とする請求の範囲第 1項〜第 3項のいずれか一項に 記載の音声認識システム。  夕 1 Information is acquired from the Internet. ¾1 Information acquisition and wear stage is further provided.Touch 3 Server-side acoustic model selection means uses recognition sensor I 'Blue information and wing information acquired by The speech recognition system according to any one of claims 1 to 3, wherein the number of glues is such that an acoustic model that conforms to both is selected from a plurality of disgusting acoustic models.
5. 前記音声認識サーバは、  5. The speech recognition server comprises:
ィンターネットから気象情報を取得する気象情報取得手段をさらに備え、 觸 3サーバ側音響モデリ 択手段は、 嫌 3センサ I青報と前記気象 I青報取得手段 により取得された気象情報との双方に適合する音響モデルを、 鎌 3複数の音響モ デルから選択することを體とする請求の範囲第 1項〜第 3項のいずれか一項に 記載の音声認、識システム。  It further comprises weather information acquisition means for acquiring weather information from the Internet, and the touch-side server-side acoustic model selection means includes both the three sensor I green report and the weather information acquired by the weather I green report acquisition means. The voice recognition and recognition system according to any one of claims 1 to 3, wherein an acoustic model conforming to (3) is selected from a plurality of acoustic models.
6. 外部マイクロホンを接続し、 その外部マイクロホンが集音した音声信号を 入力する入力端と、  6. Connect an external microphone and input the audio signal collected by the external microphone.
嫌己入力端から入力された音声信号から音声特徴量を算出するクライアント側 音響分析手段と  A client-side acoustic analysis means for calculating a voice feature from a voice signal input from a disgusting input terminal;
編己音声信号に重畳する馬賠の纏 ijを表すセンサ情報を検出するセンサと、 複数の音響モデルから嫌 3センサ情報に適合する音響モデルを選択し、 ネット ワークを介してこの音響モデルを送信する音声認識サーバに、 嫌 3センサ情報を 送信するクライアント側送信手段と、  Selects a sensor that detects sensor information that indicates ij, which is the name of the horse's liability superimposed on the self-editing voice signal, and selects an acoustic model that fits the three-sensor information from multiple acoustic models, and transmits this acoustic model via the network. Client-side transmission means for transmitting the unfavorable 3 sensor information to the voice recognition server
鎌 3音声認識サーバが送信した嫌 3音響モデルを受信するクライアント側受信 手段と、 l己音響モデルと tiff己音声特徴量とを照合するクライアント ί則照、合手段と、 を 備えることを特徴とする音声認識端末。 Sickle 3 Client-side receiving means for receiving the disagreeable 3 acoustic model transmitted by the 3 speech recognition server, l A client that matches self-acoustic models with tiff self-speech feature values.
7. 複数の音響モデルを記憶するとともに、 複数の音声認識端末の集音髓に 適合した音響モデルを鎌 复数の音響モデルから選択し、 その音響モデルを嫌己 各音声認識端末にネットワークを介して送信する音声認識サーバにおいて、 7. While storing a plurality of acoustic models, select an acoustic model suitable for the sound collection of the plurality of speech recognition terminals from a number of sickle acoustic models, and dislike the acoustic model to each speech recognition terminal via a network. In the transmitting speech recognition server,
Ml己集音環境を表すセンサ情報を tiff己各音声認識端末から受信するサーバ側受 信手段と、 ' A server-side receiving means for receiving sensor information representing the Ml self-collecting environment from each tiff's own voice recognition terminal;
嫌己複数の音響モデルを記憶するサ一バ側音響モデル記憶手段と、  Server-side acoustic model storage means for storing a plurality of disgusting acoustic models;
嫌 3センサ情報に適合する音響モデルを選択するサーバ側音響モデリ! 択手段 と、  Server-side acoustic modeler that selects an acoustic model that matches the 3 sensor information!
嫌己サ一バ側音響モデ Hi択手段が選択した音響モデルを嫌己各音声認識端末 に送信するサーバ側送信手段とを備えたことを特徴とする音声認識サーバ。 Speech recognition server characterized by comprising: server-side transmission means for transmitting the acoustic model selected by the dislike server-side sound model Hi selection means to each of the dislike speech recognition terminals.
8. 觸己音声認識端末が記憶している音響モデルと鎌 3サーバ側音響モデリ!/ S 択手段が選択した音響モデルとの差分を算出する音響モデル差分算出手段、 をさらに備え、 8. An acoustic model difference calculating means for calculating a difference between the acoustic model stored in the touching voice recognition terminal and the acoustic model selected by the sickle 3 server-side acoustic model! / S selecting means,
嫌 3サーバ側送信手段は、 前記音響モデルに代えて、 觸 3差分を送信する、 ことを特徴とする請求の範囲第 7項に記載の音声認識サーバ。  8. The speech recognition server according to claim 7, wherein the negative server-side transmitting means transmits a touch 3 difference instead of the acoustic model.
9. 嫌 3サーバ側音響モデル記憶手段は、 音響モデルの統計量に基づいて予め クラスタリングされた複数の音響モデルをさらに記憶し、  9. Dislike 3 server-side acoustic model storage means further stores a plurality of acoustic models clustered in advance based on the statistics of the acoustic model,
嫌 3音響モテリ!/^分算出手段は、 前記クラスタリングされた複数の音響モデル の差分を算出する、  The negative acoustic model! / ^ Minute calculating means calculates a difference between the clustered acoustic models,
ことを特徴とする請求の範囲第 8項に記載の音声認識サーバ。 . 9. The speech recognition server according to claim 8, wherein: .
1 0. 嫌己音声認識サーバが記憶する複数の音響モデルのうち、 ϋ½)音響モ デルを記慮するローカリ 響モデル記憶手段と、  1 0. Among a plurality of acoustic models stored by the distasteful speech recognition server, ϋ½) a local acoustic model storage unit that considers the acoustic model;
嫌 3ロー力 ^響モデル記憶手段が記憶している音響モデルに、 その音響モデ ルと嫌 3音声認識サーバが嫌 3センサ情報に適合する音響モデルとして選択した 音響モデルとの差分を加えて、 嫌己センサ情報に適合する音響モデルを生^ る 音響モデリ哈成手段と、 をさらに備え、 ' Unfavorable 3 low force ^ Acoustic model stored in the acoustic model storage means, and the acoustic model and the unfavorable 3 voice recognition server selected as the acoustic model suitable for the unfavorable 3 sensor information An acoustic modeler generating means for generating an acoustic model adapted to the disgusting sensor information by adding a difference from the acoustic model.
■己クライアント側受信手段は、 鎌己音響モデルに代えて、 維己音声認 一 ノ り送信される觸己散を受信することを糊敷とする請求の範囲第 6項に記載 の音声認識端末。  7. The speech recognition terminal according to claim 6, wherein the self-client-side receiving means uses a glue-sheet to receive the tactile dissemination transmitted by the self-voice recognition instead of the kamami sound model. .
1 1. 複数の音響モデルを記憶するとともに、 複数の音声認識端末により抽出 された入力音声の音声特徴量をネッ卜ワークを介して受信し、 ΙΐΠ己各音声認識端 末の集音環境に適合した音響モデルを編 复数の音響モデルから選択して、 その 音響モデルを用レて嫌 3音声難量を認識する音声認識サーバにおいて、 嫌 3各音声認識端末から前記集音環境を表すセンサ情報と鎌 3音声特徴量を受 信するサーバ側受信手段と、  1 1. In addition to storing multiple acoustic models, receiving voice features of input voices extracted by multiple voice recognition terminals via a network, and adapting to the sound collection environment of each voice recognition terminal. The selected acoustic model is selected from a plurality of acoustic models, and the acoustic model is used to recognize the disagreeable voice difficulty. Sickle 3 server-side receiving means for receiving voice features,
嫌己複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、 前記センサ情報に適合する音響モデルを選択するサーバ側音響モデ: 択手段 と、  Server-side acoustic model storing means for storing a plurality of disgusting acoustic models; server-side acoustic model selecting means for selecting an acoustic model matching the sensor information;
黼3音声特徴量と歸己サーバ側音響モデリ Hi択手段により選択された音響モデ ルとを照合するサーバ側照合手段とを備えたことを特徴とする音声認識サーバ。 A speech recognition server, comprising: server-side matching means for matching the speech feature quantity with the sound model selected by the return server-side sound model selection means.
1 2. 外部マイク口ホンを «し、 その外部マイク口ホンが集音した音声信号 を入力する入力端と、 1 2. Open the external microphone, and input the audio signal collected by the external microphone.
嫌 3入力端から入力された音声信号から音声特徴量を算出するクライアント側 音響分析手段と  A client-side acoustic analysis unit that calculates audio features from audio signals input from the three input terminals
漏音声信号に重畳する馬緒の翻 Uを表すセンサ情報を検出するセンサと、 複数の音響モデルから嫌 3センサ情報 (こ適合する音響モデルを選択し、 その音 響モデルに基づいて、 ネットワークを介して受信した音声特徴量の音声認識を行 う音声認識サーバに、 ΙίίΙΒセンサ情報と ΙίίΙ3音声樹敷量とを送信するクライアン ト側送信手段と、  A sensor that detects sensor information that indicates the inflection U of the horse that is superimposed on the leaked voice signal, and three-sensor information from multiple acoustic models (selects a suitable acoustic model and creates a network based on that acoustic model A client-side transmitting means for transmitting ΙίίΙΒsensor information and ΙίίΙ3amount of voice space to a voice recognition server which performs voice recognition of voice features received via the
を備えることを樹敷とする音声認識端末。 A voice recognition terminal that has a home.
1 3. 漏己クライアン卜側送信手段は、 鎌己音声特徴量を時系列順に複数のフ レームに分害!!し、 前記時系列の各時刻において前記センサが検出した前記センサ 情報を嫌 S各フレームに ¾Πして送 ίΙΤることを特徴とする請求の範囲第 1 2項 に記載の音声認識端末。 1 3. The leaked client side transmitting means harms the sickle voice feature amount to a plurality of frames in a time series order, and dislikes the sensor information detected by the sensor at each time in the time series. 13. The speech recognition terminal according to claim 12, wherein the speech recognition terminal transmits each frame.
1 4. 1913サーバ御』受信手段は、 フレーム毎に ΙίίΙ3センサ情報と Kit己音声特徴 量とを受信し、  1 4. 1913 server control ”The receiving means receives ΙίίΙ3 sensor information and Kit self-voice features for each frame,
鎌 3サーバ側音響モデ 択手段は、 觸 Sフレーム毎に觸己センサ I青報に適合 する音響モデルを選択し、  Sickle 3 The server-side acoustic model selection means selects an acoustic model that matches the touch sensor I
嫌 3サ一バ側照合手段は、 嫌 3サーバ側音響モデリ HI択手段により前記フレー ム毎に選択された音響モデルと觸 3フレームの音声特徴量との照合を行うことを 糊敷とする請求の範囲第 1 1項に記載の音声認識サーバ。  The unfavorable 3 server-side matching means is configured to perform matching between the acoustic model selected for each frame by the unfavorable 3 server-side acoustic model HI selecting means and the voice feature of the touch 3 frame. Speech recognition server according to paragraph 11 of the scope.
1 5. ネットヮ一クを介して複数の音声認識端末から音声ディジタル信号を受 信するとともに、 前記各音声認識端末の集音環境に適合した音響モデルを複数の 音響モデルから逢択して、 その音響モデルを用いて Ιϋΐ己音声ディジタル信号の音 声認識を行う音声認識サーバにおいて、  1 5. While receiving voice digital signals from a plurality of speech recognition terminals via a network, selecting an acoustic model suitable for the sound collection environment of each of the speech recognition terminals from the plurality of acoustic models. In a speech recognition server that performs speech recognition of digital signals of self-voice using an acoustic model,
前記各音声認識端末から鎌己集音環境を表すセンサ情報と前記音声ディジタル 信号とを受信するサーバ側受信手段と、  Server-side receiving means for receiving, from each of the voice recognition terminals, sensor information indicating a Kamaki sound collection environment and the voice digital signal;
前記音声ディジタル信号から音声特徴量を算出するサーバ側音響分析手段ど 嫌 S複数の音響モデルを記慮するサーバ側音響モデル記憶手段と、  Server-side acoustic model storage means for calculating a plurality of acoustic models, such as server-side acoustic analysis means for calculating an audio feature amount from the audio digital signal,
編 3センサ情報に適合する音響モデルを選択するサーバ側音響モデ! ^択手段 ' と、  3 Server-side acoustic model for selecting an acoustic model that matches the sensor information!
嫌 3音声難量と鎌 3サーバ側音響モデリ 択手段により選択された音響モデ ルとを照合するサーバ側照合手段とを備えたことを樹敫とする音声認識サーバ。 A voice recognition server that includes server-side matching means for matching the three difficult sounds and the acoustic model selected by the three server-side sound model selecting means.
1 6. 外部マイク口ホンを接続し、 その外部マイク口ホンが集音した音声信号 を入力する入力端と、 1 6. Connect the external microphone microphone and input the audio signal collected by the external microphone microphone.
觸 3入力端から入力された音声信号から音声ディジタル信号を算出する音声デ イジタル処理手段と、 Touch 3 A voice data that calculates a voice digital signal from a voice signal input from the input terminal. Digital processing means;
嫌己音声信号に重畳する馬賠の衝 (Jを表すセンサ情報を検出するセンサと、 複数の音響モデルから歸己センサ情報に適合する音響モデルを選択し、 その音 響モデルに基づいて、 ネットワークを介して受信した音声信号ディジタル信号を 音声認識する音声認識サーバに、 鎌己センサ情報と鎌己音声ディジタル信号とを 送信するクライアント側送信手段と、  A sensor that detects the sensor information representing J, which is superimposed on the disgusting voice signal, and an acoustic model that matches the return sensor information from a plurality of acoustic models are selected, and a network is created based on the acoustic model. Client-side transmitting means for transmitting the Kamaki sensor information and the Kamaki voice digital signal to a voice recognition server for recognizing the voice signal digital signal received via the
を備えることを ί敷とする音声認識端末。 A voice recognition terminal that is equipped with:
1 7. 前記クライアン卜御』送信手段は、 til己音声ディジ夕ル信号を時系列順に 複数のフレームに分割し、 鎌 3時系列の各時刻において鎌己センサが検出したセ ンサ情報を編 3各フレームに働口して送 ることを特徴とする請求の範囲第 1 1 7. The client transmission means divides the til self voice digital signal into a plurality of frames in chronological order, and edits the sensor information detected by the sick sensor at each time in the time series. The first aspect of the invention is characterized in that the frame is sent to each frame.
6項に記載の音声認識端末。 The voice recognition terminal according to item 6.
1 8. 謝 3サーバ側受信手段は、 フレーム毎に音声ディジタル信号とセンサ情 報とを受信し、  1 8. Thank you 3 The receiving means on the server side receives the audio digital signal and sensor information for each frame,
前記サーバ則音響分ネ斤手段は、 tfftS音声ディジタレ信号から前記フレーム毎に 音声特徴量を算出し、  The server rule sound dividing unit calculates a speech feature amount for each frame from a tfftS speech digital signal,
嫌 3サーバ側音響モ^ HI択手段は、 爾 3フレームの鎌 Bフレーム毎に嫌 3セ ンサ情報に適合する音響モデルを選択し、  The HI selection means on the server 3 side acoustic model selects an acoustic model that matches the sensory information for each 3 frame sickle B frame,
嫌 3サーバ側照合手段は、 謹 3サーバ側音響モデ 択手段により編 3フレー ム毎に選択された音響モデルと、 嫌 3フレームの音声特徴量との照合を行うこと を特徴とする請求の範囲第 1 5項に記載の音声認識サーバ。  The unfavorable 3 server-side matching means performs matching between the acoustic model selected for each 3 frames by the friendly 3 server-side acoustic model selecting means and the audio feature amount of the unfavorable 3 frames. A speech recognition server according to clause 15.
1 9. ィン夕ーネットから交通情報を取得する交通情報取 »段をさらに備え、 前記サ―バ側音響モ^ HI択手段は、 嫌 3センサ情報と嫌己交通膚報取醉段 により取得された翅情報との双方に適合する音響モデルを、 觸己複数の音響モ デルから選択することを體とする請求の範囲第 7項〜第 9項、 第 1 1項、 第 1 4項、 第 1 5項、 第 1 8項のいずれか一項に記載の音声認識サーバ。  1 9. It is further equipped with a traffic information collecting step for acquiring traffic information from the Internet, and the server-side sound module HI selecting means is obtained by using the unfavorable 3 sensor information and the unpleasant traffic report. Claims 7 to 9, 9, 11, and 14, wherein an acoustic model that conforms to both the selected wing information is selected from a plurality of acoustic models. A speech recognition server according to any one of paragraphs 15 and 18.
2 0 · 嫌己サーバ側音響モデリ 択手段は、ィン夕一ネットから気象情報を取 辱する気象' ft幸取得手段をさらに備え、 20 · The sound model selection means on the disgusting server side obtains weather information from It is further equipped with a means to acquire the weather that humiliates' ft,
嫌己サーバ側音響モデ耀択手段は、 嫌己センサ Iff報と黼己気象情報取 ί評段 により取得された気象情報との双方に適合する音響モデルを、 鎌己複数の音響モ デルから選択することを髓とする請求の範囲第 7項〜第 9項、 第 1 1項、 第 1 4項、 第 1 5項、 第 1 8項のいずれか一項に記載の音声認識サーバ。  The sound model selection means on the disgusting server side selects an acoustic model that matches both the disgusting sensor Iff report and the meteorological information obtained by The speech recognition server according to any one of claims 7 to 9, 11, 11, 14, 15, and 18, wherein the speech recognition server performs the process.
PCT/JP2003/009598 2003-07-29 2003-07-29 Voice recognition system and its terminal and server WO2005010868A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2003/009598 WO2005010868A1 (en) 2003-07-29 2003-07-29 Voice recognition system and its terminal and server
JP2005504586A JPWO2005010868A1 (en) 2003-07-29 2003-07-29 Speech recognition system and its terminal and server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2003/009598 WO2005010868A1 (en) 2003-07-29 2003-07-29 Voice recognition system and its terminal and server

Publications (1)

Publication Number Publication Date
WO2005010868A1 true WO2005010868A1 (en) 2005-02-03

Family

ID=34090568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2003/009598 WO2005010868A1 (en) 2003-07-29 2003-07-29 Voice recognition system and its terminal and server

Country Status (2)

Country Link
JP (1) JPWO2005010868A1 (en)
WO (1) WO2005010868A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008108232A1 (en) * 2007-02-28 2008-09-12 Nec Corporation Audio recognition device, audio recognition method, and audio recognition program
JP2011118124A (en) * 2009-12-02 2011-06-16 Murata Machinery Ltd Speech recognition system and recognition method
CN109213970A (en) * 2017-06-30 2019-01-15 北京国双科技有限公司 Put down generation method and device
WO2019031870A1 (en) * 2017-08-09 2019-02-14 엘지전자 주식회사 Method and apparatus for calling voice recognition service by using bluetooth low energy technology
CN110556097A (en) * 2018-06-01 2019-12-10 声音猎手公司 Customizing acoustic models
WO2020096172A1 (en) * 2018-11-07 2020-05-14 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002091477A (en) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp Voice recognition system, voice recognition device, acoustic model control server, language model control server, voice recognition method and computer readable recording medium which records voice recognition program
JP2003122395A (en) * 2001-10-19 2003-04-25 Asahi Kasei Corp Voice recognition system, terminal and program, and voice recognition method
JP2003140691A (en) * 2001-11-07 2003-05-16 Hitachi Ltd Voice recognition device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002091477A (en) * 2000-09-14 2002-03-27 Mitsubishi Electric Corp Voice recognition system, voice recognition device, acoustic model control server, language model control server, voice recognition method and computer readable recording medium which records voice recognition program
JP2003122395A (en) * 2001-10-19 2003-04-25 Asahi Kasei Corp Voice recognition system, terminal and program, and voice recognition method
JP2003140691A (en) * 2001-11-07 2003-05-16 Hitachi Ltd Voice recognition device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KOSAKA ET AL.: "Scalar Ryushi-ka o Riyo shita Client . Server-gata Onsei Ninshiki no Jitsugen to Server-bu no Kosoku-ka no Kento", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS GIJUTSU KENKYU HOKOKU (ONSEI), 21 December 1999 (1999-12-21), pages 31 - 36, XP002984744 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008108232A1 (en) * 2007-02-28 2008-09-12 Nec Corporation Audio recognition device, audio recognition method, and audio recognition program
JP5229216B2 (en) * 2007-02-28 2013-07-03 日本電気株式会社 Speech recognition apparatus, speech recognition method, and speech recognition program
US8612225B2 (en) 2007-02-28 2013-12-17 Nec Corporation Voice recognition device, voice recognition method, and voice recognition program
JP2011118124A (en) * 2009-12-02 2011-06-16 Murata Machinery Ltd Speech recognition system and recognition method
CN109213970A (en) * 2017-06-30 2019-01-15 北京国双科技有限公司 Put down generation method and device
CN109213970B (en) * 2017-06-30 2022-07-29 北京国双科技有限公司 Method and device for generating notes
US11367449B2 (en) 2017-08-09 2022-06-21 Lg Electronics Inc. Method and apparatus for calling voice recognition service by using Bluetooth low energy technology
WO2019031870A1 (en) * 2017-08-09 2019-02-14 엘지전자 주식회사 Method and apparatus for calling voice recognition service by using bluetooth low energy technology
JP2019211752A (en) * 2018-06-01 2019-12-12 サウンドハウンド,インコーポレイテッド Custom acoustic models
US11011162B2 (en) 2018-06-01 2021-05-18 Soundhound, Inc. Custom acoustic models
US11367448B2 (en) 2018-06-01 2022-06-21 Soundhound, Inc. Providing a platform for configuring device-specific speech recognition and using a platform for configuring device-specific speech recognition
CN110556097A (en) * 2018-06-01 2019-12-10 声音猎手公司 Customizing acoustic models
CN110556097B (en) * 2018-06-01 2023-10-13 声音猎手公司 Custom acoustic models
US11830472B2 (en) 2018-06-01 2023-11-28 Soundhound Ai Ip, Llc Training a device specific acoustic model
WO2020096172A1 (en) * 2018-11-07 2020-05-14 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof
US10699704B2 (en) 2018-11-07 2020-06-30 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof
US11538470B2 (en) 2018-11-07 2022-12-27 Samsung Electronics Co., Ltd. Electronic device for processing user utterance and controlling method thereof

Also Published As

Publication number Publication date
JPWO2005010868A1 (en) 2006-09-14

Similar Documents

Publication Publication Date Title
EP2538404B1 (en) Voice data transferring device, terminal device, voice data transferring method, and voice recognition system
US7451085B2 (en) System and method for providing a compensated speech recognition model for speech recognition
TWI508057B (en) Speech recognition system and method
US20020138274A1 (en) Server based adaption of acoustic models for client-based speech systems
US20100185446A1 (en) Speech recognition system and data updating method
EP2956939B1 (en) Personalized bandwidth extension
CN104347067A (en) Audio signal classification method and device
JP2002091477A (en) Voice recognition system, voice recognition device, acoustic model control server, language model control server, voice recognition method and computer readable recording medium which records voice recognition program
JP6466334B2 (en) Real-time traffic detection
CN104040626A (en) Multiple coding mode signal classification
CN101345819A (en) Speech control system used for set-top box
CN1171201C (en) Speech distinguishing system and method thereof
WO2005010868A1 (en) Voice recognition system and its terminal and server
JP2008026489A (en) Voice signal conversion apparatus
JP2003241788A (en) Device and system for speech recognition
JP3477432B2 (en) Speech recognition method and server and speech recognition system
CN113724698B (en) Training method, device, equipment and storage medium of voice recognition model
EP1810277A1 (en) Method for the distributed construction of a voice recognition model, and device, server and computer programs used to implement same
CN103474063A (en) Voice recognition system and method
CN1062365C (en) A method of transmitting and receiving coded speech
JPH10254473A (en) Method and device for voice conversion
JP2003122395A (en) Voice recognition system, terminal and program, and voice recognition method
JP2006106300A (en) Speech recognition device and program therefor
FI115275B (en) Speech identification application for wireless terminals
CN116052720A (en) Voice error detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2005504586

Country of ref document: JP

122 Ep: pct application non-entry in european phase