WO2002080141A1 - Sound processing apparatus - Google Patents
Sound processing apparatus Download PDFInfo
- Publication number
- WO2002080141A1 WO2002080141A1 PCT/JP2002/003248 JP0203248W WO02080141A1 WO 2002080141 A1 WO2002080141 A1 WO 2002080141A1 JP 0203248 W JP0203248 W JP 0203248W WO 02080141 A1 WO02080141 A1 WO 02080141A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cluster
- voice
- unit
- dictionary
- processing device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Definitions
- the present invention relates to a voice processing apparatus, and more particularly to a voice processing apparatus that can easily update a dictionary for registering a phrase such as a word to be subjected to voice recognition.
- a user's utterance is speech-recognized by referring to a dictionary in which words to be subjected to speech recognition are registered.
- misrecognition may affect recognition of words before and after the unregistered word.
- words before and after the unregistered word are also misrecognized. Will be.
- Japanese Unexamined Patent Publication No. Hei 9-181181 discloses that a garbage model for detecting unregistered words and an HMM (Hidden Markov Model) clustered for several phonemes such as vowels simultaneously.
- a speech recognition device has been disclosed that detects unregistered words by reducing the amount of calculation for the unregistered words by restricting the phonemic sequence allowed for the unregistered words.
- Japanese Patent Application No. 11-124, 4 5 4 6 1 An information processing device that calculates the similarity between a word and a word in the database based on the concept of the word, and constructs and outputs an appropriately ordered word string for a set of words including unregistered words It has been disclosed.
- one of the typical countermeasures for unregistered words is to register an unregistered word in the dictionary when the input speech contains the unregistered word, and thereafter, as a registered word. There is a way to get it.
- a phoneme typewriter In order to register an unregistered word in the dictionary, it is necessary to first detect the voice section of the unregistered word and recognize the phoneme sequence of the voice in the voice section.
- a phoneme typewriter As a method of recognizing a phoneme sequence of a certain voice, for example, there is a method called a phoneme typewriter.
- the phoneme typewriter basically uses a garbage model that allows free transition for all phonemes. A phoneme sequence for the input speech is output.
- a user inputs a heading (for example, reading of unregistered words) representing the unregistered word, and the user inputs the heading.
- a heading for example, reading of unregistered words
- the user inputs the heading.
- phonemic sequences of unregistered words in the represented cluster but this method is troublesome because the user needs to input a headline.
- the present invention has been made in view of such a situation, and aims to easily register an unregistered word in a dictionary while avoiding a large-scale dictionary. .
- the speech processing apparatus includes: a cluster detection unit that detects a cluster to which an input voice is added as a new member from a cluster obtained by clustering the voices; A cluster dividing means for dividing the detected cluster as a new member and dividing the cluster based on the members of the cluster; and an updating means for updating a dictionary based on a result of dividing the cluster by the cluster dividing means. It is characterized by.
- the voice processing method includes a cluster detection step of detecting a cluster to which an input voice is added as a new member from a cluster obtained by clustering voices, and a step of detecting the input voice in the cluster detection step.
- the program according to the present invention includes a cluster detection step of detecting a cluster to which an input voice is added as a new member from clusters obtained by clustering voices, and an input voice being detected in a cluster detection step.
- Takura A new cluster member, and a cluster dividing step of dividing the cluster based on the members of the cluster, and an updating step of updating a dictionary based on a result of the cluster division by the cluster dividing step.
- the recording medium of the present invention includes a cluster detection step of detecting a cluster to which an input voice is added as a new member from a cluster obtained by clustering voices, and an input voice being detected in a cluster detection step.
- a cluster is divided into new members of the cluster, and the cluster is divided on the basis of the members of the cluster, and an updating step of updating a dictionary is performed based on a result of dividing the cluster by the cluster dividing step. It is characterized by being recorded.
- a cluster to which the input voice is added as a new member is detected from the clusters obtained by clustering the voices which have already been obtained. Furthermore, the input voice is made a new member of the detected cluster, and the cluster is divided based on the members of the cluster. Then, the dictionary is updated based on the division result.
- FIG. 1 is a perspective view showing an external configuration example of an embodiment of a robot to which the present invention is applied.
- FIG. 2 is a block diagram showing an example of the internal configuration of the mouth pot.
- FIG. 3 is a block diagram showing a functional configuration example of the controller of the robot in FIG.
- FIG. 4 is a block diagram showing a configuration example of a speech recognition unit of the mouth bot of FIG. 1 as a speech recognition device to which the first embodiment of the present invention is applied.
- FIG. 5 is a diagram showing a word dictionary.
- FIG. 6 is a diagram showing grammar rules.
- FIG. 7 is a diagram showing the contents stored in the feature vector buffer of the speech recognition unit in Fig. 4.
- FIG. 8 is a diagram showing a score sheet.
- FIG. 9 is a flowchart illustrating the voice recognition processing of the voice recognition unit in FIG.
- FIG. 10 is a flowchart illustrating details of the unregistered word processing in FIG.
- FIG. 11 is a flowchart illustrating details of the cluster division processing in FIG.
- FIG. 12 is a diagram showing a simulation result.
- FIG. 13 is a diagram illustrating a configuration example of hardware of a speech recognition device to which the second embodiment of the present invention has been applied.
- FIG. 14 is a block diagram showing an example of the software configuration of the speech recognition device in FIG.
- FIG. 15 is a diagram showing the storage contents of the characteristic vector buffer of the speech recognition device of FIG.
- FIG. 16 is a flowchart illustrating the speech recognition processing of the speech recognition device in FIG.
- FIG. 17 is a flowchart for explaining the details of the unregistered word erasure process in FIG. 16: BEST MODE FOR CARRYING OUT THE INVENTION
- FIG. 1 shows an example of an external configuration of a robot according to an embodiment of the present invention.
- FIG. 2 shows an example of an electrical configuration of the robot.
- the robot is in the shape of a four-legged animal such as a dog, and the leg unit 3A, 3B: 3C, 3 with D is connected, into a front portion and a rear portion of the body unit 2, their respective head Interview - Tsu DOO 4 and t tail portion Interview for tail portion Yunitto 5 is constituted by being connected
- the cutout 5 is drawn out from a base portion 5B provided on the upper surface of the body unit 2 so as to bend or move freely with two degrees of freedom.
- the body unit 2 contains a controller 10 that controls the entire mouth pot, a battery 11 that is a power source for the robot, and an internal sensor unit 14 that includes a battery sensor 12 and a heat sensor 13. ing.
- the head unit 4 includes a microphone (microphone) 15 corresponding to the “ear”, a CCD (Charge Coupled Device) camera 16 corresponding to the “eye”, a touch sensor 17 corresponding to the tactile sense, and a “mouth”. Corresponding loudspeakers 18 and the like are arranged at predetermined positions.
- a lower jaw 4A corresponding to the lower jaw of the mouth is movably attached to the head unit 4 with one degree of freedom.The movement of the lower jaw 4A realizes the opening and closing operation of the robot mouth. It is supposed to be.
- the microphone 15 in the head unit 4 collects surrounding sounds (sounds) including utterances from the user, and sends the obtained sound signals to the controller 10.
- the CCD camera 16 captures an image of the surroundings and sends the obtained image signal to the controller 10.
- the touch sensor 17 is provided, for example, on the upper part of the head unit 4 and detects a pressure received by a physical action such as “stroking” or “slapping” from a user, and the detection result is a pressure detection signal.
- a physical action such as “stroking” or “slapping” from a user
- the battery sensor 12 in the body cutout 2 detects the remaining amount of the battery 11 and sends the detection result to the controller 10 as a battery remaining amount detection signal. Then, the detection result is sent to the controller 10 as a heat detection signal.
- the controller 10 includes a CPU (Central Processing Unit) 10 OA, a memory 10 B, and the like, and the CPU 10 A executes a control program stored in the memory 10 B, thereby executing the control program. Performs various processes.
- a CPU Central Processing Unit
- the controller 10 includes a CPU (Central Processing Unit) 10 OA, a memory 10 B, and the like, and the CPU 10 A executes a control program stored in the memory 10 B, thereby executing the control program. Performs various processes.
- CPU Central Processing Unit
- the controller 10 includes the microphone 15, the CCD camera 16, the touch sensor 17, based on the voice signal, image signal, pressure detection signal, remaining battery detection signal, and heat detection signal given from the battery sensor 12 and heat sensor 13, the surrounding conditions, user commands, and user To determine if there is any action.
- the controller 10 based on the determination results and the like, to determine the subsequent actions - on the basis of the determination result, Akuchiyueta 3 to 3 AA K, 3 BAL to 3 BA K, 3 CA t to 3 CA K, 3 DAL to 3DA K, 4A L or 4A L, 5 ⁇ 1 ⁇ 5 shall drive the necessary of a 2.
- the head unit 4 is swung up, down, left and right, and the lower jaw 4 mm is opened and closed.
- the tail unit 5 can be moved, and the leg units 3A to 3D are driven to perform actions such as walking the robot.
- the controller 10 generates a synthesized sound as necessary and supplies the synthesized sound to the speaker 18 for output, or turns on an unillustrated LED (Light Emitting Diode) provided at the position of the robot's ⁇ eye '', Turn off or blink.
- LED Light Emitting Diode
- the robot takes an autonomous action based on the surrounding situation and the like.
- FIG. 3 shows a functional configuration example of the controller 10 of FIG. Note that the functional configuration shown in FIG. 3 is realized by the CPU 10A executing a control program stored in the memory 10B.
- the controller 10 accumulates the recognition results of the sensor input processing unit 50 for recognizing a specific external state, the sensor input processing unit 50, and expresses the emotion, instinct, and growth state. Based on the recognition result of the processing unit 50, etc., an action determination mechanism unit 52 that determines the following action, a posture transition mechanism unit 53 that actually causes the robot to perform an action based on the determination result of the action determination mechanism section 52, and each actuator It comprises a control mechanism section 54 for driving and controlling the 3 A ⁇ to 5 and 5 A 2, and a speech synthesis section 55 for generating a synthesized sound.
- the sensor input processing unit 50 determines a specific external signal based on a voice signal, an image signal, a pressure detection signal, and the like provided from the microphone 15, the CCD camera 16, the touch sensor 17, and the like. It recognizes the state of a part, a specific action from the user, an instruction from the user, and the like, and notifies the model storage unit 51 and the action determination mechanism unit 52 of state recognition information indicating the recognition result.
- the sensor input processing unit 50 has a voice recognition unit 5OA, and the voice recognition unit 5OA performs voice recognition on a voice signal given from the microphone 15. Then, the voice recognition unit 5 OA uses the model storage unit 51 and the action determination as the state recognition information, for example, commands such as “walk”, “down”, “chase the ball” and the like as the voice recognition result. Notify the mechanism section 52.
- the sensor input processing section 50 has an image recognition section 50B, and the image recognition section 50B performs an image recognition process using an image signal given from the CCD camera 16. Then, when the image recognition unit 50B detects, for example, “a red round object” or “a plane perpendicular to the ground and equal to or higher than a predetermined height” as a result of the processing,
- Image recognition results such as “there is a pole” and “there is a wall” are notified to the model storage unit 51 and the action determination mechanism unit 52 as state recognition information.
- the sensor input processing section 50 has a pressure processing section 50C, and the pressure processing section 50C processes a pressure detection signal given from the touch sensor 17. And the pressure processing unit 50.
- the pressure processing section 50C processes a pressure detection signal given from the touch sensor 17.
- the pressure processing unit 50 As a result of the processing, if a short-time pressure is detected that is equal to or higher than a predetermined threshold value and is detected, it is recognized as “hit”, and the pressure that is lower than the predetermined threshold value for a long time is detected. When it is detected, it recognizes that it has been stroked (praised), and notifies the model storage unit 51 and the action determination mechanism unit 52 of the recognition result as state recognition information.
- the model storage unit 51 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot.
- the emotion model indicates, for example, the state (degree) of emotions such as “joy”, “sadness”, “anger”, and “fun” in a predetermined range (for example, from 1.0 to 1.0). 0, etc.), and the values are changed based on the state recognition information from the sensor input processing unit 50, the passage of time, and the like.
- the instinct model is, for example, "appetite”
- the state (degree) of desire by instinct such as “sleep desire” and “exercise desire” is represented by a value in a predetermined range, and based on state recognition information from the sensor input processing unit 50, time lapse, and the like. Change the value.
- the growth model represents, for example, the state of growth (degree) such as “childhood”, “adolescence”, “mature”, “elderly”, etc., by a value in a predetermined range, and performs sensor input processing.
- the value is changed based on the state recognition information from the unit 50 or the passage of time.
- the model storage unit 51 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the behavior determination mechanism unit 52 as state information.
- the model storage unit 51 is supplied with state recognition information from the sensor input processing unit 50, and from the action determination mechanism unit 52, the current or past action of the mouth pot, specifically, for example,
- the behavior information indicating the content of the behavior such as “walking for a long time” is supplied, and the model storage unit 51 stores the lopot behavior indicated by the behavior information even if the same state recognition information is given. Different state information is generated depending on the status.
- the behavior information that the robot greets the user and the state recognition information that the robot strokes the head are stored in the model storage unit. 51, and in this case, in the model storage unit 51, the value of the emotion model representing “happy” is increased.
- the behavior information indicating that the robot is performing the work and the state recognition information indicating that the robot has been stroked on the head are stored in the model storage unit 51.
- the model storage unit 51 does not change the value of the emotion model representing “joy”.
- the model storage unit 51 sets the value of the emotion model while referring to not only the state recognition information but also the behavior information indicating the current or past behavior of the mouth port.
- the value of the emotion model representing “joy” is increased. Unnatural emotional changes can be avoided.
- the model storage unit 51 increases and decreases the values of the instinct model and the growth model based on both the state recognition information and the behavior information, as in the case of the emotion model. I have. In addition, the model storage unit 51 increases and decreases the values of the emotion model, the instinct model, and the growth model based on the values of other models.
- the action determining mechanism 52 determines the next action based on the state recognition information from the sensor input processing section 50, the state information from the model storage section 51, the passage of time, and the like, and determines the determined action. Is sent to the posture transition mechanism 53 as action command information.
- the action determination mechanism 52 manages a finite state automaton that associates the action that the robot can take with the state (state) as an action model that defines the action of the robot.
- the state in the finite automaton based on the state recognition information from the sensor input processing unit 50, the value of the emotion model, the instinct model, or the growth model in the model storage unit 51, the elapsed time, and the like.
- the action corresponding to the later state is determined as the action to be taken next.
- the action determining mechanism 52 upon detecting that a predetermined trigger has occurred, changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed has reached a predetermined time, or when specific state recognition information is received, the action determining mechanism 52 The state is transited when the value of the emotion, instinct, or growth state indicated by the supplied state information falls below or above a predetermined threshold.
- the action determination mechanism 52 includes not only the state recognition information from the sensor input processing unit 50 but also the values of the emotion model, the instinct model, the growth model, and the like in the model storage unit 51. Based on the transition of the state in the behavior model, the state transition destination differs depending on the emotion model, instinct model, and growth model value (state information) even if the same state recognition information is input. Becomes
- the behavior determining mechanism 52 determines that the state information is “not angry”.
- the state recognition information indicates that "the palm is out in front of the eyes”, and the palm is in front of the eyes.
- action instruction information for taking an action of “hand” is generated and transmitted to the posture transition mechanism 53.
- the state determination information when the state information indicates “not angry” and “stomach hungry”, the state determination information
- Action command information is generated and sent to the posture transition mechanism 53.
- the behavior determination mechanism unit 52 indicates that the state recognition information indicates “the palm is put in front of the eyes”.
- the status information indicates that you are hungry, or that the status information indicates that you are not hungry, It generates action command information for causing the attitude transition mechanism 53 to send the action command information.
- the action determining mechanism 52 generates, as described above, action command information for causing the robot to speak, in addition to action command information for operating the robot's head and limbs.
- the action command information for causing the mouth pot to speak is supplied to the speech synthesis section 55, and the action command information supplied to the speech synthesis section 55 is generated by the speech synthesis section 55. Texts corresponding to synthesized sounds are included.
- the voice synthesis section 55 upon receiving the action command information from the action determination section 52, the voice synthesis section 55 generates a synthesized sound based on the text included in the action command information, and supplies the synthesized sound to the speaker 18 for output.
- the speaker 18 to output, for example, the sound of a mouth pot, various requests to the user such as “I am hungry”, a response to the user's call such as “What?”, etc. Is output.
- the action determination mechanism 52 When outputting a synthetic sound, the action determination mechanism 52 generates action command information for opening and closing the lower jaw 4A as necessary, and outputs the action command information to the posture transition mechanism 53. In this case, the lower jaw 4 A opens and closes, giving the user the impression that the mouth pot is talking.
- the posture transition mechanism unit 53 generates posture transition information for transitioning the robot posture from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 52. This is sent to the control mechanism 54.
- Control mechanism 5 4 in accordance with the posture transition information from the attitude transition mechanism part 3 generates a control signal for driving the completion Kuchiyueta 3 to 5 and 5 A 2 - this, Akuchiyueta 3 to 5 and 5 and it sends it to the A 2.
- the actuators 3A Ai to 5 and 5A 2 are driven according to the control signal, and the robot autonomously acts.
- FIG. 4 illustrates a configuration example of the voice recognition unit 5OA in FIG.
- This audio data is supplied to the feature extraction unit 22.
- the feature extraction unit 22 performs, for example, an MFCC (Mel Frequency Cepstrum Coefficient) analysis for each appropriate frame with respect to the audio data input thereto, and converts the MFCC obtained as a result of the analysis into a feature vector (feature vector).
- the parameter is output to the matching section 23 and the unregistered word section processing section 27 as parameters.
- the feature extraction unit 22 extracts, for example, a linear prediction coefficient, a cepstrum coefficient, a line spectrum pair, and power (output of a filter bank) for each predetermined frequency band as a feature vector. It is possible.
- the matching unit 23 uses the feature vector from the feature extraction unit 22 to refer to the acoustic model storage unit 24, the dictionary storage unit 25, and the grammar storage unit 26 as necessary.
- the voice (input voice) input to the microphone 15 is recognized based on, for example, a continuous distribution HMM (Hidden Markov Model) method.
- the acoustic model storage unit 24 stores individual phonemes in the language of the speech to be recognized. It also stores acoustic models that represent acoustic features of subwords such as syllables and phonemes (for example, HMM and standard patterns used for DP (Dynamic Programming) matching).
- HMM Hidden Markov Model
- the dictionary storage unit 25 stores a word dictionary in which information on pronunciation of the word (phonemic information), which is clustered for each word to be recognized, is associated with a heading of the word.
- FIG. 5 shows a word dictionary stored in the dictionary storage unit 25.
- a heading of a word is associated with its phoneme sequence, and the phoneme sequence is clustered for each corresponding word.
- one entry one line in Fig. 3 corresponds to one cluster.
- the grammar storage unit 26 stores grammar rules that describe how each word registered in the word dictionary of the dictionary storage unit 25 is linked (connected).
- FIG. 6 shows the grammar rules stored in the grammar storage unit 26.
- the grammar rules in Fig. 6 are described in EBNF (Extended Backus Naur Form).
- the grammar rule is from the beginning of the line to the first occurrence of “;”.
- Alphabets (columns) preceded by “$” represent variables, and alphabets (columns) not preceded by “$” represent word headings (Romanized headings shown in Figure 5).
- the part enclosed by [] indicates that it can be omitted, and "
- variable $ sil and $ garbage are not defined, but the variable $ sil represents a silent acoustic model (silent model), and the variable $ garbage is basically Represents a garbage model that allows free transitions between phonemes.
- the matching unit 23 refers to the word dictionary in the dictionary storage unit 25, and connects the acoustic models stored in the acoustic model storage unit 24 to form the acoustic model of the word ( (Word model). Further, the matching unit 23 connects several word models by referring to the grammar rules stored in the grammar storage unit 26, and uses the word models connected in this way to generate a feature vector. Based on, the speech input to the microphone 15 is recognized by the continuous distribution HMM method.
- the matching unit 23 detects the sequence of the word model having the highest score (likelihood) at which the time-series feature vector output from the feature extraction unit 22 is observed, and determines the sequence of the word model as the sequence.
- the heading of the corresponding word string is output as a speech recognition result.
- the matching unit 23 accumulates the appearance probabilities (output probabilities) of each feature vector for the word string corresponding to the connected word model, and uses the accumulated value as a score to calculate the score.
- the heading of the word string with the highest is output as the speech recognition result.
- the recognition result of the voice input to the microphone 15 output as described above is output to the model storage unit 51 and the action determination mechanism unit 52 as state recognition information.
- the matching unit 23 detects a phoneme sequence as a phoneme transition in the garbage model represented by the variable $ garbage when the rule for unregistered words is applied, as a phoneme sequence of an unregistered word. Then, the matching unit 23 sends the speech section and phoneme sequence of the unregistered word detected when the speech recognition result to which the unregistered word rule is applied is obtained to the unregistered word section processing unit 27. Supply.
- the unregistered word section processing unit 27 temporarily stores the sequence of feature vectors (feature vector sequence) supplied from the feature extraction unit 22. Further, when the unregistered word section processing section 27 receives the speech section and phoneme sequence of the unregistered word from the matching section 23, the feature vector sequence of the speech in the speech section is temporarily stored. Detected from vector series. Then, the unregistered word section processing unit 27 attaches a unique ID ddentification) to the phonological sequence (unregistered word) from the matching unit 23, and outputs the phonological sequence of the unregistered word and its speech section. It is supplied to the feature vector buffer 28 together with the feature vector sequence.
- the feature vector buffer 28 temporarily associates the ID of the unregistered word supplied from the unregistered word section processing unit 27, the phoneme sequence, and the feature vector sequence with each other.
- the clustering unit 29 stores the unregistered words newly stored in the feature vector buffer 28 (hereinafter referred to as new unregistered words as appropriate) in the feature vector buffer 28. Calculate the score for each of the other unregistered words that have been registered (hereinafter referred to as “stored unregistered words” as appropriate). '
- the clustering unit 29 regards the new unregistered word as the input voice, and regards the stored unregistered word as a word registered in the word dictionary, and performs the same processing as in the matching unit 23. For new unregistered words, calculate the score for each stored unregistered word. Specifically, the clustering unit 29 recognizes the feature vector sequence of the new unregistered word by referring to the feature vector buffer 28, and generates an acoustic model according to the phoneme sequence of the stored unregistered word. From the connected acoustic model, a score is calculated as the likelihood that the feature vector sequence of the new unregistered word is observed.
- the acoustic model stored in the acoustic model storage unit 24 is used.
- the clustering unit 29 also calculates a score for a new unregistered word for each stored unregistered word, and updates the score sheet stored in the score sheet storage unit 30 according to the score.
- the clustering unit 29 refers to the updated score sheet to determine a new unregistered word from a cluster obtained by clustering unregistered words (stored unregistered words) that have already been obtained. The cluster to be added as a new member is detected. Further, the clustering unit 29 sets the new unregistered word as a new member of the detected cluster, divides the cluster based on the members of the cluster, and based on the result of the division, the score sheet storage unit 3 Updates the score sheet stored in 0.
- the score sheet storage unit 30 stores a score sheet in which a score for a newly-registered unregistered word, a score for a newly-registered unregistered word, and the like for a newly-registered unregistered word are registered.
- FIG. 8 shows a score sheet
- the score sheet is composed of entries in which unregistered words “ID”, “phonological sequence”, “cluster nampa”, “representative member ID”, and “score” are described.
- Cluster Nampa is a number for specifying the cluster of which the unregistered word of the entry is a member, is assigned by the clustering unit 29, and is registered in the score sheet.
- the “representative member ID” is the ID of an unregistered word as a representative member representing the cluster of which the unregistered word of the entry is a member. The unregistered word is a member by this representative member ID. The representative member of the cluster can be recognized.
- the representative member of a cluster obtained by the clustering unit 2 9, the ID of the representative member is registered to the representative member ID of the score sheet ( "score" is about unregistered word of that entry, other The score for each unregistered word, which is calculated by the clustering unit 29 as described above.
- the score sheet contains the IDs of the N unregistered words, the phonological sequence, the cluster nampa, the representative member ⁇ ), and the score.
- the score sheet contains the scores of the new unregistered words, the phoneme sequence, the cluster name, the representative member ID, and the new unregistered words for each of the stored unregistered words (score s (N + l , l), s (N + l, 2), ⁇ ⁇ -, s (N + l, N)) are added.
- the score sheet contains the scores for the new unregistered words (s (l, N + l), s (2, N + l), , S (N, N + l)) are added.
- the cluster name of unregistered words in the score sheet and the representative member ID are changed as necessary.
- the score for the unregistered word (utterance) with ID i and the score for the unregistered word (phoneme sequence) with ID j is expressed as s (i, j). is there.
- a score s (i, i) for an unregistered word (utterance) with ID i and an unregistered word (phoneme sequence) with ID i is also registered.
- the score s (i, i) is calculated when the unregistered word phonological sequence is detected in the matching unit 23, it is not necessary to calculate the score in the clustering unit 29.
- the maintenance unit 31 updates the word dictionary stored in the dictionary storage unit 25 based on the updated score sheet in the score sheet storage unit 30.
- the representative member of the cluster is determined as follows. That is, for example, among the unregistered words that are members of the cluster, the sum of the scores for each of the other unregistered words (in addition, for example, the average value obtained by dividing the sum by the number of other unregistered words) Is the representative member of the cluster. Therefore, in this case, if the member ID of the member belonging to the cluster is represented by k, the member whose ID is the value K (ek) shown in the following equation will be the representative member.
- one or more members of the cluster Does not need to calculate a score to determine the representative member when it is two unregistered words. That is, if the cluster member is one unregistered word, the one unregistered word becomes the representative member, and if the cluster member is two unregistered words, the two unregistered words Any of the words may be the representative member.
- the method of determining the representative member is not limited to the method described above.
- each of the unregistered words may be compared with each other.
- a member that minimizes the sum of distances in the feature vector space can also be used as the representative member of the cluster.
- speech recognition processing for recognizing speech input to the microphone 15 and unregistered word processing for unregistered words are performed.
- the uttered voice is converted into digital voice data via the microphone 15 and the AD converter 21 and supplied to the feature extractor 22.
- the feature extraction unit 22 extracts a feature vector by acoustically analyzing the audio data in units of a predetermined frame, and extracts the feature vector sequence from the matching unit 23 and It is supplied to the unregistered word section processing unit 27.
- step S2 the matching unit 23 calculates the score for the feature vector sequence from the feature extraction unit 23 as described above, and proceeds to step S3.
- step S3 matching section 23 obtains and outputs a heading of a word string as a speech recognition result based on the score obtained as a result of the score calculation.
- step S4 determines whether or not the user's voice includes an unregistered word.
- the unregistered word section processing unit 27 temporarily stores the feature vector sequence supplied from the feature extraction unit 22, and the unregistered word speech section and phoneme sequence are supplied from the matching unit 23. Then, the feature vector sequence of the voice in the voice section is detected. Further, the unregistered word section processing unit 27 assigns an ID to the (unregistered word) sequence from the matching unit 23, and generates a phoneme sequence of the unregistered word and a feature vector sequence in the speech section. At the same time, it is supplied to the characteristic vector buffer 28.
- the unregistered word processing is performed.
- FIG. 10 shows a flowchart for explaining the unregistered word processing.
- the clustering unit 29 reads out the ID and phoneme sequence of the new unregistered word from the feature vector buffer 28, and proceeds to step S12.
- step S12 the clustering unit 29 refers to the score sheet in the score sheet storage unit 30 to determine whether a cluster that has already been obtained (generated) exists.
- step S12 it is determined that the cluster that has already been determined does not exist. If the new unregistered word is the first unregistered word and there is no entry of the stored unregistered word in the score sheet, the process proceeds to step S13, and the clustering unit 29 By newly generating a cluster having the new unregistered word as a representative member, and registering information on the new cluster and information on the new unregistered word in the score sheet of the score sheet storage unit 30, Update the score sheet.
- the clustering unit 29 registers the ID and the phoneme sequence of the new unregistered word read from the feature vector buffer 28 in the score sheet (FIG. 8). Further, the clustering unit 29 generates a unique cluster picker and registers it in the score sheet as a new unregistered word cluster picker. In addition, the clustering unit 29 registers the ID of the new unregistered word in the score sheet as the representative member ID of the new unregistered word. Therefore, in this case, the new unregistered word becomes the representative member of the new cluster.
- the score is not calculated because there is no stored unregistered word for calculating the score with the new unregistered word.
- step S13 After the processing in step S13, the process proceeds to step S22, where the maintenance unit 31 updates the word dictionary in the dictionary storage unit 25 based on the score sheet updated in step S13, and performs processing. To end.
- the maintenance unit 31 adds the entry corresponding to the cluster to the word dictionary in the dictionary storage unit 25, and as the phoneme sequence of the entry, the phoneme sequence of the representative member of the new cluster, that is, the current phoneme sequence In this case, register the phoneme sequence of the new unregistered word.
- step S12 determines whether a cluster that has already been found exists, that is, the new unregistered word is not the first unregistered word, and therefore, the score sheet (Fig. 8) shows If there is an entry (row) of a word that has not been stored, the process proceeds to step S14, and the clustering unit 29 proceeds to step S14, where for each new word that has not been registered, In addition to calculating the score for each recorded word, the score for each new unregistered word is calculated for each of the stored unregistered words.
- the clustering unit 29 shown by a dotted line in FIG. S (N + l, l), s (N + l, 2), ⁇ , s (N + l, N) for each of the N stored unregistered words for the part of the new unregistered word And the scores s (l, N + l), s (2, N + l), s (N, N + l) for the new unregistered words for each of the N stored unregistered words. ) Is calculated. To calculate these scores in the clustering unit 29, a feature vector sequence for each of the new unregistered word and the N stored unregistered words is required. The vector sequence is recognized and recognized by referring to the feature vector buffer 28.
- the clustering unit 29 adds the calculated score to the score sheet (FIG. 8) together with the ID of the new unregistered word and the phoneme sequence, and proceeds to step S15.
- the clustering unit 29 recognizes the stored unregistered word that is the representative member by referring to the representative member ID of the score sheet, and further refers to the score of the score sheet, The unregistered word that has been stored as the representative member that maximizes the score for the new unregistered word is detected. Then, the clustering unit 29 detects the cluster of the cluster pick-up of the stored unregistered word as the detected representative member.
- step S16 the clustering unit 29 adds the new unregistered word to the members of the cluster detected in step S15 (hereinafter, appropriately referred to as a detected cluster). That is, the clustering unit 2 9, as a cluster Nampa new unregistered word in the scoresheet, c writes the cluster number of the representative member of the detected cluster
- the clustering unit 2 9, in step S 1 7, the detected cluster e.g. , Perform a cluster division process of dividing into two clusters, and move on.
- step S18 the clustering unit 29 determines whether the detected cluster has been divided into two clusters by the cluster division processing in step S17, and determines that the cluster has been divided. Proceed to step S19.
- step S 19 the clustering unit 29 selects the two clusters obtained by dividing the detected clusters (these two clusters are hereinafter referred to as a first child cluster and a second child cluster, as appropriate). Find the intercluster distance of
- intercluster distance between the first and second child clusters is defined, for example, as follows.
- the ID of any member (unregistered word) of both the first child cluster and the second child cluster is represented by k, and the representative member of the first and second child clusters (unregistered word) If the IDs of the first and second child clusters are represented by kl or k2, respectively, then the value D (kl, k2) expressed by the following equation is defined as the intercluster distance between the first and second child clusters.
- abs () represents the absolute value of the value in 0.
- Maxvalk ⁇ represents the maximum value in ⁇ obtained by changing k.
- Log represents natural logarithm or common logarithm.
- the reciprocal score l / s (k, kl) in equation (2) corresponds to the distance between member #k and representative member kl.
- the reciprocal l / s (k, k2) of the key corresponds to the distance between member #k and representative member k2. Therefore, according to equation (2), of the members of the first and second child clusters, the distance between the representative member #kl of the first child cluster and the representative member # k2 of the second child cluster The maximum value of the difference from the distance is defined as the distance between child clusters between the first and second child clusters.
- the distance between clusters is not limited to the one described above.
- the distance between clusters is not limited to the one described above.
- the integrated value of the distance can be used as the intercluster distance.
- step S 19 the process proceeds to step S 20, where the clustering unit 29 determines that the inter-cluster distance between the first and second child clusters is larger than a predetermined threshold ⁇ (or the threshold ⁇ Is determined.
- step S20 when it is determined that the inter-cluster distance is larger than a predetermined threshold ⁇ , that is, a plurality of unregistered words as members of the detected cluster have, based on their acoustic characteristics, If it is considered that the two clusters should be clustered, the process proceeds to step S21, and the clustering unit 29 registers the first and second child clusters in the score sheet of the score sheet storage unit 30.
- a predetermined threshold ⁇ that is, a plurality of unregistered words as members of the detected cluster have, based on their acoustic characteristics
- the clustering unit 29 assigns a unique cluster number to the first and second child clusters, and assigns a cluster number of a cluster that has been clustered to the first child cluster among the members of the detection cluster to the first number.
- the score sheet is updated so that the cluster number of the second child cluster is changed to the cluster number of the second child cluster and the cluster number of the second child cluster is changed to the cluster number of the second child cluster.
- the clustering unit 29 sets the representative member ID of the member clustered in the first child cluster to the ID of the representative member of the first child cluster.
- the representative of the member clustered in the second child cluster Update the scoresheet to make the member ID the ID of the representative member of the second child cluster.
- step S 21 When the clustering unit 29 registers the first and second child clusters in the score sheet as described above, the process proceeds from step S 21 to S 22, and the maintenance unit 3 power S and the score sheet
- the word dictionary in the dictionary storage unit 25 is updated based on the result, and the process ends.
- the maintenance unit 31 since the detected cluster is divided into the first and second child clusters, the maintenance unit 31 firstly corresponds to the detected cluster in the word dictionary. Delete an entry. Further, the maintenance unit 31 adds two entries corresponding to each of the first and second child clusters to the word dictionary, and as a phonological sequence of the entry corresponding to the first child cluster, the first child cluster The phoneme sequence of the representative member of the cluster is registered, and the phoneme sequence of the representative member of the second child cluster is registered as the phoneme sequence of the entry corresponding to the second child cluster.
- step S18 determines whether the detected cluster could not be divided into two clusters by the clustering process in step S17, or if the first and second clusters have been determined in step S20. If the cluster is not similar to the cluster, the process proceeds to step S23, where the clustering unit 29 obtains a new representative member of the detected cluster and updates the score sheet.
- the clustering unit 29 refers to the score sheet of the score sheet storage unit 30 for each member of the detected cluster to which the new unregistered word is added as a member, thereby obtaining the score necessary for calculating the expression (1). Recognize s (k ,, k). Further, the clustering unit 29 uses the recognized score s (k ', k) to determine the ID of the member to be the new representative member of the detected cluster based on equation (1). Then, the clustering unit 29 rewrites the representative member ID of each member of the detected cluster in the score sheet (FIG. 8) with the ID of a new representative member of the detected cluster.
- step S22 the maintenance unit 31 updates the word dictionary in the dictionary storage unit 25 based on the score sheet, and ends the processing.
- the maintenance unit 31 recognizes a new representative member of the detected cluster by referring to the score sheet, and further recognizes the phoneme sequence of the representative member. Then, the maintenance unit 31 changes the phoneme sequence of the entry corresponding to the detected cluster in the word dictionary to the phoneme sequence of a new representative member of the detected cluster.
- step S31 the clustering unit 29 selects a combination of any two members that have not yet been selected from the detected cluster to which the newly unregistered word has been added as a member.
- Each of them is a temporary representative member.
- these two temporary representative members are hereinafter referred to as a first temporary representative member and a second temporary representative member, as appropriate.
- step S32 the clustering unit 29 sets the members of the detected cluster to 2 so that the first temporary representative member and the second temporary representative member can be respectively set as the representative members. Determine if it can be divided into two clusters.
- the representative member needs to calculate equation (1), but the score s (k ′, k) used in this calculation is required. Is recognized by referring to the score sheet.
- step S32 the members of the detection cluster are divided into two clusters so that the first temporary representative member and the second temporary representative member can be respectively set as the representative members. If it is determined that it cannot be performed, skip step S33 and proceed to step S34.
- the members of the detection cluster can be divided into two clusters so that the first temporary representative member and the second temporary representative member can be respectively set as representative members. If determined, the process proceeds to step S33, and the clustering unit 29 assigns the members of the detected cluster so that the first temporary representative member and the second temporary representative member become the representative members, respectively.
- the cluster is divided into two clusters, and a set of the two clusters after the division is set as a candidate of the first and second child clusters (hereinafter, appropriately referred to as a set of candidate clusters) as a result of the division of the detected cluster. Proceed to step S34.
- step S34 the clustering unit 29 selects the Two member sets that have not yet been selected as the first and second temporary representative member sets
- step S31 It is determined whether or not (combination) exists, and if so, the process returns to step S31, and the two members of the detection cluster that have not yet been selected as the pair of the first and second temporary representative members Is selected, and the same processing is repeated thereafter.
- step S34 If it is determined in step S34 that there is no pair of the detected clusters that has not been selected as the pair of the first and second temporary representative members, the process proceeds to step S35, and the clustering unit Step 29 determines whether there is a set of candidate clusters.
- step S35 If it is determined in step S35 that the set of catching clusters does not exist, step S36 is skipped and the process returns. In this case, in step S18 of FIG. 10, it is determined that the detected cluster could not be divided. On the other hand, if it is determined in step S35 that a set of candidate clusters exists, the process proceeds to step S36, and when there are a plurality of sets of candidate clusters, Find the intercluster distance between two clusters. Then, the clustering unit 29 obtains a set of candidate clusters having the smallest inter-cluster distance, and returns the set of candidate clusters as the division result of the detected cluster, that is, as the first and second child clusters. I do. If there is only one set of candidate clusters, the set of candidate clusters is used as the first and second child clusters.
- step S18 of FIG. 10 it is determined that the detected cluster could be divided.
- the clustering unit 29 detects a cluster (detection cluster) to which a new unregistered word is added as a new member from among the clusters obtained by clustering unregistered words, which have already been obtained. Registered words are set as new members of the detected clusters, and the detected clusters are divided based on the members of the detected clusters. It can be easily clustered. Furthermore, since the maintenance unit 31 updates the word dictionary based on such clustering results, it is possible to easily register unregistered words in the word dictionary while avoiding a large-scale word dictionary. Can be done.
- the matching unit 23 erroneously detects a voice section of an unregistered word, such a non-registered word is unregistered word in which the voice section is correctly detected by the division of the detection cluster. Is clustered into another cluster. Then, the entry powers corresponding to such clusters are registered in the word dictionary. However, since the phoneme sequence of this entry corresponds to the voice section that was not correctly detected, the subsequent voice It does not give large scores in recognition. Therefore, even if the erroneous detection of the speech section of the unregistered word is erroneous, the error hardly affects the subsequent speech recognition.
- FIG. 12 shows a simulation result of clustering obtained by uttering an unregistered word.
- each entry represents one cluster.
- the left column of Fig. 12 shows the phoneme sequence of the representative member (unregistered word) of each cluster, and the right column of Fig. 12 shows the utterance content of the unregistered word that is a member of each cluster. And the number.
- the entry on the first line represents a cluster in which only one utterance of the unregistered word “bath” is a member, and the phonological sequence of the representative member is “doroa : ”(Drawer 1).
- the entry in the second line represents a cluster in which three utterances of the unregistered word “bath” are members, and the phonological sequence of the representative member is “kuroJ (kuro)”. ing.
- the entry on the seventh line represents a cluster in which four utterances of the unregistered word “book” are members, and the phonological sequence of the representative member is “NhoNde: suj”.
- the entry in line 8 represents a cluster in which one utterance of the unregistered word “orange” and 19 utterances of the unregistered word “book” are members.
- the phoneme sequence of the representative member is “ohoN”.
- Other entries indicate the same. According to FIG. 12, it can be seen that utterances of the same unregistered word are well clustered.
- one utterance of the unregistered word “orange” and 19 utterances of the unregistered word “book” are clustered in the same cluster. It is thought that this cluster should be a cluster of the unregistered word "book” based on the utterances of which it is a member, but the utterance of the unregistered word "orange” is also a member of the cluster. .
- this cluster is also divided into clusters when the utterance of the unregistered word “book” is further input, and the cluster in which only the utterance of the unregistered word “book” is a member, It is considered that the cluster is clustered with only the utterance of “orange” as a member.
- the present invention is not limited to this, and may be applied to, for example, a voice interaction system equipped with a voice recognition device or the like. It can be widely applied. Further, the present invention is applicable not only to a robot in the real world but also to a virtual robot displayed on a display device such as a liquid crystal display.
- the above-described series of processing is performed by causing the CPU 10A to execute a program.
- the series of processing may be performed by dedicated hardware. It is.
- the program is stored in the memory 10B (Fig. 2) in advance, as well as a flexible disk, CD-ROM (Compact Disc Read Only Memory), M0 (Magneto optical) disk, DVD (Digital Versatile Disc), It can be temporarily or permanently stored (recorded) on removable recording media such as magnetic disks and semiconductor memory. Then, such a removable recording medium can be provided as so-called package software, and can be installed in a robot (memory 10B).
- a removable recording medium such as so-called package software, and can be installed in a robot (memory 10B).
- programs can be transmitted wirelessly from download sites via artificial satellites for digital satellite broadcasting, LAN (Local Area Network), Internet It can be transferred by wire through such a network and installed in the memory 10B.
- LAN Local Area Network
- the version-upgraded program can be easily installed in the memory 10B.
- processing steps for describing programs for causing the CPU 1 OA to perform various types of processing do not necessarily need to be processed in chronological order according to the order described as a flowchart. It also includes processing that is performed in parallel or individually (eg, parallel processing or processing by objects).
- program may be processed by one CPU, or may be processed by a plurality of CPUs in a distributed manner.
- the voice recognition unit 5OA in FIG. 4 can be realized by dedicated hardware or by software.
- a program constituting the software is installed in a general-purpose computer or the like.
- FIG. 13 shows a configuration example of an embodiment of a computer on which a program for realizing the voice recognition unit 5OA is installed.
- FIG. 13 shows another example of the speech recognition device 91 to which the present invention is applied.
- the program can be recorded in advance on a hard disk 105 or a ROM 103 as a recording medium built in a computer.
- the program can be temporarily or permanently stored (recorded) on a removable storage medium such as a flexible disk, CD-ROM, M0 disk: DVD, magnetic disk, or semiconductor memory.
- a removable recording medium 111 can be provided as so-called package software.
- the program can be installed on the computer from the removable recording medium 111 described above, and can be wirelessly transferred from a download site to a computer via a digital satellite broadcasting artificial satellite, LAN, or Internet.
- the program is transferred to the computer via a network such as a wireless network, and the computer receives the transferred program in the communication unit 108 and installs it on the built-in hard disk 105 be able to.
- the speech recognition device 91 has a CPU (Central Processing Unit) 102 built therein.
- An input / output interface 110 is connected to the CPU 102 via a bus 101, and the CPU 102 is connected to a keyboard or the like by the user via the input / output interface 110.
- the command is stored in an R0M (Read Only Memory) 103 according to the command. Execute the program.
- the CPU 102 may execute a program stored in the hard disk 105, a program transferred from a satellite or a network, received by the communication unit 108 and installed in the hard disk 105, or The program read from the removable recording medium 111 mounted on the drive 109 and installed on the hard disk 105 is loaded into a RAM (Random Access Memory) 104 and executed. Accordingly, the CPU 102 performs the processing according to the above-described flowchart or the processing performed by the configuration of the above-described block diagram. Then, the CPU 102 transmits the processing result to a display such as an LCD (Liquid Crystal Display), a speaker, a DA (Digital Analog) through the input / output interface 110 as necessary, for example. The data is output from the output unit 106 composed of a converter or the like, or transmitted from the communication unit 108, and further recorded on the hard disk 105.
- a display such as an LCD (Liquid Crystal Display), a speaker, a DA (Digital Analog)
- the data is
- FIG. 14 illustrates a configuration example of a software program of the speech recognition device 91.
- the software program is configured by a plurality of modules. Each module has one independent algorithm and performs its own operations according to that algorithm. That is, each module is stored in RAM I 3 and It is read out and executed as appropriate.
- Each module shown in FIG. 14 corresponds to each block shown in FIG. That is, the acoustic model buffer 13 3 is stored in the acoustic model storage unit 24, the dictionary buffer 13 4 is stored in the dictionary storage unit 25, the grammar buffer 13 5 is stored in the grammar storage unit 26, and the feature extraction module 1 3 1 is the feature extraction unit 2 2, the matching module 1 3 2 is the matching unit 23, the unregistered word section processing module 1 3 6 is the unregistered section processing unit 27, and the feature vector buffer 1 3 7 Is the feature vector buffer 28, the clustering module 13 8 is the clustering section 29, the score sheet buffer 13 9 is the score sheet storage section 30, and the maintenance module 14 0 is the maintenance section 31. , Respectively.
- the analog audio signal input from the microphone is sampled and quantized by the AD conversion unit, and the analog audio signal is converted into digital audio data. It is D-converted (Analog / Digital conversion) and supplied to the feature extraction module 13 1.
- the ID of the unregistered word supplied from the unregistered word section processing module 13 6, the phonological sequence, The feature vector sequence and the recording time are stored in association with each other.
- the feature vector buffer 1337 stores a data group composed of a plurality of unregistered word entries (rows).
- a sequential number from 1 is assigned as an ID to unregistered words. Therefore, for example, assuming that the IDs, phoneme sequences, feature vector sequences, and recording times of N unregistered words are stored in the feature vector buffer 1337, the matching module 1 3 2 newly detects the voice section and phoneme sequence of the unregistered word, and the unregistered word section processing module 1 36 attaches the unregistered word as N + 1 power S ID to the feature vector.
- the ID (N + 1) of the unregistered word, the phoneme sequence, the characteristic vector sequence, and the recording time are stored as shown by the dotted line in FIG.
- each entry in FIG. 15 is obtained by adding the recording time to the entry shown in FIG. This recording time indicates the time at which the entry was stored (recorded) in the feature vector buffer 1337, and its use method will be described later.
- the clustering module 1338 refers to the "feature vector" stored in the feature vector buffer 1337.
- Such “speech information” referred to when clustering is performed is hereinafter referred to as “speech information”.
- the "utterance information" is not limited to only the “feature vector”, and may be, for example, a "PCM (Pulse Code Modulation) signal” such as audio data supplied to the feature extraction module 1331.
- PCM Pulse Code Modulation
- this “PCM signal” is stored in the feature vector buffer 13 instead of the “feature vector sequence”.
- the voice recognition device 91 can perform the same operation as the voice recognition unit 50A in FIG. .
- the description of each of these modules and the description of the operation corresponding to the voice recognition unit 5OA will be omitted.
- the speech recognition unit 5 OA performs MFCC (Mel Frequency Cepstrura Coefficient) analysis on the speech waveform (for example, digital speech data, etc.) or feature vector (for example, digital speech MFCC, etc. obtained in the case of being applied is stored in the feature vector buffer 28 as a predetermined storage area or memory as utterance information for clustering unregistered words that are newly input in the future.
- MFCC Mel Frequency Cepstrura Coefficient
- the speech recognition unit 5 OA executes a process of detecting a cluster that adds an unregistered word as a new member from among clusters obtained by clustering speech among the processes described above. Reference is made to past speech information stored in a storage area or memory that functions as the vector buffer 28.
- the speech recognition unit 5 OA stores all the utterance information corresponding to the unregistered words, so that the input amount or the number of inputs of the unregistered words increases (the number of unregistered words increases). Acquisition) will consume a lot of storage space or memory.
- a feature vector erasing module 14 1 to be erased is further provided.
- the feature vector erasing module 14 1 refers to the score sheet and executes a predetermined process. If it is determined that the number of members belonging to the cluster has exceeded the first predetermined number, the data stored in the feature vector buffer 1337-among the members belonging to the predetermined cluster, The utterance information of the 2 members and various data related to it are deleted.
- the various data related to the member includes not only the ID and the phonological sequence of the member, but also the data on the score sheet relating to the member.
- the feature vector erasure module 14 1 can prevent the size of the cluster from becoming larger than a certain size, thereby suppressing the consumption of memory (such as RA 103). In addition to this, it is possible to prevent a delay in the operation speed of the voice recognition device 91, that is, to prevent the performance of the voice recognition device 91 from deteriorating.
- first and second numbers have a relationship that, for example, the first number is greater than or equal to the second number.
- the second number of members to be erased can be selected, for example, from the oldest recording time shown in FIG.
- the feature vector elimination module 141 determines that the unreferenced time of a given cluster supplied from the unreferenced time calculation module 144 has exceeded a given time.
- the utterance information of the member belonging to the predetermined cluster and various data related thereto are deleted.
- the unreferenced time calculation module 144 belongs to, for example, a predetermined cluster.
- the time at which the entry was stored in the feature vector buffer 1337) is obtained from the feature vector buffer 1337 as the last reference time of the given cluster.
- the unreferenced time calculation module 14 2 subtracts the obtained last reference time from the current time to calculate an unreferenced time in which a predetermined cluster is not referred to.
- Supply 1
- the unreferenced time calculation module 144 calculates the unreferenced time of all clusters at predetermined time intervals for all clusters. Is not particularly limited. That is, the unreferenced time calculation module 144 performs only the unreferenced time of the cluster specified by the user or the like, and the calculation method of the unreferenced time calculation module 144 is not limited. For example, in this example, the unreferenced time was calculated based on each recording time stored in the feature vector buffer 137.
- the 142 may calculate the unreferenced time by directly monitoring and storing the last reference time of a predetermined cluster.
- the feature vector erasing module 14 1 refers to the unreferenced time supplied from the unreferenced time calculation module 14 2 and stores the unreferenced time in the feature vector buffer 1 37.
- the data that is being deleted all utterance information of members belonging to the cluster for which new members have not been registered for a long time and various related data have been deleted. Instead, it may be possible to delete the utterance information of only some of the members and the data related thereto.
- the last member (unregistered word) registered in the cluster The recording time is set to be the last reference time of the cluster.
- the last reference time of the cluster may be any other time, such as the time detected as the detection cluster in step S15 of FIG. It is possible to use the time at which the cluster was referenced in some processing, such as the time registered as a child cluster in 21.
- the feature vector erasing module 14 1 includes, for example, a feature vector buffer when an erasing instruction (trigger signal) for a predetermined cluster is supplied from the input unit 107 (for example, a keyboard or the like). It is possible to delete all or a part of the utterance information of members belonging to the predetermined cluster and various data related thereto stored in 137.
- the feature vector erasing module 144 when the feature vector erasing module 144 is caused to erase a predetermined feature vector sequence by a stimulus from the outside regardless of the internal state of the speech recognition device 91, for example, By mounting the voice recognition device 91 on the pet robot shown in FIG. 1 described above, memory loss caused by strong stimulation can be realized in the robot.
- the feature vector elimination module 144 includes, for example, the value of the parameter of the emotion (the amount of emotion) supplied from the emotion control module 144 when the force exceeds a predetermined value (amount).
- a predetermined value the amount of emotion supplied from the emotion control module 144 when the force exceeds a predetermined value (amount).
- the information control module 144 can be realized by the model storage unit 51 of FIG. . That is, in this case, as described above, the model storage unit 51 uses the emotion model, the instinct model, and the state information indicating the state of the growth represented by the values of the growth model as emotion amounts as It will be supplied to the feature vector elimination module 1 4 1.
- the feature vector erasure module 1 4 1 Referring to the amount of emotion (the value of the parameter of the emotion (the value of the model)) supplied from (3), it is possible to erase the predetermined utterance information stored in the feature vector buffer 1 37, since S For example, when a strong anger or the like occurs in the mouth pot in FIG. 1 (when the value of the parameter of “anger” exceeds a predetermined value), so-called “forgetting” can be realized in the lopot.
- the feature vector erasure module 144 includes, for example, the memory supplied by the memory usage calculation module 144 (for example, the memory including the feature vector buffer 1337 and the score sheet buffer 1339, etc. If the total usage of the RAM I03 exceeds the predetermined amount, the utterance information of all or a part of the members belonging to the predetermined cluster stored in the feature vector buffer 1337 and It can erase various data related to it.
- the memory usage calculation module 144 constantly calculates the total usage (consumption) of the memory and supplies it to the feature vector erasing module 144 at predetermined intervals.
- the feature vector erasure module 141 constantly monitors the memory (RAM I03, etc.) consumption, and if the consumption exceeds a certain amount, reduces the consumption.
- the feature vector elimination module 14 1 determines the number of cluster members (the number of entries for the same cluster members stored in the feature vector buffer 13 37).
- the unreferenced time calculation module 144 refers to the unreferenced time supplied from the module 142
- the emotion control module 1443 refers to the amount of emotion supplied from the module
- the memory usage calculation module 144 refers to the amount of memory consumed supplied from the module. It is determined whether the value of the parameter exceeds a predetermined threshold value set in advance, and when it is determined that the value exceeds the predetermined threshold value, it is determined that the predetermined condition is satisfied, and Although all or some of the members are deleted, the method of deleting the members (utterance information of the members) is not limited to this.
- the feature vector erasing module 141 does not particularly perform such a determination process, and simply receives a trigger signal (such as the erasing instruction supplied from the input unit 107 described above).
- a trigger signal such as the erasing instruction supplied from the input unit 107 described above.
- a configuration may be adopted in which it is determined that a predetermined condition is satisfied, and the predetermined utterance information is deleted.
- the emotion control module 144 the unreferenced time calculation module 144, and the memory usage calculation module 144, for example, each of the above-described determination processes is performed, and in these determination processes, If it is determined that the value of a parameter (emotion amount, unreferenced time, or total memory usage) corresponding to these modules exceeds a predetermined threshold, a predetermined trigger signal is used as a feature vector elimination module. 1 4 1 can be supplied.
- the trigger signal supplied to the characteristic vector erasing module 141 is not limited to the one described above, but may be generated under conditions other than those described above, for example, any conditions set later by a user or the like. A trigger signal or the like may be used.
- the feature vector erasing module 14 1 determines that the predetermined condition is satisfied, the feature vector erasing module 14 1, among the utterance information of the members stored in the feature vector buffer 13
- the utterance information to be erased can be arbitrarily selected (set), and the number of utterance information to be erased can be arbitrarily selected (set). is there.
- the user or the manufacturer can individually set the utterance information to be deleted according to the respective conditions described above.
- members of a cluster with a small number of members, members with a large distance from the representative member, and members of a cluster for which new members have not been registered for a long time, etc. greatly affect speech recognition accuracy. Since it is not considered to be a thing, it is desirable to delete it preferentially.
- the feature vector erasure module 14 1 erases the utterance information of the members stored in the feature vector buffer 1 37 and various data related thereto. As described above, the score sheet stored in the score sheet buffer 139 is also included.
- the feature vector erasing module 14 1 when the feature vector erasing module 14 1 erases the utterance information of the members stored in the feature vector buffer 13 37, the feature vector erasing module 14 1 also relates to the deleted member of the score sheet. Various data are also deleted.
- the data (ID, phoneme sequence, feature vector sequence (utterance information), and recording time) of the entry (row) whose ID is 3 in Fig. 15 is the feature vector erasure module 14
- the feature vector elimination module 1 4 1 further added the data (ID, phonological sequence, and cluster number) of the entry (row) with ID 3 in the score sheet of FIG. ,
- the clustering module 1 38 includes the cluster to which the deleted member belongs, that is, in the above-described example, FIG.
- the cluster to which the member whose ID is 3 belongs (cluster If the representative member is re-selected (re-determined) for the cluster whose cluster is 1), and if the representative member is changed (if a member other than the member whose ID is 1 is selected as the representative member), all Because the cluster configuration may change, re-clustering is performed for all unregistered words. Performing
- the method of re-clustering is not particularly limited, but, for example, a k-means method can be adopted.
- the clustering module 1338 executes the following processes (1) to (3). However, it is assumed that N unregistered words are registered in the score sheet of the score sheet buffer 1339, and these unregistered words are divided into k clusters.
- K arbitrary ones of the N unregistered words be the initial cluster centers, and generate k clusters whose initial cluster centers are temporary representative members.
- the score can be obtained without performing an actual calculation by referring to the score sheet.
- the score may be actually calculated in the process (2) described above.
- utterance information of N unregistered words is required.
- the speech information of is recognized by referring to the feature vector buffer 1337.
- a PCM signal (voice data) is stored in the feature vector buffer 13 as utterance information instead of the feature vector sequence.
- the clustering module 138 calculates a score based on this PCM signal.
- the feature vector buffer 1337 stores the data shown in FIG. 15 and the score sheet buffer 1339 stores the score sheet shown in FIG.
- the utterance information is a feature vector sequence.
- step S101 the feature vector erasing module 141 judges whether or not an instruction to erase an unregistered word has been issued.
- the feature vector elimination module 1441 gives an instruction to erase an unregistered word if any of the following conditions (1) to (5) is satisfied, for example. It is determined that it has been performed.
- Score sheet buffer 13 When the number of members belonging to a certain one of the clusters registered in the score sheet of 139 exceeds a certain number
- step S 101 If the feature vector erasure module 14 1 determines in step S 101 that erasure of an unregistered word has been instructed, in step 102, the designated unregistered word (hereinafter referred to as an erasure target "Unregistered word deletion process" for And returns to step S101 to determine again whether deletion of an unregistered word has been instructed.
- an erasure target "Unregistered word deletion process” hereinafter referred to as an erasure target "Unregistered word deletion process" for And returns to step S101 to determine again whether deletion of an unregistered word has been instructed.
- step S121 the feature vector erasure module ⁇ 41 erases data corresponding to an unregistered word to be erased from the data stored in the feature vector buffer 137. I do.
- the unregistered word to be erased in Fig. 15 is an unregistered word whose ID is 3, the data of the entry (row) whose ID is 3 in the data shown in Fig. 15 (ID, phoneme sequence, feature vector (utterance information), and recording time) are deleted.
- step S122 the feature vector erasure module 141 corrects the score sheet in the score sheet buffer 139.
- step S122 the ID of the data in the score sheet of FIG.
- step S123 the clustering module 1338 re-selects (requires) the representative member of the cluster to which the unregistered word to be deleted belongs.
- the unregistered word to be deleted is an unregistered word with an ID of 3, so the cluster with a cluster number of 1 shown in the score sheet of Fig. 8 (the unregistered word with an ID of 3)
- the representative member of the cluster to which the word belongs) is re-elected by the method described above.
- step S 1 2 4 the clustering module 1 3 8 It is determined whether or not the representative member has been changed (it is determined whether or not the representative member re-elected in the processing of step S123 is different from the representative member immediately before the processing), and it is determined that the representative member has not been changed. If it does, return. That is, the process of step S102 in FIG. 16 is terminated, the process returns to step S101, and the subsequent processes are repeated. For example, if the member whose ID is 1 is re-elected as the representative member in the process of step S123, it is determined that the representative member has not been changed, while the re-elected representative member has the other ID. If the member is a member, it is determined that the representative member has been changed.
- step S125 all unregistered words (in this example, registered in the score sheet in FIG. 8)
- the re-clustering is performed on all the unregistered words among the registered unregistered words except those having ID power S3). That is, the clustering module 1338 re-clusters, for example, all unregistered words by the above-described k-means method.
- step S126 the clustering module 1338 determines whether the configuration of a cluster other than the cluster to which the unregistered word to be deleted belongs has been changed (for example, whether a member belonging to the cluster has been changed). No, and whether the representative member of the cluster has been changed to another member, etc.). If it is determined that the configuration of the cluster has not been changed, the process proceeds to step S128, where the maintenance module 1 40 updates the word dictionary in the dictionary buffer 134 based on the score sheet updated (corrected) in step S122, and returns.
- a new representative member of the cluster to which the unregistered word to be erased belongs is re-elected (step S123), and the new representative member is changed from the original representative member.
- the maintenance module 140 refers to the score sheet and recognizes the cluster for which a new representative member is required. Then, the maintenance module 140 stores the phonological system of the entry corresponding to the cluster in which the new representative member is found in the word dictionary of the dictionary buffer 134. The phoneme sequence of the new representative member is registered as a column.
- step S126 the clustering module 1338 determines in step S126 that the configuration of the cluster has been changed, then in step S127, the clustering module 132 and the feature vector erasure module 14 1 returns the stored contents of the feature vector buffer 13 7 and the score sheet buffer 13 9 to the original state before erasing (returning to the state before the processing of step S 12 1 was executed) ). That is, the clustering module 1338 and the characteristic vector erasing module 1441 execute an undo process that goes back to a state immediately before erasing an unregistered word to be erased, and returns.
- steps S126 and S127 may be omitted. That is, the voice recognition device 91 may allow the cluster to be changed, and may not execute the undo process.
- the speech recognition device 91 may be configured so that whether or not to execute the processes of steps S126 and S127 can be selected from outside the speech recognition device 91 (by a user or the like). Good.
- step S122 the processing of steps S123 and S124 is skipped. Then, the processing of steps S125 and S126 is sequentially performed. If it is determined in step S126 that the cluster configuration has not been changed, the process proceeds to step S128, and the maintenance module 140 is updated (corrected) in step S122. Updates the dictionary in the dictionary buffer 1 34 based on the score sheet, and returns.
- the cluster itself is erased by erasing all members of a cluster, and the maintenance module 140 scores The deleted cluster is recognized by referring to the cluster. Then, the maintenance module 140 deletes the entry corresponding to the deleted cluster in the word dictionary of the dictionary buffer 134.
- step S102 the feature extraction module 1331 determines whether or not a voice has been input. Is determined.
- step S102 If it is determined in step S102 that no voice has been input, the process returns to step S101, and the subsequent processing is repeated.
- the feature vector erasure module 14 1 determines whether or not an erasure of an unregistered word (erasure of the utterance information corresponding to the unregistered word stored in the feature vector buffer 13 37) has been instructed. , And the feature extraction module 13 1 always determines whether or not a voice is input, independently of the feature vector erasure module 14 1.
- the spoken voice is converted into digital voice data through the microphone of the input unit 107 and the AD conversion unit, and is sent to the feature extraction module 13 1. Supplied.
- step S103 the feature extraction module 1313 determines whether or not a voice has been input. If a voice has been input (if it has been determined that a voice has been input), the feature extraction module 1313 proceeds to step S1.
- a feature vector is extracted by acoustically analyzing the voice data in a predetermined frame unit, and the sequence of the feature vector is converted into a matching module 13 2 and an unregistered word section processing module.
- steps S104 to S108 is the same as the processing in steps S1 to S5 in FIG. 9 described above. Therefore, the processing of steps S 104 to S 108 Description is omitted.
- the feature vector erasing module 1411 determines that the data stored in the feature vector buffer 1337 has little effect on clustering. Since the utterance information of the member to be judged (the feature vector sequence in the example of Fig. 15) and the related data (ID, phoneme sequence, and recording time in the example of Fig. 15) are deleted, the unregistered words are deleted. The storage area consumption can be suppressed without impairing the function of automatically acquiring data.
- the feature vector erasing module 141 corrects the score sheet stored in the score sheet buffer 139 as the data related to the member (deletes unnecessary data). It is possible to further reduce the area consumption.
- the maintenance module 140 updates the word dictionary based on the corrected score sheet, for example, in a mouth pot, it is possible to realize “memory loss” or “forgetting”. Note c can be improved entertainment property, even in the example described above, the scan Tetsupu describing the program recorded on the recording medium, the process carried out in time series in the described order of course, always This also includes processes that are executed in parallel or individually, even if they are not processed in chronological order.
- each module in FIG. 14 is not limited as long as it fulfills its function. That is, the module may be configured by hardware or the like. In that case, the manufacturer or the like may connect each of these modules as shown in FIG. In other words, hardware corresponding to FIG. 14 may be used as the voice recognition unit instead of the voice recognition unit 5OA in FIG.
- the speech recognition is performed by the HMM method.
- the present invention is also applicable to a case in which the speech recognition is performed by, for example, the DP matching method.
- the above-mentioned score is calculated as the reciprocal of the distance between the input speech and the standard pattern. Equivalent to.
- the unregistered words are clustered, and the unregistered words are registered in the word dictionary based on the clustering result.
- the present invention is registered in the word dictionary. Applicable to registered words.
- the state recognition information output by the image recognition unit 50B and the pressure processing unit 50C is supplied to the voice recognition unit 5OA as shown by a dotted line in FIG. Then, the state recognition information is received in the voice recognition unit 5 OA maintenance unit 31 (FIG. 4).
- the feature vector buffer 28, and eventually, the score sheet storage unit 30 also stores the absolute time (time) at which an unregistered word was input, and the maintenance unit 31 stores the score sheet.
- the absolute time of the score sheet in the storage unit 30 the state recognition information supplied from the action determination mechanism unit 52 when an unregistered word is input is used as a heading of the unregistered word. recognize.
- the maintenance unit 31 in the entry corresponding to the cluster of the unregistered word in the word dictionary, state recognition information as the heading is registered together with the phoneme sequence of the representative member of the cluster. .
- the matching unit 23 can output the state recognition information as a heading of the unregistered word as a speech recognition result of the unregistered word registered in the word dictionary. Based on this state recognition information, the robot can take a predetermined action.
- state recognition information “red” is transmitted to the image recognition unit 50 B Is supplied to the voice recognition unit 50A through the action determination mechanism unit 52.
- the voice recognition unit 5OA The phoneme sequence of the unregistered word "red” is obtained.
- the speech recognition unit 5 OA the phonological sequence of the unregistered word “red” and the state recognition information “red” as its heading are entered in the word dictionary as entries of the unregistered word “red”. Will be added.
- This speech recognition result is a force supplied from the speech recognition unit 5 OA to the action determination mechanism 52.
- the action determination mechanism 52 uses the surrounding red object based on the output of the image recognition unit 50B. It is possible to make the lopot take an action that looks for a red object and heads for the red object.
- the robot cannot recognize the utterance “red” at first, but if the mouth utters “red” while the mouth pot is imaging a red object, the mouth In the pot, the utterance "red” is associated with the red object being imaged. After that, when the user utters "red", the utterance "red” is recognized as speech and the surrounding red You will start walking toward the object. In this case, the mouth pot can learn what the user is saying and give the user the impression that it is growing. The same applies to the case of the speech recognition device 91 in FIG.
- the score is stored in the score sheet. However, the score may be recalculated as needed.
- the detected cluster is divided into two clusters.
- the detected cluster can be divided into three or more clusters. It is also possible to divide into any number of clusters, which is the distance between clusters.
- the score sheet (FIG. 8) registers, in addition to the score, a phonological sequence of an unregistered word, a cluster name, a representative member ID, and the like. Can be managed separately from the score, not registered in the score sheet.
- a cluster to which the input voice is added as a new member is detected from the clusters obtained by clustering the voices which have already been obtained. Further, the input voice is made a new member of the detected cluster, and the cluster is divided based on the members of the cluster. Then, the dictionary is updated based on the division result. Therefore, for example, it is possible to easily register an unregistered word that is not registered in the dictionary in the dictionary while avoiding a large-scale dictionary.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
- Manipulator (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| EP02708744A EP1376536A1 (en) | 2001-03-30 | 2002-04-01 | Sound processing apparatus |
| US10/296,797 US7228276B2 (en) | 2001-03-30 | 2002-04-01 | Sound processing registering a word in a dictionary |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2001097843 | 2001-03-30 | ||
| JP2001-97843 | 2001-03-30 | ||
| JP2002-69603 | 2002-03-14 | ||
| JP2002069603A JP2002358095A (ja) | 2001-03-30 | 2002-03-14 | 音声処理装置および音声処理方法、並びにプログラムおよび記録媒体 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2002080141A1 true WO2002080141A1 (en) | 2002-10-10 |
Family
ID=26612647
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2002/003248 Ceased WO2002080141A1 (en) | 2001-03-30 | 2002-04-01 | Sound processing apparatus |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US7228276B2 (https=) |
| EP (1) | EP1376536A1 (https=) |
| JP (1) | JP2002358095A (https=) |
| KR (1) | KR20030007793A (https=) |
| CN (1) | CN1462428A (https=) |
| WO (1) | WO2002080141A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7813928B2 (en) | 2004-06-10 | 2010-10-12 | Panasonic Corporation | Speech recognition device, speech recognition method, and program |
Families Citing this family (57)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070265834A1 (en) * | 2001-09-06 | 2007-11-15 | Einat Melnick | In-context analysis |
| US7398209B2 (en) | 2002-06-03 | 2008-07-08 | Voicebox Technologies, Inc. | Systems and methods for responding to natural language speech utterance |
| US7693720B2 (en) | 2002-07-15 | 2010-04-06 | Voicebox Technologies, Inc. | Mobile systems and methods for responding to natural language speech utterance |
| JP4392581B2 (ja) * | 2003-02-20 | 2010-01-06 | ソニー株式会社 | 言語処理装置および言語処理方法、並びにプログラムおよび記録媒体 |
| US7110949B2 (en) * | 2004-09-13 | 2006-09-19 | At&T Knowledge Ventures, L.P. | System and method for analysis and adjustment of speech-enabled systems |
| US7634406B2 (en) * | 2004-12-10 | 2009-12-15 | Microsoft Corporation | System and method for identifying semantic intent from acoustic information |
| US7729478B1 (en) * | 2005-04-12 | 2010-06-01 | Avaya Inc. | Change speed of voicemail playback depending on context |
| US8438027B2 (en) * | 2005-05-27 | 2013-05-07 | Panasonic Corporation | Updating standard patterns of words in a voice recognition dictionary |
| US7640160B2 (en) | 2005-08-05 | 2009-12-29 | Voicebox Technologies, Inc. | Systems and methods for responding to natural language speech utterance |
| US7620549B2 (en) | 2005-08-10 | 2009-11-17 | Voicebox Technologies, Inc. | System and method of supporting adaptive misrecognition in conversational speech |
| US7949529B2 (en) | 2005-08-29 | 2011-05-24 | Voicebox Technologies, Inc. | Mobile systems and methods of supporting natural language human-machine interactions |
| WO2007027989A2 (en) | 2005-08-31 | 2007-03-08 | Voicebox Technologies, Inc. | Dynamic speech sharpening |
| KR100717385B1 (ko) * | 2006-02-09 | 2007-05-11 | 삼성전자주식회사 | 인식 후보의 사전적 거리를 이용한 인식 신뢰도 측정 방법및 인식 신뢰도 측정 시스템 |
| JP2007286356A (ja) * | 2006-04-17 | 2007-11-01 | Funai Electric Co Ltd | 電子機器 |
| WO2007138875A1 (ja) * | 2006-05-31 | 2007-12-06 | Nec Corporation | 音声認識用単語辞書・言語モデル作成システム、方法、プログラムおよび音声認識システム |
| JP4181590B2 (ja) * | 2006-08-30 | 2008-11-19 | 株式会社東芝 | インタフェース装置及びインタフェース処理方法 |
| US8073681B2 (en) | 2006-10-16 | 2011-12-06 | Voicebox Technologies, Inc. | System and method for a cooperative conversational voice user interface |
| US7818176B2 (en) | 2007-02-06 | 2010-10-19 | Voicebox Technologies, Inc. | System and method for selecting and presenting advertisements based on natural language processing of voice-based input |
| DE102007033472A1 (de) * | 2007-07-18 | 2009-01-29 | Siemens Ag | Verfahren zur Spracherkennung |
| JP5386692B2 (ja) * | 2007-08-31 | 2014-01-15 | 独立行政法人情報通信研究機構 | 対話型学習装置 |
| US8140335B2 (en) | 2007-12-11 | 2012-03-20 | Voicebox Technologies, Inc. | System and method for providing a natural language voice user interface in an integrated voice navigation services environment |
| JP2009157119A (ja) * | 2007-12-27 | 2009-07-16 | Univ Of Ryukyus | 音声単語自動獲得方法 |
| GB2471811B (en) | 2008-05-09 | 2012-05-16 | Fujitsu Ltd | Speech recognition dictionary creating support device,computer readable medium storing processing program, and processing method |
| US8589161B2 (en) | 2008-05-27 | 2013-11-19 | Voicebox Technologies, Inc. | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
| US9305548B2 (en) | 2008-05-27 | 2016-04-05 | Voicebox Technologies Corporation | System and method for an integrated, multi-modal, multi-device natural language voice services environment |
| US8326637B2 (en) | 2009-02-20 | 2012-12-04 | Voicebox Technologies, Inc. | System and method for processing multi-modal device interactions in a natural language voice services environment |
| US8064290B2 (en) * | 2009-04-28 | 2011-11-22 | Luidia, Inc. | Digital transcription system utilizing small aperture acoustical sensors |
| US9171541B2 (en) | 2009-11-10 | 2015-10-27 | Voicebox Technologies Corporation | System and method for hybrid processing in a natural language voice services environment |
| WO2011059997A1 (en) | 2009-11-10 | 2011-05-19 | Voicebox Technologies, Inc. | System and method for providing a natural language content dedication service |
| US8645136B2 (en) * | 2010-07-20 | 2014-02-04 | Intellisist, Inc. | System and method for efficiently reducing transcription error using hybrid voice transcription |
| US9595260B2 (en) * | 2010-12-10 | 2017-03-14 | Panasonic Intellectual Property Corporation Of America | Modeling device and method for speaker recognition, and speaker recognition system |
| US9117444B2 (en) | 2012-05-29 | 2015-08-25 | Nuance Communications, Inc. | Methods and apparatus for performing transformation techniques for data clustering and/or classification |
| CN103219007A (zh) * | 2013-03-27 | 2013-07-24 | 谢东来 | 语音识别方法及装置 |
| US9697828B1 (en) * | 2014-06-20 | 2017-07-04 | Amazon Technologies, Inc. | Keyword detection modeling using contextual and environmental information |
| KR102246900B1 (ko) * | 2014-07-29 | 2021-04-30 | 삼성전자주식회사 | 전자 장치 및 이의 음성 인식 방법 |
| US9898459B2 (en) | 2014-09-16 | 2018-02-20 | Voicebox Technologies Corporation | Integration of domain information into state transitions of a finite state transducer for natural language processing |
| WO2016044290A1 (en) | 2014-09-16 | 2016-03-24 | Kennewick Michael R | Voice commerce |
| EP3207467A4 (en) | 2014-10-15 | 2018-05-23 | VoiceBox Technologies Corporation | System and method for providing follow-up responses to prior natural language inputs of a user |
| US10614799B2 (en) | 2014-11-26 | 2020-04-07 | Voicebox Technologies Corporation | System and method of providing intent predictions for an utterance prior to a system detection of an end of the utterance |
| US10431214B2 (en) | 2014-11-26 | 2019-10-01 | Voicebox Technologies Corporation | System and method of determining a domain and/or an action related to a natural language input |
| DE112014007287B4 (de) * | 2014-12-24 | 2019-10-31 | Mitsubishi Electric Corporation | Spracherkennungsvorrichtung und Spracherkennungsverfahren |
| US10515150B2 (en) * | 2015-07-14 | 2019-12-24 | Genesys Telecommunications Laboratories, Inc. | Data driven speech enabled self-help systems and methods of operating thereof |
| US10455088B2 (en) | 2015-10-21 | 2019-10-22 | Genesys Telecommunications Laboratories, Inc. | Dialogue flow optimization and personalization |
| US10382623B2 (en) | 2015-10-21 | 2019-08-13 | Genesys Telecommunications Laboratories, Inc. | Data-driven dialogue enabled self-help systems |
| CN106935239A (zh) * | 2015-12-29 | 2017-07-07 | 阿里巴巴集团控股有限公司 | 一种发音词典的构建方法及装置 |
| US10331784B2 (en) | 2016-07-29 | 2019-06-25 | Voicebox Technologies Corporation | System and method of disambiguating natural language processing requests |
| US20180254054A1 (en) * | 2017-03-02 | 2018-09-06 | Otosense Inc. | Sound-recognition system based on a sound language and associated annotations |
| US20180268844A1 (en) * | 2017-03-14 | 2018-09-20 | Otosense Inc. | Syntactic system for sound recognition |
| JP6711343B2 (ja) * | 2017-12-05 | 2020-06-17 | カシオ計算機株式会社 | 音声処理装置、音声処理方法及びプログラム |
| JP7000268B2 (ja) * | 2018-07-18 | 2022-01-19 | 株式会社東芝 | 情報処理装置、情報処理方法、およびプログラム |
| US11636673B2 (en) | 2018-10-31 | 2023-04-25 | Sony Interactive Entertainment Inc. | Scene annotation using machine learning |
| US10854109B2 (en) | 2018-10-31 | 2020-12-01 | Sony Interactive Entertainment Inc. | Color accommodation for on-demand accessibility |
| US10977872B2 (en) | 2018-10-31 | 2021-04-13 | Sony Interactive Entertainment Inc. | Graphical style modification for video games using machine learning |
| US11375293B2 (en) | 2018-10-31 | 2022-06-28 | Sony Interactive Entertainment Inc. | Textual annotation of acoustic effects |
| KR20220094400A (ko) * | 2020-12-29 | 2022-07-06 | 현대자동차주식회사 | 대화 시스템, 그를 가지는 차량 및 대화 시스템의 제어 방법 |
| CN115171702B (zh) * | 2022-05-30 | 2024-09-24 | 青岛海尔科技有限公司 | 数字孪生声纹特征处理方法、存储介质及电子装置 |
| CN119495304B (zh) * | 2024-11-06 | 2025-12-12 | 深圳前海微众银行股份有限公司 | 语音识别模型微调方法、电子设备、存储介质及程序产品 |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS5745680A (en) * | 1980-08-30 | 1982-03-15 | Fujitsu Ltd | Pattern recognition device |
| JPS6125199A (ja) * | 1984-07-14 | 1986-02-04 | 日本電気株式会社 | 音声認識方式 |
| JP2002160185A (ja) * | 2000-03-31 | 2002-06-04 | Sony Corp | ロボット装置、ロボット装置の行動制御方法、外力検出装置及び外力検出方法 |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6243680B1 (en) * | 1998-06-15 | 2001-06-05 | Nortel Networks Limited | Method and apparatus for obtaining a transcription of phrases through text and spoken utterances |
| KR100277694B1 (ko) * | 1998-11-11 | 2001-01-15 | 정선종 | 음성인식시스템에서의 발음사전 자동생성 방법 |
-
2002
- 2002-03-14 JP JP2002069603A patent/JP2002358095A/ja not_active Abandoned
- 2002-04-01 EP EP02708744A patent/EP1376536A1/en not_active Withdrawn
- 2002-04-01 WO PCT/JP2002/003248 patent/WO2002080141A1/ja not_active Ceased
- 2002-04-01 CN CN02801646A patent/CN1462428A/zh active Pending
- 2002-04-01 US US10/296,797 patent/US7228276B2/en not_active Expired - Fee Related
- 2002-04-01 KR KR1020027016297A patent/KR20030007793A/ko not_active Withdrawn
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPS5745680A (en) * | 1980-08-30 | 1982-03-15 | Fujitsu Ltd | Pattern recognition device |
| JPS6125199A (ja) * | 1984-07-14 | 1986-02-04 | 日本電気株式会社 | 音声認識方式 |
| JP2002160185A (ja) * | 2000-03-31 | 2002-06-04 | Sony Corp | ロボット装置、ロボット装置の行動制御方法、外力検出装置及び外力検出方法 |
Non-Patent Citations (3)
| Title |
|---|
| IWAHASHI NAOTO, TAMURA MASANORI: "Chikaku joho kara no gainen kozo no chushutsu ni motoduki onsei nyuryoku ni yoru gengo kakutoku", INFORMATION PROCESSING SOCIETY OF JAPAN KENKYU HOKOKU (ONSEI GENGO JOHO SHORI), 28-1, vol. 99, no. 91, 29 October 1999 (1999-10-29), pages 1 - 8, XP002953834 * |
| NAKAMURA ATSUSHI: "Gijiteki gakushu deta o mochiita tango spotting yo gabeji model gakushu no", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS KENKYU HOKOKU (ONSEI), SP95-107, vol. 95, no. 431, 15 December 1995 (1995-12-15), pages 99 - 104, XP002953835 * |
| RABINER LAWRENCE, JUANG BIING HWANG: "Fundamentals of speech recognition", PRENTICE HALL PTR, 1993, pages 267 - 274, XP002953836 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7813928B2 (en) | 2004-06-10 | 2010-10-12 | Panasonic Corporation | Speech recognition device, speech recognition method, and program |
Also Published As
| Publication number | Publication date |
|---|---|
| EP1376536A1 (en) | 2004-01-02 |
| CN1462428A (zh) | 2003-12-17 |
| US7228276B2 (en) | 2007-06-05 |
| US20040030552A1 (en) | 2004-02-12 |
| KR20030007793A (ko) | 2003-01-23 |
| JP2002358095A (ja) | 2002-12-13 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2002080141A1 (en) | Sound processing apparatus | |
| JP4296714B2 (ja) | ロボット制御装置およびロボット制御方法、記録媒体、並びにプログラム | |
| US7065490B1 (en) | Voice processing method based on the emotion and instinct states of a robot | |
| JP6550068B2 (ja) | 音声認識における発音予測 | |
| JP4510953B2 (ja) | 音声認識におけるノンインタラクティブ方式のエンロールメント | |
| JP2001188555A (ja) | 情報処理装置および方法、並びに記録媒体 | |
| JP2002268699A (ja) | 音声合成装置及び音声合成方法、並びにプログラムおよび記録媒体 | |
| JP2001188553A (ja) | 音声合成装置および方法、並びに記録媒体 | |
| JP6031316B2 (ja) | 音声認識装置、誤り修正モデル学習方法、及びプログラム | |
| JP2001154685A (ja) | 音声認識装置および音声認識方法、並びに記録媒体 | |
| JP2001188779A (ja) | 情報処理装置および方法、並びに記録媒体 | |
| JP2002116792A (ja) | ロボット制御装置およびロボット制御方法、並びに記録媒体 | |
| WO2002082423A1 (en) | Word sequence output device | |
| JP4600736B2 (ja) | ロボット制御装置および方法、記録媒体、並びにプログラム | |
| JP4587009B2 (ja) | ロボット制御装置およびロボット制御方法、並びに記録媒体 | |
| JP2001154693A (ja) | ロボット制御装置およびロボット制御方法、並びに記録媒体 | |
| JP2002268663A (ja) | 音声合成装置および音声合成方法、並びにプログラムおよび記録媒体 | |
| JP2003271172A (ja) | 音声合成方法、音声合成装置、プログラム及び記録媒体、並びにロボット装置 | |
| JP4706893B2 (ja) | 音声認識装置および方法、並びに、プログラムおよび記録媒体 | |
| JP2002307349A (ja) | ロボット装置、情報学習方法、プログラム及び記録媒体 | |
| JP2002258886A (ja) | 音声合成装置および音声合成方法、並びにプログラムおよび記録媒体 | |
| JP2004309523A (ja) | ロボット装置の動作パターン共有システム、ロボット装置の動作パターン共有方法、及びロボット装置 | |
| JP2004170756A (ja) | ロボット制御装置および方法、記録媒体、並びにプログラム | |
| JP4639533B2 (ja) | 音声認識装置および音声認識方法、並びにプログラムおよび記録媒体 | |
| JP4178777B2 (ja) | ロボット装置、記録媒体、並びにプログラム |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN KR US |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2002708744 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 1020027016297 Country of ref document: KR |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 028016467 Country of ref document: CN |
|
| WWP | Wipo information: published in national office |
Ref document number: 1020027016297 Country of ref document: KR |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 10296797 Country of ref document: US |
|
| WWP | Wipo information: published in national office |
Ref document number: 2002708744 Country of ref document: EP |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2002708744 Country of ref document: EP |