CN1735924A

CN1735924A - Standard model creating device and standard model creating method

Info

Publication number: CN1735924A
Application number: CNA200380103867XA
Authority: CN
Inventors: 芳泽伸一
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2002-11-21
Filing date: 2003-11-18
Publication date: 2006-02-15

Abstract

The standard model creating apparatus which provides a high-precision standard model used for: pattern recognition such as speech recognition, character recognition, or image recognition using a probability model based on a hidden Markov model, Bayesian theory, or linear discrimination analysis; intention interpretation using a probability model such as a Bayesian net; data-mining performed using a probability model; and so forth, the apparatus comprising: a reference model preparing unit (102) operable to prepare at least one reference model; a reference model storing unit (103) operable to store the reference model (121) prepared by the reference model preparing unit (102); and a standard model creating unit (104) operable to create a standard model (122) by calculating statistics of the standard model so as to maximize or locally maximize the probability or likelihood with respect to the at least one reference model stored in the reference model storing unit (103).

Description

Standard model creation device and standard model creation method

Technical Field

The present invention relates to a device and a method for creating a standard model, which are used for voice recognition, character recognition, image recognition, and other pattern recognition based on a probability model such as hidden markov model, bayesian logic, linear judgment, and other pattern recognition, intention understanding (intention recognition) based on a probability model such as bayesian network, data acquisition (data characteristic recognition) based on a probability model, person detection based on a probability model, fingerprint authentication, face authentication, iris authentication (whether or not an object is identified as a specific object), prediction (judgment after the situation is identified), stock prediction, weather prediction, and other prediction (synthesis of a synthesized model) based on a probability model, synthesis of voices of a plurality of speakers, synthesis of a plurality of face images, and the like (interesting recognition of a synthesized model), and the like.

Background

In recent years, with the spread of the internet and the like, the capacity of the network has increased and the cost of communication has decreased. Therefore, a large number of recognition models (reference models) can be collected by using the network. For example, in speech recognition, a large number of models for speech recognition (models for children, models for adults, models for elderly people, models for automobiles, models for mobile phones, and the like) distributed by various research institutions can be downloaded via the internet. Further, the model for voice recognition used by a car navigation system or the like can be downloaded to a television, a computer, or the like by using the connection between devices based on a network. Further, in terms of the intention to understand, a recognition model in which experiences of various persons in various regions are learned can be collected through a network.

With the development of identification technology, identification models have been used in a large number of devices with different specifications such as CPU power and memory capacity, for example, in remote controllers for personal computers and television sets, mobile phones, and car navigation systems. Further, the present invention can be applied to a plurality of applications requiring different specifications, such as an application requiring identification accuracy such as a security system, or an application requiring a short time until an identification result is output such as a remote controller operation of a television.

In addition, recognition techniques can be utilized in many environments where recognition objects are different. For example, speech recognition is used in various environments such as recognition of voices of children, adults, and the elderly, recognition of voices in automobiles, and recognition of voices of mobile phones.

In view of these changes in social environment, it is considered that a large number of identification models (reference models) are effectively used, whereby a high-precision identification model (standard model) suitable for the specifications of devices and applications and the use environment can be created in a short time and provided to users.

In the field of pattern recognition such as speech recognition, a method using a probabilistic model as a standard model for recognition has attracted attention in recent years, and in particular, a hidden markov model (hereinafter, HMM) or a mixed gaussian distribution model (hereinafter, GMM) is widely used. In addition, in view of intention understanding, a method of using a probabilistic model as a standard model representing intention, knowledge, taste, and the like has attracted attention in recent years, and in particular, a bayesian network and the like are widely used. In addition, in the field of data acquisition, a method of using a probabilistic model as a representative model for each classification for classifying data has recently attracted attention, and GMM and the like are widely used. In the field of authentication such as voice recognition, fingerprint authentication, face authentication, and rainbow authentication, a method using a probabilistic model as a standard model for authentication has attracted attention, and GMM and the like are being applied. As a learning algorithm of a standard model expressed by HMM, a re-estimation method of Baum-Welch (for example, refer to "speech recognition" by san Jose, Md., pp.150-152, Japan Co., Ltd., published by 11/25/1995) is widely used. As a learning algorithm of a standard model expressed by GMM, an EM (Expectation-Maximization) algorithm (for example, refer to koku zhen a ひろ , "voice information processing", pp.100-104, published by north-sonde press, 30/6/1998) is widely used. In the EM algorithm, the method will be described in

(formula 1)

Wherein, (formula 2)

Which represents a gaussian distribution of the intensity of the light,

(formula 3)

x＝(x₍₁₎，x₍₂₎，...，x_(J))∈R^J

Mixed weighting factor (equation 4) as a statistic in input data representing dimension J ≧ 1

ω_f(m) (m＝1，2，...，M_f)

Average value of J (. gtoreq.1) dimension

(formula 5)

μ_f(m)＝(μ_f(m，1)，μ_f(m，2)，...，μ_f(m，J))∈R^J

(m＝1，2，...，M_f，j＝1，2，...，J)

Variance values of the sum J (. gtoreq.1) dimension (J diagonal components of the co-dispersion matrix)

(formula 6)

<math> <mrow> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>J</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>&Element;</mo> <msup> <mi>R</mi> <mi>J</mi> </msup> </mrow> </math>

(m＝1，2，...，M_f，j＝1，2，...，J)

Use (type 7)

x[i]＝(x₍₁₎[i]，x₍₂₎[i]，...，x_(J)[i])∈R^J (i＝1，2，...，N)

Likelihood corresponding to learning data

(formula 8)

The maximum value or the maximum value is obtained,

utilizing (formula 9)

(m＝1，2，...，M_f)

(formula 10)

(m＝1，2，...，M_f，j＝1，2，...，J)

(formula 11)

(m＝1，2，...，M_f，j＝1，2，...，J)

(wherein (formula 12) is

) The calculation is repeated for more than 1 time, and the learning is carried out.

Further, Bayesian inference methods (e.g., Tokyo instructor, Mao, "Bayesian statistical division," pp.42-53, Tokyo university Press, 30/4 in 1985) and the like have been proposed. Any learning algorithm such as the baum-welch re-estimation method, the EM algorithm, the bayesian estimation method, or the like calculates parameters (statistics) of the standard model to create the standard model so as to maximize or maximize the probability (likelihood) of the learning data. In these learning methods, mathematical optimization is achieved that maximizes or maximizes the probability (likelihood).

When the above-described learning method is used for creating a standard model for speech recognition, it is desirable to learn the standard model from a plurality of pieces of speech data so as to cope with variations in speech feature quantities of various speakers, noises, and the like. In addition, in the case of using the method for the purpose of understanding, it is desirable to learn a standard model from a plurality of data so as to cope with variations of a plurality of speakers, situations, and the like. In addition, when the rainbow authentication is used, it is desirable to learn a standard model from a plurality of rainbow image data in order to cope with variations in sunlight, camera position, rotation, and the like. However, in the case of processing such a large amount of data, since learning takes much time, a standard model cannot be provided to a user in a short time. In addition, the cost for storing a large amount of data becomes large. In addition, in the case of collecting data using a network, communication costs become large.

On the other hand, a method of creating a standard model by synthesizing a plurality of models (hereinafter, a model prepared for reference in order to create the standard model is referred to as a "reference model") has been proposed. The reference model is a probability distribution model in which a plurality of learning data are expressed by using an ensemble parameter (mean, dispersion, etc.) of probability distribution, and features of the plurality of learning data are concentrated by using a small number of parameters (ensemble parameter). In the prior art shown below, the model is expressed by a gaussian distribution.

In the conventional method 1, a reference model is expressed by a GMM, and a GMM of a plurality of reference models is synthesized by weighting to create a standard model (for example, a technique disclosed in japanese unexamined patent application publication No. 4-125599).

In addition to the conventional method 1, the conventional method 2 also maximizes or maximizes the probability (likelihood) of the learning data to learn the linear combination of the mixture weights, thereby creating a standard model (for example, the technique disclosed in japanese patent application laid-open No. 10-268893).

In the conventional method 3, the mean values of the standard model are expressed by linear combination of the mean values of the reference model, and the probability (likelihood) for the input data is maximized or maximized to learn the linear combination coefficient, thereby creating the standard model. Here, Speech data of a specific speaker is used as learning data, and a standard model is used as a speaker adaptive model For Speech Recognition (For example, m.j.f.gales, "clusterional Training For Speech Recognition", 1998, ICSLP98, p.1783-1786).

In the conventional method 4, a reference model is expressed by a single gaussian distribution, and in the gaussian distribution in which a plurality of reference models are synthesized, gaussian distributions belonging to the same class are unified by clustering (clustering), thereby creating a standard model (for example, a technique disclosed in japanese unexamined patent publication No. 9-81178).

In the 5 th conventional method, a plurality of reference models are expressed by mixed gaussian distributions of the same number of mixtures, and each gaussian distribution is assigned a one-to-one corresponding serial number. A standard model was made by synthesizing Gaussian distributions having the same consecutive numbers. The plurality of synthesized reference models are models created by speakers that are acoustically close to the user, and the created standard model is a speaker adaptive model (for example, 6 people such as arylze, "teacher-less learning method using a phonological model of a sufficient statistic and a distance from the speaker", 3.1.2002, 3.1.v., society of electronic information communication, vol.j85-D-II, No.3, pp.382-389).

However, in the method of the prior art 1, the number of reference models to be synthesized increases, and the number of standard models to be mixed increases, so that the storage capacity and the recognition processing amount for the standard models become large, and thus it is not practical. In addition, the number of mixtures of the standard models cannot be controlled in accordance with the specifications. This problem is considered to be significant with an increase in the number of synthesized reference models.

In the method of the prior art 2, the number of reference models to be synthesized increases, and the number of standard models to be mixed increases, so that the storage capacity and the recognition processing amount for the standard models become large, and thus the method is not practical. In addition, the number of mixtures of the standard models cannot be controlled in accordance with the specifications. Further, since the standard model is a simple mixture of reference models and the learned parameters are limited to the mixture weight, a highly accurate standard model cannot be created. In addition, in the case of creating the standard model, since learning is performed using a plurality of learning data, it takes a learning time. These problems are considered to be significant with an increase in the number of synthesized reference models.

In the 3 rd conventional method, since the learned parameter is limited to the linear combination coefficient of the average value of the reference model, a standard model with high accuracy cannot be created. In addition, in the case of creating the standard model, since learning is performed using a plurality of learning data, it takes a learning time.

In the 4 th conventional method, clustering is performed in a search, and therefore, it is difficult to create a standard model with high accuracy. In addition, since the reference model has a single gaussian distribution, the accuracy is low, and the accuracy of the standard model in which the reference model is unified is low. The problem relating to the recognition accuracy is considered to be significant with an increase in the number of reference models to be synthesized.

In the 5 th conventional method, a standard model is created by synthesizing gaussian distributions having the same serial number, but in order to create an optimal standard model, generally, the synthesized gaussian distributions are not in one-to-one correspondence, and therefore, the recognition accuracy is low. In addition, when a plurality of reference models have different mixture numbers, a standard model cannot be created. In addition, a serial number is not generally assigned to the gaussian distribution in the reference model, and in this case, a standard model cannot be created. In addition, the number of mixtures of the standard models cannot be controlled in accordance with the specifications.

Disclosure of Invention

The present invention has been made in view of the above problems, and an object thereof is to provide a standard model creation device and the like for creating a high-precision standard model for use in pattern recognition such as voice recognition, character recognition, and image recognition based on a probabilistic model such as hidden markov model, bayesian logic, and linear estimation analysis, intention understanding (intention recognition) based on a probabilistic model such as bayesian network, data collection based on a probabilistic model (recognition of data characteristics), prediction based on stock prediction, weather prediction, and the like (determination after state recognition).

It is another object of the present invention to provide a standard model creation device or the like that can easily create a standard model without requiring learning data or teacher data.

It is another object of the present invention to provide a standard model creating apparatus having high versatility and elasticity, which creates a standard model adapted to an object to be recognized by using the standard model, or creates a standard model adapted to a specification or environment of an apparatus which performs a recognition process using the standard model.

The term "recognition" used in the present invention means not only recognition in a narrow sense such as speech recognition but also all recognition using a standard model expressed by probability, such as parameter matching, recognition, authentication, bayesian estimation, prediction, and the like.

In order to achieve the above object, a standard model creation device according to the present invention is a device for creating a standard model defined by an output probability of a set of events and an event or a transition between events, the standard model creation device including: the disclosed device is provided with: a reference model storage unit that stores one or more reference models that are patterns created in advance for identifying a specific object; and a standard model creating unit for creating a standard model by calculating statistics of the standard model so as to maximize or maximize the probability or likelihood of the standard model with respect to one or more reference models stored in the reference model storage unit.

For example, a standard model creation device for speech recognition is a standard model creation device for creating a standard model for speech recognition that represents speech features having specific attributes, using a probabilistic model that represents a frequency reference representing the speech features from output probabilities, the standard model creation device being characterized in that: the speech recognition device includes a reference model storage unit for storing one or more reference models which are probability models representing speech features having a predetermined attribute; and a standard model creation unit that creates a standard model by calculating statistics of the standard model using statistics of one or more reference models stored in the reference model storage unit, the standard model creation unit having a standard model structure determination unit that determines a structure of the created standard model; an initial standard model creation unit for specifying a statistical amount initial value of a standard model for specifying a structure; and a statistic estimation unit configured to estimate statistics of the standard model so as to maximize or maximize a probability or likelihood of the standard model having the initial value with respect to the reference model.

Thus, the statistical amount of the standard model is calculated to create the standard model, and the probability or likelihood of the standard model with respect to one or more reference models is maximized or maximized, so that the standard model can be easily created without learning data such as voice data or teacher data, and a high-precision standard model for collectively surveying a plurality of created reference models can be created.

Preferably, the standard model creation means further includes reference model preparation means for executing at least one of acquisition of a reference model from outside and storage in the reference model storage means, and creation of a reference model and storage in the reference model storage means. For example, a standard model creation device for creating a standard model for speech recognition, which represents a speech feature having a specific attribute, using a probabilistic model in which a frequency reference representing the speech feature is expressed by an output probability, when the device is applied to speech recognition, is characterized in that: the speech recognition device includes a reference model storage unit for storing one or more reference models which are probability models representing speech features having a predetermined attribute; a reference model preparation unit that performs at least one of acquisition of a reference model from outside and storage in the reference model storage unit, and creation of a new reference model and storage in the reference model storage unit; and a standard model creation unit for creating a standard model by preparing an initial value of a statistic for a corresponding standard model having a predetermined structure and calculating the statistic of the standard model using the statistic of the reference model, so as to maximize or maximize the probability or likelihood of the standard model with respect to one or more reference models stored in the reference model storage unit.

Thus, a new reference model can be acquired from the outside of the standard model creation device, and the standard model can be created from the acquired reference model, so that a highly versatile standard model creation device can be realized that can cope with various recognition targets.

The standard model creation device further includes a use information creation unit that creates use information as information related to the recognition target; and a reference model selecting unit that selects one or more reference models from the reference models stored in the reference model storing unit based on the created utilization information, wherein the standard model creating unit calculates statistics of the standard model so as to maximize or maximize a probability or likelihood of the standard model with respect to the reference model selected by the reference model selecting unit.

In this way, only the reference model suitable for the recognition target is selected from the plurality of reference models prepared based on the user information such as the characteristics of the user, the age, sex, and environment of use of the user, and the standard model in which these reference models are unified is created.

Here, the standard model creation device further includes a similarity degree determination unit that calculates a similarity degree between the usage information and information related to the selected reference model, determines whether or not the similarity degree is equal to or greater than a predetermined threshold, and creates a determination signal.

Thus, when the reference model close to (approaching) the utilization information does not exist in the reference model storage unit, the preparation request of the reference model can be executed.

A standard model creation device connected to a terminal device via a communication path, the standard model creation device further including a use information reception unit configured to receive use information as information related to an identification target from the terminal device; and a reference model selecting unit that selects one or more reference models from the reference models stored in the reference model storing unit based on the received utilization information, wherein the standard model creating unit calculates statistics of the standard model so as to maximize or maximize a probability or likelihood of the standard model with respect to the reference model selected by the reference model selecting unit.

Thus, the standard model is created based on the use information transmittable via the communication path, so that the standard model can be created by remote control and the identification system based on the communication system can be constructed.

The standard model creating device further includes a specification information creating unit that creates specification information as information relating to a specification of the created standard model, and the standard model creating unit calculates statistics of the standard model based on the specification information created by the specification information creating unit so as to maximize or maximize a probability or likelihood of the standard model with respect to the reference model.

Thus, the standard model is created based on the specification information such as the CPU power, the memory capacity, the required recognition accuracy, and the required recognition processing time of the device using the standard model, so that the standard model satisfying the specific specification conditions can be created, and the standard model suitable for the resource environment required for the recognition processing such as the calculation engine can be created.

Here, the specification information may be information indicating a specification corresponding to a type of an application using the standard model, for example. The standard model creation device further includes specification information holding means for holding an application specification correspondence database indicating correspondence between an application using a standard model and a specification of the standard model, wherein the standard model structure determination unit reads, as the specification information, the specification corresponding to the application to be started from the application specification correspondence database held in the specification information holding means, and calculates a statistic of the standard model based on the read specification so as to maximize or maximize the probability or likelihood of the standard model with respect to the reference model.

Thus, the standard model is created according to the specification corresponding to each application program, so that the standard model optimal for each application program can be created, and the recognition accuracy of a recognition system or the like using the standard model can be improved.

The standard model creation device further includes a specification information receiving means for receiving specification information, which is information relating to a specification of the created standard model, from the terminal device, and the standard model creation means calculates a statistic of the standard model based on the specification information received by the specification information receiving means so as to maximize or maximize a probability or likelihood of the standard model with respect to the reference model.

Thus, the standard model is created based on the specification information that can be transmitted via the communication path, so that the standard model can be created by remote control, and the construction of the identification system based on the communication system can be realized.

For example, the reference model and the standard model may be expressed by 1 or more gaussian distributions, and the standard model creating unit may determine the number of mixed distributions (the number of gaussian distributions) of the standard model based on the specification information.

Thus, the number of mixture distributions of the gaussian distributions included in the created standard model is dynamically determined, and the structure of the standard model can be controlled in accordance with the environment in which the recognition processing is executed, the required specification, and the like. As an example, when the CPU power of the recognition device using the standard model is small, when the storage capacity is small, when the required recognition processing time is short, or the like, the number of mixed distributions of the standard model can be set to be small so as to meet the specification, and on the other hand, when the required recognition accuracy is high, or the like, the number of mixed distributions can be set to be much larger, thereby improving the recognition accuracy.

In addition, when the standard model is created using the use information or the specification information, the model preparation means does not necessarily need to be referred to. This is because, for example, the standard model creation device can be shipped from the factory in a state where the reference model is stored in the standard model creation device in advance in response to or irrespective of a request from the user, and the standard model can be created using the use information or specification information.

The reference model storage means stores reference models having different numbers of mixed distributions (numbers of gaussian distributions) of at least one pair of reference models, and the standard model creation means calculates statistics of the standard models so as to maximize or maximize the probability or likelihood of the standard models with respect to the reference models having different numbers of mixed distributions (numbers of gaussian distributions) of at least one pair of reference models.

Thus, the standard model is created from the reference models having different numbers of mixed distributions, so that the standard model can be created from the reference models having various structures prepared in advance, and the creation of a high-precision standard model more suitable for the recognition target can be realized.

The standard model creation device further includes a standard model storage unit that stores the standard model created by the standard model creation unit.

Thus, the standard model temporarily buffered can be immediately output in response to the transmission request, and can function as a data server provided to another device.

The standard model creation device is connected to a terminal device via a communication path, and the standard model creation device further includes a standard model transmission unit that transmits the standard model created by the standard model creation unit to the terminal device.

In this way, the created standard model is transmitted to an external device provided at a spatially separated location, and therefore, the standard model creation device can be used as a standard model creation engine by itself or as a server in a communication system.

The standard model creation device is connected to a terminal device via a communication path, and further includes a reference model receiving unit that receives a reference model transmitted from the terminal device, and the standard model creation unit calculates statistics of the standard model so as to maximize or maximize a probability or likelihood of the standard model with respect to at least the reference model received by the reference model receiving unit.

Thus, the reference model of the usage environment held by the proximity terminal device can be transmitted via the communication path, and the standard model can be created using the transmitted reference model, so that it is possible to create a high-precision standard model more suitable for the identification target. As an example, when a reference model a used by a user a in an environment a is held in a terminal device and the user a wants to use the reference model a in an environment B, a high-precision standard model reflecting the characteristics of the user a can be created by creating a standard model using the reference model a.

The reference model preparation unit further performs at least one of updating and adding of the reference model stored in the reference model storage unit. For example, the standard model creation device may be connected to a terminal device via a communication path, and the standard model creation device may further include a reference model receiving unit configured to receive a reference model transmitted from the terminal device, and the reference model preparation unit may be configured to perform at least one of updating and adding of the reference model stored in the reference model storage unit, using the reference model received by the reference model receiving unit.

Thus, since addition, update, and the like of the prepared reference model are executed, it is possible to add models for various recognition objects as the reference model, replace the models with reference models of higher accuracy, and execute learning based on feedback, such as regeneration of a standard model from the updated reference model, regeneration of a standard model from the generated standard model as the reference model, and the like.

The standard model creation means may be configured to include a standard model structure determination unit configured to determine a standard model structure to be created; an initial standard model creation unit for specifying a statistical amount initial value of the standard model for specifying the structure; and a statistic estimation unit configured to estimate and calculate statistics of the standard model so as to maximize or maximize the probability or likelihood of the standard model with respect to the reference model. In this case, the initial standard model creating unit may specify an initial value of the statistic for specifying the standard model, using one or more reference models for calculating the standard model statistic by the statistic estimating unit. For example, the initial standard model creation unit may specify the initial value based on a classification ID for identifying a standard model type. Specifically, the initial standard model creation unit may hold a correspondence table indicating a correspondence between the classification ID and the initial value and the reference model, and may specify the initial value from the correspondence table.

Thus, by providing the classification ID for each type of the recognition target using the standard model, the standard model which is finally necessary and the initial standard model having the common property can be used, and therefore, the standard model with high accuracy can be created.

As described above, the present invention can provide a highly accurate standard model which is used for voice recognition, character recognition, image recognition, and other pattern recognition based on a probability model such as a hidden markov model, bayesian logic, linear judgment, intention understanding (intention recognition) based on a probability model such as a bayesian network, data acquisition (recognition of data characteristics) based on a probability model, person detection based on a probability model, fingerprint authentication, face authentication, rainbow authentication (whether an object to be recognized is a specific object is determined), stock prediction, weather prediction, and other prediction (judgment after a state is recognized), and the like, and which has a very high practical value.

The present invention can be realized not only as such a standard modeling apparatus, but also as a standard modeling method in which characteristic components included in the standard modeling apparatus are steps, and as a program for causing a computer to execute the steps. It is to be noted that the program may be distributed via a recording medium such as a CD-ROM or a transmission medium such as the internet.

Drawings

Fig. 1 is a block diagram showing the overall server configuration of a standard modeling apparatus according to embodiment 1 of the present invention.

Fig. 2 is a flowchart showing the operation procedure of the server.

Fig. 3 is a diagram showing an example of the reference model stored in the reference model storage unit of fig. 1.

Fig. 4 is a flowchart showing the detailed procedure of step S101 (creation of the standard model) in fig. 2.

Fig. 5 is a diagram illustrating the approximation calculation performed by the 1 st approximation unit 104e in fig. 1.

Fig. 6 is a diagram showing an example of screen display when the reference model is selected.

Fig. 7(a) is a diagram showing an example of screen display when a standard model structure (number of mixed distributions) to be created is designated, and fig. 7(b) is a diagram showing an example of screen display when specification information is selected.

Fig. 8 is a diagram showing an example of screen display showing a progress status when a standard model is created.

Fig. 9 is a block diagram showing the entire configuration of an STB as a standard modeling apparatus according to embodiment 2 of the present invention.

Fig. 10 is a flowchart showing the operation procedure of the STB.

Fig. 11 is a diagram showing an example of the reference model stored in the reference model storage unit of fig. 10.

Fig. 12 is a diagram for explaining approximation calculation performed by the 2 nd approximation unit in fig. 10.

Fig. 13 is a block diagram showing the overall configuration of a PDA according to the standard modeling apparatus of embodiment 3 of the present invention.

Fig. 14 is a flowchart showing the operation procedure of the PDA.

Fig. 15 is a diagram showing an example of the reference model stored in the reference model storage unit of fig. 13.

Fig. 16 shows an example of the selection screen of the PDA.

Fig. 17 is a schematic diagram showing a statistic estimation procedure of the statistic estimation unit in fig. 13.

Fig. 18 is a diagram for explaining approximation calculation performed by the 3 rd approximation unit in fig. 13.

Fig. 19 is a block diagram showing the entire server configuration of the standard modeling apparatus according to embodiment 4 of the present invention.

Fig. 20 is a flowchart showing the operation procedure of the server.

Fig. 21 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server.

Fig. 22 is a diagram showing an example of screen display when personal information as usage information is input.

Fig. 23 is a block diagram showing the entire server configuration of the standard modeling apparatus according to embodiment 5 of the present invention.

Fig. 24 is a flowchart showing the operation procedure of the server.

Fig. 25 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server.

Fig. 26 is a block diagram showing the entire server configuration of the standard modeling apparatus according to embodiment 6 of the present invention.

Fig. 27 is a flowchart showing the operation procedure of the server.

Fig. 28 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server.

Fig. 29 is a block diagram showing the entire server configuration of the standard modeling apparatus according to embodiment 7 of the present invention.

Fig. 30 is a flowchart showing the operation procedure of the server.

Fig. 31 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server.

Fig. 32 is a block diagram showing the entire configuration of a standard model creation apparatus according to embodiment 8 of the present invention.

Fig. 33 is a flowchart showing the operation procedure of mobile phone 901.

Fig. 34 is a diagram showing an example of the reference model stored in the reference model storage unit.

Fig. 35 is a diagram showing an example of the reference model newly stored in the reference model storage unit.

Fig. 36 is a diagram showing an example of screen display in creating the usage information.

Fig. 37 is a diagram showing an example of screen display when preparing a reference model.

Fig. 38 is a graph showing the results of a recognition experiment using the standard model created using the 3 rd approximation unit.

Fig. 39 is a graph showing the result of an experiment for identifying a standard model created by the 2 nd approximation unit according to embodiment 3.

Fig. 40 is a block diagram showing the entire configuration of a standard model creation apparatus according to embodiment 9 of the present invention.

Fig. 41 is a diagram showing an example of data in the application program and specification information association database.

Fig. 42 is a flowchart showing the operation procedure of PDA 1001.

Fig. 43 is a diagram showing an example of the reference model stored in the reference model storage unit.

Fig. 44 is a flowchart showing a method of determining an initial value of a cluster by the initial standard model creation unit.

Fig. 45 is a diagram showing a specific example of step S1004 in fig. 44.

Fig. 46 is a diagram showing a specific example of step S1005 in fig. 44.

Fig. 47 is a diagram showing a specific example of step S1006 in fig. 44.

Fig. 48 is a diagram showing a specific example of step S1008 in fig. 44.

Fig. 49 is a block diagram showing the entire server configuration of the standard modeling apparatus according to embodiment 10 of the present invention.

Fig. 50 is a flowchart showing the operation procedure of the server.

Fig. 51 is a diagram showing an example of a system to which the standard modeling apparatus of the present invention is specifically applied.

Fig. 52 is a diagram showing an example of the classification ID, the initial standard model, and the reference model correspondence table.

Fig. 53 is a diagram showing examples of the reference models 8AA to AZ in the classification ID, initial standard model, and reference model correspondence table in fig. 52.

Fig. 54 is a diagram showing examples of the reference models 64ZA to ZZ in the classification ID, initial standard model, and reference model correspondence table in fig. 52.

FIG. 55 is a diagram showing examples of the classification IDs, the initial standard models, and the initial standard models 8A to 64Z in the reference model correspondence table in FIG. 52.

FIG. 56 is a flowchart showing a method of creating the classification ID, the initial standard model, and the reference model correspondence table.

Fig. 57 is a diagram showing a specific example of step S1100 in fig. 56.

Fig. 58 is a diagram showing a specific example of step S1102 in fig. 56.

Fig. 59 is a diagram showing a specific example of step S1103 in fig. 56.

Fig. 60 is a diagram showing a specific example of step S1104 in fig. 56.

Fig. 61 is a diagram showing a procedure of completing the classification ID, the initial standard model, and referring to the model correspondence table by communication between the terminal and the server.

Fig. 62 is a flowchart showing an initial standard model determination method using the classification ID, the initial standard model, and the reference model correspondence table.

Fig. 63 is a diagram showing a specific example of step S1105 in fig. 62.

Fig. 64 is a graph showing the results of a recognition experiment using the standard model created using the 3 rd approximation unit.

Fig. 65(a) - (j) are diagrams showing examples of the relationship between the attribute of the speech recognition target and the standard model structure (the number of mixture of gaussian distributions).

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the drawings. In the figures, the same or corresponding portions are denoted by the same reference numerals, and description thereof will not be repeated.

(embodiment 1)

Fig. 1 is a block diagram showing the overall configuration of a standard model creation device according to embodiment 1 of the present invention. Here, an example in which the standard modeling apparatus according to the present invention is incorporated in the server 101 of the computer system is shown. In the present embodiment, a case where a standard model for speech recognition representing a speech feature having a specific attribute is created will be described as an example.

The server 101 is a computer device in a communication system or the like, and is a standard model creation device that creates a standard model for speech recognition defined by a hidden markov model expressed by an output probability of a set of events and an event or a transition between events, and includes a reading unit 111, a reference model preparation unit 102, a reference model storage unit 103, a standard model creation unit 104, and a writing unit 112.

The reading unit 111 reads a child reference model, an adult reference model, and an elderly reference model written in a memory such as a CD-ROM. The reference model preparation unit 102 transmits the read reference model 121 to the reference model storage unit 103. The reference model storage unit 103 stores 3 reference models 121. Here, the reference model is a model created in advance to be referred to when creating the standard model (here, a model for speech recognition, that is, a probabilistic model indicating speech features having predetermined attributes).

The standard model creation unit 104 is a processing unit that creates a standard model 122, and maximizes or maximizes the probability or likelihood for 3 (Ng is 3) reference models 121 stored in the reference model storage unit 103, and the standard model creation unit 104 includes: a standard model structure determination unit 104a for determining a structure (mixture number of gaussian distributions, etc.) of a standard model; an initial standard model creation unit 104b for creating an initial standard model by specifying a statistic initial value for calculating the standard model; a statistic storage unit 104c for storing the determined initial standard model; and a statistic estimation unit 104d that calculates a statistic (a final standard model is generated) that maximizes or maximizes the probability or likelihood for 3 (Ng is 3) reference models 121 stored in the reference model storage unit 103, by performing an approximation calculation using the 1 st approximation unit 104e, or the like, on the initial standard model stored in the statistic storage unit 104 c. The statistic is a parameter for specifying a standard model, and here is a mixture weight coefficient, an average value, and a variance value.

The writing unit 112 writes the standard model 122 created by the standard model creating unit 104 into a memory device such as a CD-ROM.

Next, the operation of the server 101 configured as described above will be described.

Fig. 2 is a flowchart showing the operation procedure of the server 101.

First, before a standard model is created, a reference model to be a reference of the standard model is prepared (step S100). That is, the reading unit 111 reads the child reference model, the adult reference model, and the elderly reference model written in the memory such as a CD-ROM, the reference model preparation unit 102 transmits the read reference models 121 to the reference model storage unit 103, and the reference model storage unit 103 stores 3 reference models 121.

The reference model 121 is composed of HMMs of respective phonemes. Fig. 3 shows an example of the reference model 121. Here, schematic diagrams of the reference model for children, the reference model for adults, and the reference model for elderly persons are shown (in this figure, schematic diagrams of the reference model for elderly persons are omitted). The number of states of all the 3 reference models is 3, and the output distribution of the HMM is configured by a mixture gaussian distribution in which the number of mixture distributions is 3 in each state. As the feature value, a 12-dimensional (J ═ 12) cepstrum (cepstrum) coefficient is used.

Next, the standard model creation unit 104 creates the standard model 122 that maximizes or maximizes the probability or likelihood for the 3 reference models 121 stored in the reference model storage unit 103 (step S101).

Finally, the writing unit 112 writes the standard model 122 created by the standard model creating unit 104 into a memory device such as a CD-ROM (step S102). The standard model written in a storage device such as a CD-ROM is used as a standard model for speech recognition in consideration of children, adults, and the elderly.

First, the standard model structure determination unit 104a determines the structure of the standard model (step S102 a). Here, as a structure of the standard model, the HMM for each phoneme is configured, and 3 states are set, and the number of mixtures of output distributions of the respective states is determined to be 3 (Mf is 3).

Next, the initial standard model creation unit 104b specifies a statistic initial value for calculating the standard model (step S102 b). Here, a model obtained by integrating 3 reference models stored in the reference model storage unit 103 into one gaussian distribution using statistical processing calculation is set as a statistic initial value, and the initial value is stored as an initial standard model in the statistic storage unit 104 c.

Specifically, the initial standard model creation unit 104b generates an output distribution shown in the following expression 13 for each of the 3 states I (I ═ 1, 2, and 3). In addition, Mf (number of mixture of gaussian analysis) in the formula is 3 here.

(formula 13)

Wherein, (formula 14)

Which represents a gaussian distribution of the intensity of the light,

(formula 15)

x＝(x₍₁₎，x₍₂₎，...，x_(J))∈R^J

Represents 12-dimensional (J-12) LPC cepstral coefficients,

(formula 16)

ω_f(m)(m＝1，2，...，M_f)

A mixing weight coefficient representing each gaussian distribution,

(formula 17)

μ_f(m)＝(μ_f(m，1)，μ_f(m，2)，...，μ_f(m，J))∈R^J (m＝1，2，...，M_f)

The average value of each gaussian distribution is represented,

(formula 18)

<math> <mrow> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>J</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>&Element;</mo> <msup> <mi>R</mi> <mi>J</mi> </msup> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mi>m</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msub> <mi>M</mi> <mi>f</mi> </msub> <mo>)</mo> </mrow> </mrow> </math>

The variance values of the respective gaussian distributions are indicated.

The statistic amount estimating unit 104d estimates the statistic amount of the standard model stored in the statistic amount storage unit 104c, using the 3 reference models 121 stored in the reference model storage unit 103 (step S102 c).

Specifically, the statistical amount (the mixture weight coefficient shown in the above equation 16, the average value shown in the above equation 17, and the variance value shown in the above equation 18) of the standard model is estimated, which is the output distribution of each state (I1, 2, and 3) of the 3 (Ng 3) reference models 121, that is, the statistical amount of the standard model in which the probability or likelihood of the output distribution shown in the below equation 19 (the likelihood logP shown in the below equation 25) is maximized or maximized.

Formula (19)

Wherein, (formula 20)

Which represents a gaussian distribution of the intensity of the light,

(formula 21)

L_g(i)(i＝1，2，...，N_g)

The number of mixed distributions (here 3) for each reference model is shown,

(formula 22)

υ_g(i，l) (l＝1，2，...，L_g(i))

A mixing weight coefficient representing each gaussian distribution,

(formula 23)

μ_g(i，l) (l＝1，2，...，L_g(i))

The average value of each gaussian distribution is represented,

(formula 24)

The variance values of the respective gaussian distributions are indicated.

(formula 25)

<math> <mrow> <mi>log</mi> <mi>P</mi> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>log</mi> <mo>[</mo> <munderover> <mi>Σ</mi> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>M</mi> <mi>f</mi> </msub> </munderover> <msub> <mi>ω</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>]</mo> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </math>

Further, a mixture weight coefficient, an average value, and a variance value of the standard model are calculated from the following expressions 26, 27, and 28, respectively.

(formula 26)

<math> <mrow> <msub> <mi>ω</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>M</mi> <mi>f</mi> </msub> </munderover> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f)

(formula 27)

<math> <mrow> <msub> <mi>μ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>x</mi> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </msub> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f，j＝1，2，...，J)

(formula 28)

<math> <mrow> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </msub> <mo>-</mo> <msub> <mi>μ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <mi></mi> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>γ</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f，j＝1，2，...，J)

At this time, the 1 st approximation unit 104e of the statistic estimation unit 104d uses an approximation formula shown in the following expression 29.

(formula 29)

(m＝1，2，...，M_f)

Wherein,

(formula 30)

Show that

(formula 31)

u_h(m)(m＝1，2，...，M_f)

(formula 32)

μ_h(m)＝(μ_h(m，1)，μ_h(m，2)，...，μ_h(m，J))∈R^JAs the average value,

will (formula 33)

<math> <mrow> <msubsup> <mi>σ</mi> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>σ</mi> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>J</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>&Element;</mo> <msup> <mi>R</mi> <mi>J</mi> </msup> </mrow> </math>

As a single gaussian distribution of variance values.

The 1 st approximation unit 104e calculates the weight (equation 31), the average value (equation 32), and the variance value (equation 33) of the single gaussian distribution shown in equation 30, based on equations shown in equations 34, 35, and 36, respectively.

(formula 34)

(formula 35)

(m＝1，2，...，M_f，j＝1，2，...，J)

(formula 36)

(m＝1，2，...，M_f，j＝1，2，...，J)

Fig. 5 is a diagram illustrating the approximation calculation by the 1 st approximation unit 104 e. As shown in the figure, the 1 st approximation unit 104e determines a single gaussian distribution (equation 30) in the approximation equation shown in equation 29 above, using all mixed gaussian distributions constituting the standard model.

When the above approximation formulae of the 1 st approximation unit 104e are taken into consideration, the calculation formula of the statistic estimation unit 104d is as follows. That is, the statistic estimation unit 104d calculates the mixing weight coefficient, the average value, and the variance value from the following expressions 37, 38, and 39, and stores them in the statistic storage unit 104 c. Thereafter, the estimation of the statistic and the storage into the statistic storage unit 104c are repeated R (≧ 1) times. As a result, the obtained statistic is output as the statistic of the standard model 122 to be finally generated.

(formula 37)

(formula 38)

(formula 39)

Further, using the normalized state transition probabilities, the whole of adding all the state transition probabilities corresponding to the HMM to the reference model 121 becomes 1.

Next, a specific example of applying the present embodiment to speech recognition in a computer will be described. Here, a specific method of using a standard model will be mainly described, with a computer (PC) used as the server 101 and a CD-ROM drive device used as the reading unit 111.

First, the user installs one CD-ROM in which a plurality of audio models serving as reference models are stored in the CD-ROM drive device (reading unit 111) of the PC (server 101). In this CD-ROM, for example, ' infant ', ' child: male ',' child: female ',' adult: male ',' adult: female ',' elderly: male ',' elderly: women's' individual voice models.

Next, as shown in the screen display examples shown in fig. 6(a) and (b), the user selects an acoustic model located in the family configuration (a person using speech recognition) using a display connected to the PC (server 101). Fig. 6 shows how the audio models stored in the CD-ROM are displayed in the box written with the 'CD-ROM' and the audio model selected from these audio models is copied into the box written with the 'user'. Here, it is assumed that the family of the user is configured to be 3 persons of a son of 10 years old, a father of 50 years old, and a mother of 40 years old, and the user (father) puts' child: male ',' adult: male ',' adult: the female ' 3 models are dragged into the box where the ' user ' is written. By this operation, the reference model preparation section 102 performs preparation of a reference model. That is, 3 reference models are read by the reading unit 111 and stored in the reference model storage unit 103 via the reference model preparation unit 102.

Next, as shown in the screen display example shown in fig. 7(a), the user specifies the structure (number of mixed distributions) of the created standard model. In fig. 7(a), '3', '10', and '20' are shown as the 'mixed distribution number', and the user selects a desired number from these numbers. By this operation, the structure of the standard model thus created is specified by the standard model structure specifying unit 104 a.

The method of determining the number of mixed distributions is not limited to such direct specification, and the number of mixed distributions may be determined based on specification information selected by the user, as shown in the screen display example shown in fig. 7(b), for example. Fig. 7(b) shows a state in which the target device for performing speech recognition using the standard model is selected from 3 types of "use devices", i.e., "television set", "car navigation system", and "mobile phone set". At this time, for example, the number of mixed distributions is determined to be 3 in the case of selecting 'television set', 20 in the case of selecting 'car navigation system', and 10 in the case of selecting 'mobile phone set', based on the correspondence table stored in advance.

Further, the determination method of the number of mixed distributions may be a method of selecting from the recognition speed and the recognition accuracy, that is, 'early recognition', 'normal', 'high-accuracy recognition', and determining the values corresponding to the respective selection items ('early recognition', 'normal', 'high-accuracy recognition', '20') as the number of mixed distributions.

When such an input operation is completed, the initial standard model creation unit 104b creates an initial standard model, and then the statistic estimation unit 104d performs repeated calculations (learning) to create a standard model. At this time, as shown in the screen display example of fig. 8, the standard model structure determination unit 104a displays the progress of learning. The user can know the learning progress and learning end time, and can wait for the standard model to be completed. Examples of the display of the progress status include a bar display of the learning degree shown in fig. 8(a), a display of the number of times of learning shown in fig. 8(b), and a display of other likelihood criteria. Further, a general face image may be displayed when not learning, and the progress display may be changed to the user's face image or the like as the learning end is approached. Similarly, the face of the child may be displayed when the child is not learning, and the progress display of the fairy or the like may be displayed as the child approaches the end of learning.

When the creation of the standard model is completed in this manner, the created standard model is recorded in the memory card (writing unit 112) by the standard model creation unit 104. The user pulls out the memory card from the PC (the writing unit 112 of the server 101) and inserts the memory card into a memory card slot of a user device, for example, a television. Thus, the created standard model is moved from the PC (server 101) to the utilization device (television). The television performs voice recognition for a user (here, family member using the television) using a standard model recorded in the attached memory card. For example, by recognizing a voice input to a microphone attached to a television, an instruction for operating the television is determined, and the instruction (for example, switching of a channel, program search of an EPG, or the like) is executed. In this way, a television operation by voice using the standard model created by the standard model creation device of the present embodiment is realized.

As described above, according to embodiment 1 of the present invention, since the statistical amount of the standard model is calculated to create the standard model and the probability or likelihood of the reference model prepared in advance is maximized or maximized, the standard model can be easily created without requiring learning data or teacher data, and a high-precision standard model can be created by comprehensively considering a plurality of created reference models.

The standard model 122 is not limited to the HMM for each phoneme, and may be configured by a context-dependent HMM.

The standard model creation unit 104 may perform model creation on the event output probabilities in the partial states of the partial phonemes.

The HMM constituting the standard model 122 may be constituted by a different number of states for each phoneme, or may be constituted by a mixture gaussian distribution having a different number of distributions for each state.

The reference model 121 may be composed of different numbers of states or a mixture of different numbers of gaussian distributions for a child reference model, an adult reference model, and an elderly reference model.

In addition, speech recognition may also be performed in the server 101 using the standard model 122.

Instead of reading the reference model 121 from a memory such as a CD-ROM or a DVD-RAM, the server 101 may create the reference model 121 from the voice data.

The reference model preparation unit 102 may add and update a new reference model read from a memory device such as a CD-ROM or a DVD-RAM to the reference model storage unit 103 as necessary. That is, the reference model preparation unit 102 may update the reference model by replacing the reference model with the new reference model when the reference model for the same recognition target is stored in the reference model storage unit 103, or may delete the unnecessary reference model stored in the reference model storage unit 103, in addition to storing the new reference model in the reference model storage unit 103.

Further, the reference model preparation unit 102 may add and update a new reference model to the reference model storage unit 103 via the communication path, if necessary.

In addition, after the standard model is made, the learning can be carried out by utilizing the voice data.

The standard model structure determination unit may determine an HMM structure such as a handset, a triphone (triphone), a state sharing type, or the number of states.

(embodiment 2)

Fig. 9 is a block diagram showing the entire configuration of the standard model creation apparatus according to embodiment 2 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a set-top box 201 (hereinafter referred to as STB) is shown. In the present embodiment, a case where a standard model for speech recognition (speaker adaptive model) is created will be described as an example. Specifically, a case will be described as an example where EPG search, program switching, recording reservation, and the like of a television are executed by the voice recognition function of the STB.

The STB201 is a digital broadcasting receiver that recognizes a user's speech and performs automatic switching of TV programs, and includes a microphone 211, a speech data storage unit 212, a reference model preparation unit 202, a reference model storage unit 203, a used information creation unit 204, a reference model selection unit 205, a standard model creation unit 206, and a speech recognition unit 213 as a standard model creation device that creates a speech recognition standard model defined by a set of events and an output probability of an event or transition between events.

The voice data collected by the microphone 211 is stored in the voice data storage 212. The reference model preparation unit 202 creates a reference model 221 for each speaker using the speech data stored in the speech data storage unit 212, and stores the reference model in the reference model storage unit 203.

The usage information creating unit 204 collects the user's voice as the usage information 224 with the microphone 211. Here, the use information is information related to an object (person or object) to be recognized (recognition, discrimination, authentication, and the like in a narrow sense), and here is a voice of a user constituting a voice recognition object. The reference model selecting unit 205 selects the reference model 223 that is close to the user speech shown in the usage information 224 in terms of audio from the reference models 221 stored in the reference model storage unit 203, based on the usage information 224 created by the usage information creating unit 204.

The standard model creation unit 206 is a processing unit that creates the standard model 222, and maximizes or maximizes the probability or likelihood of the reference model 223 for the speaker selected by the reference model selection unit 205, and the standard model creation unit 206 includes: a standard model structure determination unit 206a for determining the structure of a standard model (the number of mixed distributions of gaussian distributions, etc.); an initial standard model creation unit 206b for creating an initial standard model by specifying a statistic initial value for calculating the standard model; a statistic storage unit 206c for storing the determined initial standard model; and a statistic estimation unit 206d that calculates a statistic (a final standard model is generated) that maximizes or maximizes the probability or likelihood for the reference model 223 selected by the reference model selection unit 205, for the initial standard model stored in the statistic storage unit 206c, by an approximation calculation using the general approximation unit 206e, or the like.

The speech recognition unit 213 recognizes the speech of the user using the standard model 222 created by the standard model creation unit 206.

Next, the operation of the STB201 configured as described above will be described.

Fig. 10 is a flowchart showing the operation procedure of the STB 201.

First, before a standard model is created, a reference model to be a reference of the standard model is prepared (step S200). That is, the microphone 211 collects voice data from a small a to a small Z, and stores the voice data in the voice data storage unit 212. For example, a plurality of microphones installed indoors, a microphone built in a television remote controller, a telephone set, and the like are connected to the voice data storage unit 212 of the STB201, and voice data input from the microphone or the telephone set is stored in the voice data storage unit 212. For example, the voices of brother, sister, father, mother, grandfather, neighbor, friend are stored.

The reference model preparation unit 202 creates a reference model 221 for each speaker by using the speech data stored in the speech data storage unit 212 and by using the baum-welch re-estimation method. This process is performed before a request to make a standard model is made.

The reference model storage unit 203 stores the reference model 221 created by the reference model preparation unit 202. The reference model 221 is constituted by HMMs for each phoneme. Fig. 11 shows an example of the reference model 221. Here, the number of states of all the reference models from small a to small Z is 3, and the output distribution of the HMM is configured by a mixture gaussian distribution in which the number of mixture distributions is 5 in each state. As the feature amount, a cepstrum coefficient of 25 dimensions (J ═ 25) is used.

Here, the production of the standard model is requested. For example, the user requests the creation of a standard model by pressing a 'confirmation of user' button. As the 'user confirmation' key, a method of displaying and selecting on the screen of the television or a method of setting a 'user confirmation' switch on a remote controller of the television to select may be considered. As the timing of pressing the button, there may be considered timing of starting the television, timing of executing a command operation using voice recognition, timing that is considered to be required to be adapted to a standard model of a user, and the like.

Next, the information creation unit 204 collects the user' S voice as the use information 224 through the microphone 211 (step S201). For example, if a standard model is requested to be created, the "input name" is displayed on the screen. The user inputs a name (user's voice) through a microphone built in the television remote controller. The voice of the user is the usage information. In addition, the input voice is not limited to a name. For example, a 'please utter' fit 'sound' may be displayed, and the user utters 'fit'.

The reference model selecting unit 205 selects a reference model 223 acoustically close to the user speech from the reference models 221 stored in the reference model storage unit 203 (step S202). Specifically, the user's voice is input into the reference models from small a to small Z, and a reference model for a speaker who has 10 persons (Ng is 10) with a high likelihood of uttering a word is selected.

The standard model creation unit 206 creates the standard model 222, and maximizes or maximizes the probability or likelihood for the 10 reference models 223 selected by the reference model selection unit 205 (step S203). In this case, as shown in embodiment 1, the progress status of learning may be displayed. Thus, the user can judge the progress of learning, the learning end time, and the like, and can create the standard model with ease. Further, a progress status non-display unit may be provided for making the progress status of the learning non-display. With this function, the picture can be effectively used. In addition, by performing non-display to the habitual person, the feeling of trouble can be avoided.

Finally, the speech recognition unit 213 receives the user' S speech transmitted from the microphone 211 as input, and performs speech recognition using the standard model 222 created by the standard model creation unit 206 (S204). For example, by performing audio analysis or the like on the speech uttered by the user, 25-dimensional cepstral coefficients are calculated and input to the standard model 222 for each phoneme, thereby specifying a phoneme relation having a high likelihood. Then, the phoneme relation is compared with the program name in the electronic program data received in advance, and when the likelihood of being more than a predetermined value is detected, the automatic program switching control for switching to the program is executed.

Next, the detailed procedure of step S203 (creation of the standard model) in fig. 10 will be described. The flow of steps is the same as the flow chart shown in fig. 4. However, the structure of the standard model used, the specific approximation calculation, and the like are different.

First, the standard model configuration determining unit 206a determines the configuration of the standard model (step S102a of fig. 4). Here, as a structure of the standard model, the HMM for each phoneme is used, and the number of mixed distributions of the output distributions of the states is determined to be 16 (Mf is 16).

Next, the initial standard model creation unit 206b specifies a statistic initial value for calculating the standard model (step S102b in fig. 4). Here, a model obtained by integrating the 10 reference models 223 selected by the reference model selection unit 205 into one gaussian distribution using statistical processing calculation is used as an initial value of the statistic, and the initial value is stored as an initial standard model in the statistic storage unit 206 c. Here, a standard model (speaker adaptive model) with a high-precision number of mixture distributions of 16 (16 mixtures) is created using a reference model with a number of mixture distributions of 5 learned for each speaker.

Specifically, the initial standard model creation unit 206b generates an output distribution shown by the above formula 13 for each of the 3 states I (I ═ 1, 2, and 3).

However, in the present embodiment, in the output distribution shown in the above equation 13 (equation 40)

x＝(x₍₁₎，x₍₂₎，...，x_(J))∈R^J

Represents a 25-dimensional (J ═ 25) cepstral coefficient.

Then, the statistic estimation unit 206d estimates the statistic of the standard model stored in the statistic storage unit 206c using the 10 reference models 223 selected by the reference model selection unit 205 (step S102c in fig. 4).

That is, the statistical amount (the mixture weight coefficient shown in the above equation 16, the average value shown in the above equation 17, and the variance value shown in the above equation 18) of the standard model in which the output distribution in each state (I1, 2, and 3) of the 10 (Ng 10) reference models 223, that is, the probability of the output distribution shown in the above equation 19 (here, the likelihood logP shown in the above equation 25) is maximized or maximized is estimated.

However, in the present embodiment, in the output distribution shown in the above equation 19, (equation 41)

L_g(i)(i＝1，2，...，N_g)

Is 5 (number of mixed distributions of respective reference models).

Specifically, the mixture weight coefficient, the average value, and the variance value of the standard model are calculated from equations 26, 27, and 28, respectively.

In this case, the general approximation unit 206e of the statistic estimation unit 206d uses the approximation formula shown in the above formula 29.

Here, unlike embodiment 1, the general approximation unit 206e has an output distribution shown by the denominator of the approximation formula of expression 29 above

(formula 42)

In (1), the output distribution represented by a numerator close in distance to the approximate expression of the above expression 29 is selected

(formula 43)

The last 3 (ph (m) ═ 3) output distributions,

(formula 44)

Using the selected 3 output distributions, the weight (equation 31), the average value (equation 32), and the variance value (equation 33) of the single gaussian distribution shown in equation 30 are calculated according to equations shown in equations 45, 46, and 47, respectively.

Formula (45)

(formula 46)

(formula 47)

(m＝1，2，...，M_f，j＝1，2，...，J)

Fig. 12 is a diagram illustrating the approximation calculation by the general approximation unit 206 e. As shown in the figure, the general approximation unit 206e determines a single gaussian distribution (expression 30) in the approximate expression shown in expression 29 above, using only a part (ph (m)) of the gaussian distributions close to the gaussian distribution to be calculated, from the Mf gaussian distributions constituting the standard model. Therefore, the amount of calculation in the approximation calculation is reduced compared to embodiment 1 using all (Mf) mixed gaussian distributions.

When the above approximation formulae of the general approximation unit 206e are taken into consideration, the calculation formula of the statistic estimation unit 206d is as follows. That is, the statistic estimator 206d calculates the mixing weight coefficient, the average value, and the variance value from the following expressions 48, 49, and 50, and stores them in the statistic storage 206 c. Thereafter, the estimation of the statistic and the storage into the statistic storage unit 206c are repeated R (≧ 1) times. The statistics of the obtained results are output as the statistics of the finally generated standard model 222. In addition, in the iterative calculation, the number of choices of output distribution ph (m) in the above approximation calculation is reduced in accordance with the number of times, and finally, calculation satisfying ph (m) 1 is performed.

(formula 48)

(formula 49)

(formula 50)

Further, using the normalized state transition probabilities, the entire state transition probabilities corresponding to the HMM added to the reference model 223 become 1.

As described above, according to embodiment 2 of the present invention, since the statistical amount of the standard model is calculated to create the standard model and the probabilities or likelihoods of the plurality of reference models selected based on the utilization information are maximized or maximized, a suitable high-precision standard model can be provided by utilizing the situation.

The timing of creating the standard model is not limited to the instruction of the user in the present embodiment, and the standard model may be created at another timing. For example, the STB201 is further provided with a user change determination unit that automatically determines whether or not a user has changed. The user change determination unit determines whether or not the user is changed, that is, whether or not the current user and the previously recognized user are the same person, using the recognition voice input to the television remote controller. When it is determined that the user has changed, the speech sound is used as the use information to create the standard model. Thus, the user can unconsciously perform speech recognition using a standard model suitable for the user.

The standard model 222 is not limited to the HMM for each phoneme, and may be configured by a context-dependent HMM.

The standard model creation unit 206 may perform model creation on the output probabilities of the events in the partial states of the partial phonemes.

The HMM constituting the standard model 222 may be constituted by a different number of states for each phoneme, or may be constituted by a mixture gaussian distribution having a different number of distributions for each state.

The HMM for each speaker may be composed of different numbers of states or a mixture gaussian distribution of different numbers of mixtures.

The reference model 221 is not limited to the HMM for each speaker, and may be created for each speaker, noise, and tone.

The standard model 222 may be recorded in a memory such as a CD-ROM, hard disk, or DVD-RAM.

Alternatively, the reference model 221 may be read from a memory such as a CD-ROM or DVD-RAM instead of being created.

The reference model selecting unit 205 may change the number of reference models selected for each user based on the usage information 224.

The reference model preparation unit 202 may create a new reference model as necessary, add the new reference model to the reference model storage unit 203, update the new reference model, and delete the unnecessary reference model stored in the reference model storage unit 203.

The reference model preparation unit 202 may add and update a new reference model to the reference model storage unit 203 via the communication path as necessary.

The number ph (m) of output distributions selected in the approximation calculation may be different depending on the output distributions of the target events and the standard model, or may be determined depending on the inter-distribution distance.

In addition, after the standard model is made, the learning can be performed by using the voice data.

The standard model structure determination unit may determine an HMM structure such as a handset, a triphone (triphone), or a state sharing type, the number of states, or the like.

The number of mixed distributions may be set to a predetermined value at the time of shipment from the factory, or may be determined in accordance with specifications such as CPU power of a device using the network, specifications of an application program to be started, and the like.

(embodiment 3)

Fig. 13 is a block diagram showing the overall configuration of a standard model creation device according to embodiment 3 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a PDA (personal digital Assistant)301 is shown. In the present embodiment, a case where a standard model for noise recognition (noise model) is created will be described as an example.

The PDA301 is a portable information terminal, and includes a reading unit 311, a reference model preparation unit 302, a reference model storage unit 303, a use information creation unit 304, a reference model selection unit 305, a standard model creation unit 306, a specification information creation unit 307, a microphone 312, and a noise recognition unit 313 as a standard model creation device for creating a standard model for noise recognition defined by the output probability of an event.

The reading unit 311 reads noise reference patterns such as a car a reference pattern, a car B reference pattern, a bus a reference pattern, a light rain reference pattern, and a heavy rain reference pattern, which are written in a memory device such as a CD-ROM. The reference model preparation unit 302 transmits the read reference model 321 to the reference model storage unit 303. The reference model storage unit 303 stores a reference model 321.

The information creating unit 304 creates a noise type as the use information using the screen and the keys of the PDA 301. The reference model selecting unit 305 selects a reference model that is close in audio frequency to the type of noise as the usage information 324 from the reference models 321 stored in the reference model storage unit 303. The specification information creating unit 307 creates specification information 325 in accordance with the specification of the PDA 301. Here, the specification information is information relating to the specification of the created standard model, and here is information relating to the processing capability of the CPU included in the PDA 301.

The standard model creation unit 306 is a processing unit for creating the standard model 322 so as to maximize or maximize the probability or likelihood of the reference model for the noise selected by the reference model selection unit 305, based on the specification information 325 created by the specification information creation unit 307, and includes: a standard model structure determination unit 306a that determines a standard model structure (a number of mixed distributions of gaussian distributions, etc.); an initial standard model creation unit 306b for creating an initial standard model by specifying an initial value of a statistic for calculating the standard model; a statistic storage unit 306c for storing the determined initial standard model; and a statistic estimation unit 306d for calculating a statistic that maximizes or maximizes the probability or likelihood of the reference model 323 selected by the reference model selection unit 305 by performing an approximation calculation or the like using the 2 nd approximation unit 306e on the initial standard model stored in the statistic storage unit 306c (generating a final standard model).

The noise recognition unit 313 recognizes the type of noise input from the microphone 312 using the standard model 322 created by the standard model creation unit 306.

Next, the operation of PDA301 configured as described above will be described.

Fig. 14 is a flowchart showing the operation procedure of the PDA 301.

First, before a standard model is created, a reference model to be a reference of the standard model is prepared (step S300). That is, the reading unit 311 reads the reference model of the noise written in the memory device, the reference model preparation unit 302 transmits the read reference model 321 to the reference model storage unit 303, and the reference model storage unit 303 stores the reference model 321.

The reference model 321 is made up of GMMs. Fig. 15 shows an example of the reference model 321. Here, each noise model is composed of GMMs having a mixture distribution number of 3. As the feature amount, 5-dimensional (J ═ 5) LPC cepstrum coefficients are used.

Next, the use information creating unit 304 creates use information 324 as a type of noise to be identified (step S301). Fig. 16 shows an example of the selection screen of the PDA 301. Here, the noise of the car is selected. The reference model selecting unit 305 selects a reference model of the car a and a reference model of the car B, which are reference models that are close in audio to the noise of the car, which is the selected usage information 324, from the reference models 321 stored in the reference model storage unit 303 (step S302).

Then, the specification information creating unit 307 creates specification information 325 based on the specification of the PDA301 (step S303). Here, specification information 325 is created based on the CPU specification of PDA301, such that the CPU power is small. The standard model creation unit 306 creates the standard model 322 based on the created specification information 325, and maximizes or maximizes the probability or likelihood of the reference model 323 selected by the reference model selection unit 305 (step S304).

Finally, the noise recognition unit 313 performs noise recognition on the noise input from the microphone 312 by the user using the standard model 322 (step S305).

Next, the detailed procedure of step S304 (creation of the standard model) in fig. 14 will be described. The flow of steps is the same as the flow chart shown in fig. 4. However, the structure of the standard model used, the specific approximate calculation, and the like are different.

First, the standard model structure determination unit 306a determines the structure of the standard model (step S102a in fig. 4). Here, as the structure of the standard model, it is determined that the standard model 322 is configured by a mixture (Mf is 1) of GMMs based on information that the CPU power is small as the specification information 325.

Next, the initial standard model creation unit 306b specifies a statistic initial value for calculating the standard model (step S102b in fig. 4). Here, a model obtained by integrating the selected reference model 323, that is, the three-hybrid reference model of the car a into one gaussian distribution by using statistical processing calculation is stored as a statistical initial value in the statistical amount storage unit 306 c.

Specifically, the initial standard model creation unit 306b generates an output distribution shown in equation 13.

In the present embodiment, among the output distributions shown by the above equation 13, the one

(formula 51)

x＝(x₍₁₎，x₍₂₎，...，x_(J))∈R^J

Denotes 5-dimensional (J ═ 5) LPC cepstral coefficients.

Then, the statistic estimation unit 306d estimates the statistic of the standard model stored in the statistic storage unit 306c using the 2 reference models 323 selected by the reference model selection unit 305 (step S102c in fig. 4).

That is, the statistics of the standard model (the mixture weighting coefficient shown in equation 16, the average value shown in equation 17, and the variance value shown in equation 18) that maximizes or maximizes the probability (here, the likelihood logP shown in equation 25) of the output distribution shown in equation 19, which is the output distribution in the 2 reference models 322 (Ng ═ 2) are estimated.

In the present embodiment, among the output distributions shown by the above equation 19

(formula 52)

L_g(i)(i＝1，2，...，N_g)

And 3 (number of mixed distributions of each reference model).

In this case, the 2 nd approximation unit 306e of the statistic estimation unit 306d uses the following approximation formula, assuming that the gaussian distributions of the standard model do not affect each other.

(formula 53)

(m＝1，2，...，M_f)

In addition, the Gaussian distribution of the standard model

Formula (54)

Nearby

(formula 55)

X

Is a Gaussian distribution of Qg (m, I) reference models 323 having a close inter-distribution distance such as the Euclidean distance from the average value of the output distribution shown in the above equation 54, the Mahalanobis distance, the KL (Kullback-Leibler: Kulbeck-Lableler) distance, and the like

(formula 56)

The space that is present is,

and type (57)

The output distribution of the reference vector of Qg (m, I) (1. ltoreq. Qg (m, I). ltoreq. lg (I)) having a short distance between the distributions is approximated to a Gaussian distribution of the reference model

(58)

The output distribution of the standard model having the closest inter-distribution distance (the vicinity indication parameter G is 1) is the output distribution of the reference vector of equation 57.

Fig. 17 is a schematic diagram showing a statistic estimation procedure by the statistic estimation unit 306 d. The statistical amount of each gaussian distribution of each reference model is estimated using a gaussian distribution m of the standard model in which the inter-distribution distance such as the euclidean distance of the mean value and the mahalanobis distance is closest.

Fig. 18 is a diagram illustrating the approximation calculation by the 2 nd approximation unit 306 e. As shown in the figure, the 2 nd approximation unit 306e uses the approximation formula shown in the above formula 53 by specifying the gaussian distribution m of the standard model closest to each gaussian distribution of each reference model.

When the above approximation formulas of the 2 nd approximation unit 306e are taken into consideration, the calculation formula of the statistic estimation unit 306d is as follows. That is, the statistic estimator 306d calculates the mixture weight coefficient, the average value, and the variance value from the following expressions 59, 60, and 61, and generates a standard model specified by these parameters as the final standard model 322.

(formula 59)

(m＝1，2，...，M_f)

(wherein, the sum of the denominator and the numerator means the sum of the gaussian distributions m of the standard model, which is the closest between the distributions, such as the euclidean distance of the mean and the mahalanobis distance, among the gaussian distributions of the reference models.)

(formula 60)

(m＝1，2，...，M_f，j＝1，2，...，J)

(formula 61)

(m＝1，2，...，M_f，j＝1，2，...，J)

However, in

(formula 62)

In the case of (a) in (b),

(method 1) the mixing weight coefficient, the average value, and the variance value are not updated.

In the method 2, the value of the mixing weight coefficient is set to zero, and the average value and the variance value are set to predetermined values.

In the method 3, the value of the mixing weight coefficient is set to a predetermined value, and the average value and the variance value are set to the average value and the variance value when the output distribution of the standard model is expressed as one distribution.

The value of the statistic is determined using either method. The method used may be different for each repetition number R, HMM, and HMM state. Here, the 1 st method is used.

The statistic estimator 306d stores the statistic of the standard model estimated in this way in the statistic storage 306 c. Thereafter, the estimation of the statistic and the storage in the statistic storage unit 306c are repeated R (≧ 1) times. As a result, the obtained statistic is output as the statistic of the standard model 322 to be finally generated.

Next, a specific example of the application of the present embodiment to the environmental sound recognition of the PDA will be described.

First, the reference model preparation unit 302 reads out a reference model necessary for recognizing an environmental sound from the CD-ROM. The user selects an environmental sound to be recognized from the screen in consideration of the environment (use information) in which the recognition is performed. For example, select 'car' and then select 'alarm sound', 'baby sound', 'sound of tram', etc. In response to this selection, the reference model selecting unit 305 selects a corresponding reference model from the reference models stored in the reference model storage unit 303. Then, the standard model creation unit 306 creates standard models for the selected reference models 323, one by one.

Next, the user starts an application such as 'provide information' (information is provided by determining the situation based on the environmental sound) on PDA 301. The application program is a program that performs status determination based on environmental sounds and provides appropriate information to the user. Once started, display of 'accurate determination', 'quick determination', etc. is performed in the display screen of the PDA 301. On the other hand, the user selects one of them.

Then, the specification information creating unit 307 creates specification information based on the selection result. For example, when 'accurate determination' is selected, specification information in which the number of mixing distributions is 10 is created to improve accuracy. On the other hand, when the 'quick determination' is selected, the specification information in which the number of mixing distributions is 1 is created for high-speed processing. In the case where a plurality of PDAs can perform the processing in conjunction with each other, the CPU power currently available may be determined, and the specification information may be created using the CPU power.

Based on the specification information, a single-mixture standard model such as 'car', 'alarm sound', 'baby' and 'train sound' is created. Then, the PDA301 recognizes the environment using the created standard model, and displays various information on the PDA screen based on the recognition result. For example, a road map is displayed in the case where 'car' is recognized to be in the vicinity, or an advertisement of a toy shop is displayed in the case where 'sound of baby' is recognized. In this way, provision of information based on environmental sound recognition using the standard model created by the standard model creation device according to the present embodiment is achieved. In addition, the standard model may be adjusted in complexity to the specifications of the application.

As described above, according to embodiment 3 of the present invention, since the statistical amount of the standard model is calculated to create the standard model and the probabilities or likelihoods of the plurality of reference models selected based on the use information are maximized or maximized, a suitable high-precision standard model can be provided by the use situation.

Further, since the standard model is created based on the specification information, it is possible to prepare a standard model suitable for the equipment using the standard model.

The number of times of repetition of the processing by the statistic estimation unit 306d may be equal to or more than a predetermined threshold value of the likelihood level shown in equation 25.

The GMM constituting the standard model 322 may be configured by a gaussian mixture distribution having a different number of mixture distributions for each type of noise.

The recognition model is not limited to the noise model, and may be a speaker recognition model, an age recognition model, or the like.

The standard model 322 may be recorded in a storage device such as a CD-ROM, a DVD-RAM, or a hard disk.

Instead of reading the reference model 321 from a memory such as a CD-ROM, the PDA301 may create the reference model 321 from the noise data.

The reference model preparation unit 302 may add and update a new reference model read from a memory device such as a CD-ROM to the reference model storage unit 303 as necessary, and delete an unnecessary reference model stored in the reference model storage unit 303.

The reference model preparation unit 302 may add and update a new reference model to the reference model storage unit 303 via the communication path as necessary.

In addition, after the standard model is manufactured, the data can be further utilized for learning.

The standard model structure determination unit may determine the structure, the number of states, and the like of the standard model.

The vicinity indication reference G may be different depending on the output distribution of the target event or the standard model, or may be changed depending on the number of repetitions R.

(embodiment 4)

Fig. 19 is a block diagram showing the entire configuration of a standard model creation device according to embodiment 4 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a server 401 of a computer system is shown. In the present embodiment, a case where a standard model for face recognition is created will be described as an example.

The server 401 is a computer device in a communication system or the like, and includes a camera 411, an image data storage unit 412, a reference model preparation unit 402, a reference model storage unit 412, a used information receiving unit 404, a reference model selecting unit 405, a standard model creating unit 406, and a writing unit 413 as a standard model creating device for creating a standard model for face recognition defined by an output probability of an event.

The image data of the face is collected by the camera 411, and the image data of the face is stored in the image data storage unit 412. The reference model preparation unit 402 creates a reference model 421 for each speaker using the face image data stored in the image data storage unit 412, and stores the reference model in the reference model storage unit 403.

The information receiving unit 404 receives information on the age and sex of the person who is the face recognition target desired by the user as the use information 424 by using the telephone 414. The reference model selecting unit 405 selects the reference model 423 corresponding to the speaker of the gender and the year indicated by the usage information 424 from the reference models 421 stored in the reference model storage unit 403, based on the usage information 424 received by the usage information receiving unit 404.

The standard model creation unit 406 is a processing unit that creates a standard model 422 and maximizes or maximizes the probability or likelihood of the reference model 423 with respect to the speaker face image selected by the reference model selection unit 405, and has the same function as the standard model creation unit 206 according to embodiment 2, and also has the functions of the 1 st approximation unit 104e according to embodiment 1 and the 2 nd approximation unit 306e according to embodiment 3. That is, a calculation combining the 3 kinds of approximation calculation shown in embodiments 1 to 3 is performed.

The writing unit 413 writes the standard model 422 created by the standard model creating unit 406 into a storage device such as a CD-ROM.

Next, the operation of the server 401 configured as described above will be described.

Fig. 20 is a flowchart showing the operation procedure of the server 401. Fig. 21 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server 401.

First, before the standard model is created, a reference model to be a reference of the standard model is prepared (step S400 in fig. 20). That is, the face image data from small a to small Z is collected by the camera 411 and stored in the image data storage unit 412. The reference model preparation unit 402 creates a reference model 421 for each speaker by the EM algorithm using the face image data stored in the image data storage unit 412. Here, the reference model 421 is constituted by a GMM.

The reference model storage unit 403 stores the reference model 421 created by the reference model preparation unit 402. Here, as shown in a reference model 421 of fig. 21, all reference models from a small a to a small Z are configured by G-grid with a mixed distribution number of 5. As the feature amount, a density value of a pixel of 100 dimensions (J100) is used.

Next, the usage information receiving unit 404 receives information on the age and sex as the usage information 424 by the telephone 414 (step S401 in fig. 20). Here, the usage information 424 includes a male aged 11 to 15 years and a female aged 22 to 26 years. The reference model selection unit 405 selects the reference model 423 corresponding to the usage information 424 from the reference models 421 stored in the reference model storage unit 403 based on the usage information 424 (step S402 in fig. 20). Specifically, as shown in 'selected reference model 423' of fig. 21, here, reference models of a male aged from 11 to 15 years and a female aged from 22 to 26 years are selected.

Then, the standard model creation unit 406 creates the standard model 422 so as to maximize or maximize the probability or likelihood of the reference model 421 for the speaker selected by the reference model selection unit 405 (step S403 in fig. 20). Here, as shown in the standard model 422 of fig. 21, each of the two standard models 422 is configured by a GMM whose number of mixture distributions is 3.

The standard model 422 is produced in substantially the same manner as in embodiment 2. Specifically, however, the approximation calculation in the statistic estimation of the standard model 422 is performed as follows. That is, the standard model creation unit 406 executes, by using a memory unit or the like incorporated therein, a calculation based on an approximation calculation similar to the approximation calculation executed by the general approximation unit 206e in embodiment 2, using a model created by the approximation calculation similar to the approximation calculation executed by the 1 st approximation unit 104e in embodiment 1 as an initial value, and executes, using the result thereof as an initial value, an approximation calculation similar to the approximation calculation executed by the 2 nd approximation unit 306e in embodiment 3.

The writing unit 413 writes the two standard models 422 created by the standard model creating unit 406 into a storage device such as a CD-ROM (step S404 in fig. 20).

The user receives by mail a storage device in which a standard model of a male aged 11 to 15 years old and a standard model of a female aged 22 to 26 years old are written.

Next, a specific example in which the present embodiment is applied to an information providing system that introduces stores and the like based on the prediction of behavior will be described. The information providing system is composed of a car navigation device and an information providing server device connected by a communication network. The car navigation device has the following functions: by using the standard model created in advance by the standard model creation device 401 of the present embodiment as the action prediction model, the action of the person (i.e., the destination of the vehicle, etc.) is predicted, and information related to the action (store information such as restaurants located near the destination, etc.) is provided.

First, the user creates an action prediction model for himself/herself by requesting the server 401 connected via the telephone line 414 using the car navigation system.

Specifically, the user presses a button of the 'recommendation function' on the item selection screen displayed on the car navigation device. At this time, the user's residence (place of use), age, sex, interest, and the like are input.

Here, the user is a father and a mother. First, the personal information of the father is input while talking with the screen of the car navigation device. In the case of a residence, the conversion is automated by entering a telephone number. Alternatively, when the current position is displayed on the car navigation device, the current position is input as the use location by pressing a button of the 'use location'. Here, the address information is address a. In terms of age and gender, '50' and 'male' are selected and inputted. In terms of interest, since there is a selectable item displayed in advance, the user selects the item. Here, the parent interest information is set as interest information a.

Next, personal information of the mother is also input. Personal information including address B, 40's old and young, and interest information B is created. The result of such input is shown in the screen display example of fig. 22.

Finally, the car navigation device transmits the personal information thus created as the use information to the server 401 as the information providing server device using the attached telephone line 414.

Next, the server 401 creates two behavior prediction models of the father and mother based on the transmitted personal information (use information). Here, the action prediction model is represented by a probability model, which inputs the day of the week, the time of the day, the current location, and the like, and outputs the probability of presenting the store a information, the probability of presenting the store B information, the probability of presenting the store C information, the probability of presenting the parking lot information, and the like.

The plurality of reference models stored in the reference model storage unit 403 of the server 401 are action prediction models created from the age, sex, representative residence, and interest tendency. The server 401 stores various personal information (the input and output information) in the image data storage unit 412 by inputting various personal information using an input button of the car navigation device or the like in advance instead of the camera 411, and the reference model preparation unit 402 creates a reference model 421 for each of a plurality of typical users based on the personal information stored in the image data storage unit 412 and stores the reference model in the reference model storage unit 403.

The reference model selection unit 405 selects a reference model suitable for the personal information using the personal information (use information). For example, eight reference models with the same street, the same age and gender, and the same interest are selected. The standard model creation unit 406 of the server 401 creates a standard model in which the selected reference model is integrated. The writing unit 413 stores the created standard model in the memory card. Here, standard models of both the father and mother are stored. The memory card is delivered to the user by post.

The user inserts the received memory card into the car navigation device, and selects the 'father' and 'mother' displayed on the screen, thereby setting the user. Thus, the car navigation device uses the standard model stored in the attached memory card as the action prediction model, and presents store information and the like at a necessary timing based on the current week, time, place, and the like. In this way, an information providing system is realized that predicts the action of a person (i.e., the destination of a vehicle) and provides information related to the action by using the standard model created by the standard model creating device of the present embodiment as an action prediction model.

As described above, according to embodiment 4 of the present invention, the standard model is created by calculating the statistics of the standard model, and the probabilities or likelihoods of the plurality of reference models selected based on the usage information are maximized or maximized, so that it is possible to provide a suitable high-precision standard model according to the usage situation.

In addition, the GMM constituting the standard model 422 may be constituted by a mixture gaussian distribution having a different number of distributions for each speaker.

The reference model preparation unit 402 may create a new reference model, add the new reference model to the reference model storage unit 403, update the new reference model, and delete the unnecessary reference model stored in the reference model storage unit 403, if necessary.

(embodiment 5)

Fig. 23 is a block diagram showing the entire configuration of a standard model creation device according to embodiment 5 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a server 501 in a computer system is shown. In the present embodiment, a case where a standard model (adaptive model) for speech recognition is created will be described as an example.

The server 501 is a computer device in a communication system or the like, and is a standard model creation device that creates a standard model for speech recognition defined by a set of events and an output probability of an event or a transition between events, and includes a reading unit 511, a speech data storage unit 512, a reference model preparation unit 502, a reference model storage unit 503, a used information receiving unit 504, a reference model selection unit 505, a standard model creation unit 506, a specification information receiving unit 507, and a writing unit 513.

The reading unit 511 reads the voice data of children, adults, and elderly persons written in a memory such as a CD-ROM, and stores the voice data in the voice data storage unit 512. The reference model preparation unit 502 creates a reference model 521 for each speaker using the speech data stored in the speech data storage unit 512. The reference model storage unit 503 stores the reference model 521 created by the reference model preparation unit 502.

The specification information receiving unit 507 receives the specification information 525. The usage information receiving unit 504 receives a user's voice as the usage information 524. The reference model selecting unit 505 selects a reference model of a speaker that is close to the user speech, which is the usage information 524, from the reference models 521 stored in the reference model storage unit 503.

The standard model creation unit 506 is a processing unit that creates the standard model 522 from the specification information 525 and maximizes or maximizes the probability or likelihood of the speaker reference model 523 selected by the reference model selection unit 505, and has the same function as the standard model creation unit 104 according to embodiment 1. The writing unit 513 writes the standard model 522 created by the standard model creating unit 506 into a storage device such as a CD-ROM.

Next, the operation of the server 501 configured as described above will be described.

Fig. 24 is a flowchart showing the operation procedure of server 501. Fig. 25 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server 501.

First, before the standard model is created, a reference model to be a reference of the standard model is prepared (step S500 in fig. 24). That is, the reading unit 511 reads the voice data written in the storage device such as a CD-ROM and stores the voice data in the voice data storage unit 512. The reference model preparation unit 502 creates a reference model 521 for each speaker by using the speech data stored in the speech data storage unit 512 by using the baum-welch re-estimation method. The reference model storage unit 503 stores the reference model 521 created by the reference model preparation unit 502.

The reference model 521 is constituted by HMMs for each phoneme. Here, as shown in a reference model 521 of fig. 25, the reference model of each speaker of the child constitutes the output distribution of the HMM by using a mixture gaussian distribution having 3 states and 3 mixture distributions in each state; a reference model for each speaker of an adult, which forms an output distribution of an HMM using a mixture gaussian distribution having 3 states and 64 mixture distributions for each state; the reference model of each speaker of the aged constitutes the output distribution of the HMM by using a mixture gaussian distribution having 3 states and 16 mixture distributions in each state. This is because the voice data of children is small and the voice data of adults is large. As the feature quantity, a Mel-frequency cepstrum coefficient (Mel-frequency cepstral coeffient) of 25 dimensions (J ═ 25) is used.

Next, the use information receiving unit 504 receives the user' S voice from the terminal device 514 as the use information 524 (step S501 in fig. 24). The reference model selecting unit 505 selects the reference model 523 that is close to the user speech as the use information 524 in terms of audio from the reference models 521 stored in the reference model storage unit 503 (step S502 in fig. 24). Specifically, as shown in 'selected reference model 523' of fig. 25, a reference model of a nearby speaker 10 (Ng 10) is selected here.

Then, the specification information receiving unit 507 receives the specification information 525 from the terminal device 514 in response to the request of the user (step S503 in fig. 24). Here, specification information 525 called a quick recognition process is received. The standard model creation unit 506 creates the standard model 522 based on the specification information 525 received by the specification information reception unit 507, and maximizes or maximizes the probability or likelihood of the reference model 523 for the speaker selected by the reference model selection unit 505 (step S504 in fig. 24). Specifically, the standard model 522 is composed of an HMM of 2-hybrid (Mf is 2) and 3-state, as shown in the standard model 522 of fig. 25, based on information called a quick recognition process as the specification information 525. The HMM is configured for each phoneme.

The method for producing the master model 522 is the same as in embodiment 1.

The writing unit 513 writes the standard model 522 created by the standard model creating unit 506 into a storage device such as a CD-ROM (step S505 in fig. 24).

Next, a specific example in which the present embodiment is applied to a game based on voice recognition using a communication network will be described. Here, the server 501 includes a speech recognition unit that performs speech recognition using the created standard model.

In addition, a PDA is used as the terminal device 514. They are connected by a communication network.

The server 501 sequentially prepares a reference model at the timing when voice data is acquired from a CD, a DVD, or the like, by the reading unit 511, the voice data storage unit 512, and the reference model preparation unit 502.

The user starts a game program using voice recognition, here, an 'action game', on the PDA (terminal device 514). At this time, ' please make a sound for ' action ' is displayed, so the user utters ' action '. The voice is transmitted from the PDA (terminal device 514) to the server 501 as the use information, and the use information receiving unit 504 and the reference model selecting unit 505 of the server 501 select a reference model matching the user from the plurality of reference models stored in the reference model storage unit 503.

In addition, since the user desires to respond quickly, the user sets 'high speed identification' on the setting screen of the PDA (terminal device 514). The setting contents are transmitted from the PDA (terminal device 514) to the server 501 as specification information, and the standard model creating unit 506 creates 2a mixed standard model in the server 501 based on the specification information and the selected reference model.

The user gives instructions such as 'move right', 'move left' to the microphone of the PDA by voice during the action game. The input speech is transmitted to a server, and speech recognition using the created standard model is performed. The recognition result is transmitted from the server 501 to the PDA (terminal device 514), and the PDA (terminal device 514) operates the character in the action game based on the transmitted recognition result. In this way, the standard model created by the standard model creation device according to the present embodiment is used for speech recognition, thereby realizing a motion game based on speech.

In addition, the present embodiment can be similarly applied to other applications, for example, a translation system using a communication network. For example, a user starts an application called "speech translation" in a PDA (terminal device 514). At this time, ' please sound ' translation ' is displayed. The user utters a 'translated' sound. The voice is transmitted from the PDA (terminal device 514) to the server 501 as the use information. In addition, since the user wants to recognize accurately, the user indicates the content 'wants to recognize accurately' in the application. The instruction is transmitted from the PDA (terminal device 514) to the server 501 as specification information. The server 501 creates, for example, a 100-hybrid standard model based on the transmitted use information and specification information.

The user utters a sound of 'morning security' to the microphone of the PDA (terminal device 514). The input voice is transmitted from the PDA (terminal device 514) to the server 501, and after the server 501 recognizes 'early security', the recognition result is returned to the PDA (terminal device 514). The PDA (terminal device 514) translates the recognition result received from the server 501 into english, and displays the result 'GOOD MORNINGs' on the screen. In this way, by using the standard model created by the standard model creation device of the present embodiment for speech recognition, a speech-based translation device can be realized.

As described above, according to embodiment 5 of the present invention, the statistical amount of the standard model is calculated to create the standard model, and the probabilities or likelihoods of the plurality of reference models selected based on the usage information are maximized or maximized, so that it is possible to provide a suitable high-precision standard model according to the usage situation.

In addition, since the standard model is created based on the specification information, a standard model suitable for a device using the standard model is prepared.

The reference model preparation unit 502 may prepare a high-accuracy reference model corresponding to the number of mixed distributions of the number of data for each reference model, and may create a standard model using the high-accuracy reference model. Therefore, a high-precision standard model can be utilized.

The standard model 522 is not limited to the HMM for each phoneme, and may be an HMM depending on the context.

In addition, the HMM constituting the standard model 522 may be constituted by a mixture gaussian distribution having a different number of distributions for each state.

In addition, server 501 may also perform speech recognition using standard model 522.

The reference model preparation unit 502 may create a new reference model, add the new reference model to the reference model storage unit 503, update the new reference model, and delete the unnecessary reference model stored in the reference model storage unit 503.

The standard model structure specifying unit may specify the structure, the number of states, and the like of the standard model.

(embodiment 6)

Fig. 26 is a block diagram showing the entire configuration of the standard model creation apparatus according to embodiment 6 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a server 601 in a computer system is shown. In the present embodiment, a case where a standard model (preference model) for intention understanding is created will be described as an example.

The server 601 is a computer device or the like in the communication system, and includes a reading unit 611, a reference model preparation unit 602, a reference model storage unit 603, a use information receiving unit 604, a reference model selecting unit 605, a standard model creating unit 606, and a specification information creating unit 607 as a standard model creating device for creating a standard model for intention recognition defined by an output probability of an event.

The reading unit 611 reads preference models of the talker puppet a to the talker puppet Z of different ages written in a memory such as a CD-ROM, the reference model preparation unit 602 transmits the read reference model 621 to the reference model storage unit 603, and the reference model storage unit 603 stores the reference model 621.

The specification information creation unit 607 creates specification information 625 in accordance with the CPU power of a computer that is currently used. The usage information receiving unit 604 receives the usage information 624 from the terminal device 614. The reference model selecting unit 605 selects the reference model 623 corresponding to the usage information 624 from the reference models 621 stored in the reference model storage unit 603, based on the usage information 624 received by the usage information receiving unit 604.

The standard model creation unit 606 is a processing unit that creates a standard model 622 based on the specification information 625 created by the specification information creation unit 607, maximizes or maximizes the probability or likelihood of the reference model 623 selected by the reference model selection unit 605, and has the same function as the standard model creation unit 206 in embodiment 2, and also has the function of the 2 nd approximation unit 306e in embodiment 3. That is, the two approximation calculations shown in

embodiments

2 and 3 are combined to perform the calculation.

Next, the operation of the server 601 configured as described above will be described.

Fig. 27 is a flowchart showing the operation procedure of the server 601. Fig. 28 is a diagram showing an example of a reference model and a standard model for explaining the operation procedure of the server 601.

First, before a standard model is created, a reference model to be a reference of the standard model is prepared (step S600 in fig. 27). That is, the reading unit 611 reads preference models of different ages from the talker xian a to the talker xian Z written in a memory such as a CD-ROM, the reference model preparing unit 602 transmits the read reference model 621 to the reference model storage unit 603, and the reference model storage unit 603 stores the reference model 621.

The reference model 621 is made up of GMMs. Here, as shown in a reference model 621 of fig. 28, the GMMs have a mixed distribution number of 3. As the learning data, a 5-dimensional (J-5) feature value obtained by digitizing interest, characters, and the like is used. The preparation of the reference model is performed before the production of the standard model is requested.

Next, the usage information receiving unit 604 receives the usage information 624 of the age group for which the taste model is to be created (step S601 in fig. 27). Here, the usage information 624 is information for using preference models in different age zones such as 20's, 30's, and 40's. As shown in the 'selected reference model 623' of fig. 28, the reference model selecting unit 605 selects a preference model of an age-related speaker indicated by the usage information 624 received by the usage information receiving unit 604 from the reference models 621 stored in the reference model storage unit 603 (step S602 of fig. 27).

Then, the specification information generation unit 607 generates specification information 625 from the CPU power, the memory capacity, and the like of the computer that is currently used (step S603 in fig. 27). Here, specification information 625 for the so-called normal speed recognition processing is created.

The standard model creation unit 606 creates the standard model 622 based on the specification information 625 created by the specification information creation unit 607, and maximizes or maximizes the probability or likelihood of the reference model 623 for the speaker selected by the reference model selection unit 605 (step S604 in fig. 27). Here, the standard model 622 is composed of a 3-hybrid (Mf is 3) GMM based on information such as a normal speed recognition process as the specification information 625, as shown in the standard model 622 of fig. 28.

The method of making the standard model 622 is basically the same as that of embodiment 2. However, the approximation calculation in the statistic estimation of the standard model 622 is specifically performed as follows. That is, the standard model creation unit 606 performs calculation using the same approximation calculation as that performed by the general approximation unit 206e in embodiment 2, for example, by using a memory unit built therein, and performs the same approximation calculation as that performed by the approximation unit 306e in embodiment 3, using the result as an initial value.

Next, a specific example in which the present embodiment is applied to an information search device will be described. Here, the probability of using the search path a, the search path B, and the like is output with reference to the input search key of the model. If different search paths are used, the displayed search results are different. The reference model prepared in the reference model storage unit 603 of the server 601 is a model of a speaker having representative characteristics.

First, the user inputs the use information using a remote controller (terminal device 614) attached to the server 601. The utilization information is age, character, sex, interest, and the like. Further, the information may be information for identifying a predetermined group such as 'child', 'actor', 'college student', and the like.

Next, the user selects one of the devices to be used, such as 'car navigation device', 'cellular phone', 'computer', and 'television', on the selection screen. The specification information creation unit 607 of the server 601 creates specification information based on the CPU power and the storage capacity of the used device. Here, if the "television set" is selected, the specification information 625 whose contents are the CPU power and the memory capacity are small is created, and the standard model creating unit 606 creates a 3-hybrid standard model that operates even with a small CPU power based on the specification information 625. The created standard model is stored in a memory card, and the user inserts the memory card into a television.

The user inputs a search keyword for searching for a recommended program using an EPG or the like displayed on the television set. At this time, the television specifies a search path corresponding to the search keyword using the standard model recorded in the memory card, searches for a program along the search path, and displays the program as a program corresponding to the preference of the user. In this way, a convenient search device using the standard model created by the standard model creation device according to the present embodiment is realized.

As described above, according to embodiment 6 of the present invention, the standard model is created by calculating the statistics of the standard model, and the probabilities or likelihoods of the plurality of reference models selected based on the usage information are maximized or maximized, so that it is possible to provide a suitable high-precision standard model according to the usage situation.

In addition, since the standard model is created based on the specification information, a standard model suitable for the equipment using the standard model is prepared.

In addition, the GMM constituting the standard model 622 may also be constituted by a mixture gaussian distribution having a different number of distributions for each speaker.

The reference model preparation unit 602 may add and update a new reference model read from a memory device such as a CD-ROM to the reference model storage unit 603 as necessary, and delete an unnecessary reference model stored in the reference model storage unit 603.

Additionally, the GMM with reference to the model and the standard model may also represent a portion of a network call (bayesian network).

Further, the standard model structure determination section may determine an HMM structure such as a handset, a three-party phone, a state sharing type, or the like, the number of states, or the like.

(7 th embodiment)

Fig. 29 is a block diagram showing the entire configuration of a standard model creation device according to embodiment 7 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a server 701 in a computer system is shown. In the present embodiment, a case where a standard model (adaptive model) for speech recognition is created will be described as an example.

The server 701 is a computer device in a communication system or the like, and includes a reading unit 711, a reference model preparation unit 702, a reference model storage unit 703, a used information receiving unit 704, a reference model selecting unit 705, a standard model preparation unit 706, a specification information receiving unit 707, a standard model storage unit 708, and a standard model transmitting unit 709 as a standard model creating device for creating a standard model for speech recognition defined by a set of events and an output probability of an event or a transition between events.

The reference model preparation unit 702 transmits the reference model for speech recognition classified into speaker, noise, and tone, which are read by the reading unit 711 and written in a memory such as a CD-ROM, to the reference model storage unit 703, and the reference model storage unit 703 stores the transmitted reference model 721.

The specification information receiving unit 707 receives specification information 725 from the terminal device 712. The usage information receiving unit 704 receives the voice of the user uttered under a certain noise from the terminal device 712. The reference model selecting unit 705 selects a reference model 723 of a speaker, noise, or tone that is close in audio to the user speech as the usage information 724 from the reference models 721 stored in the reference model storage unit 703.

The standard model creation unit 706 is a processing unit that creates the standard model 722 based on the specification information 725 received by the specification information reception unit 707 and maximizes or maximizes the probability or likelihood of the reference model 723 selected by the reference model selection unit 705, and has the same function as the standard model creation unit 206 of embodiment 2. The standard model storage unit 708 stores one or more standard models based on the specification information 725. When the standard model transmitting unit 709 receives specification information and a request signal for a standard model from the user terminal device 712, it transmits a standard model suitable for the specification information to the terminal device 712.

Next, the operation of the server 701 configured as described above will be described.

Fig. 30 is a flowchart showing the operation procedure of the server 701. Fig. 31 is a diagram showing an example of a reference model and a standard model for explaining an operation procedure of the server 701.

First, before the standard model is created, a reference model to be a reference of the standard model is prepared (step S700 in fig. 30). That is, the reference model preparation unit 702 transmits the reference model for speech recognition classified into speaker, noise, and tone, which are read by the reading unit 711 and written in a memory such as a CD-ROM, to the reference model storage unit 703, and the reference model storage unit 703 stores the transmitted reference model 721. Here, the reference model 721 is composed of HMMs for each speaker, noise, and tone. As shown in a reference model 721 of fig. 31, each reference model forms an output distribution of the HMM by using a gaussian mixture distribution having 3 states and 128 distributions in each state. As the feature amount, a 25-dimensional (J ═ 25) cepstrum coefficient is used.

Next, the use information receiving unit 704 receives the voice of the user a under noise from the terminal device 712 as the use information 724 (step S701 in fig. 30). The reference model selecting unit 705 selects the reference model 723 that is close in audio to the voice of the user a as the usage information 724 from the reference models 721 stored in the reference model storage unit 703 (step S702 in fig. 30). Specifically, as shown in 'selected reference model 723' of fig. 31, a reference model of a close speaker 100 (Ng 100) is selected here.

Then, the specification information receiving unit 707 receives the specification information 725 from the terminal device 712 in response to the request of the user a (step S703 in fig. 30). Here, specification information 725 called high recognition accuracy is received. The standard model creation unit 706 creates the standard model 722 based on the specification information 725, and maximizes or maximizes the probability or likelihood for the reference model 723 selected by the reference model selection unit 705 (step S704 in fig. 30). Specifically, the standard model 722 is composed of 64-hybrid (Mf is 64) and a 3-state HMM, as shown in the standard model 722 of fig. 31, based on information of high recognition accuracy as the specification information 725. The HMM is configured for each phoneme.

The method of producing the standard model 722 is the same as in embodiment 2.

The standard model storage unit 708 stores one or more standard models 722 based on the specification information 725. Here, the HMM of 16 blends of user B, which is a previously created standard model, is already stored, and the HMM of 64 blends of user a is newly stored.

User a transmits a request signal for user a, the type of noise, and the standard model as specification information from terminal device 712 to standard model transmitting unit 709 of server 701 (step S706 in fig. 30). Upon receiving the specification information and the standard model request signal transmitted by the user a, the standard model transmitting unit 709 transmits a standard model suitable for the specification to the terminal device 712 (step S707 in fig. 30). Here, the previously created standard model 722 of the user a is transmitted to the terminal device 712.

User a performs speech recognition using standard model 722 received at terminal device 712 (step S708 in fig. 30).

Next, a specific example will be described in which the present embodiment is applied to a speech recognition system including a car navigation device (terminal device 712) and a server device (server 701; standard model creation device) connected via a communication network.

First, the user selects a button representing 'obtain own voice model' on the screen of the car navigation device (terminal device 712). At this time, since 'please input name' is displayed, the name of the user is input by a button operation. Next, the display shows "please utter" voice ", so the user utters the voice of" voice "to the microphone attached to the car navigation device. These pieces of information (the name of the user and the voice under noise) are transmitted from the car navigation device (terminal device 712) to the server 701 as use information.

Similarly, the user selects a button for 'high-accuracy speech recognition' on the screen of the car navigation device (terminal device 712). At this time, the selection information is transmitted from the car navigation device (terminal device 712) to the server 701 as specification information.

The server 701 creates a standard model suitable for speech recognition of the user based on the use information and the specification information, associates the created standard model with the name of the user, and stores the standard model in the standard model storage unit 708.

When the car navigation device (terminal device 712) is started next time, the user inputs the name because 'please input the name' is displayed. At this time, the name is transmitted to the server 701, and the corresponding standard model stored in the standard model 722 is transmitted from the server 701 to the terminal device 712 by the standard model transmitting unit 709. The terminal device 712 downloads a standard model corresponding to a name (user) from the server 701, performs voice recognition on the user using the standard model, and performs destination setting by voice. In this way, the car navigation device can be operated by voice by using the standard model created by the standard model creation device of the present embodiment for voice recognition.

As described above, according to embodiment 7 of the present invention, the standard model is created by calculating the statistics of the standard model, and the probabilities or likelihoods of the plurality of reference models selected based on the usage information are maximized or maximized, so that it is possible to provide a suitable high-precision standard model according to the usage situation.

In addition, since the standard model storage unit 708 can store a plurality of standard models, the standard models can be immediately provided when necessary.

Further, since the standard model is transmitted to the terminal device 712 by the standard model transmitting unit 709, when the terminal device 712 is installed at a spatially distant location from the server, the terminal device 712 can easily use the standard model created by the server 701.

The standard model 722 is not limited to the HMM for each phoneme, and may be formed of a context-dependent HMM.

In addition, the HMM constituting the standard model 722 may be constituted by a mixture gaussian distribution having a different number of mixtures for each state.

Alternatively, speech recognition may be performed in the server 701 using the standard model 722, and the recognition result may be transmitted to the terminal device 712.

The reference model preparation unit 702 may create a new reference model, add the new reference model to the reference model storage unit 703, update the new reference model, and delete the unnecessary reference model stored in the reference model storage unit 703, if necessary.

The reference model preparation unit 702 may add and update a new reference model to the reference model storage unit 703 via a communication path as necessary.

(embodiment 8)

Fig. 32 is a block diagram showing the entire configuration of a standard model creation apparatus according to embodiment 8 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a mobile phone 901 is shown. In the present embodiment, a case where a standard model for speech recognition is created will be described as an example.

The mobile phone 901 is a mobile information terminal, and is a standard model creation device for creating a standard model for speech recognition defined by a hidden markov model expressed by a set of events and output probabilities of events or transitions between events, and includes a reference model receiving unit 909, a reference model preparing unit 902, a reference model storing unit 903, a standard model creation unit 904, a reference model selecting unit 905, a similarity information creation unit 908, a standard model creation unit 906, a specification information creation unit 907, a microphone 912, and a speech recognition unit 913.

The use information creation unit 904 creates the use information 924 using the screen and keys of the mobile phone 901.

The specification information creation unit 907 creates specification information 925 according to the specification of the mobile phone 901. Here, the specification information is information related to the specification of the created standard model, and here, information related to the CPU processing capability of mobile phone 901.

The similarity information creating unit 908 creates similarity information 926 based on the use information 924, the specification information 925, and the reference model 921 stored in the reference model storage unit 903, and transmits the similarity information 926 to the reference model preparing unit.

The reference model preparation unit 902 determines whether or not to prepare a reference model based on the similarity information 926. When the reference model preparation unit 902 determines that the reference model is to be prepared, it transmits the use information 924 and the specification information 925 to the reference model reception unit 909.

The reference model receiving unit 909 receives the reference model corresponding to the use information 924 and the specification information 925 from the server device 910, and transmits the reference model to the reference model preparing unit 902.

The reference model preparation unit 902 stores the reference model transmitted by the reference model reception unit 909 in the reference model storage unit 903.

The reference model selecting unit 905 selects the reference model 923 corresponding to the use information 924 from the reference models 921 stored in the reference model storage unit 903.

The standard model creation unit 906 is a processing unit that creates a standard model 922 from the specification information 925 created by the specification information creation unit 907 and maximizes or maximizes the probability or likelihood of the reference model 923 selected by the reference model selection unit 905, and includes: a standard model structure determination unit 906a that determines the structure of a standard model (the number of mixed distributions of gaussian distributions, etc.); an initial standard model creation unit 906b for creating an initial standard model by specifying a statistic initial value for calculating the standard model; a statistic storage unit 906c for storing the determined initial standard model; and a statistic estimation unit 906d for calculating a statistic for maximizing or maximizing the probability or likelihood of the reference model 923 selected by the reference model selection unit 905 (generating a final standard model) by performing an approximation calculation or the like using the 3 rd approximation unit 906e on the initial standard model stored in the statistic storage unit 906 c.

The speech recognition unit 913 recognizes the speech of the user input from the microphone 912 using the standard model 922 created by the standard model creation unit 906.

Next, the operation of mobile phone 901 configured as described above will be described.

Fig. 33 is a flowchart showing the operation procedure of mobile phone 901.

Now, the reference model storage unit 903 stores a child model as a reference model 921. The reference model 921 is composed of HMMs for each phoneme. Fig. 34 shows an example of the reference model 921. Here, a schematic diagram of a reference model for children is shown. These reference models constitute the output distribution of the HMM from a mixture gaussian distribution having 3 states and 16 distributions in each state. As the feature amount, a total 25-dimensional (J equal to 25) feature amount of a 12-dimensional mel-frequency cepstrum coefficient, a 12-dimensional δ mel-frequency cepstrum coefficient, and δ power is used.

First, the use information creating unit 904 creates use information 924 as a category to which the user belongs (step S900). Fig. 36 is a diagram showing an example of creating the use information 924. Fig. 36(a) shows an example of a selection screen of mobile phone 901. Here, by pressing' 4: the adult' button selects the cellular phone 901 for both adult women and adult men. Fig. 36(b) shows another example. Here, the voice is input while pressing the 'menu' button. The user's voice data ' is created as the use information 924 by converting the user's voice into the feature value.

On the other hand, the specification information creating unit 907 creates specification information 925 according to the specification of the mobile phone 901 (step S901). Here, specification information 925 of the "number of mixed distributions 16" is created according to the size of the memory capacity of mobile phone 901.

Next, the similarity information creating unit 908 creates similarity information 926 based on the use information 924, the specification information 925, and the reference model 921 stored in the reference model storage unit 903 (step S902), and transmits the similarity information 926 to the reference model preparing unit 902. Here, the reference model 921 existing in the reference model storage unit 903 is only the child model having the mixed distribution number of 3 (see fig. 34), and since the reference model corresponding to the "adult" (corresponding to fig. 36(a)) as the use information 924 and the "mixed distribution number of 16" as the specification information 925 does not exist in the reference model storage unit 903, the similarity information 926 such as "similar reference model does not exist" is created, and the similarity information 926 is transmitted to the reference model preparation unit 902. In another example, the use information 924 is "voice data of the user" (corresponding to fig. 36(b)), and the voice data of the user is input to the child-use model stored in the reference model storage unit 903 to create the similarity information 926. Here, since the likelihood for the child model is equal to or less than the predetermined threshold, the similarity information 926 of 'no similar reference model' is created and sent to the reference model preparation unit 902.

Next, the reference model preparation unit 902 determines whether or not to prepare a reference model based on the similarity information 926 (step S903). Here, since a similar reference model 'does not exist,' the user is urged to prepare a reference model as shown in an example of a screen display of mobile phone 901 in fig. 37 (a). Here, when the user presses the 'memo' button to request preparation of the reference model, the reference model preparation unit 902 determines that the reference model is to be prepared, and transmits the use information 924 and the specification information 925 to the reference model reception unit 909. In another example, since 'no similar reference model' exists, the reference model preparation unit 902 determines to automatically prepare a reference model and transmits the use information 924 and the specification information 925 to the reference model reception unit 909. Fig. 37(b) shows an example of the screen of mobile phone 901 at this time.

On the other hand, the reference model receiver 909 receives the reference model corresponding to the usage information 924 and the specification information 925 from the ocular server device 910, and then transmits the reference model to the reference model preparation section 902. Here, the reference model receiver 909 receives, from the server device 910, two reference models such as a "model for adult female with a mixed distribution number of 16" and a "model for adult male with a mixed distribution number of 16" as reference models corresponding to the "adult' (corresponding to fig. 36(a)) of the usage information 924 and the" mixed distribution number of 16 "of the specification information 925.

Thereafter, the reference model preparation unit 902 stores the reference model transmitted from the reference model reception unit 909 in the reference model storage unit 903, thereby preparing a reference model (step S904). Fig. 35 shows an example of the reference model. Here, image diagrams of reference models for adult male, adult female, and child are shown.

Next, the reference model selecting unit 905 selects two reference models, such as the "model for adult female with a mixed distribution number of 16" and the "model for adult male with a mixed distribution number of 16" belonging to the same category corresponding to the "adult" of the usage information 924, from the reference models 921 stored in the reference model storage unit 903 (step S905). In another example, the reference model selecting unit 905 selects two reference models such as an "adult female model with a mixed distribution number of 16" and an "adult male model with a mixed distribution number of 16" that are close to the "user speech data (with a high likelihood)" as the use information 924 in terms of audio from the reference models 921 stored in the reference model storage unit 903.

Next, the standard model creation unit 906 creates the standard model 922 based on the created specification information 925, and maximizes or maximizes the probability or likelihood of the reference model 923 selected by the reference model selection unit 905 (step S906).

Finally, the speech recognition unit 913 recognizes the speech of the user input from the microphone 912 based on the standard model 922 created by the standard model creation unit 906 (step S907).

Next, the detailed procedure of step S906 (creation of the standard model) in fig. 33 will be described. The flow of steps is the same as the flow chart shown in fig. 4. However, the structure of the standard model used, the specific approximation calculation, and the like are different.

First, the standard model configuration determination unit 906a determines the configuration of the standard model (step S102a of fig. 4). Here, as a structure of the standard model, the HMM for each phoneme is configured from 'the number of mixed distributions 16' as the specification information 925, and the number of states is 3, and the number of mixed distributions of the output distributions of the respective states is determined to be 16 (Mf is 16).

Next, the initial standard model creation unit 906b specifies the statistic initial value for calculating the standard model (step S102b in fig. 4). Here, the "model for adult female with the mixed distribution number of 16" as the reference model 923 selected is stored in the statistic storage unit 906c as the initial value of the statistic. In another example, the "model for adult male with the mixed distribution number 16" as the selected reference model 923 is stored in the statistic storage unit 906c as an initial value of the statistic. Specifically, the initial standard model creation unit 906b generates an output distribution shown in equation 13.

Then, the statistic estimation unit 906d estimates the statistic of the standard model stored in the statistic storage unit 906c using the 2 reference models 923 selected by the reference model selection unit 905 (step S102c in fig. 4). That is, the statistic of the standard model (the mixture weighting coefficient shown in the above expression 16, the average shown in the above expression 17, and the variance value shown in the above expression 18) that maximizes or maximizes the probability of the output distribution shown in the above expression 19, that is, the output distribution of the standard model 923 (here, the likelihood logP shown in the above expression 25) is estimated. However, in the present embodiment, expression 21 in the output distribution shown in expression 19 above is 16 (the number of mixed distributions of the respective reference models).

At this time, the 3 rd approximation unit 906e of the statistic estimation unit 906d uses the approximation formula of equation 53, assuming that the gaussian distributions of the standard model do not affect each other. When the repetition number R is the 1 st order, equation 55 near the gaussian distribution of the standard model shown in equation 54 is approximated to a space in which the gaussian distribution of reference model 923 shown in equation 56, which is two (near indication parameter G is 2) near the 2 nd order (near indication parameter G is 2) near the mahalanobis distance, kl (kl) distance, or the like, which is the distance between the equi-distributions of the output distribution shown in equation 54, exists. On the other hand, when the repetition number R is 2 or more, equation 55 near the gaussian distribution of the standard model shown in equation 54 is approximated to a space where the gaussian distribution of the reference model 923 shown in equation 56 exists, which is one of the mahalanobis distance of the output distribution shown in equation 54, the kl (kl) distance, and the like (the vicinity indication parameter G is 1) near the inter-distribution distance.

When the above approximation formulas of the 3 rd approximation unit 906e are comprehensively considered, the calculation formula of the statistic amount estimation unit 906d is as follows. That is, the statistic estimation unit 906d calculates the mixture weighting coefficient, the average value, and the variance value from expressions 59, 60, and 61, respectively, and generates a standard model specified by these parameters as the final standard model 922. As the method 2 in embodiment 3, a method is used in which the value of the mixing weight coefficient is set to zero, the average value is set to zero, and the variance value is set to 1. In addition, the value of the vicinity indication parameter G is different corresponding to the number of repetitions. In addition, the above method may also be determined as one of the 1 st to 3 rd methods in embodiment 3 depending on the value of the vicinity indication parameter G.

The statistic amount estimating unit 906d stores the statistic amount of the standard model estimated in this manner in the statistic amount storage unit 906 c. Thereafter, the estimation of the statistic and the storage into the statistic storage unit 906c are repeated R (≧ 1) times. As a result, the obtained statistic is output as the statistic of the standard model 922 finally generated.

Fig. 38 shows the result of a recognition experiment using the standard model 922 created using the 3 rd approximation unit 906 e. The vertical axis shows the recognition rate (%) of adults (male and female), and the horizontal axis shows the number of repetitions R. The repetition number R of 0 is a result of recognition by the initial model created by the initial standard model creation unit 906b before learning. In addition, when the repetition number R is 1, the vicinity indication parameter G is 2, and when the repetition number R is 2 to 5, the vicinity indication parameter G is 1.

Curve 'data' indicates the result of learning with voice data over several days, and curve 'female' and curve 'male' indicate the results when the initial model is set to adult female and adult male, respectively. The learning time of the present invention based on the reference model is in the order of tens of seconds. From the experimental results, it was found that a high-precision standard model can be produced in a short time.

Here, for reference, the recognition rate based on the standard model created by the 2 nd approximation unit 306E in embodiment 3 is shown in fig. 39. The difference from the 3 rd approximation unit 906e in the present embodiment is that the vicinity indication parameter G is 1 regardless of the repetition number R. From the experimental results, it was found that good results were obtained when adult females were selected as the initial model. In addition, when an adult male is selected as the initial model, it is found that the accuracy is slightly deteriorated. When the results of fig. 38 are combined, it is understood that the standard model based on the 3 rd approximation unit 906e is independent of the initial model, and a high-precision standard model can be created.

As described above, according to embodiment 8 of the present invention, since the reference model is prepared based on the similarity information, it is possible to prepare the reference model suitable for the utilization information and the specification information at a necessary timing. In addition, by changing the vicinity indication parameter G by the repetition number R, a highly accurate standard model can be provided regardless of the initial model.

The number of repetitions of the process performed by the statistic estimation unit 906e may be the number of times until the magnitude of the likelihood represented by equation 25 becomes equal to or greater than a predetermined threshold value.

The standard model 922 may be formed of a context-dependent HMM, not limited to the HMM for each phoneme.

The standard model creation unit 906 may perform model creation on the event output probabilities in the partial states of the partial phonemes.

The HMM constituting the standard model 922 may be configured with a different number of states for each phoneme, or may be configured with a gaussian mixture distribution having a different number of distributions for each state.

(embodiment 9)

Fig. 40 is a block diagram showing the entire configuration of a standard model creation apparatus according to embodiment 9 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a PDA (Personal digital assistant) 1001 is shown. Next, in the present embodiment, a case where a standard model for speech recognition is created will be described as an example.

The PDA1001 is a portable information terminal, and is a standard model creation device for creating a standard model for speech recognition defined by a hidden markov model expressed by a set of production events and output probabilities of events or transitions between events, and includes a reference model storage unit 1003, a standard model creation unit 1006, an application and specification information correspondence database 1014, a microphone 1012, and a speech recognition unit 1013. The standard model creation unit 1006 includes a standard model structure determination unit 1006a, an initial standard model creation unit 1006b, a statistic storage unit 306c, and a statistic estimation unit 306 d.

The standard model creation unit 1006 acquires the specification information 1025 using the application and specification information association database 1014, based on the transmitted application startup information 1027 (here, the ID number of the application to be started). Fig. 41 shows an example of data in the specification information association database 1014. The specification information correspondence database 1014 registers specification information (here, the number of mixed distributions) corresponding to the application program (ID number and name).

The standard model creation unit 1006 is a processing unit for creating a standard model 1022 from the acquired specification information 1025 and maximizing or maximizing the probability or likelihood for one reference model 1021 stored in the reference model storage unit 1003, and has the function of the 2 nd approximation unit 306e according to embodiment 3.

The speech recognition unit 1013 recognizes the speech of the user input from the microphone 1012 using the standard model 1022 created by the standard model creation unit 1006.

Next, the operation of PDA1001 configured as described above will be described.

Fig. 42 is a flowchart showing the operation procedure of PDA 1001.

Here, one user model having a large number of mixed distributions is stored in the reference model storage unit 1003 as the reference model 1021. The reference model 1021 is composed of HMMs for each phoneme. Fig. 43 shows an example of a reference model 1021. The reference model constitutes the output distribution of the HMM from a mixture gaussian distribution having 3 states and 300 distributions in each state. As the feature amount, a feature amount of 25 dimensions (J is 25) in total of a 12-dimensional mel-frequency cepstrum coefficient, a 12-dimensional δ mel-frequency cepstrum coefficient, and δ power is used.

First, the user starts an application program, for example, so-called 'stock trading' (step S1000).

On the other hand, the standard model creation unit 1006 receives the ID '3' of the started application as application start information (step S1001). Then, the standard model 1022 is created from the "number of mixed distributions 126" as the specification information 1025 corresponding to the ID '3' by using the application and specification information association database 1014 (step S1002). Specifically, the standard model 1022 is composed of the number of mixed distributions 126(Mf is 126) and a context-dependent HMM of 3 states.

Next, the standard model creation unit 1006 receives the specification information 1025 (step S1001), and creates a standard model based on the specification information 1025 (step S1002).

Finally, the speech recognition unit 1013 recognizes the speech of the user input from the microphone 1012 based on the standard model 1022 created by the standard model creation unit 1006 (step S1003).

Next, the detailed procedure of step S1002 (creation of the standard model) in fig. 42 will be described. The flow of steps is the same as the flow chart shown in fig. 4. However, the structure of the standard model used, the specific approximation calculation, and the like are different.

First, after receiving the application ID '3' as the application startup information 1027, the standard model structure determination unit 1006a compares the specification information 1025 ('mixed distribution number 126') corresponding to the ID '3' with the application and specification information correspondence database 1014, and determines the structure of the standard model as the mixed distribution number 126 (Mf: 126) or the context-dependent HMM of the 3-state (step S102a in fig. 4).

Next, the initial standard model creation unit 1006b specifies an initial value of a statistic for calculating the standard model, based on the structure of the standard model specified by the standard model structure specification unit 1006a (step S102b in fig. 4). Here, the value of a cluster (clustering) described later by the k-means method and the method using the mahalanobis distance is stored as an initial value of the statistic in the statistic storage unit 306 c.

Then, the statistic estimation unit 306d estimates the statistic of the standard model stored in the statistic storage unit 306c using the reference model 1021 stored in the reference model storage unit 1003 (step S102c in fig. 4). The estimation process of the statistic estimator 306d is the same as that of embodiment 3.

Next, a description will be given of an initial value determination method by the initial standard model creation unit 1006b, i.e., a clustering using a k-means method and a method using a mahalanobis distance. FIG. 44 shows a flow chart of clustering. Further, pattern diagrams of clusters are shown in fig. 45-48.

First, in step S1004 of fig. 44, 126 representative points (fig. 45) as the number of standard model mixture distributions are prepared. Here, 126 output distributions are selected from the 300 output distributions of the reference model, and the average value of the selected distributions is set as a representative point.

Then, in step S1005 in fig. 44, an output vector of the reference model having a short mahalanobis distance is determined for each representative point (fig. 46). Thereafter, in step S1006 of fig. 44, the near distribution determined in step S1005 is expressed by one gaussian distribution, and the average value is set as a new representative point (fig. 47).

Thereafter, in step S1007 in fig. 44, it is determined whether or not the clustering operation is stopped. Here, when the mahalanobis distance change rate (difference between the distance from the previous 1-time representative point) of the distribution of each representative point and the reference vector is equal to or less than the threshold value, the operation is stopped. If the stop condition is not satisfied, the process returns to step S1005 in fig. 44, the near distribution is determined, and the same operation is repeated.

On the other hand, when the stop condition is satisfied, the process proceeds to step S1008 in fig. 44, and the initial value of the statistic is determined and stored in the statistic storage unit 306 c. In this way, cluster-based initial value determination is performed.

As described above, according to embodiment 9 of the present invention, a standard model suitable for specification information can be automatically obtained in conjunction with an application program.

The standard model 1022 may form an HMM for each phoneme.

The standard model creation unit 1006 may create a model of the event output probability in the partial state of the partial phoneme.

The HMM constituting the standard model 1022 may be constituted by a different number of states for each phoneme, or may be constituted by a gaussian mixture distribution having a different number of distributions for each state.

(embodiment 10)

Fig. 49 is a block diagram showing the entire configuration of a standard model creation apparatus according to embodiment 10 of the present invention. Here, an example in which the standard modeling apparatus of the present invention is incorporated in a server 801 in a computer system is shown. In the present embodiment, a case where a standard model (adaptive model) for speech recognition is created will be described as an example.

The server 801 is a computer device in a communication system or the like, and includes a reading unit 711, a reference model preparation unit 702, a reference model storage unit 703, a use information receiving unit 704, a reference model selecting unit 705, a standard model preparation unit 706, a specification information receiving unit 707, a standard model storage unit 708, a standard model transmitting unit 709, and a reference model receiving unit 810 as a standard model creation device that creates a standard model for speech recognition defined by a set of events and an output probability of a transition between events.

The reference model preparation unit 702 sends the reference model for speech recognition, which is read by the reading unit 711 and is written in a memory such as a CD-ROM, to the reference model storage unit 703. The reference model storage unit 703 stores the transmitted reference model 721. The reference model preparation unit 702 transmits the reference model for speech recognition received by the reference model reception unit 810 to the reference model storage unit 703 in response to the transmission from the terminal device 712. The reference model storage unit 703 stores the transmitted reference model 721.

The specification information receiving unit 707 receives specification information 725 from the terminal device 712. The usage information receiving unit 704 receives the voice of the user who uttered noise as the usage information 724 from the terminal device 712. The reference model selecting unit 705 selects a reference model 723 of a speaker, noise, and tone that are close in audio to the user voice as the user information 724 received by the use information receiving unit 704, from the reference models 721 stored in the reference model storage unit 703.

The standard model creation unit 706 is a processing unit that creates the standard model 722 based on the specification information 725 and maximizes or maximizes the probability or likelihood for the reference model 723 selected by the reference model selection unit 705, and has the same function as the standard model creation unit 206 of embodiment 2. The standard model storage unit 708 stores one or more standard models based on the specification information 725. Upon receiving the specification information 725 and the request signal for the standard model from the user's terminal device 712, the standard model transmitting unit 709 transmits the standard model suitable for the specification to the terminal device 712.

Next, the operation of the server 801 configured as described above will be described.

Fig. 50 is a flowchart showing the operation procedure of server 801. An example of a reference model and a standard model for describing the operation procedure of the server 801 is the same as that shown in fig. 31 in embodiment 7.

First, before a standard model is created, a reference model to be a reference of the standard model is prepared (steps S800 and S801 in fig. 50). That is, the reference model preparation unit 702 transmits the reference model for speech recognition classified into speaker, noise, and tone, which is read by the reading unit 711 and written in the memory such as a CD-ROM, to the reference model storage unit 703, and the reference model storage unit 703 stores the transmitted reference model 721 (step S800 in fig. 50). Here, the reference model 721 is composed of HMMs for each speaker, noise, and tone. The reference model preparation unit 702 transmits the reference model for speech recognition suitable for the user and the terminal device 712, which is received by the reference model receiving unit 810 after the terminal device 712 transmits, to the reference model storage unit 703, and the reference model storage unit 703 stores the transmitted reference model 721 (step S801 in fig. 50). Here, as shown in a reference model 721 of fig. 31, each reference model constitutes an output distribution of the HMM by using a gaussian mixture distribution having 3 states and 128 mixed distributions in each state. As the feature amount, a mel-frequency cepstrum coefficient of 25 dimensions (J ═ 25) is used.

Next, (steps S802 to S809) of creating the standard model 722 and transmitting the standard model to the terminal device 712 using the reference model 721 are the same as those in embodiment 7 (steps S701 to S708 in fig. 30).

Since the self-model stored in the terminal 712 can be uploaded to the server to constitute a material for creating the standard model in this way, the server 801 can create a high-precision standard model having a larger number of mixtures by integrating the uploaded reference model and other reference models already held, and download the standard model to the terminal 712 for use. Therefore, a simple adaptation function is added to the terminal device 712, and the adapted model can be easily uploaded to create a standard model with higher accuracy.

Fig. 51 is a diagram showing an example of a system specifically applied to the standard modeling apparatus according to the present embodiment. Here, a server 701 and a terminal device 712 (a mobile phone 712a, a car navigation device 712b) which communicate via the internet, wireless communication, or the like are shown.

For example, the mobile phone 712a requests the creation of a standard model by using the user's voice as the use information, using the information indicating the use of the mobile phone (low processing capability of the CPU) as the specification information, using the sampling model stored in advance as the reference model, and transmitting the use information, the specification information, and the reference model to the server 701. When server 701 creates a standard model for the request, mobile phone 712a downloads the standard model and recognizes the voice of the user using the standard model. For example, when the voice of the user matches the name of an address book held inside, a telephone number corresponding to the name is automatically called.

The car navigation device 712b uses the voice of the user as the use information, uses the information (the processing capability of the CPU is general) used by the car navigation device as the specification information, uses the sampling model stored in advance as the reference model, and transmits the use information, the specification information, and the reference model to the server 701, thereby requesting the creation of the standard model. When the server 701 creates a standard model for the request, the car navigation device 712b downloads the standard model and recognizes the voice of the user using the standard model. For example, when the voice of the user matches a place name held inside, a map showing a road from the current place to a destination having the place name as a target point is automatically displayed on the screen.

In this way, the mobile phone 712a and the car navigation device 712b request the server 701 to create a standard model suitable for the own device, and can obtain standard models of various recognition targets at necessary timings without installing a circuit or a processing program required for creating the standard model in the own device.

As described above, according to embodiment 10 of the present invention, since the standard model is created using the reference model received by the reference model receiving unit 810, the standard model with high accuracy can be provided. That is, by adding the reference model by uploading it from the terminal device 712, the change of the reference model held on the server 801 side increases, and when used by another person, a standard model with higher accuracy can be provided.

In addition, since the standard model is created based on the specification information, the standard model suitable for the equipment using the standard model is prepared.

The reference model receiving unit 810 may receive a reference model from a terminal device other than the terminal device 712.

The application example shown in fig. 51 is not limited to this embodiment, and can be applied to other embodiments. That is, by distributing the standard models created in embodiments 1 to 9 to various electronic devices via various recording media or communication, these electronic devices can perform highly accurate voice recognition, image recognition, intention understanding, and the like. Further, by incorporating the standard model creation device in the above-described embodiment in various electronic devices, it is possible to realize an independent electronic device equipped with recognition and authentication functions such as voice recognition, image recognition, and intention understanding.

The standard modeling apparatus according to the present invention has been described above with reference to the embodiments, but the present invention is not limited to these embodiments.

For example, the statistical amount approximation calculation of the standard model in embodiments 1 to 10 is not limited to the approximation calculation in each embodiment, and at least one of the total 4 kinds of approximation calculation in embodiments 1 to 4 may be used. That is, any of the 4 kinds of approximation calculation may be used, or a combination of two or more kinds of approximation calculation may be used.

In embodiment 2, the general approximation unit 206e of the statistic estimation unit 206d calculates the mixture weight coefficient, the average value, and the variance value of the standard model from the approximate expressions shown in expressions 45, 46, and 47, respectively, but may calculate the mixture weight coefficient, the average value, and the variance value using the approximate expressions shown in expressions 63, 64, and 65, instead of these approximate expressions.

(formula 63)

<math> <mrow> <msub> <mi>ω</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>≈</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>M</mi> <mi>f</mi> </msub> </munderover> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>k</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f)

(formula 64)

<math> <mrow> <msub> <mi>μ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>≈</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <msub> <mi>x</mi> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </msub> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f，j＝1，2，...，J)

(formula 65)

<math> <mrow> <msubsup> <mi>σ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>≈</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mrow> <mo>(</mo> <mi>j</mi> <mo>)</mo> </mrow> </msub> <mo>-</mo> <msub> <mi>μ</mi> <mrow> <mi>f</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>,</mo> <mi>j</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> <mrow> <mi></mi> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>g</mi> </msub> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi></mi> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </msub> </munderover> <mi>γ</mi> <mrow> <mo>(</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <mi>m</mi> <mo>)</mo> </mrow> <msub> <mi>&upsi;</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mi>g</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mi>g</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>,</mo> <mi>l</mi> <mo>)</mo> </mrow> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </mfrac> </mrow> </math>

(m＝1，2，...，M_f，j＝1，2，...，J)

The inventors have confirmed that high recognition performance can be obtained from the standard model created using the approximate expression. For example, the recognition result when the number of mixtures of the reference model and the standard model is 16 is 82.2% before adaptation, and 85.0% in the method based on the sufficient statistics shown in non-patent document 2, and improved to 85.5% in the method based on the approximate expression. That is, it is found that high recognition performance can be obtained as compared with the method based on the sufficient statistics. In addition, the recognition result obtained when the number of mixtures of the reference models was 64 and the number of mixtures of the standard models was 16 was obtained as high as 85.7% in the method based on the above approximation formula.

In addition, the initial standard model creating unit may create the initial standard model by preparing a classification ID-initial standard model-reference model correspondence table shown in fig. 52 and specifying the initial standard model from the table. Next, a method of specifying the initial standard model using the classification ID-initial standard model-reference model correspondence table will be described. The classification ID is an ID for identifying a type of an identification target using a standard model, and corresponds to the type of the standard model.

The classification ID-initial standard model-reference model correspondence table shown in fig. 52 is a table in which a plurality of reference models having predetermined common properties are associated with one classification ID for identifying them, and are associated with initial standard models created in advance having properties common to these reference models. In this table, the reference models 8AA to 8AZ are associated with the classification ID and the initial standard model 8A, and the reference models 64ZA to ZZ are associated with the classification ID and the initial standard model 64Z. The standard model creation unit can generate a high-precision standard model by using an initial standard model having the same properties as the reference model used.

Here, the first symbol '8' and the like among the additional symbols 8A and 8AA of the classification ID, the initial standard model, and the reference model indicate the number of mixed distributions, the 2 nd symbol 'a' and the like indicate a large classification, and for example, in the case of voice recognition under noise, represent the type of noise environment (a under home noise, B under in-car noise, and the like), and the 3 rd symbol 'a' and the like indicate a small classification, and for example, the attribute of a person to be voice-recognized (a is an elementary school student in a low school, and B is an elementary school student in a high school, and the like). Therefore, the reference models 8AA to AZ in the classification ID-initial standard model-reference model correspondence table of fig. 52 are the models with the number of mixed distributions of 8 shown in fig. 53, the reference models 64ZA to ZZ are the models with the number of mixed distributions of 64 shown in fig. 54, and the initial standard models 8A to 64Z are the models with the number of mixed distributions of 8 to 16 shown in fig. 55.

Next, a method of creating the classification ID-initial standard model-reference model correspondence table will be described. Fig. 56 is a flowchart showing the steps thereof, and fig. 57 to 60 are diagrams showing specific examples of the steps. Here, the steps when speech recognition in a noisy environment is taken as an example and not only the table but also the classification ID, the initial standard model, and the reference model are included and reproduced will be described. A

First, voice data is classified into groups that are close in audio (step S1100 of fig. 56). For example, as shown in fig. 57, speech data is classified according to a noise environment as usage information. The environment a (voice data under home noise) includes a voice of a pupil of a low school year, a voice of a pupil of a high school year, a voice of an adult woman, and the like recorded under home noise, and the environment B (voice data under electric train) includes a voice of a pupil of a low school year, a voice of a pupil of a high school year, a voice of an adult woman, and the like recorded under electric train. The information may be classified into genders, age groups, the nature of voices such as laughter and anger, tones such as reading and conversation tones, and languages such as english and chinese, which are speakers using the information.

Next, one or more model structures of the prepared reference model are specified based on the specification information and the like (step S1101 in fig. 56). For example, 8-mix, 16-mix, 32-mix, and 64-mix are determined as objects. In the determination of the model configuration, not only the determination of the number of mixed distributions, but also the number of states of the HMM, the kind of HMM such as a handset, a three-party phone, and the like can be determined.

Next, an initial standard model is created (step S1102 in fig. 56). That is, the initial standard model for each model structure determined in step S1101 is created for each classification (environment a, environment B,..) determined in step S1100. For example, as shown in fig. 58, in the case of the initial standard model 8A, an 8-hybrid initial standard model is created by learning using voice data (voice data of pupils in a low school year, pupils in a high school year, adult men, adult women, and the like) under in-home noise (environment a) by means of the baum-welch algorithm or the like.

Next, a reference model is created (step S1103 in fig. 56). That is, the reference model is created using the initial standard model created in step S1102. Specifically, the reference model is learned using an initial standard model having the same number of mixed distributions, which is learned in the same noise environment as that in which the speech data of the reference model is learned. For example, as shown in fig. 59, the reference model 8AA is a model that is learned using voice data of a pupil of a low school year under a home noise with a mixed distribution number of 8, and as an initial value for learning, an initial standard model that is learned using voice data under a home noise (including voices of a pupil of a low school year, a pupil of a high school year, an adult female, and an adult male) under the same environment is used. As a learning method, a baum-welch algorithm is used.

Finally, the classification ID9 is given (step S1104 of fig. 56))). For example, by assigning a classification ID to each noise environment, a classification ID-initial standard model-reference model correspondence table shown in fig. 61, that is, "initial standard model with classification ID" and "reference model with classification ID" can be created.

In addition, the classification ID-initial standard model-reference model correspondence table is a completed table, and the terminal (standard model creation device) does not have to hold it in advance. As shown in fig. 61, the terminal (standard modeling apparatus) may communicate with another apparatus (server) to complete the table. That is, the standard model creation device (terminal) can acquire the "initial standard model with the classification ID" and the "reference model with the classification ID" via a communication network or the like. However, the terminal does not necessarily have to acquire the "initial standard model with the classification ID" and the "reference model with the classification ID", and may be stored in advance and shipped from the factory.

As shown in fig. 61, the terminal can acquire the "initial standard model with the classification ID" and the "reference model with the classification ID" by the following method. As the method 1, the terminal stores "an initial standard model with a classification ID" (for example, an initial standard model complying with a classification ID assignment method defined in advance by the standardization association or the like). At this time, the terminal downloads "a reference model with a classification ID" (for example, a reference model complying with a classification ID assignment method defined in advance by the standardization association or the like) from one or more servers. In addition, the terminal may store "a reference model with a classification ID" at the time of shipment.

In addition, as the 2 nd method, the terminal does not store the "initial standard model with the classification ID". At this time, the terminal downloads "the initial standard model with the classification ID" from the server (server 1 of fig. 61). Thereafter, the terminal downloads the "reference model with the classification ID" from one or more servers (server 2 in fig. 61). The definition of the classification ID can be sequentially added and changed as necessary. In addition, the memory of the terminal can be saved.

As the 3 rd method, the terminal stores a "classification ID-initial standard model-reference model correspondence table" in which the correspondence between the classification ID and the initial standard model and the reference model is described. At this time, the terminal uploads the "correspondence table" to a server (server 3 in fig. 61) which does not store the "correspondence table". The server prepares a "reference model with a classification ID" based on the transmitted "correspondence table". The terminal downloads the prepared "reference model with classification ID".

Next, a method of specifying the initial standard model by the initial standard model creating unit using the classification ID-initial standard model-reference model correspondence table will be described. FIG. 62 is a flowchart showing the procedure. Fig. 63 and 64 are diagrams showing specific examples of the steps.

First, a classification ID is extracted from a reference model used for creating a standard model (step S1105 in fig. 62). For example, the corresponding classification ID is extracted from the selected reference model based on the table shown in fig. 63. Here, as the extracted classification ID, 1 is set to 8A, 3 is set to 16A, 1 is set to 16B, and 1 is set to 64B.

Next, an initial standard model for creating a standard model is determined using the extracted classification ID (step S1106 of fig. 62). Specifically, the initial standard model is determined according to the following steps.

(1) Focusing on the classification IDs (16A, 16B) extracted from the reference model having the same classification ID (16X) as the number of mixed distributions (16 mixture) of the created standard models, the initial standard model corresponding to the classification ID extracted the most is determined as the final initial standard model. For example, in the case where the standard model is configured as 16 blends, 3 pieces of 16A and 1 piece of 16B are extracted as class IDs relating to 16 blends, and therefore an initial standard model having a class ID of 16A is used.

(2) Focusing on the classification ID (8A) extracted from the reference model having the same classification ID (8X) as the number of mixed distributions (8 mixtures) of the created standard models, the initial standard model having the same classification ID is determined as the final initial standard model. For example, in the case where the structure of the standard model is 8 blends, 1 8A is extracted as the classification ID relating to 8 blends, so that the initial standard model having a classification ID of 8A is sampled.

(3) When the classification ID extracted from the reference model having the same classification ID (3 ×) as the number of mixed distributions (32 mixtures) of the created standard models is focused on but is not present, the initial standard models (8A, 16A) having the classification ID (a) () extracted most from the specification information are used, and are clustered to be 32 mixtures, and then the final initial standard model is set (see fig. 44). For example, when the structure of the standard model is 32-hybrid, the classification ID relating to 32-hybrid is not extracted, and therefore, the classification ID (16A) extracted most is used, and after 32-hybrid is obtained by clustering, the initial standard model is set.

Further, the initial value may be determined by focusing on the use information (the type of noise, etc.) without focusing on the specification information (the number of mixed distributions, etc.) of the standard model created in advance.

Fig. 64 shows the result of a recognition experiment using a standard model with the number of mixture distributions 64 created using the 3 rd approximation unit. The vertical axis represents the recognition rate (%) of adults (male and female), and the horizontal axis represents the number of repetitions R. The repetition number R is 0, which is the result of the initial model recognition performed by the initial standard model creation unit before the learning. In addition, the number of repetition R is 1 to 5, and the vicinity indication parameter G is 1.

The curve 'data' represents the result when learning with the voice data over several days, and the curve 'female' and the curve 'male' represent the results when the initial model is set to an adult female and an adult male, respectively. The learning time of the present invention based on the reference model is in the order of minutes. From the experimental results, it was found that when the reference model of the adult female is determined as the initial standard model, the standard model with higher accuracy than the result of the learning using the speech data can be created.

This indicates that the speech data are divided, and the divided speech data are strictly learned and integrated as respective reference models, and there is a possibility that the problem of learning based on the speech data, that is, the problem of being a local solution, can be solved (comparison in recognition accuracy with learning based on the speech data).

Further, since the speech data of children who have difficulty in recording the speech data is strictly learned by the reference model with a small number of mixed distributions suitable for the number of data and the speech data of adults who can record a large amount of speech data is strictly learned by the reference model with a large number of mixed distributions, it is expected that a standard model with extremely high accuracy can be created by synthesizing the speech data by the present invention and creating the standard model.

In the recognition experiment (fig. 39) in which the number of mixed distributions of the standard models was 16, the recognition rate of the standard model learned from the speech data was not exceeded by the method of the present invention. This is considered to be due to the lack of information of the voice data when the voice data is changed into the reference model form of 16 blends. By creating a reference model by mixing 64 and sufficiently maintaining the characteristics of the speech data, a more accurate standard model can be created. Thus, in embodiment 9, the number of mixed distributions of the reference model is set to 300, which is large.

In addition, the recognition experiments shown in fig. 39 and 64 show the influence of the initial standard model on the recognition accuracy, and the importance of the method for specifying the initial standard model is emphasized (fig. 64 shows that, when the reference model of the adult female is used as the initial standard model, a standard model with higher accuracy than that of the reference model of the adult male can be created).

As described above, by using the initial standard model having the same property as the reference model from the classification ID-initial standard model-reference model correspondence table, a highly accurate standard model can be created.

The identification of the initial standard model using the classification ID-initial standard model-reference model correspondence table can be used in any of embodiments 1 to 10.

In the above-described embodiment, equation 25 is used as the likelihood of the standard model with respect to the reference model when estimating the statistic of the standard model, but the present invention is not limited to this likelihood function, and for example, a likelihood function shown in equation 66 below may be used.

(formula 66)

<math> <mrow> <mi>log</mi> <mi>L</mi> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <msubsup> <mo>&Integral;</mo> <mrow> <mo>-</mo> <mo>∞</mo> </mrow> <mo>∞</mo> </msubsup> <mi>log</mi> <mo>[</mo> <munderover> <mi>Σ</mi> <mrow> <mi>m</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>ω</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </msub> <mi>f</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mo>(</mo> <mi>m</mi> <mo>)</mo> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>]</mo> <mi>α</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>{</mo> <munderover> <mi>Σ</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>L</mi> <mi>i</mi> </msub> </munderover> <msub> <mi>&upsi;</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> </msub> <msub> <mi>g</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>;</mo> <msub> <mi>μ</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> </msub> <mo>,</mo> <msubsup> <mi>σ</mi> <mrow> <mo>(</mo> <mi>l</mi> <mo>)</mo> </mrow> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>}</mo> <mi>dx</mi> </mrow> </math>

Here, α (i) is a weight indicating the importance corresponding to each integrated reference model i. For example, if a speaker in speech recognition is applied, importance is determined by the similarity between the user's speech and the speech for creating the integrated model. That is, when the reference model is close to the user's voice (importance is high), α (i) is set to a large value (weight is high). Preferably, the similarity between the integrated model and the user's voice is determined by using the degree of likelihood when the user's voice is input to the integrated model. Thus, when a standard model is created by integrating a plurality of reference models, the closer the reference model is to the user's voice, the larger the weighting is, the more the statistics of the standard model are affected, and a highly accurate standard model that further reflects the characteristics of the user can be created.

The standard model structure determining unit in each embodiment determines the structure of the standard model based on various factors such as the usage information and the specification information, but the present invention is not limited to this, and for example, in the case of speech recognition, the structure of the standard model may be determined based on various attributes such as the age, sex, speaker characteristics of the voice quality, the tone based on the emotion or health state, the speaking speed, the respectful speech style of the speaking, the dialect, the type of background noise, the magnitude of the background noise, the signal-to-noise ratio of the speech and the background noise, the microphone characteristics, and the complexity of the recognition vocabulary of the person to be recognized.

Specifically, as shown in fig. 65(a) to (j), the higher the age of the person to be subjected to voice recognition, the larger the number of gaussian distributions (mixture number) constituting the standard model is (fig. 65(a)), or the larger the mixture number than that of women is (fig. 65(b)) in the case of a man to be subjected to voice recognition, or the larger the sound quality of the person to be subjected to voice recognition is (is) louder than ' usual ', and then becomes ' hoarse ', the mixture number is increased (fig. 65(c)), or the larger the mixture number is (fig. 65(d)) when the melody based on the vocal expression of the person to be subjected to voice recognition becomes ' usual ' fresh ' and then ' cry or laugh ', the mixture number is increased (fig. 65(e)), or the higher or lower the speaking speed of the person to be subjected to voice recognition becomes ' more like ' speaking ' and tone ' than ' reading tone ' than that of the person to voice recognition, Further, when the speech recognition is changed to "conversation tone", the number of blends is increased (fig. 65(f)), or when the dialect of the person to be subjected to speech recognition is more likely to be "osaka accent" than "standard speech", and further more likely to be "kanehai accent", the number of blends is increased (fig. 65(g)), or when the background noise of speech recognition is increased, the number of blends is decreased (fig. 65(h)), or when the performance of the microphone for speech recognition is improved, the number of blends is increased (fig. 65(I)), or when the vocabulary to be subjected to speech recognition is increased, the number of blends is increased (fig. 65 (j)). These examples mostly determine the number of blends from the viewpoint of increasing the number of blends to ensure accuracy as the difference in speech of the recognition object is larger.

Industrial applicability

The standard model creation device of the present invention is useful as a device for recognizing an object such as a voice, a character, or an image using a probabilistic model, and the like, and is useful as, for example, a television receiver device that performs various kinds of processing using a voice, a car navigation device, a translation device that translates a voice into another language, a game device operated by a voice, a search device that searches for information using a search keyword based on a voice, an authentication device that performs person detection, fingerprint authentication, face authentication, iris authentication, and the like, and an information processing device that performs prediction such as stock prediction, weather forecast, and the like.

Claims

1. A standard model creation device for creating a standard model for speech recognition that represents a speech feature having a specific attribute, using a probability model in which a frequency parameter representing the speech feature is expressed by an output probability, the standard model creation device being characterized in that:

a reference model storage unit for storing one or more reference models that are probability models representing speech features having a certain attribute; and

a standard model creation unit that creates a standard model by calculating statistics of the standard model using statistics of one or more reference models stored in the reference model storage unit;

the standard modeling unit includes:

a standard model structure determination unit for determining the structure of the created standard model;

an initial standard model creation unit that specifies a statistical amount initial value of a specific standard model for which a structure is specified; and

and a statistic estimation unit configured to estimate and calculate statistics of the standard model so as to maximize or maximize a probability or likelihood of the standard model having the initial value with respect to the reference model.

2. The standard modeling apparatus of claim 1, wherein:

the standard model creation device further includes a reference model selection unit configured to select one or more reference models from the reference models stored in the reference model storage unit, based on use information that is information on an attribute to be a speech recognition target;

the standard model creating means creates a standard model using the statistics of the reference model selected by the reference model selecting means.

3. The standard modeling apparatus of claim 2, wherein:

the standard model creation device further includes a utilization information creation unit that creates the utilization information;

the reference model selecting means selects one or more reference models from the reference models stored in the reference model storage means, based on the created use information.

4. The standard modeling apparatus of claim 2, wherein:

the standard model creation device is connected to a terminal device via a communication path,

the standard model creation device further includes a usage information receiving unit that receives the usage information from the terminal device;

the reference model selecting means selects one or more reference models from the reference models stored in the reference model storage means, based on the received usage information.

5. The standard modeling apparatus for speech recognition according to claim 1, wherein:

the standard model structure specifying unit specifies the structure of the standard model based on at least one of specification information, which is information on a specification of the created standard model, and usage information, which is information on an attribute to be a speech recognition target.

6. The standard modeling apparatus for speech recognition according to claim 5, wherein:

the specification information indicates at least one specification of a kind of an application using the standard model and a specification of a device using the standard model.

7. The standard modeling apparatus for speech recognition according to claim 5, wherein:

the attributes include information related to at least one of age, gender, speaker properties of timbre, intonation based on emotional or health status, speaking speed, closeness of speaking, dialect, kind of background noise, size of background noise, signal-to-noise ratio of speech and background noise, microphone characteristics, and complexity of recognized vocabulary.

8. The standard modeling apparatus of claim 5, wherein:

the standard model creation device further includes a specification information holding unit that holds, as the specification information, an application specification correspondence database indicating correspondence between an application using a standard model and a specification of the standard model;

the standard model configuration determining section reads a specification corresponding to the started application from an application specification correspondence database held in the specification information holding unit, and determines the configuration of the standard model based on the read specification.

9. The standard modeling apparatus of claim 5, wherein:

the standard model creation device further includes a specification information creation unit for creating the specification information,

the standard model structure specifying unit specifies the structure of the standard model based on the created specification information.

10. The standard modeling apparatus of claim 5, wherein:

the standard model creation device further includes a specification information receiving unit that receives the specification information from the terminal device,

the standard model structure determination unit determines the structure of the standard model based on the received specification information.

11. The standard modeling apparatus of claim 5, wherein:

representing the reference model and the standard model with more than one gaussian distribution;

the standard model structure determination unit determines, as a structure of the standard simulation, at least a mixture number of gaussian distributions.

12. The standard modeling apparatus of claim 1, wherein:

a standard model creation device connected to a terminal device via a communication path;

the standard model creation device further includes a standard model transmission unit that transmits the standard model created by the standard model creation unit to the terminal device.

13. The standard modeling apparatus of claim 1, wherein:

the reference model storage means stores at least a pair of reference models having different numbers of mixture of gaussian distributions;

the statistic estimation unit calculates statistics of the standard model so as to maximize or maximize the probability or likelihood of the standard model for the pair of reference models.

14. The standard modeling apparatus of claim 1, wherein:

the standard model creation means further includes reference model preparation means for executing at least one of a work of acquiring a reference model from the outside and storing the reference model in the reference model storage means, and a work of creating a new reference model and storing the reference model in the reference model storage means.

15. The standard modeling apparatus of claim 14, wherein:

the reference model preparation unit further performs at least one of updating and adding of the reference model stored by the reference model storage unit.

16. The standard modeling apparatus of claim 15, wherein:

the reference model preparation means performs at least one of updating and adding of the reference model stored in the reference model storage means, based on at least one of the use information, which is information on the identification target, and the specification information, which is information on the specification of the created standard model.

17. The standard modeling apparatus of claim 15, wherein:

the standard model creation device further includes a similarity degree information creation unit that creates similarity degree information indicating a similarity degree between at least one of the usage information and the specification information and the reference model, based on at least one of specification information, which is information relating to a specification of the created standard model, and usage information, which is information relating to an attribute to be a speech recognition target, and the reference model stored in the reference model storage unit;

the reference model preparation unit determines whether or not to perform at least one of updating and adding of the reference model stored in the reference model storage unit, based on the similarity information created by the similarity information creation unit.

18. The standard modeling apparatus of claim 1, wherein:

the initial standard model creating unit specifies an initial value of the statistic for specifying the standard model by using the one or more reference models used by the statistic estimating unit to calculate the statistic of the standard model.

19. The standard modeling apparatus of claim 1, wherein:

the initial standard model creation unit specifies the initial value based on a classification ID for identifying a standard model type.

20. The standard modeling apparatus of claim 19, wherein:

the initial standard model creating unit specifies the classification ID from the reference model, and specifies an initial value associated with the specified classification ID as the initial value.

21. The standard modeling apparatus of claim 20, wherein:

the initial standard model creating unit holds a correspondence table indicating a correspondence between the classification ID, the initial value, and the reference model, and specifies the initial value based on the correspondence table.

22. The standard modeling apparatus of claim 21, wherein:

the initial standard model creating unit creates or obtains from the outside an initial standard model with a classification ID as an initial value corresponding to the classification ID or a reference model with a classification ID as a reference model corresponding to the classification ID, thereby creating the correspondence table.

23. The standard modeling apparatus of claim 1, wherein:

the reference model storage unit stores a plurality of reference models;

the statistic estimation unit calculates the statistic so as to maximize or maximize the probability or likelihood of weighting the plurality of reference models stored in the reference model storage unit.

24. A method for creating a standard model for speech recognition, which uses a probability model in which a frequency parameter representing a speech feature is expressed by an output probability, and which represents the speech feature having a specific attribute, the method comprising:

the method comprises the following steps: a reference model reading step of reading one or more reference models from a reference model storage unit that stores one or more reference models that are probabilistic models representing speech characteristics having a certain attribute; and

a standard model creation step of creating a standard model by calculating statistics of the standard model using the statistics of the read reference model;

the standard modeling step comprises:

a standard model structure determining substep of determining the structure of the manufactured standard model;

an initial standard model making sub-step, determining the statistic initial value of a specific standard model, wherein the standard model is determined to be constructed; and

and a statistic estimation substep of estimating and calculating statistics of the standard model so as to maximize or maximize the probability or likelihood of the standard model for which the initial value is determined with respect to the reference model.

25. A program for creating a standard model for speech recognition representing a speech feature having a specific attribute by using a probability model in which a frequency parameter representing the speech feature is expressed by an output probability, the program comprising:

the standard modeling step comprises: