CN105869641A

CN105869641A - Speech recognition device and speech recognition method

Info

Publication number: CN105869641A
Application number: CN201510032839.8A
Authority: CN
Inventors: 郭莉莉
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2015-01-22
Filing date: 2015-01-22
Publication date: 2016-08-17

Abstract

The invention discloses a speech recognition device and a speech recognition method. The speech recognition device comprises a unit which is configured to obtain speech inputted by a current user, a unit which is configured to split obtained speech and output at least two voice command segments, a unit which is configured to recognize a first predefined voice command from the voice command segments through using an acoustic model unrelated to a speaker, a unit which is configured to calculate a transformation matrix for the current user based on a voice command segment which is recognized as the first predefined voice command, a unit which is configured to select an acoustic model for the current user from acoustic models registered in the speech recognition device based the calculated transformation matrix, and a unit which is configured to recognize a second voice command from the voice command segments through using the selected acoustic model. According to the speech recognition device and the speech recognition method of the invention adopted, speech recognition performance can be improved through using the selected acoustic model (AM).

Description

Speech recognition equipment and audio recognition method

Technical field

The present invention relates to speech recognition equipment and audio recognition method.

Background technology

The interface that speech recognition system provides users with the convenient, by this interface, user can be with Any number of electronic equipment with speech identifying function interacts.For user operation such as Multi-function printer (MFP), photographing unit, personal digital assistant (PDA) and mobile phone etc. Electronic equipment, voice command recognition technology is the mode of most convenient.User can be via such as Mike The voice-input device of wind etc., is directly inputted to the voice of oneself in electronic equipment, then, logical Crossing use voice command recognition technology, the voice of user will be converted into voice command to operate electronics Equipment.

Generally speaking, it should registered in advance or training is used for the acoustic model of voice command recognition (AM), but, it is time-consuming for registering or training AM, and has many users to be unwilling to carry out These operations.As the measure of this problem of reply, following technology is available: from other users And/or the existing AM of other electronic equipments selects one group of AM.Such as, U.S. Patent No. No. 7,103,549 disclose following method: personal attribute based on user and/or communication channel Attribute, to select AM from existing AM, and the AM selected by utilization carries out voice knowledge Not, to improve the speech recognition performance of user.The personal attribute of user includes sex, mother tongue, year Age, race and local etc..Further, channel attribute includes connection type, telephone model, network mark Know symbol, network attribute and background noise level etc..

But, in above-mentioned U.S. Patent No. 7,103,549, when user is initially at electronic equipment In time account is set, or when user carries out initial session with electronic equipment, obtain user Humanized and channel attribute.Therefore, personal attribute and channel attribute can not reflect that user's is instantaneous Voice attribute and momentary surroundings attribute, the sound variation that causes of such as coughing, automobile pass suddenly Road etc., thus, the AM selected based on these attributes can reduce the speech recognition performance of user.

Summary of the invention

Therefore, in view of above statement in the introduction, the problem to be solved in the present invention is from it The existing AM of his user and/or other electronic equipments selects one group of AM for active user, wherein, Selected AM can mesh well into distance speech attribute and the momentary surroundings genus of active user Property so that by the AM selected by use, it is possible to increase the speech recognition performance of active user.

According to an aspect of the invention, it is provided a kind of speech recognition equipment, this speech recognition fills Putting and include: voice-input unit, it is configured to obtain the voice inputted by active user；Voice Cutting unit, it is configured to split the voice obtained and export at least two voice command section； Predefined first voice command recognition unit, it is configured to by using the sound unrelated with speaker Learn model, from voice command section, identify predefined first voice command；Transformation matrix calculates Unit, it is configured to based on the voice command being identified as described predefined first voice command Section, calculates the transformation matrix for described active user, wherein, the transformation matrix calculated The described acoustic model unrelated with speaker can be made and be identified as described predefined first sound The voice command section coupling of order；Model selection unit, it is configured to based on the change calculated Change matrix, select to work as described from the acoustic model being registered in described speech recognition equipment The acoustic model of front user；And the second voice command recognition unit, it is configured to by using Selected acoustic model, identifies the second voice command from voice command section.

According to a further aspect in the invention, it is provided that a kind of audio recognition method, this speech recognition side Method includes: phonetic entry step, it is thus achieved that the voice inputted in speech recognition equipment by active user； Voice segmentation step, splits the voice obtained and exports at least two voice command section；Predefined First voice command recognition step, by using the acoustic model unrelated with speaker, comes from sound Command frame command phase identifies predefined first voice command；Transformation matrix calculation procedure, based on identified For the voice command section of described predefined first voice command, calculate for described active user Transformation matrix, wherein, the transformation matrix calculated can make the described sound unrelated with speaker Learn model to mate with the voice command section being identified as described predefined first voice command；Model Select step, based on the transformation matrix calculated, from being registered in described speech recognition equipment Acoustic model in select the acoustic model for described active user；And second voice command know Other step, by the acoustic model selected by use, identifies the second sound from voice command section Order.

It is as noted previously, as the voice command split based on the voice by the instantaneous input of active user A part (the most above-mentioned voice command section being identified as predefined first voice command) in Duan, Calculating transformation matrix, therefore, the transformation matrix calculated can represent active user and work as front ring The attribute in border, wherein, the attribute of active user can be the pronunciation attribute of active user and current The sound variation attribute etc. of user, and the attribute of current environment can be current speech noise belong to Property and the communication channel properties etc. of current speech.Thus, the AM selected based on transformation matrix Distance speech attribute and the momentary surroundings attribute of active user can be meshed well into, and, logical Cross the AM selected by using and can improve the speech recognition performance of active user.

By description referring to the drawings, further aspect of the present invention and advantage will become aobvious and It is clear to.

Accompanying drawing explanation

It is merged in this specification and constitutes the accompanying drawing of a part of this specification exemplified with the reality of the present invention Execute example, and be used for illustrating the principle of the present invention together with word description.

Fig. 1 is to illustrate according to the present invention, include that the several electronics with speech identifying function sets The integrally-built block diagram of standby speech recognition system.

Fig. 2 is electricity that illustrate the exemplary embodiment according to the present invention, that have speech identifying function First block diagram of the example of the internal structure of subset.

Fig. 3 be the first exemplary embodiment according to the present invention and the phonetic entry of electronic equipment The block diagram of the built-in function of the speech recognition that unit is relevant.

Fig. 4 is the Model Selection shown in the first exemplary embodiment according to the present invention, Fig. 3 The block diagram of the built-in function of unit.

Fig. 5 be the second exemplary embodiment according to the present invention and the phonetic entry of electronic equipment The block diagram of the built-in function of the speech recognition that unit is relevant.

Fig. 6 is electricity that illustrate the exemplary embodiment according to the present invention, that have speech identifying function Second block diagram of another example of the internal structure of subset.

Fig. 7 be the first exemplary embodiment according to the present invention and the phonetic entry of electronic equipment The flow chart of the speech recognition operation that unit is relevant.

Fig. 8 schematically show the first exemplary embodiment according to the present invention, for selecting The flow chart of the step of AM based on phoneme (phoneme).

Fig. 9 schematically show the first exemplary embodiment according to the present invention, for selecting The flow chart of the step of AM based on order word.

Figure 10 be the second exemplary embodiment according to the present invention and the phonetic entry of electronic equipment The flow chart of the speech recognition operation that unit is relevant.

Detailed description of the invention

The exemplary embodiment of the present invention is described in detail next, with reference to accompanying drawing.It should be pointed out that, Description below is the most only illustrative and exemplary of, and be in no way intended to limit the present invention and Its application or purposes.The each element stated in an embodiment and positioned opposite, the numerical value of step Expression formula and numerical value not delimit the scope of the invention, unless stated otherwise.Additionally, this The technology known to the skilled person in field, method and apparatus can be not discussed in detail, but suitably In the case of be regarded as the part of this specification.

Note that similar reference and letter refer to the similar item in figure, thus, once Project is defined in one drawing, then need not this project is discussed again for figure afterwards.

As it is shown in figure 1, speech recognition system 100 is equipped with having any kind of speech identifying function The electronic equipment of class, such as MFP 1, photographing unit 2, PDA 3, mobile phone 4, individual calculus Machine (PC) 5 and the electronic equipment 6 of any other kind, and these electronic equipments are via network 7 are communicatively coupled each other.Type and the quantity of the electronic equipment of network 7 to be connected to do not limit In the situation shown in Fig. 1.Any electronic equipment in speech recognition system 100 is all configured as Receive the voice of user, and based on speech identifying function, identify from this voice and be used for operating this The corresponding sound order of electronic equipment.

Above-mentioned speech identifying function can be realized by hardware and/or software.In one realization side In formula, can be incorporated to be able to carry out the functional module of speech recognition or functional device in electronic equipment, Thus, this electronic equipment will have corresponding speech identifying function.In another implementation, may be used It is able to carry out the software program of speech recognition with storage in the storage device of electronic equipment, thus, Electronic equipment also will have corresponding speech identifying function.Describe in detail next, with reference to accompanying drawing Above two implementation.

(speech recognition equipment being incorporated in electronic equipment)

Fig. 2 is to illustrate according to the exemplary embodiment of the present invention, MFP 1 etc. in such as Fig. 1 First block diagram of example of internal structure of the electronic equipment 1 with speech identifying function, wherein, In electronic equipment 1 (i.e. MFP 1), it is incorporated with hereinafter by with reference to Fig. 3～5 detailed description Speech recognition equipment.Electronic equipment 1 can include CPU (CPU) 101, deposit at random Access to memory (RAM) 102, read only memory (ROM) 103, hard disk 104, input equipment 105, speech recognition equipment 106, operating unit 107, outut device 108 and network interface 109, And these parts are communicatively coupled each other via system bus 110.

CPU 101 can be any applicable programmable control device, and can be deposited by execution The storage various application programs in ROM 103 or hard disk 104, perform various function described later. RAM 102 is used for storing the program loaded from ROM 103 or hard disk 104 or data provisionally, And also it is used as CPU 101 in order to perform the space of various program.Hard disk 104 can store many The information of kind, such as operating system (OS), various application, control program, data and speak Acoustic model (SI-AM) that person is unrelated and the AM etc. being registered by user or training.Additionally, Or the AM of training registered in advance by manufacturer can be stored in ROM 103 or hard disk 104.

Input equipment 105 can include operating input equipment 115 and voice-input unit 125, and, Input equipment 105 can make user based on the operation inputted by operation input equipment 115 or lead to Cross the speech frame of voice-input unit 125 input, interact with electronic equipment 1.Operate defeated Enter equipment 115 and can use each of such as button, keypad, rotating disk, touch-control wheel or touch screen etc. The form of kind.Further, voice-input unit 125 can be mike.

According to embodiments of the invention detailed below, speech recognition equipment 106 will Receive the voice of user from voice-input unit 125, and by from the voice received, know Not corresponding voice command or corresponding voice content.

Operating unit 107 will perform the operation corresponding with the voice command identified.Outut device 108 can include display device 118 and voice-output unit 128, and, outut device 108 can The voice command identified with display or output and/or the voice content identified.

Such as, when speech recognition equipment 106 identifies voice command from the voice received, Operating unit 107 will perform electronic equipment 1 corresponding operating, such as print, duplicate, scan with And send Email etc..Further, as optional operation, perform accordingly at operating unit 107 Operation before, display device 118 can show that the voice command that identifies is to obtain user really Recognize, and/or voice-output unit 128 can export the voice command identified to obtain user Confirmation.

Display device 118 can include cathode ray tube (CRT) or liquid crystal display, and, language Sound output unit 128 can be equipped with the audio output apparatus of such as speaker etc..Additionally, operation Input equipment 115 and display device 118 can be totally integrating or are incorporated to discretely.

Network interface 109 provides following interface, and this interface is for being connected to Fig. 1 by electronic equipment 1 Shown in network 7.Electronic equipment 1, via network interface 109, is connected with via network 7 Other electronic equipments (such as photographing unit 2, PDA 3) carry out data communication (such as sharing AM). Alternately, can be that electronic equipment 1 arranges wave point, to carry out RFDC. System bus 110 can provide following data transfer path, this data transfer path for such as Under parts between or between following parts, mutually transmit data, described parts are CPU 101, RAM 102, ROM 103, hard disk 104, input equipment 105, speech recognition equipment 106, Operating unit 107, outut device 108 and network interface 109 etc..Although being referred to as bus, but, System bus 110 is not limited to any specific data transferring technique.

For speech recognition equipment 106, figure 3 illustrates speech recognition equipment 106 First example of internal functional elements.Fig. 3 be the first exemplary embodiment according to the present invention, The built-in function of the speech recognition relevant with the voice-input unit 125 in Fig. 2 of electronic equipment 1 Block diagram.When CPU 101 performs to be stored in the program in ROM 103 and/or hard disk 104, Following functional unit is achieved.

Before user operation electronic equipment 1, user need to utilize such as logged in by IC-card and By the login method of any kind of fingerprint recognition login etc., log in electronic equipment 1.Then, AM collection and order word list will be loaded from ROM 103 and/or hard disk 104 by CPU 101 To RAM 102, wherein, described order word list can be by electronic equipment 1 operation based on self Automatically arrange, or can be arranged by user or manufacturer operation based on electronic equipment 1, Such as, if electronic equipment 1 is operable to print, duplicate and scan, the most described order word list The order words such as " printing ", " duplicating ", " scanning ", " two parts " and " two-sided " can be comprised.As Really user once used electronic equipment 1, and have registered some being stored in advance in hard disk 104 AM, then CPU 101 is by loading the AM of the registration of user, as the above-mentioned AM collection being loaded. Otherwise, the above-mentioned AM collection being loaded is empty set.

It will be apparent to one skilled in the art that and there is two kinds of AM, a type It is AM based on phoneme, and another type is AM based on order word.On the one hand, when at language When the AM used in sound identification device 106 is AM based on phoneme, user or CPU 101 will Carry out following checking, the AM based on phoneme i.e. concentrated, if contain life at the AM loaded Make whole phonemes of order word in word list.If load AM concentrate based on phoneme AM contains whole phonemes of the order word in order word list, then comprised predetermined by input user The voice of the first voice command of justice and after starting speech recognition equipment 106, speech recognition fills Putting 106 will utilize the AM collection loaded to perform speech recognition.Otherwise, if at the AM loaded The AM based on phoneme concentrated is not covered by whole phonemes of the order word in order word list, then language Sound identification device 106 will be selected to contain the order word in order word list according to the present invention The AM based on phoneme of phone set of whole phonemes, and by selected phone set based on The AM of phoneme, the AM adding loading to concentrate, and then, speech recognition equipment 106 will utilize The AM based on phoneme of selected phone set performs speech recognition.

On the other hand, when in speech recognition equipment 106 use AM be AM based on order word Time, user or CPU 101 will carry out following checking, i.e. the AM loaded concentrate based on life Make the AM of word, if contain the whole order word in order word list.If at the AM loaded The AM based on order word concentrated contains the whole order word in order word list, then user After starting speech recognition equipment 106, speech recognition equipment 106 comes utilizing the AM collection loaded Perform speech recognition.Otherwise, if do not contained at the AM based on order word of the AM concentration loaded Whole order word in lid order word list, then speech recognition equipment 106 will according to the present invention, AM based on order word is selected for each in order word list or number order word, and will The AM selected AM based on order word being added to loading concentrates, then, and speech recognition AM based on order word selected by utilization is performed speech recognition by device 106.This area skill It is understood that until there is predefined first sound in the voice of user's input in art personnel Order, speech recognition equipment 106 just will perform speech recognition.Predefined first voice command can Automatically to be predefined by electronic equipment 1, or can be arranged by user.Such as, predefined The first voice command can be following predefined introducer, this introducer is by such as " start " Any word of (beginning) and " start ... end " (starting ... terminate) etc. or one group of any word composition.

Now, will be described below the inside merit of the speech recognition equipment 106 being able to carry out the present invention First example of energy unit.As it is shown on figure 3, speech recognition equipment 106 includes voice cutting unit 302, predefined first voice command recognition unit 303, transformation matrix computing unit 304, mould Type selects unit 305 and the second voice command recognition unit 306.

Specifically, acquisition is inputted by the voice-input unit 125 of electronic equipment 1 by active user Voice.

Voice cutting unit 302 will receive the voice obtained from voice-input unit 125, then will Utilize any kind of speech terminals detection (VAD) technology well known in the art, such as based on short The time domain approach of Shi Nengliang and transform domain method based on frequency domain parameter etc., split the language of acquisition Sound also exports at least two voice command section.As it has been described above, until the voice packet obtained is containing predefined The first voice command, just speech recognition equipment 106 perform speech recognition, therefore, in order to make electricity Subset 1 performs such as to print and the corresponding operating of duplicating etc., it is thus achieved that voice must comprise at least Two voice commands, wherein, a voice command is used for identifying predefined first voice command, And another is used for operating electronic equipment 1.Such as, want to utilize electronic equipment 1 active user When carrying out printed document, it is thus achieved that voice can be " start print ".

It will be apparent to one skilled in the art that the current environment of voice is unlikely to be definitely Quietly, in other words, the voice of input may comprise active user input voice current environment week The sound enclosed, therefore in addition to above-mentioned at least two voice command section, voice cutting unit 302 Output also will include at least one background sound segment, wherein, background sound segment can reflect input The current environment of voice, the sound around such as office, the sound around kindergarten and street Sound etc. around road.

Predefined first voice command recognition unit 303 is by by using the acoustics unrelated with speaker Model (SI-AM) 310, identifies the voice command section of sound cutting unit 302 of speaking to oneself from output Predefined first voice command, wherein, SI-AM such as can be stored in the electronics in Fig. 2 In the hard disk 104 of equipment 1.

Transformation matrix computing unit 304 will be based at predefined first voice command recognition unit 303 In be identified as the voice command section of predefined first voice command, calculate for active user Transformation matrix.Additionally, the current environment being as noted previously, as voice is unlikely to be absolutely quiet , therefore, transformation matrix computing unit 304 will be based on background sound segment and be identified as making a reservation for The voice command section of the first voice command of justice, calculates the transformation matrix for active user.

It is as noted previously, as the voice command split based on the voice by the instantaneous input of active user A part (the most above-mentioned voice command section being identified as predefined first voice command) in Duan Calculate transformation matrix, and be also based on the part in voice command section and background sound Section calculates transformation matrix, and therefore, the transformation matrix calculated can represent active user and current The attribute of environment, wherein, the attribute of active user can be active user pronunciation attribute and The sound variation attribute etc. of active user, and the attribute of current environment can be making an uproar of current speech Sound attribute and the communication channel properties etc. of current speech.

Such as, in one implementation, transformation matrix can by use well known in the art Maximum-likelihood linear regression (MLLR) method calculates, and such as can be represented as below equation:

\hat{W} = \arg \max_{W} P (O / W, λ)

Wherein O represents the object of observation, all background sound segments described above and above-mentioned be identified as The voice command section etc. of predefined first voice command.W represents above-mentioned transformation matrix.λ generation The parameter of the above-mentioned SI-AM of table.

The above-mentioned formula meaning is to will enable P (O/W, λ) maximized W as output.In other words, MLLR method can be used in by using transformation matrix to adjust the parameter of SI-AM so that SI-AM can mate the object of observation, all background sound segments described above and above-mentioned be identified as The voice command section etc. of predefined first voice command.Additionally, as optional solution, The transformation matrix of the active user calculated can be stored in the hard disk 104 of electronic equipment 1, For follow-up work, such as select AM etc. for other users using the present invention.

Model selection unit 305 is by based on calculating of exporting from transformation matrix computing unit 304 Transformation matrix, selects to use for current from the AM 320 being registered in speech recognition equipment 106 The AM at family, wherein, AM 320 can be in advance based on mentioned order word list by manufacturer or user Speech samples register or train, and can be stored in electronic equipment 1 ROM 103 or In hard disk 104.Additionally, electronic equipment 1 can be able to communicate with other electronic equipments via network 7 Ground connects, as it is shown in figure 1, therefore, the model selection unit 305 in electronic equipment 1 can also Based on the transformation matrix calculated, from being registered in other electronic equipments described, (i.e. other voices are known Other device) in AM in select the AM for active user.It is as noted previously, as and there is base In the AM and AM based on order word of phoneme, therefore, hereinafter, one will be entered with reference to Fig. 4 Step describes model selection unit 305 in detail.

The AM that second voice command recognition unit 306 will be selected by use, comes from output from model Selecting to identify the second voice command in the voice command section of unit 305, wherein, voice command section is not It is included in above-mentioned predefined first voice command recognition unit 303 and is identified as predefined the The voice command section of one voice command.Then, as it has been described above, operating unit 107 will perform with defeated Come from the operation that second voice command identified of the second voice command recognition unit 306 is corresponding, Or, outut device 108 can show or export the second voice command identified.

Now, will be described below the interior of model selection unit 305 in speech recognition equipment 106 One example of portion's functional unit.Fig. 4 be the exemplary embodiment according to the present invention, in Fig. 3 The block diagram of the built-in function of shown model selection unit 305.As shown in Figure 4, Model Selection list Unit 305 includes Acoustic model selection unit 315 based on phoneme and/or based on order word Acoustic model selection unit 325, so that model selection unit 305 can be disposed any kind of AM。

If as it has been described above, when active user logs in electronic equipment 1, concentrated at the AM loaded AM based on phoneme can not contain whole phonemes of the order word in order word list, then voice Identify device 106 will according to the present invention, the order word being selected to contain in order word list All AM based on phoneme of the phone set of phoneme.In other words, Acoustic model selection based on phoneme Unit 315 is configured to based on the transformation matrix calculated, come from speech recognition equipment 106 or Person is by registration in network other speech recognition equipments interconnective (i.e. other electronic equipments) In the AM based on phoneme (i.e. AM 320 shown in Fig. 3) of phone set, select for currently The AM based on phoneme of the phone set of user.In one implementation, acoustic mode based on phoneme Type selects unit 315 to include first transformation matrix acquiring unit the 3151, first metrics calculation unit 3152 and acoustic model based on phoneme determine unit 3153.

Specifically, the first transformation matrix acquiring unit 3151 will be at speech recognition equipment 106 Or by registration in network other speech recognition equipments interconnective (i.e. other electronic equipments) The AM based on phoneme of phone set, obtain transformation matrix.For those skilled in the art aobvious and It is clear to, can be inferior for different accents, sex and age level, register or train not AM based on phoneme with phone set.Further, for the AM based on phoneme of phone set, For this phone set AM based on phoneme transformation matrix can in this phone set based on phoneme AM when being registered or train, by using above-mentioned MLLR method to be calculated, and Can be stored in the hard disk of electronic equipment together with the AM based on phoneme of this phone set.

The distance that first metrics calculation unit 3152 will be calculated as follows between two kinds of transformation matrixs, wherein A kind of transformation matrix is to be used for current by what the transformation matrix computing unit 304 in Fig. 3 calculated The transformation matrix at family, and another kind of transformation matrix is to be obtained by the first transformation matrix acquiring unit 3151 The transformation matrix of AM based on phoneme that get, for phone set.Those skilled in the art's energy Enough being understood by, above-mentioned distance can be any known distance, such as euclidean (Euclidean) Distance, K-L distance and horse Mahalanobis (Mahalanobis) distance etc., and will not at this Repeat detailed computational methods.

Acoustic model based on phoneme determine unit 3153 by determine the minimum phone set of distance based on The AM of phoneme, it is alternatively that the AM based on phoneme for active user.

If additionally, as it has been described above, when active user logs in electronic equipment 1, at the AM loaded The AM based on order word concentrated can not contain the whole order word in order word list, then language Sound identification device 106 will be according to the present invention, for each in order word list or number order word Select AM based on order word.In other words, Acoustic model selection unit 325 based on order word It is configured to based on the transformation matrix calculated, from speech recognition equipment 106 or pass through net Registration based on order in network other speech recognition equipments interconnective (i.e. other electronic equipments) In the AM (i.e. AM 320 shown in Fig. 3) of word, select for active user based on order The AM of word.In one implementation, Acoustic model selection unit 325 based on order word includes Second transformation matrix acquiring unit 3251, second distance computing unit 3252 and based on order word Acoustic model determines unit 3253.

Specifically, for each order word in mentioned order word list, the second transformation matrix obtains Unit 3251 will be at speech recognition equipment 106 or by network other voices interconnective Identify the corresponding with this order word based on order of registration in device (i.e. other electronic equipments) The AM of word, obtains transformation matrix.It will be apparent to one skilled in the art that for one For individual order word, by inferior for different accents, sex and age level, register or train One or more AM based on order word.Further, for an AM based on order word For, a transformation matrix is by when this AM based on order word is registered or trains, by making It is calculated by above-mentioned MLLR method, and can be together with this AM based on order word It is stored in the hard disk of electronic equipment.

For each order word in mentioned order word list, second distance computing unit 3252 will calculate Distance between the following two kinds transformation matrix, one of which transformation matrix is by the conversion square in Fig. 3 Transformation matrix that battle array computing unit 304 calculates, for active user, and another kind of conversion Matrix is got by the second transformation matrix acquiring unit 3251, for corresponding with this order word The transformation matrix of acoustic model based on order word.As it has been described above, calculated distance equally may be used To be any known distance, such as Euclidean distance, K-L distance and horse Mahalanobis away from From etc..

For each order word in mentioned order word list, acoustic model based on order word determines list The AM of based on order word corresponding with this order word that unit 3253 is minimum by determining distance, as The AM based on order word for active user selected.

Additionally, Acoustic model selection unit 325 based on order word can also include recommendation unit 3254.When acoustic model based on order word determines that unit 3253 can not be in order word list Number order word and when determining corresponding AM based on order word, it is recommended that unit 3254 will be recommended Active user's registration is for the AM based on order word of described order word, it is alternatively that based on order The AM of word.

As it has been described above, in one implementation, if the AM loaded concentrate based on order The AM of word can not contain the whole order word in order word list, then acoustics based on order word Model selection unit 325 can select based on order word for each order word in order word list AM.In another implementation, Acoustic model selection unit 325 based on order word can be only For order word list selects AM based on order word such as word of issuing orders, these are ordered Word, the active user not corresponding AM based on order word of registration in speech recognition equipment 106.

For speech recognition equipment 106, figure 5 illustrates speech recognition equipment 106 Second example of internal functional elements.Fig. 5 be the second exemplary embodiment according to the present invention, The built-in function of the speech recognition relevant with the voice-input unit 125 in Fig. 2 of electronic equipment 1 Block diagram.In a second embodiment, speech recognition equipment 106 would first, through use SI-AM 310 Identify the second voice command.When CPU 101 performs to be stored in ROM 103 and/or hard disk 104 In program time, functional unit below is achieved.

Compared with Fig. 5 with Fig. 3, the speech recognition equipment 106 shown in Fig. 5 has following Difference:

First, speech recognition equipment 106 also includes the 3rd voice command recognition unit 501.3rd sound Sound command recognition unit 501 will come from output from model selection unit by using SI-AM 310 Identifying the second voice command in the voice command section of 305, wherein, voice command section is not included in making a reservation for Justice the first voice command recognition unit 303 is identified as the sound of predefined first voice command Command frame command phase.

Secondly, the 3rd voice command recognition unit 501 also will determine that the second voice command identified Recognition confidence whether less than predefined threshold value, wherein, for example, it is possible to by user or manufacture Business should be for pre-defining described predefined threshold value according to actual.If the rising tone identified The recognition confidence of sound order is more than or equal to predefined threshold value, then operating unit 107 will directly Perform relative with the second voice command identified from the 3rd voice command recognition unit 501 output The operation answered, or outut device 108 can show or export the second voice command identified.

Otherwise, if the recognition confidence of the second voice command identified is less than predefined threshold value, The AM that then the second voice command recognition unit 306 will be selected by use, comes from output from the 3rd sound Identifying the second voice command in the voice command section of sound command recognition unit 501, wherein, sound is ordered The section of order is not included in predefined first voice command recognition unit 303 being identified as predefined the The voice command section of one voice command.The segmentation of voice-input unit 125 shown in Fig. 5, voice is single Unit 302, predefined first voice command recognition unit 303, transformation matrix computing unit 304, mould Type selects other detailed descriptions of unit 305, SI-AM 310 and AM 320, shown in Fig. 3 Corresponding units be similar to, thus, detailed explanation thereof will not be repeated at this.

(audio recognition method being stored in the storage device of electronic equipment)

Fig. 6 is to illustrate according to the exemplary embodiment of the present invention, MFP 1 etc. in such as Fig. 1 Second block diagram of another example of internal structure of the electronic equipment 1 with speech identifying function, Wherein, in the storage device of electronic equipment 1 (i.e. MFP 1), store hereinafter by reference The audio recognition method that Fig. 7～10 describes in detail.Electronic equipment 1 can include CPU (CPU) 101, random access memory (RAM) 102, read only memory (ROM) 103, Hard disk 104, input equipment 105, operating unit 107, outut device 108 and network interface 109, And these parts are communicatively coupled each other via system bus 110.

As shown in Figure 6, in addition to speech recognition equipment 106, the internal structure of electronic equipment 1 Essentially identical with the internal structure of the electronic equipment 1 shown in Fig. 2, thus, will be the heaviest at this Multiple CPU 101, RAM 102, ROM 103, hard disk 104, input equipment 105, operating unit 107, outut device 108, network interface 109 and the detailed description of system bus 110.Additionally, In the hard disk 104 of the electronic equipment 1 shown in figure 6, store and be capable of and institute in Fig. 2 The audio recognition method of the function that the speech recognition equipment 106 that shows is identical.

Fig. 7 be above-mentioned first exemplary embodiment according to the present invention and electronic equipment 1 in Fig. 6 The flow chart of the relevant speech recognition operation of voice-input unit 125.When CPU 101 will store Program in ROM 103 and/or hard disk 104 is loaded in RAM 102 and performs corresponding During program, the operation of each corresponding steps hereinafter is achieved.

As it is shown in fig. 7, in phonetic entry step S701, the electronic equipment 1 shown in Fig. 6 Voice-input unit 125 will obtain the voice inputted by active user (corresponding to the voice in Fig. 3 Input block 125).

In voice segmentation step S702, the CPU 101 of the electronic equipment 1 shown in Fig. 6 will be from Voice-input unit 125 receives the voice obtained, and will split the voice obtained and export at least Two voice command sections (the voice cutting unit 302 corresponding in Fig. 3).

As it has been described above, the current environment of voice is unlikely to be absolutely quiet, therefore split at voice In step S702, in addition to above-mentioned at least two voice command section, CPU 101 also will output At least one background sound segment.

In predefined first voice command recognition step S703, the electronic equipment 1 shown in Fig. 6 The SI-AM that will be stored in by use in the electronic equipment 1 shown in Fig. 6 of CPU 101, come The voice command section of sound segmentation step S702 of speaking to oneself from output identifies predefined first voice command (the predefined first voice command recognition unit 303 corresponding in Fig. 3).As it has been described above, it is predetermined First voice command of justice can be such as predefined introducer.

In transformation matrix calculation procedure S704, the CPU 101 of the electronic equipment 1 shown in Fig. 6 By based on being identified as the voice command section of predefined first voice command, calculate for currently The transformation matrix of user, wherein, the transformation matrix of calculating can make SI-AM and be identified as making a reservation for The voice command section coupling of the first voice command of justice (calculates single corresponding to the transformation matrix in Fig. 3 Unit 304).

Additionally, the current environment being as noted previously, as voice is unlikely to be absolutely quiet, therefore, In transformation matrix calculation procedure S704, CPU 101 can be based on background sound segment and identified For the voice command section of predefined first voice command, calculate the conversion square for active user Battle array.As preferred solution, conversion can be calculated by using above-mentioned MLLR method Matrix.Additionally, as optional solution, CPU 101 can be the active user calculated Transformation matrix, be stored in the hard disk 104 of the electronic equipment 1 shown in Fig. 6, for rear Continuous work, such as selects AM etc. for other users using the present invention.

In Model Selection step S705, the CPU 101 of the electronic equipment 1 shown in Fig. 6 is by base In the transformation matrix calculated, the AM from the electronic equipment 1 being registered in shown in Fig. 6 Select the AM (model selection unit 305 corresponding in Fig. 3) for active user.

Additionally, the electronic equipment 1 shown in Fig. 6 can via network 7 and other electronic equipments Connect communicatedly, as it is shown in figure 1, therefore, in Model Selection step S705, shown in Fig. 6 Electronic equipment 1 in CPU 101 can also be based on the transformation matrix calculated, from being registered in AM in other electronic equipments described selects the AM for active user.

It is as noted previously, as and there is AM based on phoneme and AM based on order word, therefore, Model Selection step S705 also includes: Acoustic model selection step based on phoneme, by based on based on The transformation matrix calculated, interconnects mutually from electronic equipment 1 shown in figure 6 or by network In other electronic equipments connect in the AM based on phoneme of the phone set of registration, select for currently The AM based on phoneme of the phone set of user；And/or Acoustic model selection of based on order word Step, for based on the transformation matrix calculated, come the electronic equipment 1 shown in figure 6 or Person, by the AM based on order word of registration in network other electronic equipments interconnective, selects Select the AM based on order word for active user.

In fig. 8 it is shown that for a kind of illustrative methods selecting AM based on phoneme.Fig. 8 Schematically show that Model Selection step S705 illustrated in the figure 7 performs, for selecting The flow chart of the step of AM based on phoneme.

As shown in Figure 8, in the first transformation matrix obtaining step S7051, the electricity shown in Fig. 6 The CPU 101 of subset 1 is by for electronic equipment 1 shown in figure 6 or by network phase In other electronic equipments connected, the AM based on phoneme of the phone set of registration, obtains conversion square Battle array (the first transformation matrix acquiring unit 3151 corresponding in Fig. 4).

In the first distance calculation procedure S7052, the CPU 101 of the electronic equipment 1 shown in Fig. 6 To be calculated as follows the distance between two kinds of transformation matrixs, one of which transformation matrix is from Fig. 7 Transformation matrix for active user that export in transformation matrix calculation procedure S704, that calculate, And another kind of transformation matrix exports from the first transformation matrix obtaining step S7051, obtains The transformation matrix of the AM based on phoneme for phone set arrived (corresponding to first in Fig. 4 away from From computing unit 3152).

Determine in step S7053 at acoustic model based on phoneme, the electronic equipment 1 shown in Fig. 6 CPU 101 will determine the AM based on phoneme of the minimum phone set of distance, it is alternatively that pin The AM based on phoneme of active user (is determined corresponding to the acoustic model based on phoneme in Fig. 4 Unit 3153).

In fig. 9 it is shown that for selecting for an order word in mentioned order word list A kind of illustrative methods of AM based on order word.Fig. 9 schematically shows example in the figure 7 That Model Selection step S705 shown performs, for selecting the step of AM based on order word Flow chart.

As it is shown in figure 9, in the second transformation matrix obtaining step S7151, the electricity shown in Fig. 6 The CPU 101 of subset 1 is by for electronic equipment 1 shown in figure 6 or by network phase The AM of based on order word corresponding with order word of registration in other electronic equipments connected, Obtain transformation matrix (the second transformation matrix acquiring unit 3251 corresponding in Fig. 4).

In second distance calculation procedure S7152, the CPU 101 of the electronic equipment 1 shown in Fig. 6 To be calculated as follows the distance between two kinds of transformation matrixs, one of which transformation matrix is from Fig. 7 Transformation matrix for active user that export in transformation matrix calculation procedure S704, that calculate, And another kind of transformation matrix be get by the second transformation matrix obtaining step S7151 output, For the transformation matrix of the acoustic model of based on order word corresponding with order word (corresponding to Fig. 4 In second distance computing unit 3252).

Determining in step S7153 at acoustic model based on order word, the electronics shown in Fig. 6 sets The AM of based on order word corresponding with order word that the CPU 101 of standby 1 is minimum by determining distance, As select, for active user AM based on order word (corresponding in Fig. 4 based on life The acoustic model making word determines unit 3253).

In step S7154, the CPU 101 of the electronic equipment 1 shown in Fig. 6 will carry out as follows Judgement, i.e. self whether can obtain minimum range for each order word in order word list. If CPU 101 can not obtain minimum range for each order word in order word list, then pushing away Recommending in step S7155, CPU 101 will recommend active user at acoustic mode based on order word Type determines the order word that can not determine corresponding AM based on order word in step S7153, comes Register AM based on order word, it is alternatively that AM based on order word (corresponding in Fig. 4 Recommendation unit 3254).

Although in above-mentioned Acoustic model selection step based on order word, the electricity shown in Fig. 6 The CPU 101 of subset 1 selects based on order word for each order word in order word list AM, but, CPU 101 can also only for active user the most in figure 6 shown in electronic equipment In 1, the corresponding AM based on order word of registration, order word in order word list, select base AM in order word.

Now, return to Fig. 7, in the second voice command recognition step S706, shown in Fig. 6 The AM that will be selected by use of the CPU 101 of electronic equipment 1, come from voice command Duan Zhongshi Other second voice command, wherein, voice command section is not included in predefined first voice command recognition Step S703 is identified as the voice command section of predefined first voice command (corresponding to Fig. 3 In voice command recognition unit 306).

It should be pointed out that, that the unit of the speech recognition equipment 106 shown in Fig. 3～4 can be by structure Make each step for performing the audio recognition method shown in the flow chart in Fig. 7～9.

Figure 10 be above-mentioned second exemplary embodiment according to the present invention and electronic equipment in Fig. 6 The flow chart of the speech recognition operation that the voice-input unit 125 of 1 is relevant.When CPU 101 will deposit Storage program in ROM 103 and/or hard disk 104 is loaded in RAM 102 and performs corresponding Program time, the operation of each corresponding steps hereinafter is achieved.

Compared with Figure 10 with Fig. 7, the audio recognition method shown in Figure 10 have following not Same part:

First, audio recognition method also includes the 3rd voice command recognition step S1001.At the 3rd sound In sound command recognition step S1001, the CPU 101 of the electronic equipment 1 shown in Fig. 6 will pass through Use and be stored in the SI-AM in the electronic equipment 1 shown in Fig. 6, come from output from Model Selection Identifying the second voice command in the voice command section of step S705, wherein, voice command section does not comprises It is identified as predefined first voice command in predefined first voice command recognition step S703 Voice command section.

Secondly, in step S1002, the CPU 101 of the electronic equipment 1 shown in Fig. 6 also will Carry out following judgement, i.e. from the 3rd voice command recognition step S1001 output, identify The recognition confidence of the second voice command, if less than predefined threshold value, wherein, such as, Described predefined threshold value can be pre-defined according to actual should being used for by user or manufacturer.As The recognition confidence of the second voice command that fruit is identified is less than predefined threshold value, then at the rising tone In sound command recognition step S706, the CPU 101 of the electronic equipment 1 shown in Fig. 6 will be by making With the AM selected, from the voice command section of step S1002, identify the second sound life from output Order, wherein, voice command section is not included in quilt in predefined first voice command recognition step S703 It is identified as the voice command section of predefined first voice command.

Otherwise, if the recognition confidence of the second voice command identified is more than or equal to predefined Threshold value, then the CPU 101 of the electronic equipment 1 shown in Fig. 6 is by the second sound identified Order, exports operating unit 107 or the outut device 108 of the electronic equipment 1 shown in Fig. 6.

Phonetic entry step S701 shown in Figure 10, voice segmentation step S702, predefined the One voice command recognition step S703, transformation matrix calculation procedure S704 and Model Selection step S705 Other describe in detail, similar with the corresponding steps shown in Fig. 7, thus, will be the heaviest at this Multiple detailed description.

It should be pointed out that, that the unit of the speech recognition equipment 106 shown in Fig. 5 can be constructed For performing each step of the audio recognition method shown in the flow chart in Figure 10.

Utilize above-mentioned exemplary speech identification device and audio recognition method, due to based on by currently The voice of the instantaneous input of user and a part in the voice command section split (the most above-mentioned are identified as The voice command section of predefined first voice command) calculate transformation matrix, and also can be with base A part and background sound segment in voice command section calculate transformation matrix, therefore, based on The transformation matrix calculated and the AM selected can mesh well into the distance speech of active user and belong to Property and momentary surroundings attribute, and, by use selected by AM can improve active user Speech recognition performance.

Above-mentioned all of unit be all for realize the exemplary of process described in the disclosure and/or Preferably module.These unit can be hardware cell (such as field programmable gate array (FPGA), Digital signal processor or special IC etc.) and/or software module (such as computer-readable journey Sequence).Unit for realizing various step is the most at large described.But, carry out certain in existence One process step in the case of, can exist the corresponding function module for realizing same treatment or Unit (is realized by hardware and/or software).In disclosure herein, including based on the step described The technical scheme of whole combinations of rapid and corresponding with these steps unit, as long as constituted These technical schemes are complete and are suitable for.

The method and device of the present invention can be implemented in several ways.For example, it is possible to by soft Part, hardware, firmware or its combination in any, implement the method and device of the present invention.It is described above The order of step of method be only intended to illustrate, and, the step of the method for the present invention is not It is defined in the order being described in detail above, unless stated otherwise.Additionally, some embodiment party In formula, the present invention may be embodied in the program recorded in the recording medium, including for realizing root Machine readable instructions according to the method for the present invention.Therefore, present invention also contemplates that storage has for realizing The record medium of the program of the method according to the invention.

Although some specific embodiments of the present invention have been discussed in detail above with example, but, this Skilled person is it should be appreciated that above-mentioned example is only intended to illustrate rather than limit The scope of the present invention.It will be apparent to a skilled person that can be without departing from the present invention's In the case of scope and spirit, above-described embodiment is modified.The scope of the present invention is by appended Claim defines.

Claims

1. a speech recognition equipment, this speech recognition equipment includes:

Voice-input unit, it is configured to obtain the voice inputted by active user；

Voice cutting unit, it is configured to split the voice obtained and export at least two sound Command frame command phase；

Predefined first voice command recognition unit, it is configured to by using unrelated with speaker Acoustic model, from voice command section, identify predefined first voice command；

Transformation matrix computing unit, it is configured to based on being identified as described predefined first sound The voice command section of sound order, calculates the transformation matrix for described active user, wherein, institute Described in the transformation matrix calculated can make the described acoustic model unrelated with speaker and be identified as The voice command section coupling of predefined first voice command；

Model selection unit, it is configured to based on the transformation matrix calculated, from being registered in Acoustic model in described speech recognition equipment selects the acoustic model for described active user； And

Second voice command recognition unit, it is configured to by the acoustic model selected by use, The second voice command is identified from voice command section.

Speech recognition equipment the most according to claim 1, wherein,

The output of described voice cutting unit also includes at least one background sound segment,

Described transformation matrix computing unit is based on described background sound segment and is identified as described predetermined The voice command section of the first voice command of justice, calculates described transformation matrix, and

The transformation matrix calculated can make the described acoustic model unrelated with speaker, with described Background sound segment and be identified as the voice command section of described predefined first voice command Join.

3. according to the speech recognition equipment described in claim 1 or claim 2, wherein, described Model selection unit includes:

Acoustic model selection unit based on phoneme, it is configured to based on the conversion square calculated Battle array, from the acoustic model based on phoneme of the phone set being registered in described speech recognition equipment, Select the acoustic model based on phoneme of the phone set for described active user；And/or

Acoustic model selection unit based on order word, it is configured to based on the conversion calculated Matrix, from the acoustic model based on order word being registered in described speech recognition equipment, choosing Select the acoustic model based on order word for described active user.

Speech recognition equipment the most according to claim 3, wherein, described sound based on phoneme Model selection unit includes:

First transformation matrix acquiring unit, it is configured to obtain for being registered in described speech recognition The transformation matrix of the acoustic model based on phoneme of the phone set in device；

First metrics calculation unit, its be configured to be calculated for described active user Transformation matrix and the conversion square of the accessed acoustic model based on phoneme for phone set Distance between Zhen；And

Acoustic model based on phoneme determines unit, and it is configured to determine the phone set that distance is minimum Acoustic model based on phoneme, as selected based on phoneme for described active user Acoustic model.

Speech recognition equipment the most according to claim 3, wherein, described based on order word Acoustic model selection unit includes:

Second transformation matrix acquiring unit, it is configured to in predefined order word list Each order word, obtains corresponding with this order word for be registered in described speech recognition equipment The transformation matrix of acoustic model based on order word；

Second distance computing unit, it is configured to in described predefined order word list Each order word, calculates the transformation matrix for described active user calculated with acquired Arrive for the transformation matrix of the acoustic model of based on order word corresponding with this order word between Distance；And

Acoustic model based on order word determines unit, and it is configured to for described predefined life Make each order word in word list, determine minimum corresponding with this order word based on life of distance Make the acoustic model of word, as the selected acoustics based on order word for described active user Model.

Speech recognition equipment the most according to claim 5, wherein, described based on order word Acoustic model selection unit also includes:

Recommendation unit, it is configured to recommend described active user, for described based on order word Acoustic model determines that unit can not determine the order word of corresponding acoustic model based on order word, Register acoustic model based on order word, as selected acoustic model based on order word.

Speech recognition equipment the most according to claim 5, wherein, described based on order word Acoustic model selection unit is only in described predefined order word list, described active user The not order word of the corresponding acoustic model based on order word of registration in described speech recognition equipment, Select acoustic model based on order word.

8. according to the speech recognition equipment described in claim 1 or claim 2, wherein, described Whether the acoustic model that checking acoustic model is concentrated is contained predefined order word by speech recognition equipment Complete order word in list,

If the acoustic model that described acoustic model is concentrated is contained in described predefined order word list Complete order word, the most described speech recognition equipment will by use described acoustic model concentrate sound Learn model, from voice command section, identify described second voice command；Otherwise, described voice is known Other device, by by the acoustic model selected by use, identifies described second from voice command section Voice command, and

Wherein, when described active user logs in described speech recognition equipment, described acoustic model collection And described predefined order word list will be loaded.

9. according to the speech recognition equipment described in claim 1 or claim 2, wherein, described Model selection unit from described speech recognition equipment and/or by network interconnective other In speech recognition equipment in the acoustic model of registration, select the acoustic mode for described active user Type.

10. according to the speech recognition equipment described in claim 1 or claim 2, this speech recognition Device also includes:

3rd voice command recognition unit, it is configured to by using described unrelated with speaker Acoustic model, identifies described second voice command, wherein from voice command section

When the recognition confidence from described 3rd voice command recognition unit output is less than predefined threshold During value, described second voice command recognition unit by by use selected by acoustic model, come from Voice command section identifies described second voice command.

11. 1 kinds of audio recognition methods, this audio recognition method includes:

Phonetic entry step, it is thus achieved that the voice inputted in speech recognition equipment by active user；

Voice segmentation step, splits the voice obtained and exports at least two voice command section；

Predefined first voice command recognition step, by using the acoustic model unrelated with speaker, Predefined first voice command is identified from voice command section；

Transformation matrix calculation procedure, based on the sound being identified as described predefined first voice command Sound command frame command phase, calculates the transformation matrix for described active user, wherein, the change calculated Change matrix to make the described acoustic model unrelated with speaker and be identified as described predefined the The voice command section coupling of one voice command；

Model Selection step, based on the transformation matrix calculated, knows from being registered in described voice Acoustic model in other device selects the acoustic model for described active user；And

Second voice command recognition step, by the acoustic model selected by use, orders from sound The section of order identifies the second voice command.

12. audio recognition methods according to claim 11, wherein,

The output of described voice segmentation step also includes at least one background sound segment,

Described transformation matrix calculation procedure is based on described background sound segment and is identified as described predetermined The voice command section of the first voice command of justice, calculates described transformation matrix, and

13. according to the audio recognition method described in claim 11 or claim 12, wherein, institute State Model Selection step to include:

Acoustic model selection step based on phoneme, based on the transformation matrix calculated, comes from note In the acoustic model based on phoneme of volume phone set in described speech recognition equipment, select for The acoustic model based on phoneme of the phone set of described active user；And/or

Acoustic model selection step based on order word, based on the transformation matrix calculated, from In the acoustic model based on order word being registered in described speech recognition equipment, select for described The acoustic model based on order word of active user.

14. according to the audio recognition method described in claim 11 or claim 12, wherein, institute Whether the acoustic model that checking acoustic model is concentrated is contained predefined order by predicate voice recognition method Complete order word in word list,

If the acoustic model that described acoustic model is concentrated is contained in described predefined order word list Complete order word, the most described audio recognition method will by use described acoustic model concentrate sound Learn model, from voice command section, identify described second voice command；Otherwise, described voice is known Other method, by by the acoustic model selected by use, identifies described second from voice command section Voice command, and

15. according to the audio recognition method described in claim 11 or claim 12, wherein, institute State Model Selection step from described speech recognition equipment and/or by network interconnective its In his speech recognition equipment in the acoustic model of registration, select the acoustics for described active user Model.

16. know according to the audio recognition method described in claim 11 or claim 12, this voice Other method also includes:

3rd voice command recognition step, by using the described acoustic model unrelated with speaker, Described second voice command is identified, wherein from voice command section

When from described 3rd voice command recognition step, the recognition confidence of output is less than predefined During threshold value, described second voice command recognition step, by by the acoustic model selected by use, is come Described second voice command is identified from voice command section.