CN110197658A

CN110197658A - Method of speech processing, device and electronic equipment

Info

Publication number: CN110197658A
Application number: CN201910463203.7A
Authority: CN
Inventors: 孙建伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-09-03
Anticipated expiration: 2039-05-30
Also published as: CN110197658B

Abstract

The application proposes a kind of method of speech processing, device and electronic equipment, wherein method includes: to be trained using training sample set to the first acoustic model, wherein the first acoustic model includes coding layer, decoding layer and output layer；Decoding layer and output layer are replicated, multiple branches are obtained；Each branch includes a decoding layer and a corresponding output layer, according to the coding layer of multiple branches and the first acoustic model, generate the second acoustic model, all types of training samples are concentrated using training sample, the branch that respective type is matched in the second acoustic model is trained respectively, to carry out speech recognition using trained second acoustic model.Since each branch of trained second acoustic model is trained using the training sample of respective type, therefore, the branch of different types of voice input respective type is subjected to speech recognition, different types of voice can be recognized accurately, to improve the accuracy of speech recognition.

Description

Method of speech processing, device and electronic equipment

Technical field

This application involves technical field of voice recognition more particularly to a kind of method of speech processing, device and electronic equipment.

Background technique

With the rapid development of speech recognition technology, speech recognition technology is widely applied, wherein speech recognition system Application of the system in intelligent terminal and smart home increasingly changes the life and production method of the mankind.For example, User can control smart phone by mobile phone assistant.

But existing speech recognition system is all made of single head acoustic model identification voice, leads to unused hardware device When the voice of acquisition has differences, the lower technical problem of the accuracy rate of speech recognition.

Summary of the invention

The application is intended to solve at least some of the technical problems in related technologies.

The embodiment of the present application proposes a kind of method of speech processing, by using all types of training samples to the second acoustic mode Each branch of type is trained, and the branch of different types of voice input respective type is carried out speech recognition, can accurately be known Not Chu different types of voice solve different hardware equipment in the prior art to improve the accuracy of speech recognition and adopt In the case that the audio of collection has differences, the technical problem of identification inaccuracy is caused using the identification of the same single head model.

The application first aspect embodiment proposes a kind of method of speech processing, comprising:

Using training sample set, the first acoustic model is trained, wherein first acoustic model includes coding Layer, decoding layer and output layer；

The decoding layer and the output layer are replicated, multiple branches are obtained；Each branch includes a decoding Layer and a corresponding output layer；

According to the coding layer of the multiple branch and first acoustic model, the second acoustic model is generated；

All types of training samples are concentrated using the training sample, respectively to matching respective class in second acoustic model The branch of type is trained, to carry out speech recognition using trained second acoustic model.

The first possible implementation as the application, the training sample include the original audio feature and language of voice Sound corresponds to the reference pronunciation information of text marking, described to concentrate all types of training samples using the training sample, respectively to institute The branch for stating matching respective type in the second acoustic model is trained, comprising:

The original audio feature in the training sample is encoded using the coding layer of second acoustic model；

According to the type of the training sample, corresponding coding input is matched to the branch of the type, obtains output hair Message breath；

According to the difference with reference to pronunciation information between the output pronunciation information, to the matching type Branch carries out parameter adjustment, so that the difference minimizes.

As second of the application possible implementation, the original audio feature, including filter FBank feature.

The third possible implementation as the application, it is described that all types of trained samples are concentrated using the training sample This, before being trained respectively to the branch for matching respective type in second acoustic model, further includes:

Classification is divided according to the source of the training sample and/or applicable business scenario.

As the 4th kind of possible implementation of the application, first acoustic model further includes attention layer；

Second acoustic model accordingly includes the attention layer.

It is described that voice is carried out using trained second acoustic model as the 5th kind of possible implementation of the application Identification, comprising:

Target voice to be identified is encoded using the coding layer of second acoustic model；

According to the type of the target voice, intended branch is determined from multiple branches of second acoustic model；

By the coding of the target voice, inputs the intended branch acoustics and obtain corresponding pronunciation information.

The method of speech processing of the embodiment of the present application is trained the first acoustic model using training sample set, In, the first acoustic model includes coding layer, decoding layer and output layer, replicates to decoding layer and output layer, obtains multiple branches, often One branch includes a decoding layer and a corresponding output layer, according to the coding layer of multiple branches and the first acoustic model, The second acoustic model is generated, all types of training samples are concentrated using training sample, it is corresponding to being matched in the second acoustic model respectively The branch of type is trained, to carry out speech recognition using trained second acoustic model.Due to trained Each branch of two acoustic models is trained using the training sample of respective type, therefore, different types of voice is inputted phase It answers the branch of type to carry out speech recognition, different types of voice can be recognized accurately, to improve the standard of speech recognition Exactness.

The application second aspect embodiment proposes a kind of voice processing apparatus, comprising:

First training module is trained the first acoustic model, wherein described first for using training sample set Acoustic model includes coding layer, decoding layer and output layer；

Processing module obtains multiple branches for replicating to the decoding layer and the output layer；Each branch includes One decoding layer and a corresponding output layer；

Generation module generates the second acoustics for the coding layer according to the multiple branch and first acoustic model Model；

Second training module, for concentrating all types of training samples using the training sample, respectively to the rising tone The branch for learning matching respective type in model is trained, to carry out speech recognition using trained second acoustic model.

The voice processing apparatus of the embodiment of the present application is trained the first acoustic model using training sample set, In, the first acoustic model includes coding layer, decoding layer and output layer, replicates to decoding layer and output layer, obtains multiple branches, often One branch includes a decoding layer and a corresponding output layer, according to the coding layer of multiple branches and the first acoustic model, The second acoustic model is generated, all types of training samples are concentrated using training sample, it is corresponding to being matched in the second acoustic model respectively The branch of type is trained, to carry out speech recognition using trained second acoustic model.Due to trained Each branch of two acoustic models is trained using the training sample of respective type, therefore, can be accurately identified different types of Voice, to improve the accuracy of speech recognition.

The application third aspect embodiment proposes a kind of electronic equipment, comprising: including memory, processor and is stored in On memory and the computer program that can run on a processor, when the processor executes described program, such as above-mentioned reality is realized Apply method of speech processing described in example.

The application fourth aspect embodiment proposes a kind of non-transitorycomputer readable storage medium, is stored thereon with meter Calculation machine program realizes such as above-mentioned method of speech processing as described in the examples when the program is executed by processor.

The additional aspect of the application and advantage will be set forth in part in the description, and will partially become from the following description It obtains obviously, or recognized by the practice of the application.

Detailed description of the invention

The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is a kind of flow diagram of method of speech processing provided by the embodiment of the present application；

Fig. 2 is the structural schematic diagram of the first acoustic model of one kind provided by the embodiment of the present application；

Fig. 3 is the structural schematic diagram of the second acoustic model of one kind provided by the embodiment of the present application；

Fig. 4 is a kind of flow diagram of model training method provided by the embodiment of the present application；

Fig. 5 is the flow diagram of another kind method of speech processing provided by the embodiment of the present application；

Fig. 6 is a kind of structural schematic diagram of voice processing apparatus provided by the embodiment of the present application；And

Fig. 7 shows the block diagram for being suitable for the exemplary computer device for being used to realize the application embodiment.

Specific embodiment

Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.

The application is lower for accuracy rate when the same single head acoustic model identifies a plurality of types of audios in the prior art The technical issues of, propose a kind of method of speech processing.

The method of speech processing of the embodiment of the present application is trained the first acoustic model using training sample set, In, the first acoustic model includes coding layer, decoding layer and output layer；Decoding layer and output layer are replicated, multiple branches are obtained；Often One branch includes a decoding layer and a corresponding output layer, according to the coding layer of multiple branches and the first acoustic model, The second acoustic model is generated, all types of training samples are concentrated using training sample, it is corresponding to being matched in the second acoustic model respectively The branch of type is trained, to carry out speech recognition using trained second acoustic model.

Below with reference to the accompanying drawings the method for speech processing, device and electronic equipment of the embodiment of the present application are described.

Fig. 1 is a kind of flow diagram of method of speech processing provided by the embodiment of the present application.

The embodiment of the present application is configured in voice processing apparatus with the method for speech processing to be come for example, at the voice Reason device can be applied in any electronic equipment, so that the electronic equipment can execute language process function.

Wherein, electronic equipment can set for PC (Personal Computer, abbreviation PC), cloud device, movement Standby etc., mobile device can for example have each for mobile phone, tablet computer, personal digital assistant, wearable device, mobile unit etc. The hardware device of kind operating system, touch screen and/or display screen.

As shown in Figure 1, the method for speech processing the following steps are included:

Step 101, using training sample set, the first acoustic model is trained, wherein the first acoustic model includes compiling Code layer, decoding layer and output layer.

Wherein, acoustic model is one of part mostly important in speech processes, in the present embodiment for the ease of with hereafter The acoustic model distinguishes, therefore referred to as the first acoustic model.First acoustic model includes coding layer, decoding layer and output Layer.

In the embodiment of the present application, training sample set can be from server end downloading, be also possible to user's designed, designed Training sample set, it is not limited here.

It should be noted that the training sample set of user's designed, designed, may include being acquired using different hardware devices The voice data of different people, for example, the object of acquisition voice can be with old man, adult men and women, children；It also may include same hardware The voice data of the same person of equipment acquisition；It also may include making an uproar of being acquired under different acoustic environments of same hardware device The horizontal different voice data of sound；It also may include the voice data, etc. downloaded from server.In short, making as far as possible Training sample set includes a variety of sample types.

Specifically, first by the first acoustic model of training sample set input random initializtion parameter, the first acoustic model It is that two-way whole sentence is trained, that is to say, that in the training process, the first acoustic model can learn the letter to voice context Breath, therefore, the generalization ability of the first acoustic model after training is stronger, can receive a plurality of types of training samples.

As an example, referring to fig. 2, Fig. 2 is that a kind of structure of first acoustic model provided by the embodiments of the present application is shown It is intended to.As shown in Fig. 2, the first acoustic model includes: input layer, coding layer, attention layer, decoding layer and output layer.

As a kind of possible situation, the first acoustic model can be transformer model, compared to the prior art in The deep learning acoustic model based on loss function speech recognition, transformer model do not have convolutional neural networks and The network structures such as length memory network, therefore, under identical training environment, in the identical situation of training sample, transformer The training speed of model is faster.It certainly, may be other acoustic models in the present embodiment, it is not limited here.

Step 102, decoding layer and output layer are replicated, obtains multiple branches；Each branch include decoding layer and A corresponding output layer.

In the embodiment of the present application, the decoding layer and output layer of the first acoustic model after training are replicated, obtained To multiple decoding layers and output layer corresponding with each decoding layer.Wherein, each decoding layer and a corresponding output layer Constitute a branch.

Step 103, according to the coding layer of multiple branches and the first acoustic model, the second acoustic model is generated.

In the present embodiment, by the first acoustic model after training decoding layer and output layer replicated it is multiple Branch combines with the coding layer of the first acoustic model, generates the second acoustic model.

As an example, referring to Fig. 3, from the figure 3, it may be seen that the second acoustic model in Fig. 3 is to the first acoustics in Fig. 2 Solution to model code layer and output layer replicate to obtain multiple branches, and multiple branches share the attention layer and coding of the first acoustic model Layer, the second obtained acoustic model.

Step 104, all types of training samples are concentrated using training sample, respectively to matching respective class in the second acoustic model The branch of type is trained, to carry out speech recognition using trained second acoustic model.

In the embodiment of the present application, training sample can be concentrated according to the source of training sample and/or using business scenario Training sample divide classification.

As a kind of possible situation, if the training sample that training sample is concentrated is the language that different hardware devices acquires When sound data, the structure of different hardware equipment and performance are different, and the voice data of acquisition is caused to have differences, therefore, according to adopting Training sample set can be divided into a plurality of types of training samples by the hardware device of collection voice data.

As alternatively possible situation, if it is same hardware device different that training sample, which concentrates each training sample, When the voice data acquired under voice environment, the noise level of training sample is caused not pass through, therefore, is deposited between each training sample In difference, training sample set can be divided by a plurality of types of training samples according to the noise level of training sample.

The situation possible as another, if training sample set be same hardware device to all ages and classes and property others When the voice data collected, since children are different with adult tongue, there is also differences for the voice data collected It is different, therefore, training sample set can be divided by a plurality of types of trained samples according to the corresponding gender of training sample and age This.

The situation possible as another can carry out classification to training sample according to the applicable business scenario of training sample It divides.Training sample for different business scene is divided into different classifications.

It should be noted that above-mentioned carry out training sample to divide class method for distinguishing only as an example, also deposit certainly In remaining possible situation, it is not limited here.

In the embodiment of the present application, all types of training samples is concentrated using training sample, and the second acoustic model is instructed When practicing, different types of training sample shares one and same coding layer, to encode to the original audio feature in training sample. In turn, by after coding different types of training sample input with the branch of the decoding layer of its type matching and corresponding output layer, To be trained to each branch in the second acoustic model, to carry out speech recognition using trained second acoustic model.

It is to be understood that during to the second acoustic training model, in order to guarantee that all types of training samples obtain It sufficiently uses, the training sample of same type can be all made of during primary training.The training sample of different training process This type, can be identical or not identical, but to reach better training effect, it can be in different training process A plurality of types of training samples are used as far as possible.

In the embodiment of the present application, all types of training samples are concentrated using training sample, respectively in the second acoustic model After branch with respective type is trained, carried out using training effect of the test sample collection to the second acoustic model after training Test.

It can basis when testing the second acoustic model after training as a kind of possible implementation Second acoustic model is split as the first acoustic model of corresponding business scenario by different business scenarios, then using the scene Test sample tests model.In order to guarantee the accuracy of model measurement, test sample quantity can be 3000-10000 It is a, and each audio sample needs corresponding text marking, and test result generally uses word standard and sentence standard to count, and then realizes Test to the second acoustic training model result.

The method of speech processing of the embodiment of the present application is trained the first acoustic model using training sample set, In, the first acoustic model includes coding layer, decoding layer and output layer, replicates to decoding layer and output layer, obtains multiple branches, often One branch includes a decoding layer and a corresponding output layer, according to the coding layer of multiple branches and the first acoustic model, The second acoustic model is generated, all types of training samples are concentrated using training sample, it is corresponding to being matched in the second acoustic model respectively The branch of type is trained, to carry out speech recognition using trained second acoustic model.Due to trained Each branch of two acoustic models is trained using the training sample of respective type, therefore, can be accurately identified different types of Voice, to improve the accuracy of speech recognition.

In a kind of possible implementation of the embodiment of the present application, the original audio feature and voice of voice can be used The reference pronunciation information of corresponding text marking is trained the second acoustic model as training sample.Specific model training Referring to fig. 4, Fig. 4 is a kind of flow diagram of the training method of second acoustic model provided by the embodiments of the present application to process.

As shown in figure 4, the model training method may comprise steps of:

Step 201, the original audio feature in training sample is encoded using the coding layer of the second acoustic model.

Wherein, training sample, original audio feature and voice including voice correspond to the reference pronunciation information of text marking.

Since the most information of voice signal is included in low frequency component and by a narrow margin in part, still, human ear is to sound audio The response of spectrum is nonlinear, it has been experienced that: it, can be with if we can be handled audio in a manner of being similar to human ear Improve the performance of speech recognition.In the present embodiment, feature extraction is carried out to the original audio of voice in training sample, obtains voice Original audio feature.It include filter FBank feature in original audio feature.

In the present embodiment, MFCC feature is extracted to the original audio of voice, and pass through gauss hybrid models (Gaussian Mixture Model, abbreviation GMM) the text justification audio fragment that will manually mark, and then convert the text to the corresponding text of voice The reference pronunciation information of this mark.

It should be noted that original audio feature extraction methods in the present embodiment, can refer to the prior art, herein no longer It repeats.

In the present embodiment, training sample can be and acquire what voice data obtained by hardware device.Different hardware are set The standby voice data collected has differences, the voice data that same hardware device collects under different language environments There is also differences, therefore training sample can be divided into different types.

Specifically, using the coding layer of the second acoustic model to the original audio in the training sample inputted by input layer Feature is encoded.

Step 202, the branch of corresponding coding input match-type is obtained by output hair according to the type of training sample Message breath.

Specifically, according to the type of training sample, by the corresponding coding input of all types of corresponding original audio features The branch of match-type in two acoustic models, the pronunciation information exported.

For example, the training sample of a certain type and second branch of the second acoustic model match if it exists, then After original audio feature in training sample to the type is encoded, corresponding the second branch of coding input obtains The pronunciation information of output.

Step 203, according to the difference between reference pronunciation information and output pronunciation information, the branch of match-type is carried out Parameter adjustment, so that difference minimizes.

In the embodiment of the present application, according to the type of training sample, the branch of corresponding coding input match-type is obtained After exporting pronunciation information, output pronunciation is pronounced to be compared with reference of corresponding text marking, obtain referring to pronunciation information and Export the difference between pronunciation information.In turn, according to the difference between reference pronunciation information and output pronunciation information, to matching class Type branch carries out parameter adjustment, to optimize training to the second acoustic model, until with reference to pronunciation information and output pronunciation letter When difference between breath minimizes, the training to the second acoustic model is completed.

Various types of training samples instructing with all types of matched each branches to the second acoustic model is used as a result, Practice, so that the second acoustic model after training can satisfy the demand of different scenes, various types of languages can be recognized accurately The corresponding pronunciation information of sound.

In the embodiment of the present application, the original audio feature in training sample is carried out using the coding layer of the second acoustic model The branch of corresponding coding input match-type is obtained output pronunciation information according to the type of training sample by coding, according to With reference to the difference between pronunciation information and output pronunciation information, parameter adjustment is carried out to the branch of match-type, so that difference is most Smallization.As a result, by the reference pronunciation information of the original audio feature of voice and corresponding text marking in the second acoustic model The branch of match-type is trained, and can be realized after a plurality of types of voices are inputted the second acoustic model, accurately exports Corresponding pronunciation information, to improve the accuracy of speech recognition.

As an example, target voice to be identified can be inputted into trained second acoustic model, to obtain The corresponding pronunciation information of target voice.It describes in detail below with reference to Fig. 5 to the above process, Fig. 5 mentions for the embodiment of the present application The flow diagram of another method of speech processing supplied.

As shown in figure 5, the method for speech processing the following steps are included:

Step 301, target voice to be identified is encoded using the coding layer of the second acoustic model.

In the embodiment of the present application, after target voice to be identified is inputted the second acoustic model, the volume of the second acoustic model Code layer encodes target voice to be identified, and target voice is converted to the encoded signal that computer can identify.

Step 302, according to the type of target voice, intended branch is determined from multiple branches of the second acoustic model.

In the embodiment of the present application, due to the second acoustic model each branch training process using with its type matching What training sample was trained, therefore, in the present embodiment, according to the type of object language, from multiple points of the second acoustic model The determining intended branch with the type matching of object language in branch.

Step 303, it by the coding of target voice, inputs intended branch and obtains corresponding pronunciation information.

In the embodiment of the present application, according to the type of target voice, the determining and mesh from multiple branches of the second acoustic model After the intended branch of the type matching of poster speech, by the coding input intended branch of object language, corresponding pronunciation information is obtained.

The method of speech processing of the embodiment of the present application, by using the coding layer of the second acoustic model to target to be identified Voice is encoded, and according to the type of target voice, intended branch is determined from multiple branches of the second acoustic model, by target The coding of voice, input intended branch obtain corresponding pronunciation information.Pass through as a result, by different types of target language to be identified Sound, the branch of input and its type matching, to improve the accuracy rate of speech recognition, are solved with obtaining corresponding pronunciation information It has determined acoustic model in the prior art lower technical problem of accuracy rate when identifying multiple types voice messaging.

In order to realize above-described embodiment, the application also proposes a kind of voice processing apparatus.

Fig. 6 is a kind of structural schematic diagram of voice processing apparatus provided by the embodiments of the present application.

As shown in fig. 6, the voice processing apparatus 100 includes: the first training module 110, processing module 120, generation module 130 and second training module 140.

First training module 110 is trained the first acoustic model, wherein the first sound for using training sample set Learning model includes coding layer, decoding layer and output layer.

Processing module 120 obtains multiple branches for replicating to decoding layer and output layer；Each branch includes one Decoding layer and a corresponding output layer.

Generation module 130 generates the second acoustic model for the coding layer according to multiple branches and the first acoustic model.

Second training module 140, for concentrating all types of training samples using training sample, respectively to the second acoustic model The branch of middle matching respective type is trained, to carry out speech recognition using trained second acoustic model.

As a kind of possible implementation, training sample includes that the original audio feature of voice and voice correspond to text mark The reference pronunciation information of note, the second training module 140, is used for:

The original audio feature in training sample is encoded using the coding layer of the second acoustic model；

The branch of corresponding coding input match-type is obtained into output pronunciation information according to the type of training sample；

According to the difference between reference pronunciation information and output pronunciation information, parameter tune is carried out to the branch of match-type It is whole, so that difference minimizes.

As alternatively possible implementation, original audio feature, including filter FBank feature.

As alternatively possible implementation, voice processing apparatus 100, further includes:

Division module, for dividing classification according to the source of training sample and/or applicable business scenario.

As alternatively possible implementation, the first acoustic model further includes attention layer；Second acoustic model is corresponding Including attention layer.

As alternatively possible implementation, the second training module 140 is used for:

Target voice to be identified is encoded using the coding layer of the second acoustic model；

According to the type of target voice, intended branch is determined from multiple branches of the second acoustic model；

By the coding of target voice, inputs intended branch and obtain corresponding pronunciation information.

It should be noted that the aforementioned voice for being also applied for the embodiment to the explanation of method of speech processing embodiment Processing unit, details are not described herein again.

In order to realize above-described embodiment, the application also proposes a kind of electronic equipment, comprising: including memory, processor and The computer program that can be run on a memory and on a processor is stored, when the processor executes described program, is realized such as Method of speech processing described in above-described embodiment.

In order to realize above-described embodiment, the application also proposes a kind of non-transitorycomputer readable storage medium, deposits thereon Computer program is contained, such as above-mentioned method of speech processing as described in the examples is realized when which is executed by processor.

Fig. 7 shows the block diagram for being suitable for the exemplary computer device for being used to realize the application embodiment.What Fig. 7 was shown Computer equipment 12 is only an example, should not function to the embodiment of the present application and use scope bring any restrictions.

As shown in fig. 7, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to: one or more processor or processing unit 16, system storage 28 connect different system components The bus 18 of (including system storage 28 and processing unit 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (Industry Standard Architecture；Hereinafter referred to as: ISA) bus, microchannel architecture (Micro Channel Architecture；Below Referred to as: MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association；Hereinafter referred to as: VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection；Hereinafter referred to as: PCI) bus.

Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.

Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory Device (Random Access Memory；Hereinafter referred to as: RAM) 30 and/or cache memory 32.Computer equipment 12 can be with It further comprise other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, Storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 7 do not show, commonly referred to as " hard drive Device ").Although being not shown in Fig. 7, the disk for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") can be provided and driven Dynamic device, and to removable anonvolatile optical disk (such as: compact disc read-only memory (Compact Disc Read Only Memory；Hereinafter referred to as: CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory；Hereinafter referred to as: DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include that at least one program produces Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application The function of embodiment.

Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28 In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and It may include the realization of network environment in program data, each of these examples or certain combination.Program module 42 is usual Execute the function and/or method in embodiments described herein.

Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, the equipment interacted with the computer system/server 12 can be also enabled a user to one or more to be communicated, and/ Or with the computer equipment 12 is communicated with one or more of the other calculating equipment any equipment (such as network interface card, Modem etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 can also pass through network adapter 20 and one or more network (such as local area network (Local Area Network；Below Referred to as: LAN), wide area network (Wide Area Network；Hereinafter referred to as: WAN) and/or public network, for example, internet) it is logical Letter.As shown, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.Although should be understood that It is not shown in the figure, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not limited to: microcode, Device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage System etc..

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize the method for speech processing referred in previous embodiment.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the application Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electricity of one or more wirings Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can then be edited, be interpreted or when necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each section of the application can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.Such as, if realized with hardware in another embodiment, following skill well known in the art can be used Any one of art or their combination are realized: have for data-signal is realized the logic gates of logic function from Logic circuit is dissipated, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..

Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.

It, can also be in addition, can integrate in a processing module in each functional unit in each embodiment of the application It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application System, those skilled in the art can be changed above-described embodiment, modify, replace and become within the scope of application Type.

Claims

1. a kind of method of speech processing, which is characterized in that the described method comprises the following steps:

Using training sample set, the first acoustic model is trained, wherein first acoustic model includes coding layer, solution Code layer and output layer；

The decoding layer and the output layer are replicated, multiple branches are obtained；Each branch include decoding layer and A corresponding output layer；

All types of training samples are concentrated using the training sample, respectively to matching respective type in second acoustic model Branch is trained, to carry out speech recognition using trained second acoustic model.

2. the method according to claim 1, wherein the training sample include voice original audio feature and Voice corresponds to the reference pronunciation information of text marking, described to concentrate all types of training samples using the training sample, right respectively The branch that respective type is matched in second acoustic model is trained, comprising:

According to the type of the training sample, corresponding coding input is matched to the branch of the type, obtains output pronunciation letter Breath；

According to the difference with reference to pronunciation information between the output pronunciation information, to the branch of the matching type Parameter adjustment is carried out, so that the difference minimizes.

3. according to the method described in claim 2, it is characterized in that,

The original audio feature, including filter FBank feature.

4. the method according to claim 1, wherein described concentrate all types of trained samples using the training sample This, before being trained respectively to the branch for matching respective type in second acoustic model, further includes:

5. method according to claim 1-4, which is characterized in that first acoustic model further includes attention Layer；

Second acoustic model accordingly includes the attention layer.

6. method according to claim 1-4, which is characterized in that described to use trained second acoustic mode Type carries out speech recognition, comprising:

By the coding of the target voice, inputs the intended branch and obtain corresponding pronunciation information.

7. a kind of voice processing apparatus, which is characterized in that described device includes:

First training module is trained the first acoustic model, wherein first acoustics for using training sample set Model includes coding layer, decoding layer and output layer；

Processing module obtains multiple branches for replicating to the decoding layer and the output layer；Each branch includes one The decoding layer and a corresponding output layer；

Generation module generates the second acoustic model for the coding layer according to the multiple branch and first acoustic model；

Second training module, for concentrating all types of training samples using the training sample, respectively to second acoustic mode The branch that respective type is matched in type is trained, to carry out speech recognition using trained second acoustic model.

8. device according to claim 7, which is characterized in that the training sample include voice original audio feature and Voice corresponds to the reference pronunciation information of text marking, and second training module is used for:

9. device according to claim 8, which is characterized in that the original audio feature, including filter FBank are special Sign.

10. device according to claim 7, which is characterized in that described device, further includes:

Division module, for dividing classification according to the source of the training sample and/or applicable business scenario.

11. according to the described in any item devices of claim 7-10, which is characterized in that first acoustic model further includes paying attention to Power layer；

Second acoustic model accordingly includes the attention layer.

12. according to the described in any item devices of claim 7-10, which is characterized in that second training module is used for:

13. a kind of electronic equipment, which is characterized in that on a memory and can be in processor including memory, processor and storage The computer program of upper operation when the processor executes described program, realizes such as voice as claimed in any one of claims 1 to 6 Processing method.

14. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program Such as method of speech processing as claimed in any one of claims 1 to 6 is realized when being executed by processor.