CN110197658B

CN110197658B - Voice processing method and device and electronic equipment

Info

Publication number: CN110197658B
Application number: CN201910463203.7A
Authority: CN
Inventors: 孙建伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2021-01-26
Anticipated expiration: 2039-05-30
Also published as: CN110197658A

Abstract

The application provides a voice processing method, a voice processing device and electronic equipment, wherein the method comprises the following steps: training a first acoustic model by adopting a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer; copying a decoding layer and an output layer to obtain a plurality of branches; each branch comprises a decoding layer and a corresponding output layer, a second acoustic model is generated according to the plurality of branches and the coding layer of the first acoustic model, various types of training samples in the training sample set are adopted, the branches matched with the corresponding types in the second acoustic model are respectively trained, and the trained second acoustic model is adopted for voice recognition. Because each branch of the trained second acoustic model is trained by adopting the training sample of the corresponding type, different types of voices are input into the branches of the corresponding type for voice recognition, the different types of voices can be accurately recognized, and the accuracy of the voice recognition is improved.

Description

Voice processing method and device and electronic equipment

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech processing method and apparatus, and an electronic device.

Background

With the rapid development of speech recognition technology, speech recognition technology is widely used, wherein the application of speech recognition systems in intelligent terminal devices and intelligent homes is increasingly changing the way people live and produce. For example, a user may control a smartphone by a cell phone assistant.

However, existing voice recognition systems all adopt a single-head acoustic model to recognize voice, and therefore the accuracy of voice recognition is low when voice collected by different hardware devices is different.

Disclosure of Invention

The present application is directed to solving, at least to some extent, one of the technical problems in the related art.

The embodiment of the application provides a voice processing method, and various branches of a second acoustic model are trained by adopting various types of training samples, different types of voices are input into the branches of corresponding types to be subjected to voice recognition, and the voices of different types can be accurately recognized, so that the accuracy of the voice recognition is improved, and the technical problem of inaccurate recognition caused by the fact that the same single-head model is adopted for recognition under the condition that different audios collected by different hardware devices in the prior art are different is solved.

An embodiment of a first aspect of the present application provides a speech processing method, including:

training a first acoustic model by adopting a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer;

copying the decoding layer and the output layer to obtain a plurality of branches; each branch comprises one decoding layer and one output layer;

generating a second acoustic model from the plurality of branches and an encoding layer of the first acoustic model;

and training branches matched with corresponding types in the second acoustic model by adopting various types of training samples in the training sample set respectively so as to adopt the trained second acoustic model to perform voice recognition.

As a first possible implementation manner of the present application, the training samples include original audio features of speech and reference pronunciation information of text labels corresponding to the speech, and the training samples of different types in the training sample set are adopted to respectively train branches of the second acoustic model matching corresponding types, including:

encoding the original audio features in the training sample by adopting an encoding layer of the second acoustic model;

according to the type of the training sample, corresponding code input is matched with the branch of the type to obtain output pronunciation information;

and according to the difference between the reference pronunciation information and the output pronunciation information, performing parameter adjustment on the branch matched with the type so as to minimize the difference.

As a second possible implementation manner of the present application, the original audio features include filter FBank features.

As a third possible implementation manner of the present application, before the applying each type of training sample in the training sample set to respectively train the branches of the second acoustic model that match the corresponding type, the method further includes:

and classifying according to the source of the training sample and/or the applicable service scene.

As a fourth possible implementation manner of the present application, the first acoustic model further includes an attention layer;

the second acoustic model comprises the attention layer, respectively.

As a fifth possible implementation manner of the present application, the performing speech recognition by using the trained second acoustic model includes:

coding the target voice to be recognized by adopting the coding layer of the second acoustic model;

determining a target branch from the plurality of branches of the second acoustic model according to the type of the target voice;

and coding the target voice, and inputting the target branch acoustics to obtain corresponding pronunciation information.

The speech processing method of the embodiment of the application adopts a training sample set to train a first acoustic model, wherein the first acoustic model comprises a coding layer, a decoding layer and an output layer, the decoding layer and the output layer are copied to obtain a plurality of branches, each branch comprises a decoding layer and a corresponding output layer, a second acoustic model is generated according to the plurality of branches and the coding layer of the first acoustic model, and branches matched with corresponding types in the second acoustic model are respectively trained by adopting various types of training samples in the training sample set to adopt the trained second acoustic model to perform speech recognition. Because each branch of the trained second acoustic model is trained by adopting the training sample of the corresponding type, different types of voices are input into the branches of the corresponding type for voice recognition, the different types of voices can be accurately recognized, and the accuracy of the voice recognition is improved.

An embodiment of a second aspect of the present application provides a speech processing apparatus, including:

the first training module is used for training a first acoustic model by adopting a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer;

the processing module is used for copying the decoding layer and the output layer to obtain a plurality of branches; each branch comprises one decoding layer and one output layer;

a generating module, configured to generate a second acoustic model according to the plurality of branches and an encoding layer of the first acoustic model;

and the second training module is used for adopting each type of training sample in the training sample set to respectively train the branches matched with the corresponding type in the second acoustic model so as to adopt the trained second acoustic model to perform voice recognition.

The speech processing device of the embodiment of the application adopts a training sample set to train a first acoustic model, wherein the first acoustic model comprises a coding layer, a decoding layer and an output layer, the decoding layer and the output layer are copied to obtain a plurality of branches, each branch comprises a decoding layer and a corresponding output layer, a second acoustic model is generated according to the coding layers of the plurality of branches and the first acoustic model, each type of training sample in the training sample set is adopted to train the branch which is matched with the corresponding type in the second acoustic model respectively, and speech recognition is carried out by adopting the trained second acoustic model. Because each branch of the trained second acoustic model is trained by adopting the training samples of the corresponding types, different types of voices can be accurately recognized, and the accuracy of voice recognition is improved.

An embodiment of a third aspect of the present application provides an electronic device, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the speech processing method as described in the above embodiments.

A fourth aspect of the present application is directed to a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing the speech processing method as described in the above embodiments.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a first acoustic model provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of a second acoustic model according to an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a model training method according to an embodiment of the present disclosure;

FIG. 5 is a flowchart illustrating another speech processing method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application; and

FIG. 7 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The application provides a voice processing method aiming at the technical problem that in the prior art, when the same single-head acoustic model identifies multiple types of audios, the accuracy is low.

The voice processing method of the embodiment of the application adopts a training sample set to train a first acoustic model, wherein the first acoustic model comprises a coding layer, a decoding layer and an output layer; copying a decoding layer and an output layer to obtain a plurality of branches; each branch comprises a decoding layer and a corresponding output layer, a second acoustic model is generated according to the plurality of branches and the coding layer of the first acoustic model, various types of training samples in the training sample set are adopted, the branches matched with the corresponding types in the second acoustic model are respectively trained, and the trained second acoustic model is adopted for voice recognition.

The following describes a speech processing method, apparatus, and electronic device according to an embodiment of the present application with reference to the drawings.

Fig. 1 is a flowchart illustrating a speech processing method according to an embodiment of the present application.

The embodiment of the present application is exemplified by the voice processing method being configured in a voice processing apparatus, and the voice processing apparatus can be applied to any electronic device, so that the electronic device can execute a voice processing function.

The electronic device may be a Personal Computer (PC), a cloud device, a mobile device, and the like, and the mobile device may be a hardware device having various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the speech processing method includes the steps of:

step 101, training a first acoustic model by using a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer.

The acoustic model is one of the most important parts in speech processing, and is referred to as a first acoustic model in this embodiment for convenience of distinction from the acoustic model described below. The first acoustic model includes an encoding layer, a decoding layer, and an output layer.

In the embodiment of the present application, the training sample set may be downloaded from a server, or may be a training sample set designed by a user, which is not limited herein.

It should be noted that the training sample set designed by the user may include the method of acquiring voice data of different people by using different hardware devices, for example, the object for acquiring voice may be the elderly, adult, male, female, or children; the voice data of the same person collected by the same hardware equipment can also be included; the method can also comprise the steps that the same hardware equipment collects voice data with different noise levels in different sound environments; but may also include voice data downloaded from a server, etc. In summary, the training sample set is made to contain multiple sample types as much as possible.

Specifically, a training sample set is firstly input into a first acoustic model of random initialization parameters, the first acoustic model is bi-directionally trained in a whole sentence, that is, in the training process, the first acoustic model can learn information before and after a voice, so that the trained first acoustic model has stronger generalization capability and can accept multiple types of training samples.

As an example, referring to fig. 2, fig. 2 is a schematic structural diagram of a first acoustic model provided in an embodiment of the present application. As shown in fig. 2, the first acoustic model includes: an input layer, an encoding layer, an attention layer, a decoding layer, and an output layer.

As a possible scenario, the first acoustic model may be a transform model, and compared to a deep learning acoustic model based on loss function speech recognition in the prior art, the transform model has no network structures such as a convolutional neural network and a long-short memory network, so that the training speed of the transform model is faster in the same training environment and under the same training samples. Of course, other acoustic models may be used in the present embodiment, and are not limited herein.

102, copying a decoding layer and an output layer to obtain a plurality of branches; each branch comprises a decoding layer and a corresponding output layer.

In the embodiment of the application, the decoding layers and the output layers of the trained first acoustic model are copied to obtain a plurality of decoding layers and output layers corresponding to the decoding layers. Wherein each decoding layer and a corresponding output layer form a branch.

Step 103, generating a second acoustic model according to the plurality of branches and the coding layer of the first acoustic model.

In this embodiment, a plurality of branches obtained by copying the decoding layer and the output layer of the trained first acoustic model are combined with the coding layer of the first acoustic model to generate the second acoustic model.

As an example, referring to fig. 3, as can be seen from fig. 3, the second acoustic model in fig. 3 is a second acoustic model obtained by copying a plurality of branches to the decoding layer and the output layer of the first acoustic model in fig. 2, and the plurality of branches share the attention layer and the coding layer of the first acoustic model.

And 104, adopting various types of training samples in the training sample set to respectively train branches matched with corresponding types in the second acoustic model so as to adopt the trained second acoustic model to perform voice recognition.

In the embodiment of the application, the training samples in the training sample set can be classified according to the source and/or the service scene of the training samples.

As a possible case, if the training samples in the training sample set are voice data collected by different hardware devices, the structure and performance of the different hardware devices are different, resulting in a difference in collected voice data, and therefore, the training sample set can be divided into multiple types of training samples according to the hardware device collecting the voice data.

As another possible scenario, if each training sample in the training sample set is voice data acquired by the same hardware device under different voice environments, the noise level of the training sample is not passed, and therefore, there is a difference between the training samples, and the training sample set may be divided into multiple types of training samples according to the noise level of the training sample.

As another possible scenario, if the training sample set is voice data collected by the same hardware device for people of different ages and sexes, the collected voice data may also be different due to different speaking manners of children and adults, and therefore, the training sample set may be divided into multiple types of training samples according to the sexes and ages corresponding to the training samples.

As yet another possible scenario, the training samples may be classified according to their applicable service scenarios. Training samples for different traffic scenarios are divided into different classes.

It should be noted that the above method for classifying training samples is only an example, and there are other possible situations, which are not limited herein.

In the embodiment of the application, when the training samples of various types in the training sample set are adopted to train the second acoustic model, the training samples of different types share the same coding layer so as to code the original audio features in the training samples. And inputting the different types of encoded training samples into branches of a decoding layer matched with the types of the encoded training samples and a corresponding output layer so as to train each branch in the second acoustic model and perform voice recognition by adopting the trained second acoustic model.

It should be explained that, in the process of training the second acoustic model, in order to ensure that each type of training sample is fully used, the same type of training sample may be used in one training process. The types of training samples in different training processes may be the same or different, but in order to achieve better training effect, multiple types of training samples may be used in different training processes as much as possible.

In the embodiment of the application, after training the branches matched with the corresponding types in the second acoustic model by adopting various types of training samples in the training sample set, the training effect of the trained second acoustic model is tested by adopting the testing sample set.

As a possible implementation manner, when testing the trained second acoustic model, the second acoustic model may be split into the first acoustic model corresponding to the service scenario according to different service scenarios, and then the model is tested using the test sample of the scenario. In order to ensure the accuracy of the model test, the number of the test samples can be 3000-10000, each audio sample needs to have a corresponding text label, and the test result is generally counted by a character standard and a sentence standard, so that the test of the training result of the second acoustic model is realized.

The speech processing method of the embodiment of the application adopts a training sample set to train a first acoustic model, wherein the first acoustic model comprises a coding layer, a decoding layer and an output layer, the decoding layer and the output layer are copied to obtain a plurality of branches, each branch comprises a decoding layer and a corresponding output layer, a second acoustic model is generated according to the plurality of branches and the coding layer of the first acoustic model, and branches matched with corresponding types in the second acoustic model are respectively trained by adopting various types of training samples in the training sample set to adopt the trained second acoustic model to perform speech recognition. Because each branch of the trained second acoustic model is trained by adopting the training samples of the corresponding types, different types of voices can be accurately recognized, and the accuracy of voice recognition is improved.

In a possible implementation manner of the embodiment of the present application, the original audio features of the speech and the reference pronunciation information of the text label corresponding to the speech may be used as training samples to train the second acoustic model. Specific model training process referring to fig. 4, fig. 4 is a flowchart illustrating a training method for a second acoustic model according to an embodiment of the present disclosure.

As shown in fig. 4, the model training method may include the steps of:

and step 201, encoding the original audio features in the training sample by using the encoding layer of the second acoustic model.

The training sample comprises original audio features of voice and reference pronunciation information of text labels corresponding to the voice.

Since most of the information of a speech signal is contained in low frequency components and low amplitude portions, however, the response of the human ear to a sound spectrum is nonlinear, experience has shown that: the performance of speech recognition can be improved if we can process the audio in a manner similar to the human ear. In this embodiment, feature extraction is performed on the original audio of the speech in the training sample to obtain the original audio feature of the speech. The original audio features include filter FBank features.

In this embodiment, MFCC features are extracted from the original audio of the speech, and the manually labeled text is aligned to the audio segment by a Gaussian Mixture Model (GMM for short), so as to convert the text into reference pronunciation information of the text label corresponding to the speech.

It should be noted that, in the present embodiment, the original audio feature extraction method may refer to the prior art, and is not described herein again.

In this embodiment, the training samples may be obtained by acquiring voice data through hardware devices. The voice data acquired by different hardware devices are different, and the voice data acquired by the same hardware device in different language environments are also different, so that the training samples can be divided into different types.

Specifically, the encoding layer of the second acoustic model is employed to encode the original audio features in the training samples input through the input layer.

Step 202, inputting the corresponding codes into the branches of the matching type according to the type of the training sample to obtain the output pronunciation information.

Specifically, according to the type of the training sample, the codes corresponding to the original audio features corresponding to each type are input into the branches of the second acoustic model, which are matched with the type, so that the output pronunciation information is obtained.

For example, if there is a type of training sample matching the second branch of the second acoustic model, the original audio features in the type of training sample will be encoded, and then the corresponding codes will be input into the second branch to obtain the output pronunciation information.

In step 203, the parameters of the branches of the matching type are adjusted according to the difference between the reference pronunciation information and the output pronunciation information, so as to minimize the difference.

In the embodiment of the application, after the corresponding coding is input into the branch of the matching type to obtain the output pronunciation information according to the type of the training sample, the output pronunciation is compared with the reference pronunciation of the corresponding text label to obtain the difference between the reference pronunciation information and the output pronunciation information. And further, according to the difference between the reference pronunciation information and the output pronunciation information, parameter adjustment is carried out on the matching type branch so as to carry out optimization training on the second acoustic model, and when the difference between the reference pronunciation information and the output pronunciation information is minimized, the training on the second acoustic model is completed.

Therefore, various types of training samples are adopted to train the branches, matched with various types, of the second acoustic model, so that the trained second acoustic model can meet the requirements of different scenes, and the pronunciation information corresponding to various types of voices can be accurately identified.

In the embodiment of the application, an encoding layer of a second acoustic model is adopted to encode original audio features in a training sample, corresponding codes are input into branches of a matching type according to the type of the training sample to obtain output pronunciation information, and parameters of the branches of the matching type are adjusted according to the difference between reference pronunciation information and the output pronunciation information to minimize the difference. Therefore, the branches of the matched types in the second acoustic model are trained through the original audio features of the voice and the reference pronunciation information of the corresponding text labels, so that the corresponding pronunciation information can be accurately output after various types of voice are input into the second acoustic model, and the accuracy of voice recognition is improved.

As an example, a target speech to be recognized may be input into the trained second acoustic model to obtain pronunciation information corresponding to the target speech. The above process is described in detail with reference to fig. 5, and fig. 5 is a flowchart illustrating another speech processing method according to an embodiment of the present application.

As shown in fig. 5, the speech processing method includes the steps of:

and step 301, encoding the target voice to be recognized by adopting the encoding layer of the second acoustic model.

In the embodiment of the application, after the target voice to be recognized is input into the second acoustic model, the coding layer of the second acoustic model codes the target voice to be recognized, so as to convert the target voice into a coding signal which can be recognized by a computer.

Step 302, determining a target branch from the plurality of branches of the second acoustic model according to the type of the target voice.

In the embodiment of the present application, since the training process of each branch of the second acoustic model is trained by using the training sample matched with the type of the branch, in this embodiment, according to the type of the target language, the target branch matched with the type of the target language is determined from the multiple branches of the second acoustic model.

And step 303, coding the target voice, and inputting the target branch to obtain corresponding pronunciation information.

In the embodiment of the application, after a target branch matched with the type of the target language is determined from a plurality of branches of the second acoustic model according to the type of the target voice, the code of the target language is input into the target branch, and corresponding pronunciation information is obtained.

According to the voice processing method, the target voice to be recognized is coded by adopting the coding layer of the second acoustic model, the target branch is determined from the multiple branches of the second acoustic model according to the type of the target voice, and the target voice is coded and input into the target branch to obtain the corresponding pronunciation information. Therefore, different types of target voices to be recognized are input into the branches matched with the types of the target voices to be recognized, so that corresponding pronunciation information is obtained, the accuracy of voice recognition is improved, and the technical problem that an acoustic model in the prior art is low in accuracy when recognizing various types of voice information is solved.

In order to implement the above embodiments, the present application further provides a speech processing apparatus.

Fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application.

As shown in fig. 6, the speech processing apparatus 100 includes: a first training module 110, a processing module 120, a generating module 130, and a second training module 140.

The first training module 110 is configured to train a first acoustic model using a training sample set, where the first acoustic model includes an encoding layer, a decoding layer, and an output layer.

A processing module 120, configured to copy the decoding layer and the output layer to obtain a plurality of branches; each branch comprises a decoding layer and a corresponding output layer.

A generating module 130 is configured to generate a second acoustic model according to the plurality of branches and the coding layer of the first acoustic model.

The second training module 140 is configured to use each type of training sample in the training sample set to respectively train branches of the second acoustic model matching the corresponding type, so as to use the trained second acoustic model to perform speech recognition.

As a possible implementation manner, the training sample includes original audio features of speech and reference pronunciation information of a text label corresponding to the speech, and the second training module 140 is configured to:

inputting the corresponding codes into branches of the matching types according to the types of the training samples to obtain output pronunciation information;

according to the difference between the reference pronunciation information and the output pronunciation information, the parameters of the branches of the matching types are adjusted to minimize the difference.

As another possible implementation, the original audio features include filter FBank features.

As another possible implementation manner, the speech processing apparatus 100 further includes:

and the classification module is used for classifying the classes according to the source and/or the applicable service scene of the training sample.

As another possible implementation, the first acoustic model further includes an attention layer; the second acoustic model accordingly includes an attention layer.

As another possible implementation, the second training module 140 is configured to:

coding the target voice to be recognized by adopting a coding layer of the second acoustic model;

and coding the target voice, and inputting the target branch to obtain corresponding pronunciation information.

It should be noted that the foregoing explanation of the embodiment of the speech processing method is also applicable to the speech processing apparatus of the embodiment, and is not repeated here.

In order to implement the above embodiments, the present application also provides an electronic device, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the speech processing method as described in the above embodiments.

In order to implement the above embodiments, the present application also proposes a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech processing method as described in the above embodiments.

FIG. 7 illustrates a block diagram of an exemplary computer device suitable for use to implement embodiments of the present application. The computer device 12 shown in fig. 7 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in FIG. 7, computer device 12 is in the form of a general purpose computing device. The components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 that couples various system components including the system memory 28 and the processing unit 16.

Bus 18 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. These architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus, to name a few.

Computer device 12 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 28 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 30 and/or cache Memory 32. Computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and commonly referred to as a "hard drive"). Although not shown in FIG. 7, a disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk Read Only Memory (CD-ROM), a Digital versatile disk Read Only Memory (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.

A program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. Program modules 42 generally perform the functions and/or methodologies of the embodiments described herein.

Computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), with one or more devices that enable a user to interact with the computer system/server 12, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public Network such as the Internet) via Network adapter 20. As shown, network adapter 20 communicates with the other modules of computer device 12 via bus 18. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 16 executes various functional applications and data processing, such as implementing the voice processing method mentioned in the foregoing embodiments, by executing programs stored in the system memory 28.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A method of speech processing, the method comprising the steps of:

training a first acoustic model by adopting a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer; each type of training sample in the training sample set is voice data collected by different hardware equipment;

and training branches matched with corresponding types in the second acoustic model by adopting various types of training samples in the training sample set respectively, so as to perform voice recognition on the target voice to be recognized by adopting the trained second acoustic model and obtain pronunciation information corresponding to the target voice.

2. The method according to claim 1, wherein the training samples include original audio features of speech and reference pronunciation information of text labels corresponding to the speech, and the training using each type of training sample in the set of training samples to respectively train branches of the second acoustic model matching the corresponding type includes:

3. The method of claim 2,

the original audio features comprise filter FBank features.

4. The method of claim 1, wherein before the training of the branches of the second acoustic model that match the corresponding type using the training samples of the respective types in the set of training samples, the method further comprises:

5. The method of any of claims 1-4, wherein the first acoustic model further comprises an attention layer;

the second acoustic model comprises the attention layer, respectively.

6. The method according to any one of claims 1-4, wherein the performing speech recognition using the trained second acoustic model comprises:

7. A speech processing apparatus, characterized in that the apparatus comprises:

the first training module is used for training a first acoustic model by adopting a training sample set, wherein the first acoustic model comprises an encoding layer, a decoding layer and an output layer; each type of training sample in the training sample set is voice data collected by different hardware equipment;

and the second training module is used for adopting each type of training sample in the training sample set to respectively train the branches matched with the corresponding type in the second acoustic model so as to adopt the trained second acoustic model to perform voice recognition on the target voice to be recognized and obtain the pronunciation information corresponding to the target voice.

8. The apparatus of claim 7, wherein the training samples comprise original audio features of speech and reference pronunciation information of corresponding text labels of the speech, and wherein the second training module is configured to:

9. The apparatus of claim 8, wherein the original audio features comprise filter FBank features.

10. The apparatus of claim 7, further comprising:

11. The apparatus of any of claims 7-10, wherein the first acoustic model further comprises an attention layer;

the second acoustic model comprises the attention layer, respectively.

12. The apparatus of any of claims 7-10, wherein the second training module is configured to:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech processing method according to any of claims 1-6 when executing the program.

14. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the speech processing method according to any one of claims 1 to 6.