Method, apparatus, device and medium for generating speech model and speech recognition
Technical Field
Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to methods, apparatuses, devices, and media for generating a speech model and speech recognition.
Background
With the progress of data processing technology and the rapid spread of mobile internet, computer technology is widely applied to various fields of society, and with the progress of data processing technology, mass data is generated. Among them, voice data is receiving more and more attention.
Generally, the recognition accuracy of the speech model is closely related to the number of samples of the speech model in the training process, and reducing the number of samples or simplifying the training process often reduces the accuracy of speech recognition. Therefore, the speech model needs more samples in the training process, the training process is complicated, and the recognition accuracy of the speech model is difficult to improve.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Some embodiments of the present disclosure propose methods, apparatuses, devices and media for generating speech models and speech recognition to solve the technical problems mentioned in the background section above.
In a first aspect, some embodiments of the present disclosure provide a method for generating a speech model, the method comprising: acquiring a training sample set, wherein training samples in the training sample set comprise voice samples and recognition result samples corresponding to the voice samples; and performing joint learning training on the initial model according to the training sample set to obtain a voice model, wherein the initial model comprises a plurality of output layers.
In a second aspect, some embodiments of the present disclosure provide a method of speech recognition, the method comprising: acquiring a target voice; inputting the target voice into a pre-trained voice model to obtain a recognition result of the target voice, wherein the voice model is generated by the method in one of the above embodiments.
In a third aspect, some embodiments of the present disclosure provide an apparatus for generating a speech model, the apparatus comprising: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a training sample set, and training samples in the training sample set comprise voice samples and recognition result samples corresponding to the voice samples; and the training unit is configured to perform joint learning training on the initial model according to the training sample set to obtain a voice model, wherein the initial model comprises a plurality of output layers.
In a fourth aspect, some embodiments of the present disclosure provide a speech recognition apparatus comprising: a target voice acquiring unit configured to acquire a target voice; a recognition unit configured to input the target speech into a pre-trained speech model, and obtain a recognition result of the target speech, wherein the speech model is generated by the method according to one of the above embodiments.
In a fifth aspect, an embodiment of the present application provides an electronic device, including: one or more processors; storage means for storing one or more programs; when executed by one or more processors, cause the one or more processors to implement a method as described in any of the implementations of the first or second aspects.
In a sixth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the method as described in any implementation manner of the first aspect or the second aspect.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, a training sample set is obtained, and then joint learning training is carried out on an initial model according to the training sample set to obtain a voice model. Because the voice samples in the training sample set are matched with the recognition result samples in the training sample set, the trained voice model has higher accuracy. The process of training the initial model by using the joint learning training method is simpler, the number of samples required is less, and the method has more flexibility, so that fewer samples can be effectively utilized, and the simplification of the training process and the improvement of the recognition accuracy of the voice model are realized.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, target voice is obtained, and then the target voice is input to a pre-trained voice model to obtain a recognition result of the target voice. The trained voice model is used to enable the recognition result of the target voice to be more accurate, and the accuracy of the recognition result and the user experience are improved.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and features are not necessarily drawn to scale.
FIG. 1 is a schematic diagram of one application scenario of a method of generating a speech model according to some embodiments of the present disclosure;
FIG. 2 is a flow diagram of some embodiments of a method of generating a speech model according to the present disclosure;
FIG. 3 is a schematic structural diagram of an initial model according to some embodiments of the present disclosure;
FIG. 4 is a flow diagram of some embodiments of a speech recognition method according to the present disclosure;
FIG. 5 is a schematic block diagram illustration of some embodiments of an apparatus for generating speech models according to the present disclosure;
FIG. 6 is a schematic block diagram of some embodiments of a speech recognition apparatus according to the present disclosure;
FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
FIG. 1 is a schematic diagram of one application scenario of a method of generating a speech model according to some embodiments of the present disclosure.
As shown in fig. 1, first, the server 101 may obtain a training sample set 102 from a local pre-storage or network download. Here, the training sample set 102 includes a speech sample 1021 and a recognition result sample 1022 corresponding to the speech sample. Thereafter, the server 101 may perform joint learning training on the initial model 103 according to the training sample set 102 to obtain the speech model 104.
It is understood that the method for generating the speech model may be executed by the server 101 or may be executed by the terminal device, and the execution body of the method may further include a device formed by integrating the server 101 and the terminal device through a network, or may also be executed by various software programs. The terminal device may be various electronic devices with information processing capability, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. When the execution subject is software, the software can be installed in the electronic device listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.
It should be understood that the number of servers in fig. 1 is merely illustrative. There may be any number of servers, as desired for implementation.
With continued reference to FIG. 2, a flow 200 of some embodiments of a method of generating a speech model according to the present disclosure is shown. The method for generating the voice model comprises the following steps:
step 201, a training sample set is obtained.
In some embodiments, an executing entity (e.g., the server shown in fig. 1) of the method for generating a speech model may obtain the training sample set by downloading from a network or reading locally through a wired connection or a wireless connection. Here, the training samples in the training sample set include a speech sample and a recognition result sample corresponding to the speech sample.
And 202, performing joint learning training on the initial model according to the training sample set to obtain a voice model.
In some embodiments, based on the training sample set obtained in step 201, the performing entity may perform joint learning training on the initial model according to the training sample set to obtain a speech model. The initial model includes a plurality of output layers.
Here, joint learning (joint learning) in the above-described joint learning training generally refers to one of Multi-task learning (MTL).
In the present embodiment, the initial model may be various existing neural network models created based on machine learning techniques. The neural network model can have various existing neural network structures, and as an example, the multi-layer output layer can be two output layers, wherein the first output layer is formed by sequentially connecting a long-short term memory network (LSTM), an attention model and a first output layer; the second output layer is formed by connecting a long-short term memory network (LSTM) with the second output layer. Here, the long-term and short-term memory network may be multi-layered.
For example, the initial Model diagram shown in fig. 3 is composed of a long short term memory network (LSTM) connected to an Attention Model (Attention Model) and a first output layer, and a long short term memory network connected to a second output layer and a third output layer. Here, the long-term and short-term memory network may be multi-layered. In some optional implementation manners of some embodiments, the executing entity may input the speech sample in the training sample to an initial model, and obtain an output result corresponding to each output layer of the initial model. And then, respectively determining the difference between the voice sample and each output result based on a preset loss function to obtain a plurality of loss values corresponding to the multi-layer output layer. And optimizing the initial model according to the loss values to obtain a voice model.
Specifically, the execution subject may analyze the recognition result sample and each of the output results described above, so that the loss value may be determined. For example, the recognition result samples and each of the output results may be used as parameters and input to a specified loss function (loss function), so that a loss value between the two may be calculated. And then, adjusting the relevant parameters of the model according to the determined loss value. Here, the loss function is generally used to measure the degree of discrepancy between the predicted value (e.g., output result) and the actual value (e.g., expected output result) of the model. It is a non-negative real-valued function. In general, the smaller the loss function, the better the robustness of the model. The loss function may be set according to actual requirements. As an example, the above-described loss function may be a cross-entropy loss function. Note that the loss functions of the plurality of output layers are usually different from each other. In some optional implementations of some embodiments, the execution body may dynamically synchronize the plurality of loss values. Thereafter, the initial model is optimized for a first loss value of the plurality of loss values. Specifically, the optimization is usually performed by adjusting the relevant parameters of the model based on the plurality of loss values.
And in response to the preset training end condition being met, discarding the output layers corresponding to the rest loss values to obtain the voice model. Here, the first loss value generally refers to a loss value corresponding to an output of an output layer that needs to be reserved. The output layer to be preserved may be a preset output layer or an output layer selected by a person. The foregoing discarding the output layer corresponding to the remaining loss value generally means discarding the hidden layer parameter and/or the output layer parameter adjusted according to the remaining output layer.
Specifically, the dynamic synchronization means that the obtained plurality of loss values are dynamically maintained in a one-to-one relationship. As an example, when the plurality of loss values is three, the loss value a starts to be 10, the loss value B starts to be 5, the loss value C starts to be 1, and the loss value B is set as the first loss value. After training, the three loss values will change. Multiplying the changed loss value A 'by the ratio of the loss value B to the loss value A, namely multiplying the ratio of A' to 5 to 10; the modified loss value C 'is multiplied by the ratio of the loss value B to the loss value C, i.e. C' is multiplied by the ratio of 5 to 1.
In some optional implementations of some embodiments, the performing subject may further perform random inactivation during the training process in response to a number of training samples in the training sample set being below a predetermined threshold. Here, the random inactivation (dropout) generally refers to a method for optimizing an artificial neural network having a deep structure. In this way, mutual dependency (co-dependency) between nodes is reduced by randomly zeroing partial weights or outputs of hidden layers in the learning process, so that regularization of the neural network is realized, and the structural risk (structural risk) of the neural network is reduced.
In some optional implementations of some embodiments, the multi-layer output layer is a three-layer output layer. For example, the initial Model diagram shown in fig. 3, a long short term memory network (LSTM), an Attention Model (Attention Model), and a first output layer are connected in sequence to form a first output layer; the long-short term memory network is connected with the second output layer to form a second output layer; the long-short term memory network is connected with the third output layer to form a third output layer. Here, the long-term and short-term memory network may be multi-layered.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, a training sample set is obtained, and then joint learning training is carried out on an initial model according to the training sample set to obtain a voice model. Because the voice samples in the training sample set are matched with the recognition result samples in the training sample set, the trained voice model has higher accuracy. The process of training the initial model by using the joint learning training method is simpler, the number of samples required is less, and the method is more flexible. Compared with the common method which needs to train the model for multiple times, the method can effectively utilize fewer samples, and realizes the simplification of the training process and the improvement of the recognition accuracy rate of the voice model.
With continued reference to fig. 4, a flow 400 of some embodiments of a speech recognition method according to the present disclosure is shown. The speech recognition method 400 includes the steps of:
step 401, obtaining a target voice.
In some embodiments, the execution subject of the voice recognition method may obtain the target voice from a local pre-storage or network download, etc. Here, the target speech is generally speech that the user needs to recognize.
It is understood that the voice recognition method may be executed by a server or a terminal device, and the execution subject of the method may further include a device formed by integrating the server and the terminal device via a network, or may also be executed by various software programs. The terminal device may be various electronic devices with information processing capability, including but not limited to a smart phone, a tablet computer, an e-book reader, a laptop portable computer, a desktop computer, and the like. When the execution subject is software, the software can be installed in the electronic device listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.
And step 402, inputting the target voice into a pre-trained voice model to obtain a recognition result of the target voice.
In some embodiments, the execution subject inputs the target speech into a pre-trained speech model to obtain a recognition result of the target speech. Here, the above-mentioned speech model is generated by a method as in one of the above-mentioned embodiments.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, target voice is obtained, and then the target voice is input to a pre-trained voice model to obtain a recognition result of the target voice. The trained voice model is used to enable the recognition result of the target voice to be more accurate, and the accuracy of the recognition result and the user experience are improved.
With further reference to fig. 5, as an implementation of the methods illustrated in the above figures, the present disclosure provides some embodiments of an apparatus for generating speech models, which correspond to those of the method embodiments illustrated in fig. 2, and which may be applied in various electronic devices in particular.
As shown in FIG. 5, the apparatus 500 for generating speech models of some embodiments comprises: an acquisition unit 501 and a training unit 502. The obtaining unit 501 is configured to obtain a training sample set, where the training sample includes a voice sample and a recognition result sample corresponding to the voice sample; the training unit 504 is configured to perform joint learning training on an initial model according to the training sample set to obtain a speech model, where the initial model includes multiple output layers.
In an alternative implementation of some embodiments, the training unit 502 of the apparatus 500 for generating speech models is further configured to: inputting the voice samples in the training samples into an initial model to obtain an output result corresponding to each output layer of the initial model; respectively determining the difference between the voice sample and each output result based on a preset loss function to obtain a plurality of loss values corresponding to the plurality of output layers; and optimizing the initial model according to the loss values to obtain a voice model.
In an optional implementation manner of some embodiments, the training unit 502 may dynamically synchronize the plurality of loss values; optimizing the initial model for a first loss value of the plurality of loss values; and in response to the preset training end condition being met, discarding the output layers corresponding to the rest loss values to obtain the voice model.
In an optional implementation manner of some embodiments, the apparatus 500 for generating a speech model further includes an inactivation unit configured to perform random inactivation during the training process in response to the number of training samples in the training sample set being lower than a predetermined threshold.
In an alternative implementation of some embodiments, the multi-output layer is a three-output layer.
It will be understood that the elements described in the apparatus 500 correspond to various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, a training sample set is obtained, and then joint learning training is carried out on an initial model according to the training sample set to obtain a voice model. Because the voice samples in the training sample set are matched with the recognition result samples in the training sample set, the trained voice model has higher accuracy. The process of training the initial model by using the joint learning training method is simpler, the number of samples required is less, and the method is more flexible. Compared with the common method which needs to train the model for multiple times, the method can effectively utilize fewer samples, and realizes the simplification of the training process and the improvement of the recognition accuracy rate of the voice model.
With further reference to fig. 6, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a speech recognition apparatus, which correspond to those shown in fig. 4, and which may be applied in various electronic devices in particular.
As shown in FIG. 6, the apparatus 600 for generating speech models of some embodiments comprises: a target voice acquiring unit 601 and a recognition unit 602, wherein the target voice acquiring unit 601 is configured to acquire a target voice; and the recognition unit 602 is configured to input the target speech into a pre-trained speech model to obtain a recognition result of the target speech, where the speech model is generated by the method according to one of the above embodiments.
It will be understood that the elements described in the apparatus 600 correspond to various steps in the method described with reference to fig. 4. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 600 and the units included therein, and are not described herein again.
One of the above-described various embodiments of the present disclosure has the following advantageous effects: firstly, target voice is obtained, and then the target voice is input to a pre-trained voice model to obtain a recognition result of the target voice. The trained voice model is used to enable the recognition result of the target voice to be more accurate, and the accuracy of the recognition result and the user experience are improved.
Referring now to fig. 7, a schematic diagram of an electronic device (e.g., the server of fig. 1) 700 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.
In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network via communications means 709, or may be installed from storage 708, or may be installed from ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of some embodiments of the present disclosure.
It should be noted that the computer readable medium described above in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a training sample set, wherein the training sample comprises a voice sample and a recognition result sample corresponding to the voice sample; and performing joint learning training on the initial model according to the training sample set to obtain a voice model, wherein the initial model comprises a plurality of output layers. Or when the one or more programs are executed by the electronic device, cause the electronic device to: acquiring a target voice; inputting the target voice into a pre-trained voice model to obtain a recognition result of the target voice, wherein the voice model is generated by the method according to one of the above embodiments.
Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in some embodiments of the present disclosure may be implemented by software, and may also be implemented by hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and a training unit. Where the names of these units do not in some cases constitute a limitation of the unit itself, for example, the acquisition unit may also be described as a "unit that acquires a training sample set".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In accordance with one or more embodiments of the present disclosure, there is provided a method of generating a speech model, including: acquiring a training sample set, wherein the training sample comprises a voice sample and a recognition result sample corresponding to the voice sample; and performing joint learning training on the initial model according to the training sample set to obtain a voice model, wherein the initial model comprises a plurality of output layers.
According to one or more embodiments of the present disclosure, the performing joint learning training on the initial model according to the training sample set to obtain a speech model includes: inputting the voice samples in the training samples into an initial model to obtain an output result corresponding to each output layer of the initial model; respectively determining the difference between the voice sample and each output result based on a preset loss function to obtain a plurality of loss values corresponding to the multi-layer output layer; and optimizing the initial model according to the loss values to obtain a voice model.
According to one or more embodiments of the present disclosure, the optimizing the initial model according to the plurality of loss values to obtain a speech model includes: dynamically synchronizing the plurality of loss values; optimizing the initial model for a first loss value of the plurality of loss values; and in response to the preset training end condition being met, discarding the output layers corresponding to the rest loss values to obtain the voice model.
According to one or more embodiments of the present disclosure, the method further includes: in response to the number of training samples in the set of training samples being below a predetermined threshold, random inactivation is performed during the training process.
According to one or more embodiments of the present disclosure, the above-mentioned multi-output layer is a three-output layer.
According to one or more embodiments of the present disclosure, there is provided a method of speech recognition, including: acquiring a target voice; inputting the target voice into a pre-trained voice model to obtain a recognition result of the target voice, wherein the voice model is generated by the method of one of the above embodiments.
According to one or more embodiments of the present disclosure, there is provided an apparatus for generating a speech model, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is configured to acquire a training sample set, wherein the training sample comprises a voice sample and a recognition result sample corresponding to the voice sample; and the training unit is configured to perform joint learning training on the initial model according to the training sample set to obtain a voice model, wherein the initial model comprises a plurality of output layers.
According to one or more embodiments of the present disclosure, the training unit is further configured to: inputting the voice samples in the training samples into an initial model to obtain an output result corresponding to each output layer of the initial model; respectively determining the difference between the voice sample and each output result based on a preset loss function to obtain a plurality of loss values corresponding to the multi-layer output layer; and optimizing the initial model according to the loss values to obtain a voice model.
According to one or more embodiments of the present disclosure, the training unit is further configured to: dynamically synchronizing the plurality of loss values; optimizing the initial model for a first loss value of the plurality of loss values; and in response to the preset training end condition being met, discarding the output layers corresponding to the rest loss values to obtain the voice model.
According to one or more embodiments of the present disclosure, the apparatus for generating a speech model further includes a random inactivation unit configured to: in response to the number of training samples in the set of training samples being below a predetermined threshold, random inactivation is performed during the training process.
According to one or more embodiments of the present disclosure, the above-mentioned multi-output layer is a three-output layer.
According to one or more embodiments of the present disclosure, there is provided an apparatus for speech recognition, including: a target voice acquiring unit configured to acquire a target voice; and a recognition unit configured to input the target speech into a pre-trained speech model, and obtain a recognition result of the target speech, wherein the speech model is generated by the method according to one of the above embodiments.
According to one or more embodiments of the present disclosure, there is provided an electronic device including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method as in any of the various embodiments described above.
According to one or more embodiments of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the method as described in any of the various embodiments above. The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the embodiments of the present disclosure is not limited to the specific combinations of the above-described features, but also encompasses other embodiments in which any combination of the above-described features or their equivalents is possible without departing from the spirit of the present disclosure. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.