WO2023245869A1 - 语音识别模型的训练方法、装置、电子设备及存储介质 - Google Patents

语音识别模型的训练方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023245869A1
WO2023245869A1 PCT/CN2022/116552 CN2022116552W WO2023245869A1 WO 2023245869 A1 WO2023245869 A1 WO 2023245869A1 CN 2022116552 W CN2022116552 W CN 2022116552W WO 2023245869 A1 WO2023245869 A1 WO 2023245869A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
speech
recognition model
model
decoding
Prior art date
Application number
PCT/CN2022/116552
Other languages
English (en)
French (fr)
Inventor
游岚华
贾磊
张奇
蒋正翔
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023245869A1 publication Critical patent/WO2023245869A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, especially to the fields of deep learning, speech recognition and other fields.
  • Speech recognition technology is a technology that allows machines to convert speech signals into corresponding text or commands through the process of recognition and understanding.
  • Speech recognition technology mainly includes feature extraction technology, pattern matching, etc.
  • the current voice recognition is not accurate enough, which is a problem that needs to be solved.
  • the present disclosure provides a speech recognition model training method, a speech recognition method, a device, an electronic device, and a storage medium.
  • a training method for a speech recognition model including: constructing a negative sample based on a positive sample to obtain a target negative sample for constraining the speech decoding path; sample to obtain training data; and, train the first speech recognition model according to the training data to obtain the second speech recognition model.
  • a speech recognition method including: in the case of decoding speech data to be recognized, constraining the speech decoding path corresponding to the speech data to be recognized according to a second speech recognition model, and the second speech
  • the recognition model is a model trained according to the speech recognition model training method provided by the embodiment of the present disclosure; and, according to the constraints of the speech decoding path, the speech recognition result is obtained; wherein the speech recognition result is a text object that matches the expected text.
  • a training device for a speech recognition model including: a first processing module configured to construct a negative example sample based on the positive example sample to obtain a target negative example for constraining the speech decoding path. sample; the second processing module is configured to obtain training data based on the positive sample and the target negative sample; and the training module is configured to train the first speech recognition model based on the training data to obtain the second speech recognition Model.
  • a speech recognition device including: a third processing module configured to, in the case of decoding speech data to be recognized, generate a speech corresponding to the speech data to be recognized according to a second speech recognition model.
  • the decoding path is constrained, and the second speech recognition model is a model trained according to the speech recognition model training method provided by the embodiment of the present disclosure; and the fourth processing module is configured to obtain the speech recognition result according to the constraints of the speech decoding path. ;
  • the speech recognition result is a text object that matches the expected text.
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be executed by the at least one processor. Instructions, which are executed by the at least one processor, so that the at least one processor can execute any method provided by the embodiments of the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions.
  • the computer instructions are used to cause a computer to execute any method provided by the embodiments of the present disclosure.
  • a computer program product which includes computer instructions.
  • the computer instructions are executed by a processor, any one of the methods provided by the embodiments of the present disclosure is implemented.
  • negative sample samples can be constructed based on positive sample samples to obtain target negative sample samples used to constrain the speech decoding path, and training data can be obtained based on the positive sample samples and target negative sample samples.
  • the first speech recognition model is trained according to the training data to obtain the second speech recognition model. Since the second speech recognition model is trained under the constraints of the speech decoding path, the accuracy of speech recognition is improved.
  • Figure 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure.
  • Figure 2 is a schematic flowchart of a speech recognition model training method according to an embodiment of the present disclosure.
  • Figure 3 is a schematic diagram of speech recognition path expansion according to an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of speech recognition path constraints according to an embodiment of the present disclosure.
  • Figure 5 is a schematic diagram of prefix tree and sample generation according to an embodiment of the present disclosure.
  • Figure 6 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure.
  • Figure 7 is a schematic diagram of the network structure of the first composition model according to an embodiment of the present disclosure.
  • Figure 8 is a schematic network structure diagram of a speech recognition model according to an embodiment of the present disclosure.
  • Figure 9 is a schematic diagram of a speech recognition framework according to an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present disclosure.
  • Figure 11 is a schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure.
  • FIG. 12 is a block diagram of an electronic device used to implement the training method/speech recognition method of the speech recognition model according to the embodiment of the present disclosure.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and they exist alone. B these three situations.
  • at least one herein refers to any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, and C, which may include A, B, and Any one or more elements selected from the set composed of C.
  • first and second in this article refer to and distinguish multiple similar technical terms, and do not mean to limit the order, or to limit only two meanings, for example, the first feature and the second feature.
  • Features refer to two types/two features.
  • the first feature can be one or more, and the second feature can also be one or more.
  • NNLM neural network language models
  • N-GRAM language model which has word graph constraints
  • all possible decoding paths will be expanded, and there is no mandatory constraint on the path, resulting in many unexpected texts appearing, causing great problems. Identify distractions.
  • NNLM based on specific corpus training can obtain higher language scores in speech recognition in related application scenarios, it often also results in many unexpected recognition results (such as text outside the address book) also having higher scores. Language points.
  • the neural network-based speech recognition composition method constrains the expansion path of the NNLM decoding space by training a speech recognition model (such as a neural network-based composition model), thereby suppressing the occurrence of unexpected recognition results. output, effectively improving the accuracy of speech recognition.
  • a speech recognition model such as a neural network-based composition model
  • FIG. 1 is a schematic diagram of a distributed cluster processing scenario.
  • the distributed cluster system is an example of a cluster system.
  • Figure 1 illustrates how the distributed cluster system can be used for speech recognition.
  • the present disclosure is not limited to speech recognition on a single machine or multiple machines. Distributed processing can further improve the accuracy of speech recognition.
  • the distributed cluster system 100 includes multiple nodes (such as server cluster 101, server 102, server cluster 103, server 104, server 105).
  • the server 105 can also be connected to electronic devices, such as mobile phones 1051 and desktops. Machine 1052), multiple nodes, and multiple nodes and connected electronic devices can jointly perform one or more speech recognition tasks.
  • multiple nodes in the distributed cluster system can use data parallelism to perform speech recognition, and then multiple nodes can perform speech recognition training tasks based on the same training method; if multiple nodes in the distributed cluster system If each node adopts a model parallel model training method, multiple nodes can perform speech recognition training tasks based on different training methods to better train the above speech recognition model.
  • data exchange (such as data synchronization) can be performed between multiple nodes.
  • FIG. 2 is a schematic flowchart of a method for training a speech recognition model according to an embodiment of the present disclosure.
  • This method can be applied to a speech recognition device, for example, this device It can be deployed on a terminal or server or other processing device in a single machine, multi-machine or cluster system to implement speech recognition and other processing.
  • the terminal can be a user equipment (UE, User Equipment), a mobile device, a personal digital assistant (PDA, Personal Digital Assistant), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc.
  • the method can also be implemented by the processor calling computer-readable instructions stored in the memory.
  • this method is applied to any node or electronic device (mobile phone or desktop computer, etc.) in the cluster system shown in Figure 1, including S201 to S203.
  • other samples other than the positive sample can be used as the target negative sample, and the positive sample and the target negative sample can be used as training data, because the positive sample and the target negative sample have corresponding data.
  • labels therefore, supervised learning based on data labels can be performed on the first speech recognition model, and the second speech recognition model can be obtained after training the model.
  • the training data used for model training includes target negative samples that constrain the speech decoding path, that is, constraints are made in advance to avoid unexpected occurrences, which can suppress unexpected speech recognition results during the model training process and model use (such as in communications). In the speech scene of address book recognition, the output of text outside the address book is suppressed, thereby effectively improving the accuracy of speech recognition.
  • negative sample samples can be constructed based on positive sample samples to obtain target negative sample samples used to constrain the speech decoding path, and training data can be obtained based on the positive sample samples and target negative sample samples.
  • the first speech recognition model is trained according to the training data to obtain the second speech recognition model. Since the second speech recognition model is trained under the constraints of the speech decoding path, the accuracy of speech recognition is improved.
  • constructing a negative sample based on the positive sample to obtain the target negative sample used to constrain the speech decoding path includes: determining the text characters in the matching library as the positive sample, and selecting the text characters other than the positive sample. Other samples are determined as target negative samples.
  • positive samples shown in Table 1
  • negative samples shown in Table 2
  • the training samples include several types of data, such as original text, multiple text characters that constitute the text, identifiers (tokens) corresponding to the multiple text characters, and labels (labels) corresponding to the multiple text characters respectively.
  • the matching library can be a specified address book.
  • the user name "Zhang San” in the address book is used as a positive sample.
  • the voice decoding is performed to obtain the correct
  • the speech recognition result should be the text information corresponding to the user name "Zhang San”.
  • positive samples can be determined based on the matching library (such as a specified address book), and then negative samples can be constructed based on the positive samples, and other samples other than the positive samples can be used as The target negative sample forms a constraint on the speech decoding path, thereby suppressing the output of unexpected speech recognition results (such as incorrect speech recognition results such as Zhang Dan or Zhang Han).
  • the output of unexpected speech recognition results can be suppressed based on the constraints of the speech decoding path, thereby effectively Improved speech recognition accuracy.
  • using other samples other than the positive sample as the target negative sample includes: obtaining a data structure in the form of a node tree based on the positive sample; wherein each node in the node tree is and constitutes a positive sample.
  • the identifier corresponding to the text character of the example sample Traverse the positive paths formed by the positive samples in the node tree to obtain a first path set; and determine the paths in the node tree other than the first path set as a second path set (the second path set includes target negative sample).
  • the speech recognition path expansion without constraints can include: Zhang San, Zhang Dan or Zhang Han speech recognition results, but Zhang does not exist in the matching library (such as the specified address book) Dan or Zhang Han's text, resulting in inaccurate speech recognition results.
  • the data structure in the form of a node tree (or called a prefix tree based on the positive path) includes positive samples and negative samples. Traversing the data structure in the form of a node tree is: The path composed of the tokens corresponding to the positive samples is called the positive path (denoted as the first path set), and the path composed of the tokens corresponding to the negative samples is called the negative path (denoted as the second path set), that is Paths other than the first path set are the second path set.
  • the positive path denoted as the first path set
  • the negative path denotes other than the first path set
  • the tokens in the first path set are shown as bold and underlined numbers in Figure 5; the other tokens are all tokens in the second path set. Therefore, a full amount of positive samples can be directly generated based on the positive path.
  • all expandable paths shown as dotted lines in Figure 5) except the positive path are negative example path, and finally obtain the above target negative sample.
  • data dimensionality can be reduced through the screening strategy of effective negative samples.
  • the acoustic confusion matrix and language score are used to screen effective negative sample samples, that is, the negative sample paths with lower acoustic scores or language scores are screened out and deleted, and the remaining negative sample samples are used as target negative sample samples. And based on this, the training data for model training is formed.
  • the negative example paths other than the positive example paths can be obtained, thereby obtaining the target negative example sample.
  • the negative samples in the negative path can also be further filtered to obtain more accurate negative samples with less data.
  • the target negative samples and positive samples obtained by screening constitute the training data for model training, which improves the accuracy of the model.
  • training the first speech recognition model based on the training data to obtain the second speech recognition model includes: inputting the training data into the embedding layer of the first speech recognition model, and converting the training data into corresponding features through the embedding layer. vector; associate the feature vector with the historical vector in the correlation layer in the first speech recognition model to obtain correlation features for speech recognition prediction; input the correlation features into the fully connected layer in the first speech recognition model and activate them Binary classification processing of the function; obtaining the loss function according to the output value and the target value obtained after the binary classification processing; and training the first speech recognition model according to the back propagation of the loss function to obtain the second speech recognition model (which can be based on Neural network composition model).
  • the structure of the first semantic recognition model may include: an embedding layer, an association layer, a fully connected layer, and an activation function connecting the fully connected layers.
  • the output of the activation function can be processed into two categories.
  • the embedding layer can be a word embedding layer (embedding layer);
  • the association layer can be applied to scenes with spatiotemporal correlation and has a time loop structure, so that it can well describe sequence data with spatiotemporal correlation (such as temperature, traffic volume, Sales volume, etc.), text (such as notepad, address book), events (shopping list, personal behavior),
  • the association layer is not limited to Long Short Term Memory Network (LSTM);
  • the activation function is not limited to (softmax function).
  • the associated features for speech recognition prediction can be obtained, so that better Perform binary classification processing to predict more accurate speech recognition results during model use.
  • FIG. 6 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure.
  • the method can be applied to a speech recognition device.
  • the device can be deployed on a terminal or server or other processing equipment in a single machine, multi-machine or cluster system.
  • speech recognition and other processing can be realized.
  • the terminal can be a user equipment (UE, User Equipment), a mobile device, a personal digital assistant (PDA, Personal Digital Assistant), a handheld device, a computing device, a vehicle-mounted device, a wearable device, etc.
  • the method can also be implemented by the processor calling computer-readable instructions stored in the memory.
  • this method is applied to any node or electronic device (mobile phone or desktop computer, etc.) in the cluster system shown in Figure 1, including S601 and S602.
  • the second speech recognition model is a model trained according to the embodiment.
  • the correct speech recognition result can be obtained according to the constraints of the speech decoding path. For example, there is Zhang San in the address book. Since the positive sample is obtained based on the matching of the address book, and the negative sample is obtained based on the positive sample, the second speech recognition model is trained based on the positive sample and the negative sample. It is obtained that the constraints of the speech decoding path are satisfied, so that under the constraints of the speech decoding path, the output result of speech recognition is text that meets the expectations. For example, by matching the text in the address book, the unique speech recognition result of Zhang San is obtained, not the speech recognition result of Zhang Dan or Zhang Han.
  • obtaining the speech recognition result includes: according to the second speech recognition model, obtaining the corresponding language score under the condition that the speech data to be recognized satisfies the decoding path constraint; determining the target decoding path according to the language score; And, obtain the speech recognition result according to the target decoding path.
  • the speech recognition method further includes: obtaining the acoustic score corresponding to the speech data to be recognized according to the acoustic model.
  • determining the target decoding path based on the language score may include: obtaining an evaluation value based on the language score and the acoustic score; obtaining the decoding space obtained when the speech data to be recognized is decoded (the decoding space includes multiple decoding lines). path); and, determine the decoding path with the highest evaluation value among the multiple decoding paths as the target decoding path.
  • the second speech recognition model is a composition model based on a neural network (NN).
  • the second speech recognition model can be combined with an existing language model, or replace the existing language model to combine with the acoustic model.
  • the model calculates the language score and the acoustic score together.
  • the corresponding language score of the speech data to be recognized is obtained under the constraints of the decoding path.
  • the target decoding path can be determined based on the language score and the acoustic score, that is, in the decoding space including multiple decoding paths, the language score and If the path with the highest total score of the acoustic component is used as the target decoding path, the accuracy of the output speech recognition result will be greatly improved.
  • the second speech recognition model shown in Figure 7 is combined with the language model in the speech recognition framework shown in Figure 9, or directly replaces the language model to decode the speech data to be recognized, thereby obtaining the speech Recognition results.
  • the second speech recognition model can be a NN composition model.
  • a combined language model ie, a constrained language model, or a composition-based language model
  • Figure 8 Language model "NNLM with composition” a combined language model (ie, a constrained language model, or a composition-based language model) can be obtained as shown in Figure 8 Language model "NNLM with composition"), not only can accurately perform speech recognition, but the NNLM model with composition obtained by combining the NN composition model with the language model takes up less storage space and is more flexible in scenarios such as offline recognition.
  • the speech recognition framework may also include: a decoder and an acoustic model.
  • the decoder performs a path search in the decoding space, and converts the input speech data to be recognized (i.e., the audio signal) into the correct speech recognition result (such as the text corresponding to the speech) under the constraints of the speech decoding path. content, matching text in the specified address book).
  • the acoustic model and the language model are regarded as two independent parts and can be optimized separately. Language models are more suitable for optimizing different business scenarios. For example, using text corpus in a certain field to train the corresponding language model to enhance the recognition performance of that scenario.
  • the decoding space obtained when the speech data to be recognized is decoded (the decoding space includes multiple decoding paths)
  • the multiple decoding paths are decoded under the constraints of the speech decoding path.
  • the decoding path with the highest evaluation value is used as the target decoding path, thus improving the accuracy of speech recognition.
  • the NNLM model As a language model, the NNLM model often has better modeling capabilities for text and is more suitable for parallel computing.
  • the path expansion during NNLM decoding is shown in Figure 3.
  • the decoder expands every possible path during decoding, and finally selects the acoustic and language components.
  • the path with the highest total score is used as the target decoding path, and the speech recognition result obtained is not unique and inaccurate.
  • improving the language score of the corresponding text it is often due to the similarity between texts, imbalance of training data, and insufficient model complexity. The reason will also increase the language score of other unwanted texts.
  • the correct speech recognition result is Zhang San, but Zhang Dan and Zhang Han are also recognized.
  • the decoder expands all possible decoding paths during decoding, and the lack of constraints on the paths leads to the output of unwanted speech recognition results. Therefore, it is not enough to simply train a language model to improve the language score in the corresponding field. Other methods need to be used to limit the path. This application example is different from the above-mentioned direct training of the NNLM model.
  • the speech recognition results can be constrained to the text in the matching library (such as a specified address book) or an industry field (such as the communication field). .
  • a mandatory constraint is provided for the speech recognition decoding path.
  • the decoding path is expanded in a streaming manner, the unintended path is suppressed and the decoding path is limited to a feasible set, thereby obtaining the expected recognition results and greatly improving the performance of the speech recognition decoding path. Improve recognition rate.
  • the path constraint is shown below.
  • the neural network-based composition model judges the expanded path, and determines whether the expanded path is a valid expected path through a given threshold to achieve decoding path constraints.
  • the solution mainly includes the following three parts: training sample generation, model training and model use.
  • Training samples can be divided into two types: positive samples and negative samples.
  • the positive samples are the set of feasible paths, and the negative samples are the set of paths that need to be suppressed, that is, the set of all paths of positive exceptions.
  • each sample starts with the start symbol ⁇ sos>.
  • the token identifier corresponding to the decoding path can be used as the input for model training.
  • the label corresponding to the positive example path can be set to 1, and the label to the negative example path can be set to 0.
  • the non-positive example paths can be generated by traversing each layer of the prefix tree All negative samples.
  • composition samples under large amounts of data - effective negative example selection strategy Generating composition samples under large amounts of data - effective negative example selection strategy:
  • the above-mentioned full composition sample generation strategy is more suitable for feasible path sets with a relatively small number of samples. If there are a large number of positive samples in a given path set, such as several If there are millions of samples, it will be difficult to traverse all negative samples, causing problems of storage explosion and excessive calculation.
  • Use language score for further filtering Use the pre-trained language model to calculate the language score of the positive example and candidate negative example paths respectively. The negative examples of "negative example language score - positive example language score ⁇ threshold" will be further filtered.
  • This second speech recognition model can be called a NN composition model with composition, or is referred to as the following NN composition model.
  • Its network structure is shown in Figure 7. Among them, the token identification of the input training sample first passes through the embedding layer to obtain the corresponding embedding representation, and then uses several LSTM layers to obtain an abstract vector representation with historical memory. Finally, it undergoes binary classification processing through the fully connected layer and softmax function to predict the sample. label for training. Among them, the LSTM layer can also be replaced by other RNN (Recurrent Neural Network), or any streaming neural network with historical memory function.
  • RNN Recurrent Neural Network
  • the weight can be shared with the underlying neural network of a language model with the same structure, or directly extended based on several layers of neural networks with fixed weights of the language model to train the NN composition model, which is beneficial to reducing the model volume and amount of calculation.
  • the NN composition model After training the NN composition model, combine the NN composition model with the language model as shown in Figure 9 to obtain the NNLM with composition (as shown in Figure 8), which can replace the original NNLM for decoding, thereby achieving Decoding path constraints. Specifically, the combination will be performed by implementing a merge operation, including the following i to ii.
  • a threshold which can be obtained by counting the accuracy of positive and negative samples. If the score of the composition is greater than the threshold, it will be judged as a positive sample, otherwise it will be a negative sample.
  • the language score of the decoding path will remain unchanged (+0 points). If it is judged as a negative sample, a large negative score will be added to the language score of the corresponding decoding path (for example - 10000 points), thereby suppressing the decoding path. In this way, there is no need to change the decoder and acoustic parts. You only need to use a given set to train a NN composition model, and combine it with the existing language model or directly replace the language model to achieve decoding. The path is forced to constrain, which greatly improves the accuracy of speech recognition results.
  • this application example provides a universal mandatory constraint method for the decoding path of NNLM, which makes up for the lack of path constraints in the original NNLM during the decoding process, avoids the occurrence of unexpected results during the speech recognition process, and makes the decoding more efficient.
  • the path is limited to the preset feasible set, thereby greatly improving the recognition effect; it not only supports the positive example set with a small amount of data, but also supports the composition of a large amount of data through the effective negative example sample screening strategy, which greatly enhances the application of the model.
  • the model adopts the design of the NN composition model structure, and shares the underlying neural network by weight sharing with the NN language model of similar structure, thereby effectively saving storage space and calculation amount; when using the model, there is no need to modify the decoder and acoustics To modify the model and other parts, you only need to use a given set to train a NN composition model and combine it with the existing language model to achieve mandatory constraints on the decoding path, which greatly improves the convenience and practicality of using the model. sex.
  • a training device for a speech recognition model is provided.
  • Figure 10 is a schematic structural diagram of a speech recognition model training device according to an embodiment of the present disclosure.
  • the speech recognition model training device includes: a first processing module 1001 configured to construct negative examples based on positive example samples. sample to obtain the target negative sample used to constrain the speech decoding path; the second processing module 1002 is configured to obtain training data according to the positive sample and the target negative sample; and the training module 1003 is configured To train a first speech recognition model based on the training data to obtain a second speech recognition model.
  • the first processing module 1001 is configured to determine text characters in the matching library as the positive sample; and to determine other samples other than the positive sample as the target. Negative sample.
  • the first processing module 1001 is configured to obtain a data structure in the form of a node tree based on the positive sample, wherein each node in the node tree is related to all the nodes that constitute the positive sample. identifier corresponding to the text character; traverse the positive path formed by the positive sample in the node tree to obtain a first path set; and, The path is determined as a second path set, and the second path set includes the target negative sample.
  • the training module 1003 is configured to input the training data into the embedding layer of the first speech recognition model, and convert the training data into the corresponding feature vector through the embedding layer; in the The correlation layer in the first speech recognition model associates the feature vector with the historical vector to obtain correlation features for speech recognition prediction; inputs the correlation features into the fully connected layer in the first speech recognition model Then perform two-classification processing of the activation function; obtain a loss function according to the output value and target value obtained after the two-classification processing; and train the first speech recognition model according to the back propagation of the loss function to obtain The second speech recognition model.
  • the second speech recognition model is a composition model based on a neural network.
  • FIG. 11 is a schematic structural diagram of a speech recognition device according to an embodiment of the present disclosure.
  • the speech recognition device includes: a third processing module 1101 configured to decode speech data to be recognized according to the second processing module 1101.
  • the speech recognition model constrains the speech decoding path corresponding to the speech data to be recognized.
  • the second speech recognition model is a model trained according to the embodiment; and the fourth processing module 1102 is configured to respond to the speech Decode the constraints of the path to obtain a speech recognition result; wherein the speech recognition result is a text object that matches the expected text.
  • the fourth processing module 1102 is configured to obtain, according to the second speech recognition model, the corresponding language score of the speech data to be recognized that satisfies the decoding path constraint; according to the language score, Determine a target decoding path; and obtain the speech recognition result according to the target decoding path.
  • the method further includes: a recognition module configured to obtain the acoustic score corresponding to the speech data to be recognized according to the acoustic model.
  • the fourth processing module 1102 is configured to obtain an evaluation value according to the language score and the acoustic score; and obtain the decoding space obtained when the speech data to be recognized is decoded, wherein, The decoding space includes multiple decoding paths; and the decoding path with the highest evaluation value among the multiple decoding paths is determined as the target decoding path.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 12 illustrates a schematic block diagram of an example electronic device 1200 that may be used to implement embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the electronic device 1200 includes a computing unit 1201 that can perform calculations according to a computer program stored in a read-only memory (ROM) 1202 or loaded from a storage unit 1208 into a random access memory (RAM) 1203 . Perform various appropriate actions and processing.
  • RAM 1203 various programs and data required for the operation of the electronic device 1200 can also be stored.
  • Computing unit 1201, ROM 1202 and RAM 1203 are connected to each other via bus 1204.
  • An input/output (I/O) interface 1205 is also connected to bus 1204.
  • the I/O interface 1205 includes: an input unit 1206, such as a keyboard, a mouse, etc.; an output unit 1207, such as various types of displays, speakers, etc.; a storage unit 1208, such as a magnetic disk, an optical disk, etc. etc.; and communication unit 1209, such as network card, modem, wireless communication transceiver, etc.
  • the communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks.
  • Computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 1201 performs various methods and processes described above, such as the training/speech recognition method of the speech recognition model.
  • the speech recognition model training/speech recognition method may be implemented as a computer software program that is tangibly embodied in a machine-readable medium, such as storage unit 1208.
  • part or all of the computer program may be loaded and/or installed onto electronic device 1200 via ROM 1202 and/or communication unit 1209.
  • the computer program When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the training/speech recognition method of the speech recognition model described above may be performed.
  • the computing unit 1201 may be configured to perform the training of the speech recognition model/speech recognition method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or a combination thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or a combination thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, a distributed system server, or a server combined with a blockchain.

Abstract

一种语音识别模型的训练方法、装置、电子设备(1200)及存储介质,涉及人工智能技术领域,尤其涉及深度学习、语音识别等领域。方法为:S201、根据正例样本构建负例样本,以得到用于约束语音解码路径的目标负例样本;S202、根据正例样本及目标负例样本,得到训练数据;S203、根据训练数据对第一语音识别模型进行训练,得到第二语音识别模型。

Description

语音识别模型的训练方法、装置、电子设备及存储介质 技术领域
本公开涉及人工智能技术领域,尤其涉及深度学习、语音识别等领域。
背景技术
语音识别技术,是让机器通过识别和理解过程把语音信号转变为相应的文本或命令的技术。语音识别技术主要包括特征提取技术、模式匹配等方面。目前的语音识别不够准确,是要解决的问题。
发明内容
本公开提供了一种语音识别模型的训练方法、语音识别方法、装置、电子设备以及存储介质。
根据本公开的一方面,提供了一种语音识别模型的训练方法,包括:根据正例样本构建负例样本,以得到用于约束语音解码路径的目标负例样本;根据正例样本及目标负例样本,得到训练数据;以及,根据训练数据对第一语音识别模型进行训练,以得到第二语音识别模型。
根据本公开另的一方面,提供了一种语音识别方法,包括:在对待识别语音数据进行解码的情况下,根据第二语音识别模型对待识别语音数据对应的语音解码路径进行约束,第二语音识别模型为根据本公开实施例提供的语音识别模型的训练方法训练得到的模型;以及,根据语音解码路径的约束,得到语音识别结果;其中,语音识别结果为与预期文本相匹配的文本对象。
根据本公开的另一方面,提供了一种语音识别模型的训练装置,包括:第一处理模块,被配置为根据正例样本构建负例样本,以得到用于约束语音解码路径的目标负例样本;第二处理模块,被配置为根据正例样本及目标负例样本,得到训练数据;以及,训练模块,被配置为根据训练数据对第一语音识别模型进行训练,以得到第二语音识 别模型。
根据本公开的另一方面,提供了一种语音识别装置,包括:第三处理模块,被配置为在对待识别语音数据进行解码的情况下,根据第二语音识别模型对待识别语音数据对应的语音解码路径进行约束,第二语音识别模型为根据本公开实施例提供的语音识别模型的训练方法训练得到的模型;以及,第四处理模块,被配置为根据语音解码路径的约束,得到语音识别结果;其中,语音识别结果为与预期文本相匹配的文本对象。
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及,与该至少一个处理器通信连接的存储器;其中,该存储器存储有能够被该至少一个处理器执行的指令,该指令被该至少一个处理器执行,以使该至少一个处理器能够执行本公开实施例所提供的任意一个方法。
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,该计算机指令用于使计算机执行本公开实施例所提供的任意一个方法。
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机指令,该计算机指令被处理器执行时实现本公开实施例所提供的任意一个方法。
采用本公开,可以根据正例样本构建负例样本,得到用于约束语音解码路径的目标负例样本,根据正例样本及目标负例样本可以得到训练数据。根据训练数据对第一语音识别模型进行训练,可以得到第二语音识别模型,由于第二语音识别模型在语音解码路径约束下训练得到,因此,提高了语音识别的准确率。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。
图1是根据本公开实施例的分布式集群处理场景的示意图。
图2是根据本公开实施例的语音识别模型的训练方法的流程示意图。
图3是根据本公开实施例的语音识别路径扩展的示意图。
图4是根据本公开实施例的语音识别路径约束的示意图。
图5是根据本公开实施例的前缀树及样本生成示意图。
图6是根据本公开实施例的语音识别方法的流程示意图。
图7是根据本公开实施例的第一构图模型的网络结构示意图。
图8是根据本公开实施例的语音识别模型的网络结构示意图。
图9是根据本公开实施例的语音识别框架的示意图。
图10是根据本公开实施例的语音识别模型的训练装置的组成结构示意图。
图11是根据本公开实施例的语音识别装置的组成结构示意图。
图12是用来实现本公开实施例的语音识别模型的训练方法/语音识别方法的电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。本文中术语“第一”、“第二”表示指代多个类似的技术用语并对其进行区分,并不是限定顺序的意思,或者限定只有两个的意思,例如,第一特征和第二特征,是指代有两类 /两个特征,第一特征可以为一个或多个,第二特征也可以为一个或多个。
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
通过语音识别技术,可以将语音信号转换为文字输出。随着深度学习技术在声学模型和语言模型中的不断发展,语音识别技术也取得了长足的进步。就声学模型而言,从高斯混合-隐马尔可夫模型(Gaussian Mixture Model-Hidden Markov Model,GMM-HMM)建模方法,到基于连接时序分类(Connectionist Temporal Classification,CTC)尖峰信息的流式多级的截断注意力(Streaming Multi-Layer Trancated Attention,SMLTA)模型,该SMLTA模型可以提供基于注意力机制的在线语音识别服务,使得语音识别性能得到了提升。就语言模型而言,相比于基于统计语言模型的算法(如N-GRAM语言模型),神经网络语言模型(Nerual Network Language Model,NNLM)由于基于深度学习技术,对文本往往具有更好的建模能力并且更合适并行计算,同时保存的模型体积较小,尤其适用于通讯录识别等离线语音场景中,因此,NNLM在识别效果、计算速度和应用场景上更具有优势。
不同于N-GRAM语言模型具有词图约束,直接使用NNLM进行解码的过程中由于所有可能的解码路径都会进行扩展,没有对路径进行强制约束,导致许多非预期的文本出现,带来很大的识别干扰。虽然,基于特定语料训练的NNLM在相关应用场景的语音识别中能获得较高的语言分,但是,往往也使得许多非预期的识别结果(例如通讯录之外的文本)也同时具有较高的语言分。
综上所述,由于解码路径没有强制的路径约束,导致非预期的识别结果出现,语音识别不够准确。
根据本公开的实施例,基于神经网络的语音识别构图方法,通过训练语音识别模型(如基于神经网络的构图模型),对NNLM解 码空间的扩展路径进行约束,从而抑制了非预期的识别结果的输出,有效的提升了语音识别的准确率。
根据本公开的实施例,图1是分布式集群处理场景的示意图。该分布式集群系统为集群系统的一个示例,图1示例性的描述了可以利用该分布式集群系统进行语音识别。本公开不限于单机或多机上的语音识别,采用分布式的处理可以进一步提高语音识别的准确率。如图1所示,在该分布式集群系统100中包括多个节点(如服务器集群101、服务器102、服务器集群103、服务器104、服务器105,服务器105还可以连接电子设备,如手机1051及台式机1052),多个节点间,以及多个节点与连接的电子设备间可以共同执行一个或多个语音识别任务。可选地,该分布式集群系统中的多个节点可以采用数据并行的关系进行语音识别,则多个节点可以基于相同的训练方式执行语音识别的训练任务;若该分布式集群系统中的多个节点采用的是模型并行的模型训练方式,则多个节点可以基于不同的训练方式执行语音识别的训练任务,以更好的训练上述语音识别模型。可选地,在每一轮关系提取模型训练完成后,多个节点之间都可以进行数据交换(如数据同步)。
根据本公开的实施例,提供了一种语音识别模型的训练方法,图2是根据本公开实施例的语音识别模型的训练方法的流程示意图,该方法可以应用于语音识别装置,例如,该装置可以部署于单机、多机或集群系统中的终端或服务器或其它处理设备执行的情况下,可以实现语音识别等等处理。其中,终端可以为用户设备(UE,User Equipment)、移动设备、个人数字助理(PDA,Personal Digital Assistant)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。如图2所示,该方法应用于图1所示的集群系统中的任一节点或电子设备(手机或台式机等)中,包括S201至S203。
S201、根据正例样本构建负例样本,得到用于约束语音解码路径的目标负例样本。
S202、根据正例样本及目标负例样本,得到训练数据。
S203、根据训练数据对第一语音识别模型进行训练,得到第二语音识别模型。
S201-S203的一示例中,可以将正例样本之外的其他样本作为该目标负例样本,将正例样本及目标负例样本作为训练数据,由于正例样本及目标负例样本对应有数据标签,因此,可以对第一语音识别模型进行基于数据标签的有监督学习,对模型训练结束后得到第二语音识别模型。用于模型训练的训练数据包括约束语音解码路径的目标负例样本,即预先做了避免非预期出现的约束,进而在模型训练过程以及模型使用过程中可以抑制非预期语音识别结果(如在通讯录识别的语音场景中抑制了通讯录之外文本)的输出,从而有效的提高了语音识别的准确率。
采用本公开,可以根据正例样本构建负例样本,得到用于约束语音解码路径的目标负例样本,根据正例样本及目标负例样本可以得到训练数据。根据训练数据对第一语音识别模型进行训练,可以得到第二语音识别模型,由于第二语音识别模型是在语音解码路径约束下训练得到的,因此,提高了语音识别的准确率。
一实施方式中,根据正例样本构建负例样本,得到用于约束语音解码路径的目标负例样本,包括:将匹配库中的文本字符确定为正例样本,将除该正例样本之外的其他样本确定为目标负例样本。
一些示例中,正例样本(如表1所示)及负例样本(如表2所示)可以构成训练样本。训练样本包括几类数据,如原始的文本、构成文本的多个文本字符,以及与多个文本字符分别对应的识别符(token)及与多个文本字符分别对应的标签(label)。
Figure PCTCN2022116552-appb-000001
表1
文本 起始符号
   <SOS>    
识别符(token) 3617 23 66
标签(label) 1 1 0
表2
一些示例中,匹配库可以是指定的通讯录,比如将通讯录中的用户名“张三”作为正例样本,当一个语音电话涉及到该用户名“张三”,进行语音解码,得到正确的语音识别结果应该为该用户名“张三”对应的文本信息。相应的,在设计训练数据时,可以基于匹配库(如指定的通讯录)确定出正例样本,然后,根据正例样本构建负例样本,可以是将正例样本之外的其他样本都作为该目标负例样本,从而形成了语音解码路径的约束,从而达到抑制非预期语音识别结果(如张丹或张涵等不正确的语音识别结果)的输出。
采用本实施方式,由于根据正例样本构建负例样本的过程中,可以基于语音解码路径的约束,达到抑制非预期语音识别结果(或称为不正确的语音识别结果)的输出,从而有效的提高了语音识别的准确率。
一实施方式中,将除该正例样本之外的其他样本作为目标负例样本,包括:根据正例样本得到节点树形式的数据结构;其中,该节点树中的每个节点为与构成正例样本的文本字符对应的识别符。遍历该节点树中由正例样本形成的正例路径,得到第一路径集合;以及将节点树中除该第一路径集合之外的路径确定为第二路径集合(该第二路径集合中包括目标负例样本)。
一些示例中,如图3所示,不约束情况下语音识别路径扩展,可以包括:张三、张丹或张涵等语音识别结果,而匹配库(如指定的通讯录)中并不存在张丹或张涵的文本,导致语音识别结果不准确。
一些示例中,如图4所示,在约束情况下对张三、张丹或张涵等语音识别结果进行语音识别的路径约束,只会得到“张三”这一期望的语音识别结果,语音识别结果非常准确。其中,对“三”标记为1,为正例样本的数据标签;对“丹”及“涵”标记为0,为负例样本的数据标签。
一些示例中,如图5所示,节点树形式的数据结构(或称之为基于正例路径得到的前缀树),包括正例样本及负例样本,遍历该节点树形式的数据结构,由与正例样本对应的token构成的路径称为正例路径(记为第一路径集合),由与负例样本对应的token构成的路径称为负例路径(记为第二路径集合),即除该第一路径集合之外的路径为该第二路径集合。token的示例参见上表1及上表2,此处不赘述。其中,第一路径集合中的token,如图5中加粗下划线的数字所示;除此之外的token,都为第二路径集合中的token。从而,基于正例路径可以直接生成全量的正例样本,在该正例样本的节点树形式的数据结构中,除正例路径外的所有可拓展路径(图5中虚线所示)即为负例路径,最终得到上述目标负例样本。
一些示例中,为了提高速度及准确度,可以通过有效负例样本的筛选策略进行数据降维。比如,利用声学混淆矩阵和语言分来进行有效负例样本的筛选,即筛选出声学分或语言分较低的负例路径,并予以删除,用剩下的负例样本作为目标负例样本,并据此构成用于模型训练的训练数据。
采用本实施方式,可以通过对节点树形式的数据结构进行节点遍历的方式,基于构成正例样本的正例路径,得到除该正例路径之外的负例路径,从而得到目标负例样本。还可以进一步的对负例路径中的负例样本进行筛选,以得到更为准确及数据量更少的负例样本。据筛选得到的目标负例样本及正例样本构成用于模型训练的训练数据,提高了模型精度。
一实施方式中,根据训练数据对第一语音识别模型进行训练,得到第二语音识别模型,包括:将训练数据输入第一语音识别模型的嵌入层,通过嵌入层将训练数据转换为对应的特征向量;在第一语音识别模型中的关联层将所述特征向量与历史向量进行关联,得到用于语音识别预测的关联特征;将关联特征输入第一语音识别模型中的全连接层后进行激活函数的二分类处理;根据二分类处理后得到的输出值与目标值得到损失函数;以及,根据损失函数的反向传播对第一语音识别模型进行训练,得到第二语音识别模型(可以为基于神经网络 的构图模型)。
一些示例中,第一语义识别模型的结构,可以包括:嵌入层、关联层、全连接层以及连接全连接层的激活函数。可以对激活函数的输出进行二分类处理。其中,嵌入层可以为词嵌入层(embedding层);关联层可以应用于带有时空关联的场景,具有时间循环结构,从而能很好地刻画具有时空关联的序列数据(如气温、车流量、销量等)、文本(如记事本、通讯录)、事件(购物清单、个人行为),关联层不限于长短时记忆网络(Long Short Term Memory Network,LSTM);激活函数不限于(softmax函数)。基于该第一语义识别模型进行二分类后训练得到上述第二语音识别模型。
采用本实施方式,基于上述第一语义识别模型的结构,通过将训练数据转换为对应的特征向量,将特征向量与历史向量进行关联,得到用于语音识别预测的关联特征,从而可以更好的执行二分类处理,以在模型使用过程中预测出更为精确的语音识别结果。
根据本公开的实施例,提供了一种语音识别方法。图6是根据本公开实施例的语音识别方法的流程示意图,该方法可以应用于语音识别装置,例如,该装置可以部署于单机、多机或集群系统中的终端或服务器或其它处理设备执行的情况下,可以实现语音识别等等处理。其中,终端可以为用户设备(UE,User Equipment)、移动设备、个人数字助理(PDA,Personal Digital Assistant)、手持设备、计算设备、车载设备、可穿戴设备等。在一些可能的实现方式中,该方法还可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。如图6所示,该方法应用于图1所示的集群系统中的任一节点或电子设备(手机或台式机等)中,包括S601和S602。
S601、对待识别语音数据进行解码的情况下,根据第二语音识别模型对待识别语音数据对应的语音解码路径进行约束,第二语音识别模型为根据实施例训练得到的模型。
S602、响应于语音解码路径的约束,得到语音识别结果。
S601-S602的一示例中,在第二语音识别模型的模型使用过程,可以根据语音解码路径的约束,得到正确的语音识别结果。比如通讯 录里有张三,由于是基于通讯录的匹配得到的正例样本,以及基于该正例样本得到的负例样本,因此,该第二语音识别模型基于正例样本及负例样本训练得到,满足该语音解码路径的约束,从而在语音解码路径的约束下,语音识别的输出结果为符合预期的文本。比如与通讯录中的文本相匹配,得到唯一的张三的语音识别结果,而不是张丹或张涵的语音识别结果。
采用本公开的实施例,在对待识别语音数据进行解码的情况下,由于是根据第二语音识别模型对待识别语音数据对应的语音解码路径进行约束进行的语音识别,因此,响应于语音解码路径的约束,可以得到更为准确的语音识别结果,提高了语音识别的准确率。
一实施方式中,响应于语音解码路径的约束,得到语音识别结果,包括:根据第二语音识别模型,得到待识别语音数据满足解码路径约束下对应的语言分;根据语言分确定目标解码路径;以及,根据该目标解码路径得到语音识别结果。
一些示例中,该语音识别方法还包括:根据声学模型,得到待识别语音数据对应的声学分。
一些示例中,根据语言分确定目标解码路径,具体可以包括:根据该语言分及该声学分得到评估值;获取在待识别语音数据进行解码的情况下得到的解码空间(解码空间包括多条解码路径);以及,将该多条解码路径中评估值最高的解码路径确定为目标解码路径。
一些示例中,第二语音识别模型为基于神经网络(Nerual Network,NN)的构图模型,第二语音识别模型可以与已有的语言模型相结合,或者替代该已有的语言模型,以与声学模型一起计算出语言分及声学分。
采用本实施方式,得到待识别语音数据满足解码路径约束下对应的语言分,进一步,可以根据语言分及声学分确定目标解码路径,即在包括多条解码路径的解码空间中,将语言分及声学分构成的总分最高的路径作为该目标解码路径,则输出的语音识别结果,其准确率得到了大大的提升。
一应用示例中,将如图7所示的第二语音识别模型与如图9所 示语音识别框架中的语言模型结合,或者直接替代该语言模型,以对待识别语音数据进行解码,从而得到语音识别结果。该第二语音识别模型可以为NN构图模型,将NN构图模型与语言模型相结合,可以得到如图8所示的结合后的语言模型(即带约束的语言模型,或称之为带构图的语言模型“带构图的NNLM”),不仅可以准确的进行语音识别,且该NN构图模型与语言模型结合后得到的带构图的NNLM模型占用存储空间较小,在离线识别等场景中有更灵活的应用。
该语音识别框架除了包括上述语言模型,还可以包括:解码器及声学模型。结合声学分和语言分,解码器在解码空间中进行路径搜索,将输入的待识别语音数据(即音频信号),在语音解码路径的约束下转换为正确的语音识别结果(如语音对应的文本内容,与指定通讯录中的文本相匹配)。其中,将声学模型和语言模型作为两个独立的部分,可以分别优化。语言模型更适合针对不同的业务场景进行优化,比如,用某一领域的文本语料训练对应的语言模型来增强该场景的识别性能。由于在基于该解码器进行解码的过程中,获取在待识别语音数据进行解码的情况下得到的解码空间(解码空间包括多条解码路径),在语音解码路径的约束下将该多条解码路径中评估值最高的解码路径作为目标解码路径,因此,提高了语音识别的准确率。
以NNLM模型为例进行分析,NNLM模型作为一种语言模型,其对文本往往具有更好的建模能力并且更合适并行计算。但是,在收集特定语料训练NNLM模型的过程中,由于没有约束,NNLM解码时的路径扩展如图3所示,解码器在解码时对每一个可能的路径进行扩展,最终选择声学分和语言分的总分最高的路径作为目标解码路径,得到语音识别结果不唯一且不准确,在提高相应文本的语言分的同时,往往因为文本之间的相似性、训练数据不平衡以及模型复杂度不足等原因也会将其它不想要的文本的语言分提高,比如,正确的语音识别结果为张三,但是张丹和张涵也被识别出来。
可见,在没有约束的情况下,解码器在解码时对所有可能的解码路径进行扩展,缺乏对路径的约束,导致不想要的语音识别结果输 出。因此,仅仅通过训练一个语言模型来提高相应领域的语言分是不够的,还需要通过其它方法对路径进行限制。本应用示例,有别于上述直接训练NNLM模型,通过对语音解码路径进行约束,可以使得语音识别结果被约束在匹配库(如指定通讯录)或某行业领域(如通信领域)的文本之内。
本应用示例中,为语音识别解码路径提供了一种强制约束,在解码路径流式扩展时,通过对非预期路径进行抑制,让解码路径限定在可行集合内,从而得到预期的识别结果并大幅提升识别率。路径约束示意如下所示,在解码过程中,基于神经网络的构图模型对所扩展的路径进行判分,通过给定的阈值,判断扩展路径是否为有效的预期路径,实现解码路径约束。该方案主要包括如下的训练样本生成、模型训练及模型使用三部分的内容。
1)训练样本生成
a.训练样本构建:训练样本可以分成正例和负例两种样本,其中正例样本是设定的可行路径集合,负例样本则是需要抑制的路径集合,即正例外的所有路径集合。如表3所示,每个样本都以起始符<sos>开始,解码路径对应的token标识可作为模型训练的输入,正例路径对应label可以设为1,负例路径label则为0。
Figure PCTCN2022116552-appb-000002
表3
b.全量构图样本生成:对于一个给定的可行路径集合,可以通过构建正例路径的前缀树来生成全部的正例和负例样本。假设输入的token标识范围为[0,3619],起始符<sos>的标识为3617,结束符为3618,则根据正例路径(3617 23 52 3618)进行前缀树构建和样本生 成示意图,如图5所示。对于正例路径,通过构建token-label数据对可以直接生成全量的正例样本,则除正例路径外的所有可拓展路径即为负例路径。由于一旦在流式解码过程中被判为负例,其后续路径则不再扩展,因此只需要训练和正例样本等长的负例样本即可,利用前缀树遍历每层非正例路径可以生成全部负例样本。
c.大数据量下构图样本生成——有效负例挑选策略:上述的全量构图样本生成策略比较适合于样本数相对较少的可行路径集合,若给定路径集合存在大量正例样本,例如几百万上千万条样本,那么将难以遍历全部负例样本,造成存储爆炸和计算量过大的问题。
由于在实际解码过程中,对于声学或语言分较低的路径,本来就会被裁剪掉,因此在训练构图模型时可以不考虑这部分数据,只需要挑选其中的有效负例即可。据此,进一步的,提出一种有效负例样本的筛选策略,利用声学混淆矩阵和语言分来进行负例挑选,解决大样本的构图问题,大幅降低存储空间和训练时长,包括如下i至iii。
i.只挑选混淆音节负例路径:用声学混淆矩阵,对正例路径的每个token取top-N个对应的混淆负例token作为负例候选。
ii.使用语言分进一步过滤:利用事先训好的语言模型,分别计算正例和候选负例路径的语言分,“负例语言分-正例语言分<阈值”的负例将被进一步过滤。
iii.用剩下的负例样本构成训练集合。
2)模型训练
该第二语音识别模型可以称之为带构图的NN构图模型,简称为如下的NN构图模型,其网络结构如图7所示。其中,输入的训练样本的token标识先经过embedding层得到对应的embedding表示,然后再用若干LSTM层获得具有历史记忆的抽象向量表示,最后经过全连接层和softmax函数进行二分类处理,以预测样本label进行训练。其中,LSTM层也可以替换为其它RNN(循环神经网络,Recurrent Neural Network),或任意具有历史记忆功能的流式神经网络。在训练时,可以和同结构的语言模型的底层神经网络进行权重共享,或直接基于该语言模型的若干层权重固定的神经网络进行扩展, 进行该NN构图模型的训练,有利于减少模型体积和计算量。
3)模型使用
在训练好该NN构图模型后,将该NN构图模型与如图9中所示的语言模型进行结合,得到带构图的NNLM(如图8所示),来替代原来的NNLM进行解码,从而实现解码路径约束。具体来说,通过实现一个合并(merge)操作将进行结合,包括如下i至ii。
i.划分一个阈值(可以通过统计正负样本正确率得到),构图的得分大于阈值即判为正例样本,否则为负例样本。
ii.若判为正例样本,则该解码路径的语言分不变(+0分),若判为负例样本,在对应解码路径的语言分上加一个很大的负向分(例如-10000分),从而对该解码路径进行抑制。如此一来,不需要对解码器和声学部分进行改动,只需要利用给定集合训练一个该NN构图模型,并将其与已有的语言模型进行结合或者直接替代该语言模型,即可实现解码路径强制约束,大幅度提升语音识别结果的准确率。
采用本应用示例,为NNLM的解码路径提供了一种通用的强制约束手段,弥补了原来的NNLM在解码过程中缺乏路径约束的不足,在语音识别过程中避免了非预期结果的出现,使解码路径限定在预设的可行集合内,从而大幅提升识别效果;不仅支持小数据量的正例集合,同时通过有效负例样本筛选策略还可以支持大数据量的构图情况,大大增强了模型的应用场景;模型采用该NN构图模型结构的设计,通过与相似结构的NN语言模型进行权重共享,共用底层神经网络,从而有效节省存储空间和计算量;在模型使用中,不需要对解码器和声学模型等其它部分进行改动,只需要利用给定集合训练一个NN构图模型,并将其与已有有的语言模型进行结合,即可实现解码路径强制约束,大大提升了模型使用的便捷性及实用性。
根据本公开的实施例,提供了一种语音识别模型的训练装置。图10是根据本公开实施例的语音识别模型的训练装置的组成结构示意图,如图10所示,语音识别模型的训练装置包括:第一处理模块1001,被配置为根据正例样本构建负例样本,得到用于约束语音解码路径的目标负例样本;第二处理模块1002,被配置为根据所述正例 样本及所述目标负例样本,得到训练数据;以及,训练模块1003,被配置为根据所述训练数据对第一语音识别模型进行训练,得到第二语音识别模型。
一实施方式中,所述第一处理模块1001,被配置为将匹配库中的文本字符确定为所述正例样本;以及,将除所述正例样本之外的其他样本确定为所述目标负例样本。
一实施方式中,所述第一处理模块1001,被配置为根据所述正例样本,得到节点树形式的数据结构,其中,所述节点树中的每个节点为与构成正例样本的所述文本字符对应的识别符;遍历所述节点树中由所述正例样本形成的正例路径,得到第一路径集合;以及,将所述节点树中除所述第一路径集合之外的路径确定为第二路径集合,所述第二路径集合中包括所述目标负例样本。
一实施方式中,所述训练模块1003,被配置为将所述训练数据输入所述第一语音识别模型的嵌入层,通过所述嵌入层将所述训练数据转换为对应的特征向量;在所述第一语音识别模型中的关联层,将所述特征向量与历史向量进行关联,得到用于语音识别预测的关联特征;将所述关联特征输入所述第一语音识别模型中的全连接层后进行激活函数的二分类处理;根据所述二分类处理后得到的输出值与目标值得到损失函数;以及,根据所述损失函数的反向传播对所述第一语音识别模型进行训练,得到所述第二语音识别模型。
一实施方式中,所述第二语音识别模型,为基于神经网络的构图模型。
根据本公开的实施例,提供了一种语音识别装置。图11是根据本公开实施例的语音识别装置的组成结构示意图,如图11所示,语音识别装置包括:第三处理模块1101,被配置为对待识别语音数据进行解码的情况下,根据第二语音识别模型对所述待识别语音数据对应的语音解码路径进行约束,所述第二语音识别模型为根据实施例训练得到的模型;以及,第四处理模块1102,被配置为响应于所述语音解码路径的约束,得到语音识别结果;其中,所述语音识别结果为与预期文本相匹配的文本对象。
一实施方式中,所述第四处理模块1102,被配置为根据所述第二语音识别模型,得到所述待识别语音数据满足所述解码路径约束下对应的语言分;根据所述语言分,确定目标解码路径;以及,根据所述目标解码路径,得到所述语音识别结果。
一实施方式中,还包括:识别模块,被配置为根据声学模型,得到所述待识别语音数据对应的声学分。
一实施方式中,所述第四处理模块1102,被配置为根据所述语言分及所述声学分得到评估值;获取在所述待识别语音数据进行解码的情况下得到的解码空间,其中,所述解码空间包括多条解码路径;以及,将所述多条解码路径中所述评估值最高的解码路径确定为所述目标解码路径。
本公开的技术方案中,所涉及的用户个人信息的获取,存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图12示出了可以用来实施本公开的实施例的示例电子设备1200的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图12所示,电子设备1200包括计算单元1201,其可以根据存储在只读存储器(ROM)1202中的计算机程序或者从存储单元1208加载到随机访问存储器(RAM)1203中的计算机程序,来执行各种适当的动作和处理。在RAM 1203中,还可存储电子设备1200操作所需的各种程序和数据。计算单元1201、ROM 1202以及RAM 1203通过总线1204彼此相连。输入/输出(I/O)接口1205也连接至总线1204。
电子设备1200中的多个部件连接至I/O接口1205,包括:输入单元1206,例如键盘、鼠标等;输出单元1207,例如各种类型的显示器、扬声器等;存储单元1208,例如磁盘、光盘等;以及通信单元1209,例如网卡、调制解调器、无线通信收发机等。通信单元1209允许电子设备1200通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元1201可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1201的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1201执行上文所描述的各个方法和处理,例如语音识别模型的训练/语音识别方法。例如,在一些示例中,语音识别模型的训练/语音识别方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1208。在一些示例中,计算机程序的部分或者全部可以经由ROM 1202和/或通信单元1209而被载入和/或安装到电子设备1200上。当计算机程序加载到RAM 1203并由计算单元1201执行时,可以执行上文描述的语音识别模型的训练/语音识别方法的一个或多个步骤。备选地,在其他示例中,计算单元1201可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音识别模型的训练/语音识别方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系 统、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入、或者触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或 者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以为分布式系统的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (21)

  1. 一种语音识别模型的训练方法,包括:
    根据正例样本构建负例样本,以得到用于约束语音解码路径的目标负例样本;
    根据所述正例样本及所述目标负例样本,得到训练数据;以及
    根据所述训练数据对第一语音识别模型进行训练,以得到第二语音识别模型。
  2. 根据权利要求1所述的方法,其中,根据所述正例样本构建所述负例样本,以得到用于约束语音解码路径的所述目标负例样本,包括:
    将匹配库中的文本字符确定为所述正例样本;以及
    确定除所述正例样本之外的其他样本为所述目标负例样本。
  3. 根据权利要求2所述的方法,其中,确定除所述正例样本之外的所述其他样本为所述目标负例样本,包括:
    根据所述正例样本,得到节点树形式的数据结构;其中,所述节点树中的每个节点为与构成所述正例样本的所述文本字符对应的识别符;
    遍历所述节点树中由所述正例样本形成的正例路径,以得到第一路径集合;以及
    确定所述节点树中除所述第一路径集合之外的路径为第二路径集合,所述第二路径集合中包括所述目标负例样本。
  4. 根据权利要求1-3中任一项所述的方法,其中,根据所述训练数据对所述第一语音识别模型进行训练,以得到所述第二语音识别模型,包括:
    将所述训练数据输入所述第一语音识别模型的嵌入层,以通过所述嵌入层将所述训练数据转换为对应的特征向量;
    在所述第一语音识别模型中的关联层,将所述特征向量与历史 向量进行关联,以得到用于语音识别预测的关联特征;
    将所述关联特征输入所述第一语音识别模型中的全连接层后进行激活函数的二分类处理;
    根据所述二分类处理后得到的输出值与目标值得到损失函数;以及
    根据所述损失函数的反向传播对所述第一语音识别模型进行训练,以得到所述第二语音识别模型。
  5. 根据权利要求4所述的方法,其中,所述第二语音识别模型,为基于神经网络的构图模型。
  6. 一种语音识别方法,包括:
    在对待识别语音数据进行解码的情况下,根据第二语音识别模型对所述待识别语音数据对应的语音解码路径进行约束,所述第二语音识别模型为根据权利要求1-5中任一项所述的方法训练得到的模型;以及
    根据所述语音解码路径的约束,得到语音识别结果;
    其中,所述语音识别结果为与预期文本相匹配的文本对象。
  7. 根据权利要求6所述的方法,其中,根据所述语音解码路径的约束,得到所述语音识别结果,包括:
    根据所述第二语音识别模型,得到所述待识别语音数据满足所述解码路径约束下对应的语言分;
    根据所述语言分,确定目标解码路径;以及
    根据所述目标解码路径,得到所述语音识别结果。
  8. 根据权利要求7所述的方法,还包括:
    根据声学模型,得到所述待识别语音数据对应的声学分。
  9. 根据权利要求8所述的方法,其中,根据所述语言分,确定所 述目标解码路径,包括:
    根据所述语言分及所述声学分得到评估值;
    获取在所述待识别语音数据进行解码的情况下得到的解码空间,其中,所述解码空间包括多条解码路径;以及
    确定所述多条解码路径中评估值最高的解码路径为所述目标解码路径。
  10. 一种语音识别模型的训练装置,包括:
    第一处理模块,被配置为根据正例样本构建负例样本,以得到用于约束语音解码路径的目标负例样本;
    第二处理模块,被配置为根据所述正例样本及所述目标负例样本,得到训练数据;以及
    训练模块,被配置为根据所述训练数据对第一语音识别模型进行训练,以得到第二语音识别模型。
  11. 根据权利要求10所述的装置,其中,所述第一处理模块,被配置为:
    将匹配库中的文本字符确定为所述正例样本;以及
    确定除所述正例样本之外的其他样本为所述目标负例样本。
  12. 根据权利要求11所述的装置,其中,所述第一处理模块,被配置为:
    根据所述正例样本,得到节点树形式的数据结构;其中,所述节点树中的每个节点为与构成所述正例样本的所述文本字符对应的识别符;
    遍历所述节点树中由所述正例样本形成的正例路径,以得到第一路径集合;以及
    确定所述节点树中除所述第一路径集合之外的路径为第二路径集合,所述第二路径集合中包括所述目标负例样本。
  13. 根据权利要求10-12中任一项所述的装置,其中,所述训练模块,被配置为:
    将所述训练数据输入所述第一语音识别模型的嵌入层,以通过所述嵌入层将所述训练数据转换为对应的特征向量;
    在所述第一语音识别模型中的关联层,将所述特征向量与历史向量进行关联,以得到用于语音识别预测的关联特征;
    将所述关联特征输入所述第一语音识别模型中的全连接层后进行激活函数的二分类处理;
    根据所述二分类处理后得到的输出值与目标值得到损失函数;以及
    根据所述损失函数的反向传播对所述第一语音识别模型进行训练,以得到所述第二语音识别模型。
  14. 根据权利要求13所述的装置,其中,所述第二语音识别模型,为基于神经网络的构图模型。
  15. 一种语音识别装置,包括:
    第三处理模块,被配置为在对待识别语音数据进行解码的情况下,根据第二语音识别模型对所述待识别语音数据对应的语音解码路径进行约束,所述第二语音识别模型为根据权利要求1-5中任一项所述的方法训练得到的模型;以及
    第四处理模块,被配置为根据所述语音解码路径的约束,得到语音识别结果;
    其中,所述语音识别结果为与预期文本相匹配的文本对象。
  16. 根据权利要求15所述的装置,其中,所述第四处理模块,被配置为:
    根据所述第二语音识别模型,得到所述待识别语音数据满足所述解码路径约束下对应的语言分;
    根据所述语言分,确定目标解码路径;以及
    根据所述目标解码路径,得到所述语音识别结果。
  17. 根据权利要求16所述的装置,还包括:识别模块,被配置为:
    根据声学模型,得到所述待识别语音数据对应的声学分。
  18. 根据权利要求17所述的装置,其中,所述第四处理模块,被配置为:
    根据所述语言分及所述声学分得到评估值;
    获取在所述待识别语音数据进行解码的情况下得到的解码空间,其中,所述解码空间包括多条解码路径;以及
    确定所述多条解码路径中评估值最高的解码路径为所述目标解码路径。
  19. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;
    其中,所述存储器存储有能够被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行根据权利要求1-9中任一项所述的方法。
  20. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使计算机执行根据权利要求1-9中任一项所述的方法。
  21. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-9中任一项所述的方法。
PCT/CN2022/116552 2022-06-23 2022-09-01 语音识别模型的训练方法、装置、电子设备及存储介质 WO2023245869A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210719500.5A CN115035890B (zh) 2022-06-23 2022-06-23 语音识别模型的训练方法、装置、电子设备及存储介质
CN202210719500.5 2022-06-23

Publications (1)

Publication Number Publication Date
WO2023245869A1 true WO2023245869A1 (zh) 2023-12-28

Family

ID=83127459

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116552 WO2023245869A1 (zh) 2022-06-23 2022-09-01 语音识别模型的训练方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN115035890B (zh)
WO (1) WO2023245869A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690434A (zh) * 2024-02-04 2024-03-12 深圳市友杰智新科技有限公司 多命令词的语音解码识别方法、装置、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
CN111418009A (zh) * 2019-10-31 2020-07-14 支付宝(杭州)信息技术有限公司 个性化说话者验证系统和方法
CN113420121A (zh) * 2021-06-24 2021-09-21 中国科学院声学研究所 文本处理模型训练方法、语音文本处理方法及装置
CN113646833A (zh) * 2021-07-14 2021-11-12 东莞理工学院 语音对抗样本检测方法、装置、设备及计算机可读存储介质
CN114299927A (zh) * 2021-12-20 2022-04-08 北京声智科技有限公司 唤醒词识别方法、装置、电子设备及存储介质
CN114299933A (zh) * 2021-12-28 2022-04-08 北京声智科技有限公司 语音识别模型训练方法、装置、设备、存储介质及产品

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017010249A (ja) * 2015-06-22 2017-01-12 日本電信電話株式会社 パラメタ学習装置、文類似度算出装置、方法、及びプログラム
CN110598224A (zh) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 翻译模型的训练方法、文本处理方法、装置及存储介质
CN111160552B (zh) * 2019-12-17 2023-09-26 北京百度网讯科技有限公司 新闻信息的推荐处理方法、装置、设备和计算机存储介质
CN112001190A (zh) * 2020-07-20 2020-11-27 北京百度网讯科技有限公司 自然语言处理模型的训练方法、装置、设备及存储介质
US11775778B2 (en) * 2020-11-05 2023-10-03 Microsoft Technology Licensing, Llc Machine translation of entities
CN113095901B (zh) * 2021-02-20 2024-02-20 科大讯飞股份有限公司 推荐方法和相关模型的训练方法、电子设备、存储装置
CN113658586B (zh) * 2021-08-13 2024-04-09 北京百度网讯科技有限公司 语音识别模型的训练方法、语音交互方法及装置
CN114444462B (zh) * 2022-01-26 2022-11-29 北京百度网讯科技有限公司 模型训练方法及人机交互方法、装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200043483A1 (en) * 2018-08-01 2020-02-06 Google Llc Minimum word error rate training for attention-based sequence-to-sequence models
CN111418009A (zh) * 2019-10-31 2020-07-14 支付宝(杭州)信息技术有限公司 个性化说话者验证系统和方法
CN113420121A (zh) * 2021-06-24 2021-09-21 中国科学院声学研究所 文本处理模型训练方法、语音文本处理方法及装置
CN113646833A (zh) * 2021-07-14 2021-11-12 东莞理工学院 语音对抗样本检测方法、装置、设备及计算机可读存储介质
CN114299927A (zh) * 2021-12-20 2022-04-08 北京声智科技有限公司 唤醒词识别方法、装置、电子设备及存储介质
CN114299933A (zh) * 2021-12-28 2022-04-08 北京声智科技有限公司 语音识别模型训练方法、装置、设备、存储介质及产品

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690434A (zh) * 2024-02-04 2024-03-12 深圳市友杰智新科技有限公司 多命令词的语音解码识别方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115035890B (zh) 2023-12-05
CN115035890A (zh) 2022-09-09

Similar Documents

Publication Publication Date Title
US11816440B2 (en) Method and apparatus for determining user intent
JP7108675B2 (ja) 意味マッチング方法、装置、電子機器、記憶媒体及びコンピュータプログラム
CN107180084B (zh) 词库更新方法及装置
CN112541076B (zh) 目标领域的扩充语料生成方法、装置和电子设备
JP7430820B2 (ja) ソートモデルのトレーニング方法及び装置、電子機器、コンピュータ可読記憶媒体、コンピュータプログラム
US20180150143A1 (en) Data input system with online learning
WO2023015939A1 (zh) 用于文本检测的深度学习模型训练方法及文本检测方法
CN113360700B (zh) 图文检索模型的训练和图文检索方法、装置、设备和介质
JP7058574B2 (ja) 情報処理装置、情報処理方法、およびプログラム
CN113053367A (zh) 语音识别方法、语音识别的模型训练方法以及装置
CN111797216B (zh) 检索项改写方法、装置、设备以及存储介质
JPWO2014073206A1 (ja) 情報処理装置、及び、情報処理方法
WO2018232591A1 (en) SEQUENCE RECOGNITION PROCESSING
WO2023245869A1 (zh) 语音识别模型的训练方法、装置、电子设备及存储介质
JP6553180B2 (ja) 言語検出を行うためのシステムおよび方法
JP6563350B2 (ja) データ分類装置、データ分類方法、及びプログラム
US10796090B2 (en) Quick language detection with language neutral functionality
CN113919424A (zh) 文本处理模型的训练、文本处理方法、装置、设备和介质
JP7096199B2 (ja) 情報処理装置、情報処理方法、およびプログラム
CN116204624A (zh) 应答方法、装置、电子设备及存储介质
US20230032208A1 (en) Augmenting data sets for machine learning models
CN113806541A (zh) 情感分类的方法和情感分类模型的训练方法、装置
CN113807099B (zh) 实体信息识别方法、装置、电子设备以及存储介质
CN116244432B (zh) 语言模型的预训练方法、装置及电子设备
CN116069914B (zh) 训练数据的生成方法、模型训练方法以及装置

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 18266432

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22947609

Country of ref document: EP

Kind code of ref document: A1