WO2021217619A1 - 基于标签平滑的语音识别方法、终端及介质 - Google Patents

基于标签平滑的语音识别方法、终端及介质 Download PDF

Info

Publication number
WO2021217619A1
WO2021217619A1 PCT/CN2020/088422 CN2020088422W WO2021217619A1 WO 2021217619 A1 WO2021217619 A1 WO 2021217619A1 CN 2020088422 W CN2020088422 W CN 2020088422W WO 2021217619 A1 WO2021217619 A1 WO 2021217619A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
sample
preset
training
speech recognition
Prior art date
Application number
PCT/CN2020/088422
Other languages
English (en)
French (fr)
Inventor
郑诣
杨显杰
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2020/088422 priority Critical patent/WO2021217619A1/zh
Publication of WO2021217619A1 publication Critical patent/WO2021217619A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method of speech recognition based on label smoothing, an intelligent terminal, and a computer-readable storage medium.
  • the training method of the speech recognition model in the related technical solution has the problem of insufficient accuracy of subsequent speech recognition.
  • a method of speech recognition based on label smoothing including:
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • An intelligent terminal includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the following steps:
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps:
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • the sample recognition label corresponding to the training sample is based on the preset homophone dictionary Perform label smoothing processing to obtain the corresponding sample smooth label; then train the speech recognition model through the training sample and sample smooth label.
  • the corresponding loss value is calculated based on the preset loss function, and based on the loss value Backpropagation to complete the training of the speech recognition model.
  • the label smoothing of the training samples takes into account the homophones, and the homophones have a higher probability than other non-homophones through the homophones, thereby improving the accuracy of the speech recognition of Chinese containing homophones and improving The overall accuracy of speech recognition.
  • the gap between the test identification label and the sample smoothing label that can measure the speech recognition model is also added.
  • the differential KL distance is used as a penalty, and the loss value obtained through the above loss value calculation method can better complete the training of the speech recognition model, so as to improve the speech recognition effect and improve the accuracy of subsequent speech recognition.
  • FIG. 1 is an application environment diagram of a speech recognition method based on label smoothing according to an embodiment of the application
  • FIG. 2 is a schematic flowchart of a method for speech recognition based on label smoothing according to an embodiment of the application
  • FIG. 3 is a schematic flowchart of a process of label smoothing processing on a sample identification label according to an embodiment of the application
  • Figure 4 is a schematic structural diagram of a speech recognition model in an embodiment of the application.
  • FIG. 5 is a schematic flowchart of a loss value calculation process in an embodiment of the application.
  • FIG. 6 is a schematic structural diagram of a speech recognition method based on label smoothing in an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a loss value calculation module in an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a label smoothing processing module in an embodiment of the application.
  • FIG. 9 is a schematic structural diagram of a computer device running the above-mentioned tag smoothing-based speech recognition method according to an embodiment of the application.
  • FIG. 10 is a schematic structural diagram of a smart terminal in an embodiment of this application.
  • FIG. 11 is a schematic structural diagram of a non-transitory computer-readable storage medium in an embodiment of this application.
  • Fig. 1 is an application environment diagram of a method of speech recognition based on label smoothing in an embodiment.
  • the voice recognition system includes a terminal 110 and a server 120.
  • the terminal 110 and the server 120 are connected through a network.
  • the terminal 110 may be a smart robot, a desktop terminal, or a mobile terminal.
  • the mobile terminal may be at least one of a mobile phone, a tablet computer, a notebook computer, etc.
  • the terminal 10 Not limited to any smart terminal.
  • the server 120 may be implemented as an independent server or a server cluster composed of multiple servers. Among them, the terminal 110 is used to perform recognition processing on the speech segment to be recognized, and the server 120 is used to train and predict the model.
  • the voice recognition system to which the above-mentioned tag smoothing-based voice recognition method is applied may also be implemented based on the terminal 110.
  • the terminal 110 is used for model training and prediction, and converts speech fragments to be recognized into text.
  • a method for speech recognition based on label smoothing is provided.
  • the method can be applied to a terminal or a server, and this embodiment is applied to a terminal as an example.
  • the label smoothing-based speech recognition method specifically includes the following steps:
  • Step S102 Obtain training data, where the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice.
  • the training data is a training database composed of multiple training samples, where each training sample includes a sample voice corresponding to a voice segment (the sample voice can be a voice segment or a sample voice feature), and The sample identification tag corresponding to the sample voice.
  • the sample speech may be a speech fragment, or a speech feature corresponding to the speech fragment.
  • the sample recognition tag is a text recognition tag corresponding to the speech segment, and is a text sequence.
  • the sample recognition result may be an artificially labeled label corresponding to the speech segment.
  • sample voice is a voice feature corresponding to a voice segment
  • the speech segment corresponding to each training sample in the training data is a sentence. That is to say, in this embodiment, the training of the speech recognition model and subsequent recognition are performed on the sentence, and Not for a single word or word. Therefore, the sample speech corresponding to the training sample is the speech feature vector corresponding to the sentence.
  • each word vector corresponds to one speech feature vector. In other embodiments, multiple word vectors may correspond to one speech feature vector.
  • Step S104 Perform label smoothing processing on the sample identification label based on a preset homophone dictionary, and obtain sample smoothing labels after label smoothing processing.
  • homophones Because there are many homophones in Chinese, homophones also have a certain influence on speech recognition. Therefore, in the process of training the speech recognition model, homophones need to be considered.
  • tags with the same pinyin are predetermined according to the pinyin of each tag (ie, a character).
  • chu2 the kitchen is in addition to the cupboard.
  • the preset homophone dictionary contains all tags of the same pinyin.
  • Label Smoothing is a regularization method that smoothes the labels to prevent overfitting.
  • label smoothing is the selectivity of adding a certain probability of other labels to the labeled label (such as the aforementioned sample identification label).
  • a corresponding label database is constructed for Chinese characters commonly used in Chinese, that is, a character database.
  • a character database For example, a two-level font library of the national standard GB2313 can be used, which contains 6763 Chinese characters.
  • the label database when the label database corresponds to the two-level font library of the national standard GB2313, the label database includes 6763 labels.
  • the sample recognition label corresponding to the sample speech corresponding to the training sample includes labels corresponding to multiple characters.
  • it needs to be smoothed by label processing that is, the label in the label is recognized for the sample, and other labels in the label database are added as the voice recognition label.
  • the probability Specifically, based on the homophone dictionary, the label in the sample identification label is smoothed based on the homophone character to determine the corresponding smoothing label of the sample after the smoothing process.
  • the process of performing label smoothing processing on the sample identification label specifically includes steps S402-S406 as shown in FIG. 3:
  • Step S402 Based on the preset homophone dictionary, determine at least one homophone tag corresponding to each sample identification tag.
  • the homophone label corresponding to each label is determined according to the pinyin corresponding to the character, wherein the number of homophone labels corresponding to each label is more indivual.
  • the corresponding multiple homophone labels can be: di, cu, cu, chu, hoe, hu, chu, bu, ⁇ .
  • the sample identification tag includes "kitchen chu2”
  • the corresponding non-homonymous character tags are other tags in the tag database except for the above-mentioned sample identification tags and homophone tags.
  • the sample recognition label is the speech recognition label corresponding to the training sample, which is a text sequence.
  • the pinyin corresponding to the sample recognition result is determined, and the corresponding pinyin sequence is constructed, which is the sample pinyin sequence.
  • the sample pinyin sequence also includes multiple characters or words corresponding to the sentence.
  • the homophone character corresponding to the pinyin label for each pinyin label, according to the preset homophone dictionary, determine the homophone character corresponding to the pinyin label, and use this as the homophone label.
  • the training sample as the sample voice corresponding to the speech segment corresponding to “turn on the kitchen ventilation fan” and the corresponding sample identification tag (the text sequence corresponding to “turn on the kitchen ventilation fan”) as an example for description.
  • the process of constructing the sample pinyin sequence is the process of determining the pinyin sequence corresponding to "turn on the kitchen ventilation fan", and the constructed sample pinyin sequence is: "da kai chu fang huan qi shan”.
  • the corresponding Chinese character cannot be determined based on the pinyin of a single character, and the corresponding pinyin sequence needs to be generated by combining words in the context of the entire sentence.
  • Step S404 Perform label smoothing processing on the sample identification label based on the determined homophone label, and determine first distribution information corresponding to the sample identification label.
  • Step S406 Use the first distribution information as a sample smoothing label.
  • tags in the tag database are not based on word frequency statistics, nor are they randomly distributed, but are determined based on homophones. That is to say, in the process of label smoothing, the probability corresponding to the label is determined according to whether it is the homophone label corresponding to the sample identification label.
  • label smoothing processing is performed on the sample identification label, thereby determining the sample smoothing label.
  • the sample identification tag and the corresponding homophone tag and the tag probability corresponding to the non-homonym tag are determined.
  • the determination of each label is determined according to the preset probability distribution according to whether each label is a homophone label.
  • a preset probability coefficient is obtained, and the label probability of one or more of the sample identification tag, multiple homophone tags, and/or multiple non-homonymous word tags is determined according to the preset probability coefficient.
  • the preset probability coefficient includes the tag probability of one or more of the sample identification tag, the homophone tag and/or the multiple non-homonymous word tags, and the tag probability corresponding to each tag can be determined accordingly.
  • the corresponding probability distribution is determined according to the tag probability of one or more of the sample identification tag, the multiple homophone tags and/or the multiple non-homonym tags, that is, the first distribution information.
  • the constructed first distribution information is:
  • probability coefficients can also be set according to requirements, for example, 0.7, 0.2, 0.1.
  • the details can be determined according to the requirements of the model design.
  • the corresponding probability in addition to using the same probability, can also be determined according to each tag, for example, according to each tag.
  • the word frequency of a tag in the process of speech recognition is determined by the statistical results.
  • the sum of the probabilities corresponding to these 9 homophone labels is 0.3, but the probability distribution corresponding to each homophone label can be
  • the aforementioned equalization method can also be used to determine the probability of each homophone label based on other factors. Specifically, for the homophone tag, determine the probability coefficient corresponding to the homophone tag, such as the aforementioned 0.3, and then for multiple homophone tags, determine that each homophone tag corresponds to each homophone tag according to a preset method (for example, according to word frequency statistics) The probability coefficient of, where the sum of the probability coefficients of all homophone labels corresponds to the probability coefficient (such as 0.3) corresponding to the previously determined homophone label.
  • the probability coefficient of the non-homonymous character label can also be determined in the same manner.
  • word frequency statistics and other factors are also considered, which can further improve the scientificity of label smoothing, improve the effectiveness of speech recognition model training, and improve the follow-up The accuracy of speech recognition.
  • Step S106 Train a preset speech recognition model according to the training sample and the sample smoothing label, and calculate a loss value corresponding to the training sample based on the preset loss function.
  • a speech recognition model can be constructed first according to the requirements of speech recognition, for example, a neural network model.
  • the constructed neural network model is an end-to-end neural network model.
  • Figure 4 shows a specific model structure diagram of a speech recognition model.
  • x 1 ,..., x T are the input of the speech recognition model, for example, the speech features extracted from the speech segment;
  • y 1 , y 2 ,... are the output of the speech recognition model, for example, speech The identification tag (character or character vector) of the fragment.
  • 301 is the first sub-network module that processes the input features, and
  • h 2 ,..., h T are intermediate variables of the first sub-network module.
  • the second sub-network module 302 and 303 are the second sub-network module and the third sub-network module, respectively, used to calculate the corresponding voice recognition label according to the output of the first sub-network module 301, z 2 , z 4 ,... are the second sub-network module 302
  • the intermediate variables of r 0 ,..., r L and q 0 ,..., q L , c 1 , c 2 ,... are the intermediate variables of the third sub-network module 303.
  • the final output of the speech recognition model namely y 1 , y 2 ,... is determined.
  • the output result of the aforementioned speech recognition model includes the recognized characters and the confidence level corresponding to each character, and the sum of the confidence levels of all characters is 100%.
  • the constructed neural network model may also be a BP neural network, a Hopfield network, an ART network, or a Kohonen network.
  • the preset speech recognition model is trained, and then the corresponding loss value is calculated to evaluate the training effect of the speech recognition model .
  • the calculation of the loss value adopts the cross-entropy loss function in one embodiment, which refers to the cross-entropy of the predicted label probability and the constructed label probability, that is, in this step, the cross-entropy loss
  • the function calculates the distance between the predicted label probability preset by the current speech recognition model and the sample smoothed label after label smoothing. It should be noted that, in other embodiments, the loss function may also use other loss function calculation methods to calculate the loss value.
  • the process of training the preset speech recognition model based on the training samples may be inputting the sample speech corresponding to the training sample into the speech recognition model to obtain the output result of the speech recognition model.
  • the output result is based on the speech recognition
  • the identification label calculated by the model is the test identification label.
  • the corresponding loss value is calculated according to the test identification label, the sample identification label and the sample smoothing label, where the calculation of the loss value is calculated according to the preset loss function.
  • the process of calculating the loss value includes steps S202-S204 as shown in FIG. 5:
  • Step S202 Input the sample voice into the preset voice recognition model, and obtain a test recognition label output by the preset voice recognition model.
  • the sample speech corresponding to the training sample is used as the input, and the sample recognition label is used as the output.
  • the sample voice is input into the voice recognition model, and then the corresponding test identification tag is obtained through the voice recognition model.
  • the test recognition label is a voice recognition label corresponding to the sample voice output by the voice recognition model, and is a corresponding text sequence.
  • Step S204 Calculate the loss value between the test identification label and the sample smoothing label according to a preset loss function.
  • the sample identification label is subjected to label smoothing processing to obtain the corresponding sample smoothing label, and the corresponding first distribution information is constructed at the same time.
  • step S202 after the sample voice is input into the speech recognition model through the training sample, the test identification label output by the model is obtained, and the second distribution information is determined based on the test identification label, and the second distribution information is the probability determined according to the test identification label distributed.
  • the second distribution information is determined according to the test identification tags of the current speech recognition model, and identifies the distribution situation corresponding to each tag in the test identification tags.
  • the difference between the second distribution information (the distribution of the predicted output) and the first distribution information (the distribution of the expected output) is calculated to estimate the construction based on label smoothing
  • the loss between the first distribution information and the predicted distribution (that is, the second distribution information) predicted by the speech recognition model is calculated.
  • the calculation of the above loss function is divided into two parts, one part is cross entropy, and the other part is KL divergence.
  • the cross entropy between the second distribution information (the distribution of the predicted output) and the first distribution information (the distribution of the expected output) is calculated through the preset cross-entropy loss function, as the cross-entropy term , Used to measure the closeness between the two.
  • the cross-entropy value between the first distribution information and the second distribution information is calculated as the cross-entropy term:
  • the loss value is determined in the form of negative entropy:
  • L( ⁇ ) is the loss value
  • KL divergence For the calculation of KL divergence, the KL distance between the second distribution information (the distribution of the predicted output) and the first distribution information (the distribution of the expected output) is calculated as the KL penalty according to the preset KL distance calculation formula.
  • KL divergence also known as KL distance, Kullback–Leibler divergence, also known as relative entropy
  • KL divergence is used to describe the difference between two probability distributions. That is, in this embodiment, the difference between the first distribution information and the second distribution information can be calculated by the preset KL distance calculation formula (also called KL divergence calculation formula), and used to calculate the corresponding loss value. .
  • a KL distance value between the first distribution information and the second distribution information is calculated as the loss value.
  • the KL distance calculation formula is as follows:
  • y is a character (that is, the test recognition result and the sample recognition result)
  • x identifies the voice feature of the sample
  • identifies the parameter of the voice recognition model
  • d y is the above-mentioned first distribution information.
  • the above-constructed KL distance D KL (p(d y
  • the loss value is determined in the form of a negative KL distance:
  • L( ⁇ ) is the loss value
  • the corresponding loss function can be further constructed. That is to say, in this step, the loss function includes crossover Entropy and KL penalty:
  • L( ⁇ ) is the loss value
  • x)) is the cross-entropy term
  • D KL p(d y
  • the first distribution information is constructed after label smoothing the sample identification labels based on the homophone dictionary. That is to say, in the first distribution information, homophones have a higher probability than other characters. In Chinese speech recognition, it is well considered that homophones have a higher probability than other characters.
  • u is a fixed uniform distribution or unigram distribution.
  • the second distribution information replaces D KL (p(d y
  • the speech recognition model can better perform speech recognition for Chinese that contains homophones.
  • the influence of the Chinese character polyphone information in the training of the speech recognition model and the speech recognition model is considered, and not only the corresponding character of each Chinese character in the speech recognition process is considered.
  • the frequency of the appearance of characters also considers the influence of the appearance of homophones on the speech recognition process, and considers homophones as one of the knowledge to construct the prior distribution, thereby improving the accuracy of Chinese speech recognition with more homophones.
  • the loss function for calculating the loss value may also be other functions, and is not limited to the calculation method of the above loss function.
  • Step S108 Perform back propagation according to the loss value to complete the training of the preset speech recognition model.
  • the speech recognition model can be back propagated according to the loss value to complete the training of the speech recognition model.
  • the stochastic gradient descent method can be used to perform back propagation to train the speech recognition model; in other embodiments, other algorithms can also be used to perform back propagation.
  • use a preset optimizer for backpropagation The preset optimizer trains the speech recognition model, where the preset optimizer can be one of AdagradOptimizer, MomentumOptimizer, or AdamOptimizer, or it can be Other optimizers.
  • the specific method for backpropagation is not limited in this embodiment, and it can be implemented by the backpropagation method in related technical solutions.
  • the influence of the existence of homophones on the speech recognition effect is considered
  • the probability of homophones is set to be higher than the probability of other characters, which ensures that homophones have a higher probability than other characters, thereby improving the label smoothing effect of speech recognition model training.
  • the performance of the speech recognition model is improved, and the accuracy of speech recognition is improved.
  • the excessive confidence of the network can be alleviated, thereby reducing
  • the network is over-fitting, which has a better label smoothing effect, thereby improving the performance of the neural network model.
  • the excessive confidence of the neural network model means low entropy distribution, because in this embodiment, the KL distance based on the distribution of the test identification label and the distribution corresponding to the sample smoothing label is added on the basis of the cross entropy, and the loss value is paired Optimization during the training process of the speech recognition model can further play a better label smoothing effect, improve the recognition effect of the speech recognition model, and reduce the over-fitting phenomenon of the speech recognition model network.
  • the loss value calculated by the above loss function trains the speech recognition model through the backpropagation algorithm, so that the trained speech recognition model can better consider the impact of homophones on Chinese speech recognition, and achieve better The effect of speech recognition.
  • the step of performing back propagation according to the loss value to complete the training of the preset speech recognition model further includes: judging whether the loss value is less than a preset loss threshold; when the loss value is less than In the case of a preset loss threshold, it is determined that the training of the preset speech recognition model is completed.
  • the loss value represents the distance or difference between the test identification label and the sample identification label. If the test identification label and the sample identification label are close enough or equal to each other, it means that the accuracy of the trained speech recognition model has reached the requirements. End the training of the corresponding speech recognition model.
  • a loss threshold is set, such as 0.05. When the loss value is less than the loss threshold, it is determined that the training of the speech recognition model is completed. Otherwise, it is necessary to continue the speech recognition through the training samples included in the training data. The model is trained.
  • a speech recognition device based on label smoothing is also proposed.
  • the above-mentioned label smoothing-based speech recognition device includes:
  • the training data acquisition module 102 is configured to acquire training data.
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • the label smoothing processing module 104 is configured to perform label smoothing processing on the sample identification label based on a preset homophone dictionary, and obtain the sample smoothing label after label smoothing processing;
  • the loss value calculation module 106 is configured to train a preset speech recognition model according to the training sample and the sample smoothing label, and calculate the loss value corresponding to the training sample based on the preset loss function;
  • the back-propagation training module 108 is configured to perform back-propagation according to the loss value to complete the training of the preset speech recognition model.
  • the preset speech recognition model is an end-to-end neural network model.
  • the loss value calculation module 106 further includes:
  • the test recognition label obtaining unit 602 is configured to input the sample voice into the preset voice recognition model, and obtain the test recognition label output by the preset voice recognition model;
  • the loss value calculation subunit 604 is configured to calculate the loss value between the test identification label and the sample smoothing label according to a preset loss function.
  • the label smoothing processing module 104 further includes:
  • Homophone tag determination subunit 402 configured to determine at least one homophone tag corresponding to each sample identification tag based on the preset homophone dictionary
  • the first distribution determining subunit 404 is configured to perform label smoothing processing on the sample identification label based on the determined homophone label, and determine the first distribution information corresponding to the sample identification label; The information serves as a sample smoothing label.
  • each of the sample identification tags includes a text sequence corresponding to the sample identification tag; the homophone tag determination subunit 402 is used to determine the sample pinyin sequence corresponding to the sample identification tag, and the sample The pinyin sequence includes several pinyin tags corresponding to the character sequence corresponding to the sample identification tag; based on the preset homophone dictionary, the at least one homophone tag corresponding to each pinyin tag is determined respectively.
  • the homophone tag determining subunit 402 is configured to determine at least one homophone tag and at least one non-homonymous word tag corresponding to each of the pinyin tags based on a preset homophone dictionary;
  • the second distribution determining subunit 404 is further configured to obtain a preset probability coefficient, and determine the sample identification label, multiple homophone labels, and/or multiple non-identical characters according to the preset probability coefficient.
  • the label probability of one or more of the homophone tags; the first distribution information is determined according to the label probability of one or more of the sample identification tag, multiple homophone tags, and/or multiple non-homonymous word tags .
  • the test identification tag acquisition unit 602 is further configured to determine the second distribution information according to the test identification tag; the loss value calculation sub-unit 604 is also configured to calculate and calculate the second distribution information based on the preset cross-entropy loss function.
  • the cross-entropy item corresponding to the training sample; based on the preset KL distance calculation formula, the KL distance value between the first distribution information and the second distribution information is calculated as a KL penalty item; according to the KL penalty item And the cross entropy term to calculate the loss value.
  • Fig. 9 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device is not limited to any intelligent terminal, and may also be a server. In this embodiment, it is preferably an intelligent robot.
  • the computer device 90 includes a processor 901, a non-transitory memory 902, and a network interface 903 connected through a system bus.
  • the non-transitory memory 902 includes a non-volatile storage medium 9021 and an internal memory 9022.
  • the non-volatile storage medium of the computer device stores an operating system 9023, and may also store a computer program 9024.
  • the processor can enable the processor to implement a label-based smooth voice recognition method.
  • a computer program 9025 may also be stored in the internal memory.
  • the processor can make the processor execute the tag smoothing-based speech recognition method.
  • FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • an intelligent terminal is proposed. As shown in FIG. 10, the intelligent terminal 1000 includes a non-transitory memory 1001 and a processor 1002.
  • the non-transitory memory 1001 stores a computer program 1003.
  • the processor 1002 is caused to execute the following steps:
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • a non-transitory computer-readable storage medium 1100 which stores a computer program 1101.
  • the processor executes the following step:
  • the training data includes a plurality of training samples, and each of the training samples includes a sample voice and a sample recognition label corresponding to the sample voice;
  • the sample recognition label corresponding to the training sample is based on the preset homophone
  • the word dictionary performs label smoothing processing to obtain the corresponding sample smooth label; then the speech recognition model is trained through the training sample and sample smooth label.
  • the corresponding loss value is calculated based on the preset loss function, and based on the loss The value is back propagated to complete the training of the speech recognition model.
  • the label smoothing of the training samples takes into account the homophones, and the homophones have a higher probability than other non-homophones through the homophones, thereby improving the accuracy of the speech recognition of Chinese containing homophones and improving The accuracy of speech recognition.
  • the gap between the test identification label and the sample smoothing label that can measure the speech recognition model is also added.
  • the differential KL distance is used as a penalty, and the loss value obtained by the above loss value calculation method can improve the speech recognition effect and improve the accuracy of subsequent speech recognition.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种基于标签平滑的语音识别模型训练方法,包括:获取训练数据,该训练数据包括多个训练样本,每一个训练样本包括样本语音及与样本语音对应的样本识别标签(S102);基于预设的同音字字典,对该样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签(S104);根据训练样本和样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与训练样本对应的损失值(S106);根据损失值进行反向传播,以完成对预设的语音识别模型的训练(S108)。采用该方法训练的语音识别模型,可以提高语音识别准确性。

Description

基于标签平滑的语音识别方法、终端及介质 技术领域
本申请涉及人工智能技术领域,尤其涉及一种基于标签平滑的语音识别方法、智能终端及计算机可读存储介质。
背景技术
随着移动互联网和人工智能技术的快速发展,语音识别在人工智能领域以及各个领域的应用越来越多。如何提供语音识别的准确性也成为了语音识别技术中非常重要的一个任务。但是,在相关的语音识别模型的训练方法中,训练的语音识别模型的准确率尚存在一定的不足,尤其是在针对中文的语音识别中,出现错别字的几率较高。
也就是说,相关技术方案中的语音识别模型的训练方法存在后续的语音识别的准确率不足的问题。
申请内容
基于此,有必要针对上述问题,提出了一种基于标签平滑的语音识别方法、装置、智能终端及计算机可读存储介质。
在本申请的第一方面,提出了一种基于标签平滑的语音识别方法。
一种基于标签平滑的语音识别方法,包括:
获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
在本申请的第二方面,提出了一种智能终端。
一种智能终端,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
在本申请的第三方面,提出了一种计算机可读存储介质。
一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
实施本申请实施例,将具有如下有益效果:
采用了上述基于标签平滑的语音识别方法、智能终端及计算机可读存储介质之后,在通过训练样本对语音识别模型进行训练的过程中,针对训练样本对应的样本识别标签基于预设的同音字字典进行标签平滑处理,以得到对应的样本平滑标签;然后通过训练样本和样本平滑标签对语音识别模型进行训练,在这个过程中,基于预设的损失函数计算对应的损失值,并基于损失值进行反向传播以完成对语音识别模型的训练。其中,对训练样本的标签平滑考虑了同音字,通过同音字的使得同音字相比于其他非同音字具有较高的概率,从而提高了包含同音字的中文的语音识别的准确性,提高了语音识别整体的准确度。
进一步的,在本实施例中,对于损失值的计算过程中,在损失函数中除了用交叉熵度量损失值之外,还增加了可以度量语音识别模型的测试识别标签和样本平滑标签之间的差异的KL距离作为罚项,通过上述损失值计算方式得到的损失值,可以更好的完成对语音识别模型的训练,以提高语音识别效果,提高后续的语音识别的精准度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为本申请的一个实施例的基于标签平滑的语音识别方法的应用环境图;
图2为本申请的一个实施例的一种基于标签平滑的语音识别方法的流程示意图;
图3为本申请的一个实施例的对样本识别标签进行标签平滑处理的过程的流程示意图;
图4为本申请的一个实施例中语音识别模型的结构示意图;
图5为本申请的一个实施例中损失值计算过程的流程示意图;
图6为本申请的一个实施例中一种基于标签平滑的语音识别方法的结构示意图;
图7为本申请的一个实施例中损失值计算模块的结构示意图;
图8为本申请的一个实施例中标签平滑处理模块的结构示意图;
图9为本申请的一个实施例的运行上述基于标签平滑的语音识别方法的计算机设备的结构示意图;
图10为本申请的一个实施例中一种智能终端的结构示意图;
图11为本申请的一个实施例中一种非暂时性计算机可读存储介质的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
图1为一个实施例中一种基于标签平滑的语音识别方法的应用环境图。参照图1,该基于标签平滑的语音识别方法可应用于语音识别系统。该语音识别系统包括终端110和服务器120。终端110和服务器120通过网络连接,终端110具体可以是智能机器人、台式终端或移动终端,移动终端具体可以是手机、平板电脑、笔记本电脑等中的至少一种,在本实施例中,终端10不限于任何智能终端。服务器120可以用独立的服务器或者是多个服务器组成的服务器集群来实现。其中,终端110用于对需要进行待识别的语音片段进行识别处理,服务器120用于模型的训练与预测。
在另一个实施例中,上述基于标签平滑的语音识别方法所应用的语音识别系统还可以是基于终端110实现的。终端110用于模型的训练与预测,并将待识别的语音片段转换成文字。
如图2所示,在一个实施例中,提供了一种基于标签平滑的语音识别方法。该方法既可以应用于终端,也可以应用于服务器,本实施例以应用于终端举例说明。该基于标签平滑的语音识别方法具体包括如下步骤:
步骤S102:获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签。
在本实施例中,为了语音识别模型进行训练,首先需要构建对语音识别模型进行训练的训练数据。在本实施例中,训练数据是由多个训练样本构成的训练数据库,其中,每一个训练样本包括与语音片段对应的样本语音(样本语音可以为语音片段,也可以为样本语音特征),以及与该样本语音对应的样本识别标签。其中,样本语音可以是语音片段,也可以是与语音片段对应的语音特征。样本识别标签为与语音片段对应的文字识别标签,为一文本序列。在一个具体的实施例中,样本识别结果可以是人工对于语音片段对应的标注标签。
需要说明的,在样本语音为与语音片段对应的语音特征的情况下,需要预先针对语音片段按照预设的特征提取算法提取对应的语音特征向量,然后将提取到的对应的语音特征向量作为前述样本语音。
在本实施例中,训练数据中的每一条训练样本对应的语音片段,为一个语句,也就是说,在本实施例中,语音识别模型的训练以及后续的识别,是针对语句进行的,并不是针对单个词或字。因此,训练样本对应的样本语音为与语句对应的语音特征向量,例如,每一个字向量对应一个语音特征向量,在其他实施例中,还可以是多个字向量对应一个语音特征向量。
步骤S104:基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签。
因为中文中的同音字很多,同音字对于语音的识别也有一定的影响,因此,在对语音识别模型进行训练的过程中,还需要考虑同音字。
在本实施例中,在语音识别的标签数据库中,根据每一个标签(即为字符)的拼音,预先确定相同拼音的标签。例如,chu2:厨除橱雏滁锄躇刍蜍蹰。也就是说,针对指定的拼音,可以根据自己所使用的中文语料信息和字典信息,统计同一个拼音对应哪些标签(即字符或汉字)。在本实施例中,预设的同音字字典中包含了相同拼音的所有标签。
标签平滑(Label Smoothing)是一种正则化的方法,对标签平滑化处理以防止过拟合。在具体操作中,标签平滑是对标注的标签(如前述的样本识别标签)增加其他标签的一定概率的可选择性。
在本实施例中,针对中文常用的汉字构建对应的标签数据库,即为字符数据库。例如,可以采用国标GB2313的两级字库,其中包含了6763个汉字。也就是说,在标签数据库对应国标GB2313的两级字库的情况下,标签数据库中包括了6763个标签。
对于训练样本对应样本语音对应的样本识别标签包括了多个字符对应的标签。在本实施例中,针对样本语音对应的每一个标签,需要通过标签平滑处理,对每一个标签进行平滑处理,即为针对样本识别标签中的标签,增加标签数据库中的其他标签作为语音识别标签的概率。具体的,基于同音字字典,对样本识别标签中的标签基于同音字进行标签的平滑,以确定对应的经过平滑处理后的样本平滑标签。
在一个具体的实施例中,在本步骤中,对样本识别标签进行标签平滑处理的过程具体包括如图3所示的步骤S402-S406:
步骤S402:基于所述预设的同音字字典,确定与每一个样本识别标签对应的至少一个同音字标签。
对于样本识别标签包含的每一个字符(即每一个标签),根据该字符对应的拼音,来分别确定与每一个标签对应的同音字标签,其中,每一个标签对应的同音字标签的数量为多个。例如,在样本识别标签包括“厨chu2”的情况,对应的多个同音字标签可以为:除、橱、雏、滁、锄、躇、刍、蜍、蹰。进一步的,在本实施例中,还需要确定与样本识别标签对应的非同音字标签。具体的,在样本识别标签包括“厨chu2”的情况,对应的非同音字标签为标签数据库中除了上述样本识别标签以及同音字标签之外的其他标签。
样本识别标签为训练样本对应的语音识别标签,为一文本序列,在本步骤中,根据汉语拼音,确定样本识别结果对应的拼音,并构建对应的拼音序列,即为样本拼音序列。其中,因为样本识别标签对应的是一个语句,因此,样本拼音序列也包含了该语句对应的多个字或词语对应的多个拼音。针对训练样本,需要确定样本识别标签包含的多个字符(标签)对应的拼音,从而确定对应的样本拼音序列,其中,样本拼音序列包括了样本识别标签对应的多个字符构成的文字序列对应的多个拼音标签。然后针对每一个拼音标签,根据预设的同音字字典,确定与该拼音标签对应的同音字,以此作为同音字标签。
以训练样本为与“打开厨房换气扇”对应的语音片段对应的样本语音以及对应的样本识别标签(“打开厨房换气扇”对应的文本序列)为例进行说明。
构建样本拼音序列的过程,即为确定“打开厨房换气扇”对应的拼音序列的过程,并且构建的样本拼音序列为:“da kai chu fang huan qi shan”。
在本实施例中,因为中文中存在很多的同音字,无法根据单个字的拼音确定对应的汉字,需要结合整个语句的上下文中的词语生成对应的拼音序列。
步骤S404:基于所述确定的同音字标签,对所述样本识别标签进行标签平滑处理,确定与所述样本识别标签对应的第一分布信息。
步骤S406:将所述第一分布信息作为样本平滑标签。
在本实施例中,进行标签平滑的过程中,针对标签数据库中的所有标签并不是基于字频统计,也不是随机分布,而是根据同音字确定的。也就是说,在标签平滑的过程中,标签对应的概率的确定,是根据是否为样本识别标签对应的同音字标签来确定的。
具体的,对于样本识别标签,基于前述确定的同音字标签和/或对应的非同音字标签,对样本识别标签进行标签平滑处理,从而确定样本平滑标签。
在本实施例中,基于一预设的概率分布,确定样本识别标签、以及对应的同音字标签、非同音字标签对应的标签概率。其中,每一个标签的确定是根据该预设的概率分布根据每一个标签是否为同音字标签来确定的。具体实施中,获取预设的概率系数,根据所述预设的概率系数确定所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率。其中,预设的概率系数包括了与样本识别标签、同音字标签和/或多个非同音字标签中的一个或多个的标签概率,据此可以确定每一个标签对应的标签概率。然后根据所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率确定对应的概率分布,即为第一分布信息。
例如,针对chu2这个拼音标签(样本识别标签中对应的为“厨”),同音字标签数量为N个,标签数据库共包括M个字符,则构建的第一分布信息为:
[“厨”的概率=0.6,所有同音字标签概率=0.3/N,其他标签=(1-0.6-0.3)/M]
即:厨的同音字标签共有9个,则M维的向量为[厨除橱雏滁锄躇刍蜍蹰...........]=[…厨=0.6,…同音字=0.3/9,…其他…(1-0.6-0.3)/(M-10),........]。也就是说,“厨”本身赋予0.6标签概率,其它同音字标签共同分享0.3标签概率,剩下所有非同音字标签共同分享0.1的标签概率。
需要说明的是,还可以根据需求,设定其它的概率系数,例如,0.7,0.2,0.1。具体可以根据模型设计的需求进行确定。
在另一个实施例中,在上述每一个同音字标签以及非同音字标签的概率确定的过程中,除了采用相同的概率之外,还可以根据每一个标签确定其对应的概率,例如,根据每一个标签在语音识别的过程中出现的字频统计结果来确定。
例如,针对上述“厨”的同音字标签“除橱雏滁锄躇刍蜍蹰”,这9个同音字标签对应的概率的和为0.3,但是每一个同音字标签对应的概率的分布可以为前述的均分方式,也可以根据其他因素来确定每一个同音字标签的概率。具体的,对于同音字标签,确定与同音字标签对应的概率系数,例如前述的0.3,然后对于多个同音字标签,按照预设的方式(例如根据字频统计)确定每一个同音字标签对应的概率系数,其中,所有同音字标签的概率系数的和与前述确定的同音字标签对应的概率系数(如0.3)是对应的。相应的, 还可以相同的方式确定非同音字标签的概率系数。采用上述概率的确定方式,在考虑了同音字的情况下,还考虑了字频统计结果以及其他的因素,可以进一步的提高标签平滑的科学性,提高了语音识别模型训练的有效性,提高后续的语音识别的准确性。
步骤S106:根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值。
在本步骤中,首先需要确定进行训练的语音识别模型,即预设的语音识别模型。具体实施中,可以先根据语音识别的需求构建语音识别模型,例如构建神经网络模型。在一个具体的实施例中,构建的神经网络模型为端到端的神经网络模型。
请参见图4,图4给出了一种语音识别模型的具体模型结构图。如图4所示,x 1,……,x T为语音识别模型的输入,例如,根据语音片段提取出的语音特征;y 1,y 2,……为语音识别模型的输出,例如,语音片段的识别标签(字符或字符向量)。其中,301为对输入特征进行处理的第一子网络模块,h 2,……,h T为第一子网络模块的中间变量。302和303分别为第二子网络模块和第三子网络模块,用于根据第一子网络模块301的输出计算对应的语音识别标签,z 2,z 4,……为第二子网络模块302的中间变量,r 0,……,r L和q 0,……,q L、c 1,c 2,……为第三子网络模块303的中间变量。然后根据第二子网络模块302和第三子网络模块303分别的语音识别标签,确定语音识别模型最终的输出,即y 1,y 2,……。
需要说明的是,上述语音识别模型输出的结果包括识别出来的字符以及每一个字符对应的置信度,所有字符的置信度的和为100%。
在其他实施例中,构建的神经网络模型还可以是BP神经网络、Hopfield网络、ART网络或Kohonen网络等。
在本步骤中,根据步骤S102中确定的训练样本、以及进行标签平滑处理之后的样本平滑标签,对预设的语音识别模型进行训练,然后计算对应的损失值,以评估语音识别模型的训练效果。
在本实施例中,损失值的计算在一个实施例中采用的是交叉熵损失函数,指的是预测标签概率与构建的标签概率的交叉熵,也就是说,在本步骤中,交叉熵损失函数计算的是通过当前的语音识别模型预设的预测标签概率与经过标签平滑之后的样本平滑标签之间的距离值。需要说明的是,在其它实施例中,损失函数还可以采用其它损失函数的计算方法来计算损失值。
具体实施例中,基于训练样本对预设的语音识别模型进行训练的过程,可以是将训练样本对应的样本语音输入语音识别模型,获取语音识别模型输出的结果,该输出结果即为根据语音识别模型进行计算得到的识别标签,即为测试识别标签。然后根据测试识别标签、样本识别标签以及样本平滑标签,计算对应的损失值,其中,损失值的计算是按照预设的损失函数来进行计算的。
具体的,如图5所示,给出了上述步骤S106中,计算损失值过程包括如图5所示的步骤S202-S204:
步骤S202:将所述样本语音输入所述预设的语音识别模型,获取所述预设的语音识别模型输出的测试识别标签。
在本实施例中,在对语音识别模型进行训练的过程中,分别针对每一个训练样本,将 训练样本对应的样本语音作为输入,样本识别标签作为输出,对上述构建的语音识别模型(即预设的语音识别模型)进行训练,以使得训练完成的语音识别模型具备语音识别能力。
具体的,在本步骤中,将样本语音输入语音识别模型,然后通过该语音识别模型,获取对应的测试识别标签。该测试识别标签为该语音识别模型输出的与样本语音对应的语音识别标签,为对应的文本序列。
步骤S204:根据预设的损失函数,计算所述测试识别标签与所述样本平滑标签之间的损失值。
下面对损失值的计算过程进行具体的描述。
如前所述,基于同音字字典,对样本识别标签进行标签平滑处理以获取对应的样本平滑标签,同时构建对应的第一分布信息。
在步骤S202中,通过训练样本,将样本语音输入语音识别模型之后,获取模型输出的测试识别标签,基于该测试识别标签确定第二分布信息,该第二分布信息是根据测试识别标签确定的概率分布。其中,第二分布信息是根据当前的语音识别模型的测试识别标签确定的,标识了测试识别标签中每一个标签对应的分布情况。
在计算损失值的过程中,基于预设的损失函数,计算第二分布信息(预测输出的分布)和第一分布信息(期望输出的分布)之间的差异,以此来估计基于标签平滑构建的第一分布信息与语音识别模型预测得到的预测分布(即第二分布信息)之间的损失。
在本实施例中,上述损失函数的计算分为两个部分,一个部分为交叉熵,另一个部分为KL散度。
具体的,对于交叉熵的计算,通过预设的交叉熵损失函数,计算第二分布信息(预测输出的分布)和第一分布信息(期望输出的分布)之间的交叉熵,作为交叉熵项,用于度量二者之间的接近程度。
具体实施中,基于预设的交叉熵损失函数计算公式,计算所述第一分布信息和所述第二分布信息之间的交叉熵值,作为所述交叉熵项:
Σ(logp θ(y|x)),
并且在具体的损失值的计算过程中,以负熵的形式确定损失值:
L(θ)=-Σ(logp θ(y|x)),
其中,L(θ)为损失值。
对于KL散度的计算,根据预设的KL距离计算公式,计算第二分布信息(预测输出的分布)和第一分布信息(期望输出的分布)之间的KL距离作为KL罚项。其中,KL散度(也称KL距离,Kullback–Leibler divergence,又称相对熵(relative entropy)),用于描述两个概率分布之间的差异。也就是说,在本实施例中,可以通过预设的KL距离计算公式(也称KL散度计算公式)计算第一分布信息和第二分布信息之间的差异,用于计算对应的损失值。
具体实施中,基于预设的KL距离计算公式,计算所述第一分布信息和所述第二分布信息之间的KL距离值,作为所述损失值。
其中,KL距离计算公式如下:
D KL(p(d y|y||p θ(y|x))
其中,y为字符(即测试识别结果与样本识别结果),x标识样本语音特征,θ标识语音识别模型的参数,d y为上述根据第一分布信息。上述构建的KL距离D KL(p(d y|y||p θ(y|x))可以度量了第二分布信息与第一分布信息之间的距离。
并且在具体的损失值的计算过程中,以负的KL距离的形式确定损失值:
L(θ)=-D KL(p(d y|y||p θ(y|x)),
其中,L(θ)为损失值。
进一步的,基于上述KL距离D KL(p(d y|y||p θ(y|x)),可以进一步的构建对应的损失函数。也就是说,在本步骤中,损失函数包括了交叉熵项和KL罚项:
L(θ)=-Σ(logp θ(y|x))-D KL(p(d y|y||p θ(y|x))
其中,L(θ)为损失值,Σ(logp θ(y|x))为交叉熵项,D KL(p(d y|y||p θ(y|x))为KL罚项。
如前所述,第一分布信息是基于同音字字典对样本识别标签进行标签平滑处理之后构建的,也就是说,在第一分布信息中,同音字较其他的字符具有较高的概率,可以在中文的语音识别中很好的考虑同音字相较于其他字符有较高的概率。
在本实施例中,针对KL距离计算的一般计算方式D KL(u||p θ(y|x)中,u为固定的uniform分布或unigram分布,将u由包含同音字信息在内的第二分布信息替代D KL(p(d y|y||p θ(y|x),从而使得在损失值的计算过程中,可以很好的考虑到同音字,以使得后续通过损失值训练得到的语音识别模型,可以更好的针对包含有同音字的中文进行语音识别。
也就是说,在本实施例中,通过构建同音字字典,考虑了在语音识别模型的训练中的中文汉字多音字信息与语音识别模型的影响,不仅考虑了语音识别过程中每一个汉字对应的字符出现的频率,还考虑了同音字的出现对于语音识别过程中的影响,将同音字作为构建先验分布的知识之一,从而提高了同音字较多的中文语音识别的准确率。
需要说明的是,在本实施例中,计算损失值的损失函数还可以是其他函数,不限于上述损失函数的计算方式。
步骤S108:根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
在计算损失值之后,即可根据损失值对语音识别模型进行反向传播,以完成对语音识别模型的训练。具体实施例中,可以利用利用随机梯度下降法进行反向传播,对语音识别模型进行训练;在其它实施例中,还可以采用其它算法进行反向传播。或者,利用预设的优化器进行反向传播,所述预设的优化器,对语音识别模型进行训练,其中,预设的优化器可以是AdagradOptimizer、MomentumOptimizer或AdamOptimizer中的一种,也可以是其它的优化器。对于进行反向传播的具体方法,在本实施例中不作限定,可以通过相关技术方案中反向传播的方法进行实现。
不同于英文的语音识别,在中文的语音识别中,因为同音字的原因,导致中文的语音识别效果存在不足。在相关的语音识别方案的标签平滑过程中,仅考虑了字符的概率分布,该标签平滑的方法可以在一定程度上提高神经网络模型的性能,但是对于包含多音字的中文语音识别还存在一定的不足。在本实施例中提出的基于同音字字典进行标签平滑,并基于标签平滑结果构建的先验分布构建的包括交叉熵和KL距离的损失函数中,考虑了同音字的存在对于语音识别效果的影响,在构建先验分布的情况下,将同音字的概率设置为高于其它字符的概率,保证了同音字相比其它字符有较高的概率,从而提升了语音识别模型 训练的标签平滑效果,提升了语音识别模型的性能,提高了语音识别的准确率。
也就是说,在本实施例中,通过在损失函数中添加基于测试识别标签分布与基于样本平滑标签对应的分布之间的交叉熵以及KL距离对应的罚项,可以缓解网络过度置信,从而降低网络过拟合,起到较好的标签平滑的作用,从而提高神经网络模型的性能。并且,神经网络模型过度置信意味着低熵分布,因为,在本实施例中,在交叉熵的基础上中添加了基于测试识别标签分布和样本平滑标签对应的分布的KL距离,通过损失值对语音识别模型训练过程中进行优化,可以进一步的起到较好的标签平滑的作用,提高语音识别模型的识别效果,并且降低了语音识别模型网络过拟合现象。
通过上述损失函数计算得到的损失值对语音识别模型通过反向传播算法进行训练,可以使得训练完成的语音识别模型能较好的考虑同音字对中文的语音识别所带来的影响,达到较好的语音识别效果。
进一步的,在通过反向传播算法对语音识别模型进行训练的过程中,还需要进一步的考虑模型训练的结束条件。具体的,上述根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练的步骤,还包括:判断所述损失值是否小于预设的损失阈值;在所述损失值小于预设的损失阈值的情况下,判定所述预设的语音识别模型的训练完成。损失值代表的是测试识别标签和样本识别标签之间的距离或者差值,如果测试识别标签和样本识别标签之间足够接近或者相等,则说明训练的语音识别模型的准确率已经达到要求,可以结束相应的语音识别模型的训练。在本实施例中,设定一损失阈值,例如0.05,在损失值小于该损失阈值的情况下,确定语音识别模型训练完毕,反之,则需要通过训练数据中包括的训练样本继续对该语音识别模型进行训练。
也就是说,在本实施例中,通过损失阈值的设置,既避免了语音识别模型训练的无止境,也进一步的提高了语音识别模型训练的准确性。
在本实施例中,还提出了一种基于标签平滑的语音识别装置。
具体的,请参见图6,上述基于标签平滑的语音识别装置包括:
训练数据获取模块102,用于获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
标签平滑处理模块104,用于基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
损失值计算模块106,用于根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
反向传播训练模块108,用于根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
在其中一个实施例中,所述预设的语音识别模型为端到端的神经网络模型。
在其中一个实施例中,如图7所示,损失值计算模块106还包括:
测试识别标签获取单元602,用于将所述样本语音输入所述预设的语音识别模型,获取所述预设的语音识别模型输出的测试识别标签;
损失值计算子单元604,用于根据预设的损失函数,计算所述测试识别标签与所述样本平滑标签之间的损失值。
在其中一个实施例中,如图8所示,标签平滑处理模块104还包括:
同音字标签确定子单元402,用于基于所述预设的同音字字典,确定与每一个样本识别标签对应的至少一个同音字标签;
第一分布确定子单元404,用于基于所述确定的同音字标签,对所述样本识别标签进行标签平滑处理,确定与所述样本识别标签对应的第一分布信息;将所述第一分布信息作为样本平滑标签。
在其中一个实施例中,每一个所述样本识别标签包括与该样本识别标签对应的文字序列;同音字标签确定子单元402用于确定与所述样本识别标签对应的样本拼音序列,所述样本拼音序列包括与所述样本识别标签对应的文字序列对应的若干个拼音标签;基于所述预设的同音字字典,分别确定与每一个拼音标签对应的所述至少一个同音字标签。
在其中一个实施例中,同音字标签确定子单元402用于基于预设的同音字字典,确定与每一个所述拼音标签对应的至少一个同音字标签和至少一个非同音字标签;
在其中一个实施例中,第二分布确定子单元404还用于获取预设的概率系数,根据所述预设的概率系数确定所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率;根据所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率确定所述第一分布信息。
在其中一个实施例中,测试识别标签获取单元602还用于根据所述测试识别标签,确定第二分布信息;损失值计算子单元604还用于基于预设的交叉熵损失函数,计算与所述训练样本对应的交叉熵项;基于预设的KL距离计算公式,计算所述第一分布信息和所述第二分布信息之间的KL距离值,作为KL罚项;根据所述KL罚项和所述交叉熵项,计算所述损失值。
图9示出了一个实施例中计算机设备的内部结构图。该计算机设备不限于任意智能终端,也可以是服务器,在本实施例中,优选为智能机器人。如图9所示,该计算机设备90包括通过系统总线连接的处理器901、非暂时性存储器902和网络接口903。其中,非暂时性存储器902包括非易失性存储介质9021和内存储器9022。该计算机设备的非易失性存储介质存储有操作系统9023,还可存储有计算机程序9024,该计算机程序被处理器执行时,可使得处理器实现基于标签平滑的语音识别方法。该内存储器中也可储存有计算机程序9025,该计算机程序被处理器执行时,可使得处理器执行基于标签平滑的语音识别方法。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,提出了一种智能终端,如图10所示,所述智能终端1000包括非暂时性存储器1001和处理器1002,所述非暂时性存储器1001存储有计算机程序1003,所述计算机程序1003被所述处理器1002执行时,使得所述处理器1002执行以下步骤:
获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
在一个实施例中,如图11所示,提出了一种非暂时性计算机可读存储介质1100,存储有计算机程序1101,所述计算机程序1101被处理器执行时,使得所述处理器执行以下步骤:
获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
采用了上述基于标签平滑的语音识别方法、装置、智能终端及计算机可读存储介质之后,在通过训练样本对语音识别模型进行训练的过程中,针对训练样本对应的样本识别标签基于预设的同音字字典进行标签平滑处理,以得到对应的样本平滑标签;然后通过训练样本和样本平滑标签对语音识别模型进行训练,在这个过程中,基于预设的损失函数计算对应的损失值,并基于损失值进行反向传播以完成对语音识别模型的训练。其中,对训练样本的标签平滑考虑了同音字,通过同音字的使得同音字相比于其他非同音字具有较高的概率,从而提高了包含同音字的中文的语音识别的准确性,提高了语音识别的准确度。
进一步的,在本实施例中,对于损失值的计算过程中,在损失函数中除了用交叉熵度量损失值之外,还增加了可以度量语音识别模型的测试识别标签和样本平滑标签之间的差异的KL距离作为罚项,通过上述损失值计算方式得到的损失值,可以提高语音识别效果,提高后续的语音识别的精准度。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (10)

  1. 一种基于标签平滑的语音识别方法,其特征在于,包括:
    获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
    基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
    根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
    根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
  2. 根据权利要求1所述的基于标签平滑的语音识别方法,其特征在于,所述根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值的步骤,还包括:
    将所述样本语音输入所述预设的语音识别模型,获取所述预设的语音识别模型输出的测试识别标签;
    根据预设的损失函数,计算所述测试识别标签与所述样本平滑标签之间的损失值。
  3. 根据权利要求2所述的基于标签平滑的语音识别方法,其特征在于,所述基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签的步骤,还包括:
    基于所述预设的同音字字典,确定与每一个样本识别标签对应的至少一个同音字标签;
    基于所述确定的同音字标签,对所述样本识别标签进行标签平滑处理,确定与所述样本识别标签对应的第一分布信息;
    将所述第一分布信息作为样本平滑标签。
  4. 根据权利要求3所述的基于标签平滑的语音识别方法,其特征在于,每一个所述样本识别标签包括与该样本识别标签对应的文字序列;
    所述基于所述预设的同音字字典,确定与每一个样本识别标签对应的至少一个同音字标签的步骤,还包括:
    确定与所述样本识别标签对应的样本拼音序列,所述样本拼音序列包括与所述样本识别标签对应的文字序列对应的若干个拼音标签;
    基于所述预设的同音字字典,分别确定与每一个拼音标签对应的所述至少一个同音字标签。
  5. 根据权利要求3所述的基于标签平滑的语音识别方法,其特征在于,所述基于所述预设的同音字字典,确定与每一个样本识别标签对应的多个同音字标签的步骤包括:
    基于预设的同音字字典,确定与每一个所述拼音标签对应的至少一个同音字标签和至少一个非同音字标签;
    所述基于所述确定的同音字标签,对所述样本识别标签进行标签平滑处理,确定与所述样本识别标签对应的第一分布信息的步骤,还包括:
    获取预设的概率系数,根据所述预设的概率系数确定所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率;
    根据所述样本识别标签、多个同音字标签和/或多个非同音字标签中的一个或多个的标签概率确定所述第一分布信息。
  6. 根据权利要求3所述的基于标签平滑的语音识别方法,其特征在于,所述将所述样本语音输入所述预设的语音识别模型,获取所述预设的语音识别模型输出的测试识别标签的步骤,还包括:
    根据所述测试识别标签,确定第二分布信息;
    所述基于预设的损失函数,计算与所述训练样本对应的损失值的步骤,还包括:
    基于预设的交叉熵损失函数,计算与所述训练样本对应的交叉熵项;
    基于预设的KL距离计算公式,计算所述第一分布信息和所述第二分布信息之间的KL距离值,作为KL罚项;
    根据所述KL罚项和所述交叉熵项,计算所述损失值。
  7. 根据权利要求1所述的基于标签平滑的语音识别方法,其特征在于,所述预设的语音识别模型为端到端的神经网络模型。
  8. 根据权利要求1所述的基于标签平滑的语音识别方法,其特征在于,所述根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练的步骤,还包括:
    判断所述损失值是否小于预设的损失阈值;
    在所述损失值小于预设的损失阈值的情况下,判定所述预设的语音识别模型的训练完成。
  9. 一种非暂时性计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:
    获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
    基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
    根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
    根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
  10. 一种智能终端,包括非暂时性存储器和处理器,所述非暂时性存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
    获取训练数据,所述训练数据包括多个训练样本,每一个所述训练样本包括样本语音及与样本语音对应的样本识别标签;
    基于预设的同音字字典,对所述样本识别标签进行标签平滑处理,获取经过标签平滑处理后的样本平滑标签;
    根据训练样本和所述样本平滑标签对预设的语音识别模型进行训练,并基于预设的损失函数,计算与所述训练样本对应的损失值;
    根据损失值进行反向传播,以完成对所述预设的语音识别模型的训练。
PCT/CN2020/088422 2020-04-30 2020-04-30 基于标签平滑的语音识别方法、终端及介质 WO2021217619A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/088422 WO2021217619A1 (zh) 2020-04-30 2020-04-30 基于标签平滑的语音识别方法、终端及介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/088422 WO2021217619A1 (zh) 2020-04-30 2020-04-30 基于标签平滑的语音识别方法、终端及介质

Publications (1)

Publication Number Publication Date
WO2021217619A1 true WO2021217619A1 (zh) 2021-11-04

Family

ID=78373171

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/088422 WO2021217619A1 (zh) 2020-04-30 2020-04-30 基于标签平滑的语音识别方法、终端及介质

Country Status (1)

Country Link
WO (1) WO2021217619A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511848A (zh) * 2021-12-30 2022-05-17 广西慧云信息技术有限公司 一种基于改进标签平滑算法的葡萄物候期识别方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109545190A (zh) * 2018-12-29 2019-03-29 联动优势科技有限公司 一种基于关键词的语音识别方法
CN110738997A (zh) * 2019-10-25 2020-01-31 百度在线网络技术(北京)有限公司 一种信息修正方法、装置、电子设备及存储介质
CN111066082A (zh) * 2018-05-25 2020-04-24 北京嘀嘀无限科技发展有限公司 一种语音识别系统和方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111066082A (zh) * 2018-05-25 2020-04-24 北京嘀嘀无限科技发展有限公司 一种语音识别系统和方法
CN109545190A (zh) * 2018-12-29 2019-03-29 联动优势科技有限公司 一种基于关键词的语音识别方法
CN110738997A (zh) * 2019-10-25 2020-01-31 百度在线网络技术(北京)有限公司 一种信息修正方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YI ZHENG; XIANJIE YANG; XUYONG DANG: "Homophone-based Label Smoothing in End-to-End Automatic Speech Recognition", ARXIV.ORG, 7 April 2020 (2020-04-07), pages 1 - 4, XP081639388 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114511848A (zh) * 2021-12-30 2022-05-17 广西慧云信息技术有限公司 一种基于改进标签平滑算法的葡萄物候期识别方法及系统
CN114511848B (zh) * 2021-12-30 2024-05-14 广西慧云信息技术有限公司 一种基于改进标签平滑算法的葡萄物候期识别方法及系统

Similar Documents

Publication Publication Date Title
CN110765763B (zh) 语音识别文本的纠错方法、装置、计算机设备和存储介质
CN110598206B (zh) 文本语义识别方法、装置、计算机设备和存储介质
US11393492B2 (en) Voice activity detection method, method for establishing voice activity detection model, computer device, and storage medium
WO2019169719A1 (zh) 文摘自动提取方法、装置、计算机设备及存储介质
KR102668530B1 (ko) 음성 인식 방법, 장치 및 디바이스, 및 저장 매체
CN111583911B (zh) 基于标签平滑的语音识别方法、装置、终端及介质
JP5901001B1 (ja) 音響言語モデルトレーニングのための方法およびデバイス
CN110704588A (zh) 基于长短期记忆网络的多轮对话语义分析方法和系统
CN110569500A (zh) 文本语义识别方法、装置、计算机设备和存储介质
WO2021051598A1 (zh) 文本情感分析模型训练方法、装置、设备及可读存储介质
CN110390017B (zh) 基于注意力门控卷积网络的目标情感分析方法及系统
CN108520041B (zh) 文本的行业分类方法、系统、计算机设备和存储介质
US20230076658A1 (en) Method, apparatus, computer device and storage medium for decoding speech data
CN111833845A (zh) 多语种语音识别模型训练方法、装置、设备及存储介质
CN110377733B (zh) 一种基于文本的情绪识别方法、终端设备及介质
CN112580346A (zh) 事件抽取方法、装置、计算机设备和存储介质
CN113240510A (zh) 异常用户预测方法、装置、设备及存储介质
JPWO2014073206A1 (ja) 情報処理装置、及び、情報処理方法
CN113011532A (zh) 分类模型训练方法、装置、计算设备及存储介质
WO2021217619A1 (zh) 基于标签平滑的语音识别方法、终端及介质
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
WO2021051507A1 (zh) 一种机器人对话生成方法、装置、可读存储介质及机器人
US20230070966A1 (en) Method for processing question, electronic device and storage medium
CN112863518B (zh) 一种语音数据主题识别的方法及装置
CN115909376A (zh) 文本识别方法、文本识别模型训练方法、装置及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933276

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933276

Country of ref document: EP

Kind code of ref document: A1