CN114520001A

CN114520001A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN114520001A
Application number: CN202210281930.3A
Authority: CN
Inventors: 万根顺; 王磊奇; 潘嘉; 高建清; 刘聪; 胡国平; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-05-20

Abstract

The invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining a voice to be recognized, and recognizing the voice to be recognized based on a voice recognition model obtained through pre-training, wherein the voice recognition model is obtained through two stages of training, the first stage is used for training by taking the recognition result of training voice and a text marked by the training voice as a target, and the second stage is used for training by taking the text unit error rate and the semantic acceptability of the voice recognition result of the training voice as a target. The voice recognition method provided by the invention can obtain the voice recognition result with higher user acceptability.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

The speech recognition technology is a technology for recognizing speech as text. The current speech recognition scheme is based on a speech recognition model, and the scheme recognizes speech to be recognized based on a speech recognition model obtained through pre-training so as to obtain a recognition result.

The speech recognition model used in the current speech recognition scheme is usually obtained by training based on a cross entropy criterion, however, the speech recognition model obtained by training based on the cross entropy criterion only has poor recognition performance, and thus, when performing speech recognition based on the speech recognition model obtained by training, a better recognition effect is difficult to obtain.

Disclosure of Invention

In view of this, the present invention provides a voice recognition method, apparatus, device and storage medium, so as to solve the problem that the recognition effect of the current voice recognition scheme is not good, and the technical scheme is as follows:

a speech recognition method, comprising:

acquiring a voice to be recognized;

recognizing the voice to be recognized based on a voice recognition model obtained by pre-training;

the speech recognition model is obtained through two stages of training, the first stage is trained by taking the recognition result of the training speech and the text marked by the training speech as targets, and the second stage is trained by taking the text unit error rate and the semantic acceptability of the speech recognition result of the training speech as targets.

Optionally, a speech recognition baseline model is obtained in the first stage of training, and the speech recognition baseline model is trained in the second stage;

Training the speech recognition baseline model, including:

recognizing the training voice based on the voice recognition baseline model to obtain a plurality of candidate recognition results of the training voice;

determining a text unit error rate and a semantic change evaluation index corresponding to each candidate recognition result, wherein the semantic change evaluation index can reflect semantic change of the corresponding candidate recognition result relative to a labeled text of training voice;

determining the prediction loss of the speech recognition baseline model on each candidate recognition result by combining the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result;

and updating parameters of the voice recognition baseline model according to the determined prediction loss.

Optionally, the determining, by combining the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result, a prediction loss of the speech recognition baseline model on each candidate recognition result includes:

for each candidate recognition result:

determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result;

And determining the prediction loss of the speech recognition baseline model on the candidate recognition result according to the weight corresponding to the candidate recognition result and the prediction probability corresponding to the candidate recognition result.

Optionally, the semantic change evaluation index includes: part-of-speech and/or syntactic category deviation;

the part-of-speech deviation degree can reflect the deviation degree of the corresponding candidate recognition result in part-of-speech relative to the labeled text of the training speech;

the syntactic category deviation degree can reflect the deviation degree of the corresponding candidate recognition result relative to the labeled text of the training voice in the syntactic category.

Optionally, the determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result includes:

calculating a difference value between a text unit error rate corresponding to the candidate recognition result and an average text unit error rate, a difference value between a part-of-speech deviation degree corresponding to the candidate recognition result and an average part-of-speech deviation degree, and a difference value between a syntax category deviation degree corresponding to the candidate recognition result and an average syntax category deviation degree, wherein the average word error rate is an average value of the text unit error rates corresponding to the candidate recognition results respectively, the average part-of-speech deviation degree is an average value of the part-of-speech deviation degrees corresponding to the candidate recognition results respectively, and the average syntax category deviation degree is an average value of the syntax category deviation degrees corresponding to the candidate recognition results respectively;

And fusing the calculated difference values, and taking a fusion result as the weight corresponding to the candidate identification result.

Optionally, determining a part-of-speech deviation degree corresponding to one candidate recognition result includes:

determining the part of speech of each word contained in the labeled text of the training voice, and determining the part of speech of the word contained in the unaligned part of the candidate recognition result and the labeled text of the training voice;

and determining the part-of-speech deviation degree corresponding to the candidate recognition result according to the part-of-speech of each word contained in the tagged text of the training voice and the part-of-speech of the word contained in the unaligned part of the candidate recognition result and the tagged text of the training voice.

Optionally, the determining, according to the part of speech of each word included in the tagged text of the training speech and the part of speech of the word included in the unaligned portion of the candidate recognition result and the tagged text of the training speech, a part of speech deviation degree corresponding to the candidate recognition result includes:

determining the part-of-speech deviation weight of the candidate recognition result relative to the part-of-speech of the labeled text of the training voice according to the part-of-speech weight of the word contained in the unaligned part of the candidate recognition result and the labeled text of the training voice, wherein the part-of-speech weight of a word can represent the importance degree of the part-of-speech of the word;

Summing the weights of the parts of speech of all words contained in the labeled text of the training voice to obtain a part of speech weight sum;

and determining the part-of-speech deviation degree corresponding to the candidate recognition result according to the part-of-speech weight and the part-of-speech deviation weight of the candidate recognition result relative to the labeled text of the training voice.

Optionally, determining a syntax category deviation corresponding to a candidate recognition result includes:

determining the dependency relationship corresponding to each word in the labeled text of the training voice, and determining the dependency relationship changed by the candidate recognition result and the word of the unaligned part of the labeled text;

and determining the syntactic category deviation degree corresponding to the candidate recognition result according to the dependency relationship corresponding to each word in the labeled text of the training voice and the dependency relationship changed by the words of the unaligned part of the candidate recognition result and the labeled text.

Optionally, the determining, according to the dependency relationship corresponding to each word in the labeled text of the training speech and the dependency relationship changed by the word of the unaligned portion of the candidate recognition result and the labeled text, the syntax category deviation degree corresponding to the candidate recognition result includes:

determining the deviation weight of the candidate recognition result relative to the dependency relationship of the labeled text of the training speech according to the weight of the dependency relationship changed by the candidate recognition result and the words of the unaligned part of the labeled text;

Determining the sum of the weights of the dependency relationship corresponding to each word in the labeled text of the training speech to obtain the sum of the weights of the dependency relationship;

and determining the syntax category deviation degree corresponding to the candidate recognition result according to the dependency relationship weight and the dependency relationship deviation weight of the candidate recognition result relative to the labeled text of the training voice.

A speech recognition apparatus comprising: the device comprises a voice acquisition module and a voice recognition module;

the voice acquisition module is used for acquiring the voice to be recognized;

the voice recognition module is used for recognizing the voice to be recognized based on a voice recognition model obtained by pre-training;

A speech recognition device comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech recognition method described in any one of the above.

A readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the speech recognition method of any of the above.

The invention provides a voice recognition method, a device, equipment and a storage medium, firstly obtaining a voice to be recognized, and then recognizing the voice to be recognized based on a voice recognition model obtained by pre-training, wherein the voice recognition model is obtained by training in two stages, the first stage is used for training by taking a recognition result of training voice and a text marked by the training voice as a target, the second stage is used for training by taking a text unit error rate and a semantic acceptability of a voice recognition result of the training voice as a target, and on the basis of a model obtained by training by taking a recognition result of the training voice and a text marked by the training voice as a target, the invention provides a method, a device, equipment and a storage medium for recognizing the voice, which are based on the fact that the recognition performance of the model obtained by training by taking a recognition result of the training voice and a text marked by the training voice as a target is not good (the acceptability of a user based on the recognition result of the model obtained by the model is not high), the training is further carried out by taking the text unit error rate and the semantic acceptability of the voice recognition result of the training voice as targets, the voice recognition model obtained through the training has better performance, when the voice to be recognized is recognized through the training model, the recognition result with higher user acceptability can be obtained, and the user experience is better.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic flowchart of a process for training a speech recognition baseline model by balancing text unit error rate and semantic acceptability of a speech recognition result of a training speech according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech recognition device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In view of the poor recognition performance of the speech recognition model obtained by training based on the cross entropy criterion, which leads to the difficulty in obtaining a good recognition effect when performing speech recognition based on the speech recognition model obtained by training, the inventors of the present invention have studied, and the initial thought is:

the method comprises the steps of obtaining a voice recognition model through two training stages, obtaining a voice recognition baseline model through training based on a cross entropy criterion in the first training stage, further training the voice recognition baseline model by adopting a text unit error rate as a constraint criterion in the second training stage, obtaining a model through training in the second training stage as a final voice recognition model, and recognizing the voice to be recognized based on the voice recognition model obtained through training. On the basis of obtaining the speech recognition baseline model based on cross entropy criterion training, the text unit error rate is further adopted as a constraint criterion for training, so that the performance of the speech recognition model can be improved to a certain extent.

The inventor researches the scheme of obtaining the speech recognition model through two training phases, and finds that compared with the scheme of obtaining the speech recognition model through training based on the cross entropy criterion, the scheme has certain improvement in effect, but still has some problems, and is specifically reflected in that: the error rate of the text unit can only reflect the statistical information of the text unit with the recognition error in the recognition result, but cannot directly reflect the information of the semantic level of the recognition result, which results in that the acceptability of the obtained recognition result is not high when the speech recognition is performed by using the speech recognition model trained by taking the error rate of the text unit as the constraint criterion.

Illustratively, the recognition result obtained by recognizing a piece of speech by using the speech recognition model obtained by the two stages of training and the labeled text of the piece of speech are as follows:

labeling results: if the foot soaking disease does not happen, the foot soaking disease is normal later.

And (3) recognition results: if the foot is not soaked by the foot, and the foot is normally soaked by the foot.

Comparing the recognition result with the labeled text, it is found that the recognition result is wrong by one word "and- > sick", and although the recognition result is wrong by only one word, the influence on the semantics is very large, and thus the acceptability of the recognition result is not high.

Aiming at the problem of low receptivity of the recognition result, the inventor thinks that the proportion of a language model can be increased and a result with more reasonable language distribution is selected, but the scheme usually causes the fluctuation of the recognition result because a unified acoustic and language component fusion parameter cannot be found, and the inventor thinks that the inventor can unite an upstream task and a downstream task to carry out joint debugging on the recognition result, such as uniting a voice recognition task and a recognition result translation task, so that the reasonable distribution of the voice recognition result is further strengthened by a loss function of the translation task, but the way of uniting the task needs parallel data to be trained, and the requirement on the data and the task is higher.

In view of the fact that the above-mentioned idea of solving the problem of low acceptability of the recognition result is not feasible, the inventors have continued intensive research and finally proposed a speech recognition method with a good effect through research, by which a speech recognition result with a high acceptability can be obtained. In one possible implementation manner, the hardware architecture related to the voice recognition method provided by the present invention may include an electronic device with data processing capability, and for example, the electronic device may be any electronic product that can perform human-computer interaction with a user through voice interaction, for example, a smart phone, a smart speaker, a notebook computer, a tablet computer, a palm computer, a wearable device, a smart television, a vehicle-mounted terminal, and the like, the electronic device may obtain a voice to be recognized, and recognize the voice to be recognized according to the voice recognition method provided by the present invention, and in another possible implementation manner, the hardware architecture related to the voice recognition method provided by the present invention may include an electronic device and a server, the electronic device may be any electronic product that can perform human-computer interaction with the user through voice interaction, and the server may be a server, the electronic device can be connected with the server through a wired communication network or a wireless communication network and is communicated with the server, the electronic device obtains the voice to be recognized and sends the obtained voice to be recognized to the server through the communication network, and the server recognizes the voice to be recognized according to the voice recognition method provided by the invention.

It should be understood by those skilled in the art that the above-described electronic devices and servers are merely exemplary, and that other existing or future electronic devices or servers, as may be suitable for use with the present invention, are intended to be included within the scope of the present invention and are hereby incorporated by reference.

Next, a speech recognition method provided by the present invention will be described by the following embodiments.

Referring to fig. 1, a schematic flow chart of a speech recognition method provided in an embodiment of the present invention is shown, where the method may include:

step S101: and acquiring the voice to be recognized.

Step S102: and recognizing the speech to be recognized based on the speech recognition model obtained by pre-training.

It should be noted that, in the first stage, training may be performed based on a cross entropy criterion, the trained model is used as a speech recognition baseline model, and after the speech recognition baseline model is obtained, in order to improve the performance of the speech recognition model and enable the speech recognition model to output a recognition result with a high user acceptability, the present invention further trains the speech recognition baseline model with a goal of balancing the text unit error rate and the semantic acceptability of the speech recognition result of the training speech.

The method of performing training based on the cross entropy criterion is the prior art, and this embodiment does not repeat this, and this embodiment focuses on introducing the process of performing training on the speech recognition baseline model obtained by the first-stage training with the text unit error rate and semantic acceptability of the speech recognition result of the training speech being balanced.

Referring to fig. 2, a schematic flow chart of training a speech recognition baseline model with the goal of balancing text unit error rate and semantic acceptability of speech recognition results of a training speech may include:

step S201: and recognizing the training voice based on the voice recognition baseline model to obtain a plurality of candidate recognition results of the training voice.

The candidate recognition results of the training speech may be Nbest candidate recognition results.

Step S202: and determining the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result.

The text unit error rate is a statistical index of the text unit with the recognition error in the corresponding candidate recognition result, and is an index which is not directly related with the semantic meaning, and the semantic change evaluation index is an index which can reflect the semantic change of the corresponding candidate recognition result relative to the labeled text of the training voice.

The error rate of the text unit corresponding to a candidate recognition result is determined according to the candidate recognition result and the labeled text of the training speech, and more specifically, according to the unaligned part of the candidate recognition result and the labeled text of the training speech.

Alternatively, the text unit error rate can be a word error rate WER, which can be calculated by the following formula:

wherein S represents the number of words replaced, D represents the number of words deleted, I represents the number of words inserted, and M represents the total number of words of the labeled text of the training speech.

When determining the word error rate corresponding to a candidate recognition result, a part of the candidate recognition result that is not aligned with the labeled text of the training speech may be determined first, then for the unaligned part, the number of replaced words, the number of deleted words, the number of inserted words, the total number of words of the labeled text of the training speech may be counted, and finally the word error rate corresponding to the candidate recognition result may be calculated in the manner shown in the above formula (1).

It should be noted that the text unit error rate is not limited to the word error rate in this embodiment, and other text unit error rates such as the word error rate may be used.

Optionally, the semantic change evaluation index may include a part-of-speech deviation degree and/or a syntactic category deviation degree. The part-of-speech deviation degree can reflect the deviation degree of the part-of-speech of the corresponding candidate recognition result relative to the labeled text of the training speech, and the syntax category deviation degree can reflect the deviation degree of the part-of-speech of the corresponding candidate recognition result relative to the labeled text of the training speech in the syntax category. The specific determination manner of the part-of-speech deviation and the syntactic category deviation will be described in the following embodiments.

It should be noted that the above-mentioned part-of-speech deviation and syntax category deviation are merely examples, and the semantic change evaluation index is not limited to the part-of-speech deviation and syntax category deviation, and other indexes that can reflect semantic changes of the corresponding candidate recognition result with respect to the labeled text of the training speech may be used.

Step S203: and determining the prediction loss of the speech recognition baseline model on each candidate recognition result by combining the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result.

The specific implementation process of step S203 includes: for each candidate recognition result, performing:

step a1, determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result.

Specifically, determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result includes:

step a11, calculating the difference between the error rate of text unit and the error rate of average text unit corresponding to the candidate recognition result, the difference between the deviation degree of part of speech corresponding to the candidate recognition result and the deviation degree of average part of speech, and the difference between the deviation degree of syntactic category corresponding to the candidate recognition result and the deviation degree of average syntactic category.

The average word error rate is an average value of error rates of text units corresponding to the candidate recognition results (for example, Nbest candidate recognition results), the average degree of speech deviation is an average value of degree of speech deviations corresponding to the candidate recognition results, and the average syntactic category deviation is an average value of syntactic category deviations corresponding to the candidate recognition results.

For the jth candidate recognition result in the candidate recognition results, if the jth candidate recognition result is the next candidate recognition resultThe error rate of the text unit corresponding to the j candidate recognition results is represented as W (y)_jY), the average text unit error rate is expressed as

Representing the part of speech deviation degree corresponding to the j-th candidate recognition result as P (y)_jY), the average degree of deviation of part-of-speech is expressed as

The syntax category deviation degree corresponding to the j-th candidate recognition result is expressed as S (y)_jY), the average syntactic category deviation is expressed as

Three difference values are obtained via step a11

Wherein, y_jAnd representing the j candidate recognition result, and y represents the labeled text of the training speech.

And a12, fusing the calculated difference values, and taking the fused result as the weight corresponding to the candidate identification result.

Optionally, for the jth candidate recognition result in the candidate recognition results, the three difference values may be obtained by the following formula

Carrying out fusion:

wherein α and β are equilibrium coefficients.

It should be noted that, in this embodiment, the method of using equation (2) to fuse the difference values is not limited, and other methods may also be used, for example, directly summing the difference values.

Step a2, determining the prediction loss of the speech recognition baseline model on the candidate recognition result according to the weight corresponding to the candidate recognition result and the prediction probability corresponding to the candidate recognition result.

Specifically, the prediction probability corresponding to the candidate recognition result may be weighted by the weight corresponding to the candidate recognition result (i.e., the weight corresponding to the candidate recognition result is multiplied by the prediction probability corresponding to the candidate recognition result), and the obtained result is used as the prediction loss of the speech recognition baseline model on the candidate recognition result.

The prediction loss of the speech recognition baseline model on the jth candidate recognition result in the candidate recognition results can be expressed as:

step S204: and updating parameters of the voice recognition baseline model according to the prediction loss of the voice recognition baseline model on each candidate recognition result.

Specifically, the predicted loss of the speech recognition baseline model on each candidate recognition result can be summed, the summed loss is used as the predicted loss of the speech recognition baseline model, and the speech recognition baseline model is subjected to parameter updating according to the predicted loss of the speech recognition baseline model. The prediction penalty of the speech recognition baseline model can be expressed as:

The speech recognition model adopted in the speech recognition method provided by the embodiment of the invention is obtained by two-stage training, namely, the speech recognition baseline model is obtained by training based on the cross entropy criterion, and then the speech recognition baseline model is further trained by taking the text unit error rate and the semantic acceptability of the speech recognition result of the training speech as targets. When a speech recognition baseline model is trained by taking the text unit error rate and the semantic acceptability of the speech recognition result of the training speech as targets, the embodiment of the invention introduces the part-of-speech deviation and the syntactic category deviation which can reflect the semantic change condition on the basis of the text unit error rate, and performs the discriminative training on the decoding result of the speech recognition by combining the text unit error rate, the part-of-speech deviation and the syntactic category deviation, so that the acceptability of the speech recognition result can be improved.

Next, the determination of the degree of part-of-speech deviation and the degree of syntactic category deviation will be described. Since the determination manners of the part-of-speech deviation and the syntax category deviation corresponding to each candidate recognition result are the same, this embodiment takes one candidate recognition result as an example, and introduces the determination manners of the part-of-speech deviation and the syntax category deviation.

First, a process of determining a part-of-speech deviation degree corresponding to one candidate recognition result is introduced.

The part-of-speech bias is a statistical indicator of the deviation of parts-of-speech. In consideration of the fact that different parts of speech have certain distinctiveness on the importance degree of semantics, the part of speech deviation in the invention focuses on the part of which the part of speech changes after the candidate recognition result of the training speech is aligned with the labeled text of the training speech due to recognition error, deletion error and insertion error. Note that if a recognition error occurs but no part-of-speech change is caused, it is considered that no deviation occurs from the viewpoint of part-of-speech deviation.

The process of determining the part-of-speech deviation degree corresponding to one candidate recognition result may include:

and b1, determining the part of speech of each word contained in the labeled text of the training speech, and determining the part of speech of the word contained in the unaligned part of the candidate recognition result and the labeled text of the training speech.

When determining the part of speech of each word contained in the labeled text of the training voice, firstly, performing word segmentation processing on the labeled text of the training voice to obtain each word contained in the labeled text of the training voice, and then determining the part of speech of each word obtained through the word segmentation processing.

When determining the part-of-speech of each word in the unaligned part of the candidate recognition result and the labeled text of the training speech, the word segmentation processing needs to be performed on the candidate recognition result, and then the word segmentation result of the labeled text of the training speech is aligned with the word segmentation result of the candidate recognition result, so that each word in the unaligned part of the candidate recognition result and the labeled text of the training speech can be determined, and further the part-of-speech of each word in the aligned part can be determined.

In this embodiment, the part-of-speech of the labeled text of the training speech and the part-of-speech of the candidate recognition result of the training speech may be obtained by using a common word segmentation tool.

Step b2, determining the part-of-speech deviation corresponding to the candidate recognition result according to the part-of-speech of each word contained in the tagged text of the training speech and the part-of-speech of the word contained in the unaligned part of the candidate recognition result and the tagged text of the training speech.

Specifically, the implementation process of step b2 may include: determining the part-of-speech deviation weight of the candidate recognition result relative to the part-of-speech of the labeled text of the training voice according to the part-of-speech weight of the word contained in the unaligned part of the candidate recognition result and the labeled text of the training voice, wherein the part-of-speech weight of a word can represent the importance degree of the part-of-speech of the word; summing the weights of the parts of speech of each word contained in the labeled text of the training voice to obtain a part of speech weight sum; and determining the part-of-speech deviation degree corresponding to the candidate recognition result according to the part-of-speech weight sum and the part-of-speech deviation weight of the candidate recognition result relative to the labeled text of the training voice, and specifically, calculating the ratio of the part-of-speech deviation weight of the candidate recognition result relative to the labeled text of the training voice to the part-of-speech weight sum to serve as the part-of-speech deviation degree corresponding to the candidate recognition result.

The part-of-speech deviation degree in the present embodiment emphasizes the influence of the change of part-of-speech on the rationality of the grammar structure by the weight definition of the importance of part-of-speech. The calculation formula of the part of speech deviation degree is as follows:

wherein, W_jA weight of part of speech of the j-th word representing the misalignment of the candidate recognition result and the labeled text of the training speech, N represents the total number of words contained in the labeled text of the training speech, W_iAnd the weights represent the part of speech of the ith word contained in the labeled text of the training speech.

In order to determine the part-of-speech deviation, the invention defines the weights of various parts-of-speech in advance, and when defining the weights of the parts-of-speech, the importance of different parts-of-speech is defined based on 863 part-of-speech tagging sets mainly according to a combination principle (a superior grammar structure is formed by combining a plurality of secondary grammar structures according to a certain hierarchy, and the whole fluency of a sentence is the sum of local fluency) and a predicate center principle (the more the possibility of independent sentence predicate is, the more the legality of the grammar structure is influenced), and the like:

the following describes the process of determining the degree of part-of-speech bias with reference to two specific examples:

the first example is as follows: the part of speech of each word contained in the labeled text of the training speech and the candidate recognition result of the training speech are as follows:

Since the part of the candidate recognition result of the training speech which is not aligned with the annotated text of the training speech is drown (v) - > smoke (n), the numerator in the above equation (2) is the part-of-speech weight "6", and the denominator is the sum of the parts-of-speech weights of the respective words contained in the annotated text of the training speech (6+4+1+2+3+4+6+6), that is, the part-of-speech deviation POSD corresponding to the candidate recognition result of the training speech is 6/(6+4+1+2+3+4+6+6) ═ 0.1875.

The second example is as follows: the part of speech of each word contained in the labeled text of the training speech and the candidate recognition result of the training speech are as follows:

the part of the candidate recognition result which is not aligned with the marked text of the training speech comprises: i (r) - > NULL, drown (v) - > smoke (n), NULL- > as (asp), so the numerator in the above formula (2) is the sum of the part-of-speech weight 4, the part-of-speech weight 6, and the part-of-speech weight 5, and the denominator is the sum of the part-of-speech weights of the respective words contained in the annotated text of the training speech (6+4+1+2+3+4+6+6), i.e., the part-of-speech deviation POSD corresponding to the candidate recognition result is (4+6+5)/(6+4+1+2+3+4+6+ 0.46875).

Finally, a method for determining the sentence scope deviation is introduced.

The syntax category deviation degree is a statistical index for the syntax category deviation. Because the sentence is composed of the syntactic components according to a certain hierarchical structure, and the compactness among the syntactic components is inconsistent, once the close syntactic relation is destroyed, the understanding of the user is more difficult. A recognition error, a deletion error, or an insertion error causes a change in the syntactic relationship between a component and a component, and the component is said to have a syntactic deviation. If a recognition error occurs but the category is not changed, it is determined that no syntax category deviation has occurred.

and c1, determining the dependency relationship corresponding to each word in the labeled text of the training speech, and determining the changed dependency relationship between the candidate recognition result and the word of the unaligned part of the labeled text.

In this embodiment, the dependency relationship corresponding to each word in the labeled text of the training speech and the candidate recognition result may be obtained by analyzing using a commonly used sentence pattern category analysis tool.

And c2, determining the syntactic category deviation degree corresponding to the candidate recognition result according to the dependency relationship corresponding to each word in the labeled text of the training voice and the dependency relationship changed by the word of the unaligned part of the candidate recognition result and the labeled text.

Specifically, the specific implementation process of step c2 includes: determining the deviation weight of the candidate recognition result relative to the dependency relationship of the labeled text of the training speech according to the weight of the dependency relationship changed by the candidate recognition result and the words of the unaligned part of the labeled text; determining the sum of the weights of the dependency relationship corresponding to each word in the labeled text of the training speech to obtain the sum of the weights of the dependency relationship; and specifically, the ratio of the dependency relationship deviation weight of the candidate recognition result relative to the labeled text of the training speech to the dependency relationship weight sum can be calculated to be used as the syntax category deviation degree corresponding to the candidate recognition result. Wherein the weight of the dependency relationship can represent the importance degree of the dependency relationship.

The syntactic category deviation in the present embodiment emphasizes the influence of the variation in sentence category dependency on the rationality of the syntactic structure. The syntax category deviation SCD is calculated as:

wherein S is_jA weight representing the jth dependency changed by the word of the unaligned portion of the candidate recognition result and the annotated text of the training speech, N representing the total number of words contained in the annotated text of the training speech, S_iThe i-th word in the labeled text representing the training speech corresponds toThe weight of the relationship is stored.

In order to determine the syntax category deviation degree, the present invention defines the weight of each dependency relationship in advance. When defining the weights of various dependency relationships, the importance of various dependency relationships is defined mainly according to a combination principle (an upper grammar structure is formed by combining a plurality of secondary grammar structures according to a certain hierarchy, the whole fluency of a sentence is the sum of local fluency), a syntax hierarchy principle (the lower the hierarchy, the tighter the grammar relationship of domain combination is, the larger the influence on the acceptability of the grammar structure is), a predicate center principle (the stronger the predicate function of a domain dependency object is, the larger the influence on the acceptability of the grammar structure by the inter-domain relationship is), and the like:

The following describes a method for determining the syntax category deviation with reference to a specific example:

illustratively, the labeled text of the training speech is ": the expletive me is drowned by the expletive me, and a candidate recognition result of the training speech is "the expletive me is smoked by the expletive me", wherein the dependency relationship corresponding to each word in the labeled text of the training speech is as follows:

the (3,1, 'RAD') of an expletive (1,7, 'FOB') me (2,1, 'VOB') are all (4,7, 'ADV') flooded (7,0, 'HED') by (5,7, 'ADV') they (6,5, 'POB') (8,7, 'CMP')

Wherein, in (i, j, dependency label), i represents the rank of the corresponding word in the sentence, j represents the rank of the dependency object of the corresponding word, and the dependency label represents the dependency type.

Comparing the candidate recognition result of the training voice with the labeled text of the training voice, finding that the word of the unaligned part of the candidate recognition result of the training voice and the labeled text of the training voice is 'flooded-smoke', the changes in dependencies involved include "HED- > Root" and "CMP- > HED", it can be seen that, in the dependency relationship corresponding to each word contained in the label text, the "HED" and the "CMP" are changed, thus, the numerator in equation (6) is the sum of the weight "2" for "HED" and the weight "6" for "CMP", the denominator is the sum of the weights (5+9+6+7+7+6+2+11) of the dependency relationships "FOB", "VOB", "RAD", "ADV", "POB", "HED" and "CMP" corresponding to the words contained in the labeled text of the training speech, that is, the syntax category deviation SCD corresponding to the candidate recognition result is (2+11)/(5+9+6+7+7+6+2+11) ≈ 0.245.

The part-of-speech and syntactic category deviations in this embodiment depend on part-of-speech and sentence category to emphasize semantic changes in candidate recognition results of the training speech relative to the labeled text of the training speech.

The following describes the voice recognition apparatus provided in the embodiment of the present invention, and the voice recognition apparatus described below and the voice recognition method described above may be referred to in correspondence with each other.

Referring to fig. 3, a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention is shown, which may include: a speech acquisition module 301 and a speech recognition module 302.

A voice obtaining module 301, configured to obtain a voice to be recognized.

And the voice recognition module 302 is configured to recognize the voice to be recognized based on a pre-trained voice recognition model.

Optionally, the speech recognition apparatus provided in this embodiment may further include: and a model training module.

The model training module comprises: a first training submodule and a second training submodule.

And the first training submodule is used for training by taking the recognition result of the training voice and the text marked by the training voice as a target to obtain a voice recognition baseline model.

And the second training submodule is used for training the voice recognition baseline model by taking the text unit error rate and the semantic acceptability of the voice recognition result of the training voice as targets so as to obtain a final voice recognition model.

Optionally, when training the speech recognition baseline model with the text unit error rate and the semantic acceptability of the speech recognition result of the training speech as a target, the second training submodule is specifically configured to:

Optionally, when determining the prediction loss of the speech recognition baseline model on each candidate recognition result by combining the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result, the second training sub-module is specifically configured to:

for each candidate recognition result:

Optionally, the second training submodule determines, in the semantic change evaluation index, that: part-of-speech and/or syntactic category deviations.

The part-of-speech deviation degree can reflect the deviation degree of the part-of-speech of the corresponding candidate recognition result relative to the labeled text of the training speech, and the syntax category deviation degree can reflect the deviation degree of the syntax category of the corresponding candidate recognition result relative to the labeled text of the training speech.

Optionally, when determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result, the second training sub-module is specifically configured to:

and fusing the calculated difference values, and taking the fusion result as the weight corresponding to the candidate identification result.

Optionally, when determining the part-of-speech deviation corresponding to one candidate recognition result, the second training submodule is specifically configured to:

determining the part-of-speech of each word contained in the labeled text of the training voice, and determining the part-of-speech of the word contained in the unaligned part of the candidate recognition result and the labeled text of the training voice;

Optionally, when determining the part-of-speech deviation corresponding to the candidate recognition result according to the part-of-speech of each word included in the tagged text of the training speech and the part-of-speech of the word included in the unaligned portion of the candidate recognition result and the tagged text of the training speech, the second training sub-module is specifically configured to:

Optionally, when determining the syntax category deviation corresponding to a candidate recognition result, the second training sub-module is specifically configured to:

Determining the dependency relationship corresponding to each word in the labeled text of the training speech, and determining the dependency relationship changed by the candidate recognition result and the word of the unaligned part of the labeled text;

and determining the syntax category deviation degree corresponding to the candidate recognition result according to the dependency relationship corresponding to each word in the labeled text of the training voice and the dependency relationship changed by the words of the unaligned part of the candidate recognition result and the labeled text.

Optionally, when determining the syntax category deviation corresponding to the candidate recognition result according to the dependency corresponding to each word in the labeled text of the training speech and the dependency changed by the word in the unaligned portion of the candidate recognition result and the labeled text, the second training sub-module is specifically configured to:

The speech recognition model adopted in the speech recognition device provided by the embodiment of the invention is obtained by two-stage training, namely, firstly, the recognition result of the training speech is consistent with the text marked by the training speech to be used as a target for training to obtain a speech recognition baseline model, and then the speech recognition baseline model is further trained by taking the text unit error rate and the semantic acceptability of the speech recognition result of the training speech as targets. When the speech recognition baseline model is trained by taking the text unit error rate and the semantic acceptability of the speech recognition result of the balance training speech as targets, the embodiment of the invention introduces the part-of-speech deviation and the syntax category deviation capable of reflecting the semantic change condition on the basis of the text unit error rate, and performs the discriminative training on the decoding result of the speech recognition by combining the text unit error rate, the part-of-speech deviation and the syntax category deviation, so that the acceptability of the speech recognition result can be improved.

An embodiment of the present invention further provides a speech recognition device, and please refer to fig. 4, which shows a schematic structural diagram of the speech recognition device, where the speech recognition device may include: at least one processor 401, at least one communication interface 402, at least one memory 403 and at least one communication bus 404;

in the embodiment of the present invention, the number of the processor 401, the communication interface 402, the memory 403, and the communication bus 404 is at least one, and the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404;

processor 401 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 403 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring a voice to be recognized;

Alternatively, the detailed function and the extended function of the program may be as described above.

An embodiment of the present invention further provides a readable storage medium, where the readable storage medium may store a program adapted to be executed by a processor, where the program is configured to:

acquiring a voice to be recognized;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

acquiring a voice to be recognized;

2. The speech recognition method of claim 1, wherein a first stage of training obtains a speech recognition baseline model, and a second stage of training the speech recognition baseline model;

training the speech recognition baseline model, including:

determining a text unit error rate and a semantic change evaluation index corresponding to each candidate recognition result, wherein the semantic change evaluation index can reflect semantic change of the corresponding candidate recognition result relative to a labeled text of the training voice;

and updating parameters of the speech recognition baseline model according to the determined prediction loss.

3. The method of claim 2, wherein determining the prediction loss of the speech recognition baseline model on each candidate recognition result in combination with the text unit error rate and the semantic change evaluation index corresponding to each candidate recognition result comprises:

For each candidate recognition result:

4. The speech recognition method of claim 2, wherein the semantic change evaluation index comprises: part-of-speech deviation and/or syntactic category deviation;

5. The speech recognition method of claim 3, wherein determining the weight corresponding to the candidate recognition result according to the text unit error rate and the semantic change evaluation index corresponding to the candidate recognition result comprises:

6. The speech recognition method of claim 4, wherein determining the degree of part-of-speech bias for a candidate recognition result comprises:

7. The speech recognition method of claim 6, wherein the determining the part-of-speech deviation corresponding to the candidate recognition result according to the part-of-speech of each word included in the labeled text of the training speech and the part-of-speech of the word included in the unaligned portion of the candidate recognition result and the labeled text of the training speech comprises:

Summing the weights of the parts of speech of each word contained in the labeled text of the training voice to obtain a part of speech weight sum;

8. The speech recognition method of claim 4, wherein determining a syntactic category deviation corresponding to a candidate recognition result comprises:

9. The speech recognition method of claim 8, wherein determining the syntactic category deviation corresponding to the candidate recognition result according to the dependency relationship corresponding to each word in the labeled text of the training speech and the dependency relationship changed by the word in the unaligned part of the candidate recognition result and the labeled text comprises:

10. A speech recognition apparatus, comprising: the device comprises a voice acquisition module and a voice recognition module;

the voice acquisition module is used for acquiring the voice to be recognized;

11. A speech recognition device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech recognition method according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 9.