CN113012701A

CN113012701A - Identification method, identification device, electronic equipment and storage medium

Info

Publication number: CN113012701A
Application number: CN202110281812.8A
Authority: CN
Inventors: 刘俊帅; 夏光敏; 王进
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-06-22
Anticipated expiration: 2041-03-16
Also published as: CN113012701B

Abstract

The application provides a recognition method, a recognition device, an electronic device and a storage medium, wherein a speech recognition error correction model is obtained by training with training data, the training data comprises data for correcting a training text and context information obtained by a punctuation prediction model based on a training sample, the training data of the speech recognition error correction model can be richer, the speech recognition error correction model can be ensured to learn richer context information, and the precision of the speech recognition error correction model is improved. On the basis, the first context information and the second context information based on the word characteristics, the determined word characteristics and the third context information based on the word characteristics are input into the speech recognition error correction model, so that the accuracy of the speech recognition error correction model can be improved. And on the basis of improving the accuracy of error correction of the identification result, the punctuation prediction model carries out punctuation prediction on the identification result with higher accuracy, so that the accuracy of punctuation prediction can be improved.

Description

Identification method, identification device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a recognition method, a recognition device, an electronic device, and a storage medium.

Background

At present, the recognition result of the speech recognition system may contain some errors, and in order to improve the accuracy of the recognition result, the error correction module may be used to correct the recognition result of the speech recognition system.

However, the error correction module has low error correction precision, which results in low error correction accuracy.

Disclosure of Invention

The application provides the following technical scheme:

one aspect of the present application provides an identification method, including:

acquiring word characteristics of each word in a text to be processed, which is identified by a voice identification system;

inputting the word features into a punctuation prediction model to obtain first context information of the word features obtained by the punctuation prediction model;

inputting the word features into a speech recognition error correction model to obtain second context information of the word features obtained by the speech recognition error correction model, wherein the speech recognition error correction model is obtained by training with training data, and the training data comprises data for correcting the training text and context information obtained by the punctuation prediction model based on the training samples;

determining third context information of the word feature based on the first context information and the second context information of the word feature;

and inputting the word characteristics and the third context information of the word characteristics into the voice recognition error correction model to obtain a text obtained by performing error correction processing on the text to be processed by the voice recognition error correction model.

Determining third context information of the word feature based on the first context information and the second context information of the word feature, including:

and splicing the first context information and the second context information of the word features to obtain third context information.

Obtaining third context information of the word feature based on the first context information and the second context information of the word feature, including:

and performing dot product operation processing on the first context information and the second context information of the word features to obtain third context information.

and inputting the first context information and the second context information of the word features into a first machine learning model for feature fusion to obtain third context information output by the first machine learning model.

The punctuation prediction model comprises a punctuation prediction submodel and a self-encoder;

the inputting the word features into a punctuation prediction model to obtain first context information of the word features obtained by the punctuation prediction model includes:

inputting the word features into the self-encoder, and obtaining parameters used when a middle layer of the self-encoder processes first sub-context information of word features to be processed, wherein the word features to be processed are first word features arranged in front of the word features in the text to be processed;

obtaining a feature to be used based on a parameter used when the middle layer of the self-encoder processes the first subcontext information of the feature of the word to be processed and the word feature;

inputting the feature to be used into an intermediate layer of the punctuation prediction submodel, and obtaining first context information of the word feature by processing the feature to be used by the intermediate layer of the punctuation prediction submodel.

The obtaining of the feature to be used based on the parameters used when the middle layer of the self-encoder processes the first subcontext information of the feature to be processed and the word feature comprises:

and multiplying the parameters used when the middle layer of the self-encoder processes the first subcontext information of the word feature to be processed by the word feature to obtain the feature to be used.

and inputting parameters used when the middle layer of the self-encoder processes the first subcontext information of the word features to be processed and the word features into a second machine learning model for feature fusion to obtain the features to be used output by the second machine learning model.

Another aspect of the present application provides an identification apparatus, including:

the acquisition module is used for acquiring the word characteristics of each word in the text to be processed, which is identified by the voice recognition system;

the first obtaining module is used for inputting the word features into a punctuation prediction model and obtaining first context information of the word features obtained by the punctuation prediction model;

a second obtaining module, configured to input the word feature into a speech recognition error correction model, and obtain second context information of the word feature, where the second context information is obtained by the speech recognition error correction model, the speech recognition error correction model is obtained by training with training data, and the training data includes data for correcting a training text and context information obtained by the punctuation prediction model based on the training sample;

a determining module, configured to determine third context information of the word feature based on the first context information and the second context information of the word feature;

and the third obtaining module is used for inputting the word characteristics and third context information of the word characteristics into the voice recognition error correction model to obtain a text obtained after the voice recognition error correction model performs error correction processing on the text to be processed.

A third aspect of the present application provides an electronic device comprising:

a memory and a processor.

A memory for storing at least one set of instructions;

a processor for calling and executing the set of instructions in the memory, by executing the set of instructions:

A fourth aspect of the present application provides a storage medium storing a computer program for implementing the identification method according to any one of the above, the computer program being executed by a processor for implementing the steps of the identification method according to any one of the above.

Compared with the prior art, the beneficial effect of this application is:

in the application, the speech recognition error correction model is obtained by training with training data, the training data comprises data for correcting the training text and context information obtained by the punctuation prediction model based on the training sample, so that the training data of the speech recognition error correction model can be richer, the speech recognition error correction model can learn richer context information, and the precision of the speech recognition error correction model is improved. On the basis, third context information of the word features is determined based on the first context information and the second context information of the word features, and the word features and the third context information of the word features are input into the speech recognition error correction model, so that the accuracy of error correction of the speech recognition error correction model can be improved.

And on the basis of improving the accuracy of error correction of the identification result, the punctuation prediction model carries out punctuation prediction on the identification result with higher accuracy, so that the accuracy of punctuation prediction can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

Fig. 1 is a schematic flow chart of an identification method provided in embodiment 1 of the present application;

fig. 2 is a schematic flowchart of an identification method provided in embodiment 2 of the present application;

fig. 3 is a schematic flowchart of an identification method provided in embodiment 3 of the present application;

fig. 4 is a schematic flowchart of an identification method provided in embodiment 4 of the present application;

fig. 5 is a schematic flowchart of an identification method provided in embodiment 5 of the present application;

fig. 6 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the above problem, the present application provides an identification method, and the identification method provided by the present application is described next.

Referring to fig. 1, a schematic flow chart of an identification method provided in embodiment 1 of the present application is shown, where the identification method provided in the present application may be applied to an electronic device, and the present application does not limit a product type of the electronic device, and as shown in fig. 1, the method may include, but is not limited to, the following steps:

step S101, obtaining word characteristics of each word in the text to be processed, which is recognized by the voice recognition system.

The process of obtaining the word feature of each word in the text to be processed recognized by the speech recognition system may include, but is not limited to:

when the speech recognition system recognizes the text to be processed, the word features of each word in the text to be processed are extracted.

Of course, the process of obtaining the word feature of each word in the text to be processed, which is recognized by the speech recognition system, may also include:

and searching word characteristics corresponding to each word in the text to be processed identified by the voice recognition system from a pre-constructed word characteristic database.

The word feature database is formed by extracting word features of a large number of texts and mapping relations between the extracted word features and words. In the embodiment, the method for obtaining a large number of texts is not limited, and the texts are downloaded from the network; alternatively, the manner of obtaining the text recognized by the speech recognition system may be taken as a specific implementation of obtaining a large amount of text.

Step S102, inputting the word features into the punctuation prediction model, and obtaining first context information of the word features obtained by the punctuation prediction model.

The punctuation prediction model has the capability of punctuation prediction and context information of word features, and particularly, the word features are input into the punctuation prediction model, and the punctuation prediction model can obtain a punctuation prediction result and first context information of the word features.

The punctuation prediction model may be, but is not limited to: the one-way long-time and short-time memory cyclic neural network model or the two-way long-time and short-time memory cyclic neural network model.

If the punctuation prediction model is a bidirectional long-and-short-term memory recurrent neural network model, the word features are input into the punctuation prediction model, and the first context information of the word features obtained by the punctuation prediction model can be determined by, but is not limited to, the following formula:

in the above-mentioned formula,

representing first context information of word characteristics of the t-th word in the text to be processed, BilSTM representing a bidirectional long-and-short time memory recurrent neural network model, x^t+1The word characteristics of the t +1 th word in the text to be processed are represented,

and the first context information represents the word characteristics of the t +1 th word in the text to be processed.

The word features are input into the punctuation prediction model, and the punctuation prediction result obtained by the punctuation prediction model can be obtained in addition to the first context information of the word features obtained by the punctuation prediction model. The punctuation prediction result obtained by the punctuation prediction model can be determined by the following formula:

y_punc＝softmax(h_punc)

in the above formula, y_puncRepresenting punctuation prediction results, softmax () representing a probability normalization function, h_puncFirst context information representing word features in the text to be processed.

And step S103, inputting the word characteristics into the voice recognition error correction model, and obtaining second context information of the word characteristics obtained by the voice recognition error correction model.

In this embodiment, the speech recognition error correction model may be, but is not limited to: the one-way long-time and short-time memory cyclic neural network model or the two-way long-time and short-time memory cyclic neural network model.

Under the condition that the speech recognition error correction model is a bidirectional long-time memory recurrent neural network model, the word characteristics are input into the speech recognition error correction model, second context information of the word characteristics obtained by the speech recognition error correction model is obtained, and the second context information can be determined by adopting the following formula:

in the formula, second context information corresponding to the word characteristics of the tth word in the text to be processed is represented, BilSTM represents a bidirectional long-and-short-term memory recurrent neural network model, and x^t+1The word characteristics of the t +1 th word in the text to be processed are represented,

and second context information corresponding to the word characteristics of the t +1 th word in the text to be processed is represented.

The speech recognition error correction model is obtained by training by using training data, the training data comprises data for correcting the error of the training text and context information obtained by the punctuation prediction model based on the training samples.

Specifically, the training process of the speech recognition error correction model may include:

and S1031, obtaining word features of each word in the training text, punctuation marks of the training text and error correction labels labeled for each word.

S1032, inputting the plurality of word features into the punctuation prediction model, and obtaining punctuation prediction results obtained by the punctuation prediction model and first context information of the word features.

In this embodiment, the parameters of the punctuation prediction model may be obtained by training a plurality of complete training texts in advance. Of course, the parameters of the punctuation prediction model may also be initially set, parameters that have not been trained by a complete training sample.

S1033, inputting the plurality of word features into the speech recognition error correction model, and obtaining second context information of each word feature obtained by the speech recognition error correction model.

S1034, determining third context information of the word features based on the first context information and the second context information of the word features.

If the parameters of the punctuation prediction model are obtained by training a plurality of complete training texts in advance, the punctuation prediction model is proved to have learned richer context information, so that a plurality of word features are input into the punctuation prediction model, the accuracy of the first context information of each word feature obtained by the obtained punctuation prediction model is higher, and the richness and the accuracy of the third context information of the word features can be further ensured.

The process of determining the third context information of the word feature based on the first context information and the second context information of the word feature may include, but is not limited to:

s10341, the first context information and the second context information of the word features are spliced to obtain third context information.

For example, step S10341 is described, where for example, if the first context information of the word feature is [ p1, p2, p3, …, pn ], the second context information of the word feature is [ e1, e2, e3, …, en ], and the first context information and the second context information of the word feature are subjected to a concatenation process to obtain third context information [ p1, p2, p3, …, pn, e1, e2, e3, …, en ].

And splicing the first context information and the second context information of the word characteristics to obtain third context information, wherein the first context information and the second context information are not lost, and the training precision of the speech recognition error correction model is further ensured to be improved.

The process of determining the third context information of the word feature based on the first context information and the second context information of the word feature may also include, but is not limited to:

s10342, performing dot product operation processing on the first context information and the second context information of the word features to obtain third context information.

And performing dot product operation processing on the first context information and the second context of the word features to obtain third context information, so that the operation time can be saved, the efficiency of obtaining the third context information is improved, and the training efficiency is improved while the training precision of the speech recognition error correction model is improved.

Alternatively, the process of determining the third context information of the word feature based on the first context information and the second context information of the word feature may also include, but is not limited to:

s10343, inputting the first context information and the second context information of the word feature into the first machine learning model for feature fusion, and obtaining third context information output by the first machine learning model.

When the first context information and the second context information of the word features are input into the first machine learning model for feature fusion, the first machine learning model and the voice recognition error correction model are trained together, the accuracy of the training of the first machine learning model is ensured, the accuracy of the third context information output by the first machine learning model is further ensured, and the training accuracy of the voice recognition error correction model is improved on the basis of ensuring the richness and the accuracy of training data of the voice recognition error correction model.

And S1035, inputting the third context information and the word characteristics into the voice recognition error correction model, and obtaining an error correction result output by the voice recognition error correction model, wherein the error correction result is a result of correcting the word characteristics.

S1036, judging whether the punctuation prediction model and the voice recognition error correction model meet the training end conditions or not based on the punctuation prediction result, the error correction result of each word feature, punctuation symbols of the training text and the error correction label marked for each word.

If not, step S1037 is performed.

In this embodiment, the training end condition may be set as needed, and is not limited in this application. For example, the end-of-training condition may be, but is not limited to: the loss function value of the punctuation prediction model is converged and the loss function value of the speech recognition error correction model is converged; or, the obtained comprehensive loss function value is converged based on the loss function value of the punctuation prediction model and the loss function value of the speech recognition error correction model.

Steps S1031 to S1037 may be understood as training the punctuation prediction model while training the speech recognition error correction model, so as to implement joint learning of the speech recognition error correction model and the punctuation prediction model.

Based on the punctuation prediction result, the error correction result of each word feature, the punctuation marks of the training text and the error correction label labeled for each word, a specific implementation process for judging whether the punctuation prediction model and the speech recognition error correction model meet the training end condition can be as follows:

s10361, determining an error correction loss function value based on the error correction result of the word features and the difference between error correction labels marked for the words;

s10362, determining a punctuation loss function value based on the difference between punctuation prediction results and punctuation symbols of the training text;

s10363, obtaining a comprehensive loss function value based on the error correction loss function value and the punctuation loss function value.

The resulting composite loss function value is based on the error correction loss function value and the punctuation loss function value and may include, but is not limited to:

and adding the error correction loss function value and the punctuation loss function value to obtain a comprehensive loss function value.

Of course, the synthetic loss function value is obtained based on the error correction loss function value and the punctuation loss function value, and may include, but is not limited to:

and calculating to obtain a comprehensive loss function value by using the following formula:

loss_cp＝a×loss_ec+b×loss_punc

in the above formula, loss_cpRepresenting the value of the combined loss function, loss_ecThe values of the error correction loss function are shown, a and b are different weights, a and b can be set according to needs, and the values of a and b are not limited in the application.

S10364, judging whether the comprehensive loss function value is converged.

And S1037, updating the parameters of the punctuation prediction model and the parameters of the speech recognition error correction model, and returning to execute the step S1031 until the training end condition is met.

And step S104, determining third context information of the word features based on the first context information and the second context information of the word features.

And determining third context information of the word features based on the first context information and the second context information of the word features, so that the third context information contains more context information than the second context information.

And S105, inputting the word characteristics and the third context information of the word characteristics into the voice recognition error correction model, and obtaining a text obtained after the voice recognition error correction model performs error correction on the text to be processed.

The word features and the third context information of the word features are input to the speech recognition error correction model to obtain a text obtained by the speech recognition error correction model after performing error correction processing on the text to be processed, which can be understood as follows:

and inputting the word characteristics and the third context information of the word characteristics into the voice recognition error correction model to obtain a text obtained by the voice recognition error correction model after error correction processing is carried out on the text to be processed.

The process of inputting the word features and the third context information of the word features into the speech recognition error correction model to obtain a text obtained by the speech recognition error correction model after performing error correction processing on the text to be processed may include:

s1051, inputting the word characteristics and the third context information of the word characteristics into a speech recognition error correction model, and obtaining the context information of the word characteristics of the t +1 th word in the text to be processed by adopting the following formula:

wherein the content of the first and second substances,

to indicate a waitProcessing context information of word features of t +1 th word in text, BiLSTM representing bidirectional long-and-short time memory recurrent neural network model, x^t+1The word characteristics of the t +1 th word in the text to be processed are represented,

and third context information representing word characteristics of the t-th word in the text to be processed.

S1052, obtaining the word characteristics of each word in the text to be processed after error correction processing by adopting the following formula:

y_ec＝softmax(h_ec)

in the above formula, y_ecRepresenting the word features after error correction, softmax () representing the probability normalization function, h_ecContext information representing word characteristics of words in the text to be processed.

S1053, based on the word characteristics of each word in the text to be processed after error correction, obtaining the text obtained after error correction of the text to be processed.

As another alternative embodiment of the present application, referring to fig. 2, a schematic flow chart of an identification method provided in embodiment 2 of the present application is provided, and this embodiment mainly relates to a refinement scheme of the identification method described in embodiment 1 above, as shown in fig. 2, the method may include, but is not limited to, the following steps:

step S201, acquiring word characteristics of each word in the text to be processed, which is recognized by the voice recognition system.

Step S202, inputting the word features into the punctuation prediction model, and obtaining first context information of the word features obtained by the punctuation prediction model.

Step S203, inputting the word characteristics into the speech recognition error correction model, and obtaining second context information of the word characteristics obtained by the speech recognition error correction model.

The detailed processes of steps S201 to S203 can refer to the related descriptions of steps S101 to S103 in embodiment 1, and are not described herein again.

And S204, splicing the first context information and the second context information of the word features to obtain third context information.

For example, if the first context information of the word feature is [ p1, p2, p3, …, pn ], the second context information of the word feature is [ e1, e2, e3, …, en ], the first context information and the second context information of the word feature are spliced to obtain third context information [ p1, p2, p3, …, pn, e1, e2, e3, …, en ].

Step S204 is a specific implementation manner of step S104 in example 1.

Step S205, inputting the word characteristics and the third context information of the word characteristics into the speech recognition error correction model, and obtaining a text obtained after the speech recognition error correction model performs error correction processing on the text to be processed.

The detailed process of step S205 can refer to the related description of step S105 in embodiment 1, and is not described herein again.

In this embodiment, the first context information and the second context information of the word feature are spliced to obtain the third context information, so that the first context information and the second context information are not lost, and the accuracy of error correction of the speech recognition error correction model is further ensured.

As another alternative embodiment of the present application, referring to fig. 3, a schematic flow chart of an identification method provided in embodiment 3 of the present application is provided, and this embodiment mainly relates to a refinement scheme of the identification method described in the foregoing embodiment 1, as shown in fig. 3, the method may include, but is not limited to, the following steps:

step S301, word characteristics of each word in the text to be processed, which is recognized by the voice recognition system, are obtained.

Step S302, inputting the word features into the punctuation prediction model, and obtaining first context information of the word features obtained by the punctuation prediction model.

Step S303, inputting the word characteristics into the speech recognition error correction model, and obtaining second context information of the word characteristics obtained by the speech recognition error correction model.

The detailed processes of steps S301 to S303 can refer to the related descriptions of steps S101 to S103 in embodiment 1, and are not described herein again.

Step S304, performing dot product operation processing on the first context information and the second context information of the word features to obtain third context information.

Step S304 is a specific implementation manner of step S104 in embodiment 1.

Step S305, inputting the word characteristics and the third context information of the word characteristics into the speech recognition error correction model, and obtaining a text obtained after the speech recognition error correction model performs error correction processing on the text to be processed.

The detailed process of step S305 can refer to the related description of step S105 in embodiment 1, and is not described herein again.

And performing dot product operation processing on the first context information and the second context of the word characteristics to obtain third context information, so that the operation time can be saved, the efficiency of obtaining the third context information is improved, and the error correction efficiency is improved while the accuracy of the text obtained after the error correction processing is performed on the text to be processed by the voice recognition error correction model is ensured.

As another alternative embodiment of the present application, referring to fig. 4, a schematic flow chart of an identification method provided in embodiment 4 of the present application is provided, and this embodiment mainly relates to a refinement scheme of the identification method described in the foregoing embodiment 1, as shown in fig. 4, the method may include, but is not limited to, the following steps:

step S401, word characteristics of each word in the text to be processed identified by the voice recognition system are obtained.

Step S402, inputting the word features into the punctuation prediction model, and obtaining first context information of the word features obtained by the punctuation prediction model.

And S403, inputting the word features into the speech recognition error correction model, and obtaining second context information of the word features obtained by the speech recognition error correction model.

The detailed processes of steps S401 to S403 can refer to the related descriptions of steps S101 to S103 in embodiment 1, and are not described herein again.

Step S404, inputting the first context information and the second context information of the word features into a first machine learning model for feature fusion to obtain third context information output by the first machine learning model.

Step S404 is a specific implementation manner of step S104 in embodiment 1.

Step S405, inputting the word characteristics and the third context information of the word characteristics into the speech recognition error correction model, and obtaining a text obtained after the speech recognition error correction model performs error correction processing on the text to be processed.

The detailed process of step S405 can refer to the related description of step S105 in embodiment 1, and is not described herein again.

The first context information and the second context information of the word features are input into the first machine learning model for feature fusion, the first machine learning model outputs the third context information, the accuracy of the third context information can be guaranteed, the accurate third context information is input into the voice recognition error correction model, and the error correction accuracy of the voice recognition error correction model can be improved.

As another alternative embodiment of the present application, mainly a refinement of the recognition method described in embodiment 1 above, in this embodiment, the punctuation prediction model may include a punctuation prediction submodel and an auto-encoder. In the case that the punctuation prediction model comprises a punctuation prediction submodel and an auto-encoder, the training process of the speech recognition error correction model may comprise the following steps:

s2001, acquiring word features of each word in the training text, punctuation marks of the training text and an error correction label labeled for each word.

The detailed process of step S2001 may be referred to the related description of step S1031 in embodiment 1, and is not described herein again.

And S2002, inputting the word features into the self-encoder to obtain parameters used when the middle layer of the self-encoder processes the first subcontext information of the word features to be processed, wherein the word features to be processed are the first word features arranged in front of the word features in the training text.

An auto-encoder can be understood as: a machine learning model that learns characteristic information (e.g., punctuation distribution information) of an input object within its feature space.

The self-encoder may be, but is not limited to: the one-way long-time and short-time memory cyclic neural network model or the two-way long-time and short-time memory cyclic neural network model.

When the self-encoder is a unidirectional long-time and short-time memory cyclic neural network model or a bidirectional long-time and short-time memory cyclic neural network model, the intermediate layer of the self-encoder can be understood as follows: and a hidden layer of the unidirectional long-time and short-time memory cyclic neural network model or the bidirectional long-time and short-time memory cyclic neural network model.

In this embodiment, the intermediate layer of the self-encoder may process the word feature by using the following formula to obtain the first subcontext information of the word feature:

in the above-mentioned formula,

representing parameters, x, used in processing the word features from the intermediate layer of the encoder^t+1Representing the word features of the t +1 th word in the training text,

parameters used in processing the first subcontext information of the word features from an intermediate layer of the encoder,

first subcontext information representing word features of a t +1 th word in the training text,

first subcontext information representing word features of a tth word in the training text.

It can be understood that the word feature to be processed is one of the word features in the training text, and the first subcontext information of the word feature to be processed is also calculated by using the above formula.

And step S2003, obtaining the to-be-used characteristics based on parameters and word characteristics used when the middle layer of the self-encoder processes the first subcontext information of the to-be-processed word characteristics.

The to-be-used feature is obtained based on parameters and word features used when the middle layer of the self-encoder processes the first subcontext information of the to-be-processed word feature, which can be understood as: the method comprises the steps of utilizing parameters used when the middle layer of the self-encoder processes first subcontext information of a word feature to be processed, mapping the word feature to a feature space meeting the requirements of a punctuation prediction submodel to obtain a feature to be used, enabling the feature to be used to meet the feature space of the punctuation prediction submodel, and enabling the feature to be used to contain characteristic information (such as punctuation distribution information) of the word feature in an original feature space.

The feature to be used is obtained based on the parameters used when the intermediate layer of the self-encoder processes the word feature and the word feature, and may include but is not limited to:

Of course, the obtaining of the feature to be used based on the parameter and the word feature used when the middle layer of the self-encoder processes the first subcontext information of the feature to be processed may also include:

and inputting parameters and word characteristics used when the first subcontext information of the word characteristics to be processed in the middle layer of the self-encoder is processed into a second machine learning model for characteristic fusion to obtain the characteristics to be used output by the second machine learning model.

And step S2004, inputting the features to be used into the middle layer of the punctuation prediction submodel, and processing the features to be used by the middle layer of the punctuation prediction submodel to obtain first context information of the word features.

Inputting the feature to be used into the intermediate layer of the punctuation prediction submodel, and processing the feature to be used by the intermediate layer of the punctuation prediction submodel to obtain the first context information of the word feature, which can be understood as:

inputting the characteristics to be used into an intermediate layer of the punctuation prediction submodel, and processing the characteristics to be used by the intermediate layer of the punctuation prediction submodel by using the following formula to obtain first context information of the word characteristics:

in the above-mentioned formula,

parameters used when the middle layer representing the punctuation prediction submodel processes the word features,

the characteristics to be used are indicated,

first context information representing word features of a t +1 th word in the training text,

first context information representing word features of a tth word in the training text,

the parameter used when the middle layer of the punctuation prediction submodel processes the first context information of the word feature, and sigmoid () represents a mathematical function.

The feature to be used accords with the feature space required by the punctuation prediction submodel and contains the characteristic information of the word feature in the original feature space, so that the feature to be used is input into the middle layer of the punctuation prediction submodel, the middle layer of the punctuation prediction submodel can be ensured to process the feature to be used without losing the characteristic information of the word feature in the original feature space, and the accuracy of the first context information of the word feature is ensured.

Steps S2002-S2004 are a specific implementation of step S1032 in example 1.

In this embodiment, the parameters of the punctuation prediction submodel may be obtained by training a plurality of complete training texts in advance. Of course, the parameters of the punctuation predictor model may also be initially set, parameters that have not been trained by a complete training sample.

And S2005, inputting the plurality of word features into the voice recognition error correction model, and obtaining second context information of each word feature obtained by the voice recognition error correction model.

And S2006, determining third context information of the word features based on the first context information and the second context information of the word features.

If the parameters of the punctuation prediction submodel are obtained by training a plurality of complete training texts in advance, the punctuation prediction submodel is proved to have learned richer context information, so that the features to be used are input into the punctuation prediction submodel, the accuracy of the obtained first context information of the features to be used of the punctuation prediction submodel is higher, and the richness and the accuracy of the third context information of the features of the words to be used can be further ensured.

And S2007, inputting the third context information and the word characteristics into the voice recognition error correction model, and obtaining an error correction result output by the voice recognition error correction model, wherein the error correction result is a result of correcting the word characteristics.

And S2008, determining a loss function value of the self-encoder based on the first subcontext information of each word feature obtained by the self-encoder.

The process of determining the value of the auto-encoder loss function based on the first subcontext information obtained for each word feature from the encoder may include, but is not limited to:

the self-encoder loss function value is calculated using the following formula:

in the above formula, L_aeRepresenting the value of the loss function of the self-encoder, h_aeFirst subcontext information representing word features, MLE () representing a maximum likelihood estimation function.

And S2009, determining a punctuation prediction sub-model loss function value based on the first context information of each word feature.

The process of determining a punctuation predictor sub-model loss function value based on the first context information of each word feature may include:

calculating the loss function value of the punctuation predictor model by using the following formula:

L_punc＝MLE(h_punc)

in the above formula, L_puncLoss function value, h, representing a punctuation predictor model_puncFirst context information representing a word feature, MLE () representing a maximum likelihood estimation function.

And S2010, obtaining a punctuation prediction loss function value based on the self-encoder loss function value and the punctuation prediction sub-model loss function value.

The punctuation prediction loss function value is obtained based on the self-encoder loss function value and the punctuation predictor model loss function value, which may include but is not limited to:

and adding the loss function value of the self-encoder and the loss function value of the punctuation prediction submodel to obtain a punctuation prediction loss function value.

Another embodiment of obtaining the punctuation prediction loss function value based on the self-encoder loss function value and the punctuation prediction submodel loss function value may be:

the punctuation prediction loss function value is calculated using the following formula:

L＝γL_punc+(1-γ)L_ac；

in the above formula, L_puncLoss function value, L, representing a punctuation predictor model_aeRepresents the loss function value of the self-encoder, y represents the ultra-parameter, the value range of y is 0-1, and L represents the point prediction loss function value.

S2011, determining an error correction loss function value based on the error correction result of the word feature and the difference between the error correction labels labeled for the words.

And S2012, obtaining a comprehensive loss function value based on the error correction loss function value and the punctuation loss function value.

The detailed process of step S2012 can be referred to the related description of step S10363 in embodiment 1, and is not described herein again.

And S2013, judging whether the comprehensive loss function value is converged.

If not, go to step S2014.

Steps S2008-S2013 are a specific implementation of step S1036 in example 1.

And S2014, updating the parameters of the punctuation prediction model and the parameters of the speech recognition error correction model, and returning to execute the step S2001 until the training end condition is met.

In this embodiment, on the basis of ensuring the accuracy of the first context information of the word feature, the accuracy of the third context information of the word feature can be ensured, so as to ensure the training accuracy of the speech recognition error correction model.

Corresponding to the training process of the above self-encoder, punctuation prediction submodel and speech recognition error correction model, referring to fig. 5, a flow chart of a recognition method provided in embodiment 5 of the present application is shown, and this embodiment mainly describes a refinement scheme of the recognition method described in embodiment 1, as shown in fig. 5, the method may include, but is not limited to, the following steps:

step S501, word characteristics of each word in the text to be processed, which is recognized by the voice recognition system, are obtained.

The detailed process of step S501 can refer to the related description of step S101 in embodiment 1, and is not described herein again.

Step S502, inputting the word characteristics into the self-encoder, and obtaining parameters used when the middle layer of the self-encoder processes the first subcontext information of the word characteristics to be processed, wherein the word characteristics to be processed is the first word characteristics arranged in front of the word characteristics in the text to be processed.

Step S503, parameters and word characteristics used when the first subcontext information of the word characteristics to be processed is processed based on the middle layer of the self-encoder are obtained, and the characteristics to be used are obtained.

The method for obtaining the feature to be used based on the parameters used when the middle layer of the self-encoder processes the first subcontext information of the feature to be processed and the word feature comprises the following steps:

Step S504, inputting the characteristics to be used into the middle layer of the punctuation prediction submodel, and processing the characteristics to be used by the middle layer of the punctuation prediction submodel to obtain first context information of the word characteristics.

Steps S502 to S504 are a specific implementation of step S102 in example 1.

And step S505, inputting the word characteristics into the voice recognition error correction model, and obtaining second context information of the word characteristics obtained by the voice recognition error correction model.

Step S506, third context information of the word features is determined based on the first context information and the second context information of the word features.

Step S507, inputting the word characteristics and the third context information of the word characteristics into the voice recognition error correction model, and obtaining a text obtained after the voice recognition error correction model performs error correction processing on the text to be processed.

The detailed processes of steps S505 to S507 can be referred to the related descriptions of steps S103 to S105 in embodiment 1, and are not described herein again.

In this embodiment, since the feature to be used conforms to the feature space required by the punctuation prediction submodel and includes the feature information of the word feature in the original feature space, the feature to be used is input to the intermediate layer of the punctuation prediction submodel, so that the intermediate layer of the punctuation prediction submodel can process the feature to be used without losing the feature information of the word feature in the original feature space, the accuracy of the first context information of the word feature is ensured, and the accuracy of the error correction of the speech recognition error correction model is further improved.

Corresponding to the embodiment of the identification method provided by the application, the application also provides an embodiment of the electronic equipment applying the identification method.

As shown in fig. 6, which is a schematic structural diagram of an embodiment 1 of an electronic device provided in the present application, the electronic device may include the following structures:

a memory 100 and a processor 200.

A memory 100 for storing at least one set of instructions;

a processor 200 for calling and executing the set of instructions in the memory 100, and executing the set of instructions to:

Corresponding to the embodiment of the identification method provided by the application, the application also provides an embodiment of an identification device.

In this embodiment, the identification device may include:

In this embodiment, the determining module may be specifically configured to:

splicing the first context information and the second context information of the word characteristics to obtain third context information;

or, performing dot product operation processing on the first context information and the second context information of the word features to obtain third context information;

or inputting the first context information and the second context information of the word features into a first machine learning model for feature fusion to obtain third context information output by the first machine learning model.

In this embodiment, the punctuation prediction model may include a punctuation prediction sub-model and a self-encoder;

accordingly, the first obtaining module may be specifically configured to:

inputting word characteristics into the self-encoder, and acquiring parameters used when the middle layer of the self-encoder processes first subcontext information of the word characteristics to be processed, wherein the word characteristics to be processed are first word characteristics arranged in front of the word characteristics in a text to be processed;

the method comprises the steps that parameters and word characteristics used when first subcontext information of word characteristics to be processed is processed based on an intermediate layer of an autoencoder are used, and the characteristics to be used are obtained;

and inputting the characteristics to be used into an intermediate layer of the punctuation prediction submodel, and processing the characteristics to be used by the intermediate layer of the punctuation prediction submodel to obtain first context information of the word characteristics.

In this embodiment, the process of obtaining the to-be-used feature by the first obtaining module based on the parameter and the word feature used when the middle layer of the self-encoder processes the first subcontext information of the to-be-processed word feature may specifically be:

multiplying parameters used when the middle layer of the self-encoder processes the first subcontext information of the word feature to be processed by the word feature to obtain the feature to be used;

or, inputting parameters and word features used when the first subcontext information of the word features to be processed in the middle layer of the self-encoder is processed into a second machine learning model for feature fusion to obtain the features to be used output by the second machine learning model.

Corresponding to the embodiment of the identification method provided by the application, the application also provides an embodiment of a storage medium.

In this embodiment, a storage medium stores a computer program for implementing the identification method according to any one of the foregoing embodiments, and the computer program is executed by a processor for implementing the steps of the identification method according to any one of the foregoing embodiments.

It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

The foregoing detailed description is directed to a control method, a control device, and an electronic device provided by the present application, and specific examples are applied in the present application to explain the principles and embodiments of the present application, and the descriptions of the foregoing examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An identification method, comprising:

2. The method of claim 1, the determining third context information for the word feature based on the first context information and the second context information for the word feature, comprising:

3. The method of claim 1, wherein obtaining third context information for the word feature based on the first context information and the second context information for the word feature comprises:

4. The method of claim 1, wherein obtaining third context information for the word feature based on the first context information and the second context information for the word feature comprises:

5. The method of claim 1, the punctuation prediction model comprising a punctuation prediction submodel and an auto-encoder;

6. The method according to claim 5, wherein the obtaining the feature to be used based on the word feature and parameters used by an intermediate layer of the self-encoder to process the first subcontext information of the feature to be processed comprises:

7. The method according to claim 5, wherein the obtaining the feature to be used based on the word feature and parameters used by an intermediate layer of the self-encoder to process the first subcontext information of the feature to be processed comprises:

8. An identification device comprising:

9. An electronic device, comprising:

a memory and a processor;

a memory for storing at least one set of instructions;

10. A storage medium storing a computer program implementing the identification method according to any one of claims 1 to 7, the computer program being executable by a processor to implement the steps of the identification method according to any one of claims 1 to 7.