CN112347789B

CN112347789B - Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and storage medium

Info

Publication number: CN112347789B
Application number: CN202011230897.9A
Authority: CN
Inventors: 李小喜; 李亚; 张为泰; 刘俊华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2024-04-12
Anticipated expiration: 2040-11-06
Also published as: CN112347789A

Abstract

The application provides a punctuation prediction method, a punctuation prediction device, punctuation prediction equipment and a storage medium, wherein the punctuation prediction method comprises the following steps: obtaining a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice fragment; acquiring historical prediction information based on whether the text to be predicted is the first intermediate recognition result of the current voice fragment, wherein the historical prediction information is intermediate information which is generated in the process of punctuation prediction on the historical recognition result and is used for determining the punctuation prediction result; and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted. The punctuation prediction method provided by the application has higher prediction accuracy and prediction efficiency, and the advantage enables the punctuation prediction method provided by the application to be suitable for simultaneous interpretation scenes of machines.

Description

Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a punctuation prediction method, apparatus, device, and storage medium.

Background

In recent years, with the application of deep learning in the fields of speech, natural language processing and the like, the accuracy of speech recognition is continuously improved, and the translation effect of machine translation is also continuously improved, wherein the machine translation basically reaches the level of manual translation. At the same time, advances in speech automatic recognition and machine translation have also driven the development of simultaneous interpretation by machines.

Standard automatic speech recognition systems typically generate audio as a text without any punctuation, which is poorly readable and affects the processing of subsequent tasks (e.g., machine co-interpretation), while inserting appropriate punctuation into the recognized text solves this problem.

It will be appreciated that, to insert an appropriate punctuation into the recognized text, it is first necessary to predict the punctuation information of each word in the recognized text (whether or not a punctuation needs to be inserted after a word, and if a punctuation needs to be inserted, which punctuation needs to be inserted), and how to predict the punctuation information of each word in the recognized text is a current urgent problem to be solved.

Disclosure of Invention

In view of this, the present application provides a punctuation prediction method, apparatus, device and storage medium, which are used for predicting punctuation information of each word in a speech recognition text, and the technical scheme is as follows:

a punctuation prediction method, comprising:

obtaining a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice fragment, and the recognition result of a voice fragment comprises a plurality of intermediate recognition results and a final recognition result;

acquiring historical prediction information based on whether the text to be predicted is the first recognition result of the current voice fragment, wherein the historical prediction information is intermediate information which is generated in the process of punctuation prediction of the historical recognition result and is used for determining the punctuation prediction result;

And predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction method further includes:

determining the update type corresponding to the text to be predicted according to the update condition of the text to be predicted relative to the previous recognition result;

if the update type corresponding to the text to be predicted is modified, executing the step of acquiring historical prediction information according to whether the text to be predicted is the first recognition result of the current voice segment;

if the update type corresponding to the text to be predicted is increased, counting the number of words increased by the text to be predicted compared with the previous recognition result, and if the number of the words increased is greater than a first preset number, executing the step of obtaining historical prediction information based on whether the text to be predicted is the first recognition result of the current voice segment.

Optionally, the obtaining the history prediction information based on whether the text to be predicted is the first recognition result of the current speech segment includes:

if the text to be predicted is the first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is obtained and used as historical prediction information;

If the text to be predicted is the non-first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to the previous recognition result of the current voice segment are obtained and used as historical prediction information;

the punctuation prediction information is intermediate information which is generated in the process of carrying out punctuation prediction on the corresponding identification result and is used for determining the punctuation prediction result.

Optionally, the predicting punctuation information of the word in the text to be predicted according to the historical prediction information and the text to be predicted includes:

predicting punctuation information after words in the text to be predicted by using a pre-established punctuation marking model according to the historical prediction information and the text to be predicted;

the punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice fragments, and when the punctuation prediction model is trained by using the training text, the punctuation prediction model predicts punctuation information after a word according to the word before the word and a second preset number of words after the word for each word in the training text.

Optionally, the predicting punctuation information of the word in the text to be predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted includes:

removing punctuation prediction information of a last second preset number of words in the history identification result from the history prediction information, wherein the removed information is used as prediction reference information;

splicing the last second preset number of words in the history recognition result with the part which does not participate in punctuation prediction in the text to be predicted, and taking the spliced text as an input text;

and inputting the prediction reference information and the input text into the punctuation prediction model to perform punctuation prediction so as to obtain punctuation information of words in the text to be predicted.

Optionally, the removing punctuation prediction information of the last second preset number of words in the history recognition result from the history prediction information includes:

if the text to be predicted is the first recognition result of the current voice segment, punctuation prediction information of the last second preset number of words in the final recognition result of the forward adjacent voice segment of the current voice segment is removed from the historical prediction information;

And if the text to be predicted is a non-first recognition result of the current voice fragment, removing punctuation prediction information of a last second preset number of words in a previous recognition result of the current voice fragment from the historical prediction information.

Optionally, the splicing the last second preset number of words in the history recognition result with the portion of the text to be predicted, which does not participate in punctuation prediction, includes:

if the text to be predicted is the first recognition result of the current voice fragment, splicing the last second preset number of words in the historical recognition result with the whole text to be predicted;

if the text to be predicted is a non-first recognition result of the current voice segment, splicing the last second preset number of words in the historical recognition result with the part, which is increased compared with the previous middle recognition result of the current voice segment, of the text to be predicted.

Optionally, the inputting the prediction reference information and the input text into the punctuation prediction model to perform punctuation prediction, so as to obtain punctuation information of the word in the text to be predicted, includes:

determining a characterization vector of each word in the input text by using the punctuation prediction model;

Determining a target vector corresponding to each word in the input text by using the punctuation prediction model, the characterization vector of each word in the input text and the prediction reference information, wherein the target vector corresponding to one word in the input text can characterize the relevance between the word before the word and a second preset number of words after the word in the input text and the word respectively;

and determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the characterization vector of each word in the input text and the target vector corresponding to each word in the input text.

Optionally, the determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the characterization vector of each word in the input text and the target vector corresponding to each word in the input text includes:

for each word in the input text, predicting punctuation information after each word in a second preset number of words before the word by using the punctuation prediction model, the characterization vector of the word and the target vector corresponding to the word;

for each word in the input text, determining punctuation information after the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word;

And acquiring punctuation information of the words in the text to be predicted from the punctuation information of each word in the input text.

A punctuation prediction apparatus, comprising: the system comprises a text acquisition module to be predicted, a history prediction information acquisition module and a punctuation prediction module;

the text to be predicted obtaining module is used for obtaining a text to be predicted, wherein the text to be predicted is a current recognition result of a current voice fragment, and the recognition result of the voice fragment comprises a plurality of intermediate recognition results and a final recognition result;

the history prediction information acquisition module is used for acquiring history prediction information based on whether the text to be predicted is the first recognition result of the current voice segment, wherein the history prediction information is intermediate information which is generated in the process of punctuation prediction on the history recognition result and is used for determining the punctuation prediction result;

and the punctuation prediction module is used for predicting the punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction apparatus further includes: an update type determining module and a quantity counting module;

the update type determining module is used for determining the update type corresponding to the text to be predicted according to the update condition of the text to be predicted compared with the previous recognition result;

The history prediction information obtaining module is specifically configured to obtain history prediction information based on whether the text to be predicted is a first recognition result of a current speech segment when the update type corresponding to the text to be predicted is modification;

the quantity counting module is used for counting the quantity of words, which are increased by the text to be predicted compared with the previous recognition result, when the update type corresponding to the text to be predicted is increased;

the history prediction information obtaining module is specifically configured to obtain history prediction information based on whether the text to be predicted is a first recognition result of the current speech segment when the number of added words is greater than a first preset number.

Optionally, the punctuation prediction module is specifically configured to predict punctuation information of a word in the text to be predicted by using a punctuation annotation model established in advance based on the historical prediction information and the text to be predicted;

A punctuation prediction apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement each step of the punctuation prediction method described in any one of the above.

A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the steps of the punctuation prediction method of any of the above.

According to the scheme, after the text to be predicted is obtained, the historical prediction information is obtained based on whether the text to be predicted is the first recognition result of the current voice segment, and then the punctuation information of words in the text to be predicted is predicted according to the historical prediction information and the text to be predicted. According to the punctuation prediction method, when the punctuation information of the text to be predicted is predicted, the historical prediction information is combined with the information of the text to be predicted, the historical prediction information is combined for prediction, more semantic information can be obtained, so that a more accurate punctuation prediction result can be obtained, in addition, the historical prediction information in the application adopts middle information which is generated in the punctuation prediction process of the historical recognition result and is used for determining the punctuation prediction result, and the historical recognition result is not used for predicting, compared with the method for predicting by directly using the historical recognition result, the calculation amount can be greatly reduced, so that the punctuation prediction efficiency is improved, in addition, the method is combined with the historical prediction information for prediction whether the middle recognition result of a voice fragment or the final recognition result is predicted, and the accurate prediction result can be obtained regardless of the middle recognition result of the voice fragment or the final recognition result.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a punctuation prediction method according to an embodiment of the present disclosure;

FIG. 2 is another flow chart of the punctuation prediction method according to the embodiments of the present application;

FIG. 3 is a schematic flow chart of predicting punctuation information of words in a text to be predicted by using a pre-established punctuation prediction model based on historical prediction information and the text to be predicted according to the embodiment of the present application;

fig. 4 is a schematic flowchart of punctuation prediction performed by inputting prediction reference information and an input text into a punctuation prediction model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a punctuation prediction apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a punctuation prediction apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order to realize punctuation prediction, especially punctuation prediction in a simultaneous interpretation scene of a machine, the inventor performs research, and the initial thought is as follows: punctuation prediction is performed by adopting a punctuation prediction scheme based on a statistical language model.

The statistical language model aims to obtain probability distribution of the word sequence in the corresponding corpus on the premise of knowing the word sequence. The basic framework of the statistical language model is that for a segment of text, its probability can be expressed as:

according to the markov assumption, the probability of any word appearing is only related to the N words that it has appeared in front of, where N is an integer greater than 1. The smaller N was found, the more times it appeared in the training set, so that the more reliable the result became. The larger N is, the more information can be utilized for predicting the next word is, the higher the accuracy is, but at the same time, the more corresponding parameters are, and the longer the calculation is. It is generally considered that the larger the value of N, the more excellent the performance of the model, but according to practical experience, the value of N is generally set to an integer of not more than 4, which is a result obtained through a large number of practices, and is the best value in balance of efficiency and accuracy. However, the N-gram language model has difficulty in using history information, and only the first N words of the current word are focused on, but in most cases, the n=3 language model is used, that is, the model only focuses on the first 3 words, which is far from sufficient for punctuation prediction, that is, the accuracy of the punctuation prediction scheme based on the statistical language model is low.

In view of the above problems with punctuation prediction schemes based on statistical language models, the inventors further studied, and in the course of the study, they conceived punctuation prediction schemes based on bidirectional long-short-term memory networks. The general idea of the punctuation prediction scheme based on the bidirectional long-short-time memory network is as follows: when the current recognition result of the current voice segment is subjected to punctuation prediction, the final recognition results of the first m voice segments of the current voice segment are spliced with the current recognition result of the current voice segment, and the two-way long and short-time memory network is input for punctuation prediction.

However, for one speech segment, n intermediate recognition results are usually obtained before the final recognition result is obtained, if the recognition results of the previous m speech segments are spliced each time, the recognition results of the previous m speech segments are repeatedly calculated n times, that is, the punctuation prediction efficiency is low, and in order to improve the prediction efficiency, the recognition results of the previous m speech segments may be spliced only when the punctuation prediction is performed on the final recognition results of the speech segments. However, in the simultaneous interpretation of the machine, the translation system can only translate by using the intermediate recognition result, so that the simultaneous interpretation of the machine has higher requirements on punctuation prediction accuracy of the intermediate recognition result.

In view of the above problems of the punctuation prediction scheme based on the bidirectional long-short-term memory network, the inventor further makes intensive studies, and finally provides a punctuation prediction method which has a good prediction effect and can be applied to a machine simultaneous interpretation scene.

First embodiment

Referring to fig. 1, a flowchart of a punctuation prediction method provided in an embodiment of the present application is shown, where the method may include:

step S101: and obtaining the text to be predicted.

The text to be predicted is the current recognition result of the current voice fragment.

It should be noted that, in the process of recognizing a speech segment to obtain a final recognition result, several intermediate recognition results are usually obtained, that is, the recognition result of a speech segment includes several intermediate recognition results and a final recognition result, and each recognition result has at least an increase content compared to the previous recognition result.

Illustratively, the current speech segment is the second speech segment VAD2, and in the process of identifying VAD2, the following identification result is generated:

recognition result 1: at this point

Recognition result 2: in this spring day

Recognition result 3: flying in the spring

Recognition result 4: during the spring, the fragrant festival is good

Recognition result 5: in the spring, the fragrant flowers gather in the good time

Recognition result 6: in the spring, the fragrant festival is concentrated in Beijing Tiananmen

The recognition results 1 to 5 are intermediate recognition results of the voice segment VAD2, and the recognition result 6 is a final recognition result of the voice segment VAD 2.

Step S102: and acquiring historical prediction information based on whether the text to be predicted is the first recognition result of the current voice fragment.

The history prediction information is intermediate information which is generated in the process of punctuation prediction of the history identification result and used for determining the punctuation prediction result. It should be noted that, the history recognition result is the recognition result before the current recognition result.

Specifically, based on whether the text to be predicted is the first recognition result of the current speech segment, the process of obtaining the history prediction information may include:

Step S102a, if the text to be predicted is the first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is obtained and used as history prediction information.

Assuming that the current speech segment is VAD2 in the above example, the text to be predicted is "at this point", the text to be predicted is the first recognition result of the current speech segment.

It should be noted that, the first recognition result of a speech segment usually has an identifier indicating that the first recognition result is the first recognition result, and if the obtained recognition result has the identifier indicating the first recognition result, it may be determined that the recognition result is the first recognition result of a speech segment.

In this embodiment, the punctuation prediction information corresponding to the final recognition result of the speech segment before the current speech segment is intermediate information for determining the punctuation prediction result, which is generated in the process of carrying out the punctuation prediction on of the final recognition result of the speech segment before the current speech segment.

The method comprises the steps of obtaining punctuation prediction information corresponding to a final recognition result of a voice segment before a current voice segment, wherein the punctuation prediction information corresponding to the final recognition result of all voice segments before the current voice segment can be obtained as historical prediction information in one possible implementation, considering that the semantics of the current voice segment are generally higher than the correlation degree of one or more voice segments with the current voice segment, which are closer to the current voice segment, the punctuation prediction information corresponding to the final recognition result of a preset (such as 3) voice segments before the current voice segment can be obtained as historical prediction information, for example, the current voice segment is the 5 th voice segment VAD5, and the punctuation prediction information corresponding to the final recognition results of VADs 2-VAD 4 can be obtained as historical prediction information in another possible implementation.

Step S102b, if the text to be predicted is the non-first recognition result of the current speech segment, the punctuation prediction information corresponding to the final recognition result of the speech segment before the current speech segment and the punctuation prediction information corresponding to the previous recognition result of the current speech segment are obtained and used as the historical prediction information.

The punctuation prediction information corresponding to the previous recognition result of the current voice segment is intermediate information used for determining the punctuation prediction result and is generated in the punctuation prediction process of the previous recognition result of the current voice segment.

It should be noted that, the previous recognition result of the current speech segment refers to the recognition result that is located before the current recognition result in the recognition result of the current speech segment. Assuming that the current speech segment is VAD2 in the above example and the text to be predicted is recognition result 5 ("we gather in the good time of the spring arfei"), punctuation prediction information corresponding to the final recognition result of VAD1 and punctuation prediction information corresponding to recognition result 4 of VAD2 ("the good time of the spring arfei") are obtained as history prediction information.

Step S103: and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

When the punctuation prediction is carried out on the text to be predicted, the embodiment predicts by combining the historical prediction information, and can obtain more semantics by combining the historical prediction information, so that a more accurate punctuation prediction result can be obtained.

According to the punctuation prediction method provided by the embodiment, after the text to be predicted is obtained, the history prediction information is obtained, and then the punctuation information of words in the text to be predicted is predicted according to the history prediction information and the text to be predicted. According to the punctuation prediction method provided by the embodiment, when the punctuation information of the text to be predicted is predicted, the historical prediction information is combined with the information of the text to be predicted, more semantic information can be obtained by combining the prediction of the historical prediction information, so that more accurate punctuation prediction results can be obtained, the historical prediction information in the embodiment adopts intermediate information which is generated in the punctuation prediction process of the historical recognition results and is used for determining the punctuation prediction results, and is not a historical recognition result, compared with the method of directly predicting by utilizing the historical recognition result, the calculation amount can be greatly reduced, so that the punctuation prediction efficiency is improved, and in addition, the method of the punctuation prediction provided by the embodiment can be used for predicting the middle recognition result or the final recognition result of the voice fragment by combining the historical prediction information, so that more accurate prediction results can be obtained.

Second embodiment

In order to improve the punctuation prediction efficiency, this embodiment provides another punctuation prediction method, please refer to fig. 2, which shows a flow chart of the punctuation prediction method, which may include:

step S201: and obtaining the text to be predicted.

The text to be predicted is a current recognition result of the current speech segment, which is a recognition result obtained in the process of recognizing the current speech segment, and may be an intermediate recognition result of the current speech segment or a final recognition result of the current speech segment.

Step S202: and determining an update type corresponding to the text to be predicted according to the update condition of the text to be predicted relative to the previous recognition result, executing step S203 if the update type corresponding to the text to be predicted is increased, and executing step S205 if the update type corresponding to the text to be predicted is modified.

The update conditions of the text to be predicted relative to the previous recognition result include two types, one is only increased, the other is modified and increased, if the text to be predicted is only increased by a word relative to the previous recognition result, the update type corresponding to the text to be predicted is determined to be increased, step S203 is executed at this time, and if the text to be predicted is not only increased by a word but also is modified relative to the previous recognition result, the update type corresponding to the text to be predicted is determined to be modified, at this time, step S205 is executed.

Assuming that the current speech segment is VAD2 mentioned in the above embodiment, the text to be predicted is recognition result 3 ("in this spring"), and since recognition result 3 is increased by only one word "spring" relative to recognition result 1 ("in this spring"), it can be determined that the update type corresponding to the text to be predicted is increased; assuming that the current speech segment is VAD2 mentioned in the above embodiment, the text to be predicted is a recognition result 4 ("good time of the fragrant in the spring"), and since the recognition result 4 is not only increased by "good time" compared with the recognition result 3 ("flying in the spring"), but also modified by "flying" to "fragrant", the update type corresponding to the text to be predicted can be determined as modification.

In addition, it should be noted that, if the text to be predicted is the first recognition result of the current speech segment, it is determined that the update type corresponding to the text to be predicted is increased.

Step S203: and counting the number of words of the text to be predicted, which is increased compared with the previous recognition result.

Alternatively, the number of words of the text to be predicted, which is increased compared with the previous recognition result, can be determined by using the Levenshtein edit distance calculation algorithm with the words as units.

Step S204: judging whether the number of the added words is greater than or equal to a first preset number (such as 2), if the number of the added words is greater than or equal to the first preset number, executing step S205, and if the number of the added words is less than the first preset number, acquiring a next recognition result of the current voice segment as a text to be predicted.

Step S205: and judging whether the text to be predicted is the first recognition result of the current voice fragment, if so, executing the step S206a, and if not, executing the step S206b.

Specifically, whether the text to be predicted has the indication mark of the first recognition result or not can be judged, if the text to be predicted has the indication mark of the first recognition result, the text to be predicted is judged to be the first recognition result of the current voice segment, and if the text to be predicted does not have the indication mark of the first recognition result, the text to be predicted is judged to be the non-first recognition result of the current voice segment.

Step S206a: and acquiring punctuation prediction information corresponding to a final recognition result of the voice fragment before the current voice fragment, and taking the punctuation prediction information as historical prediction information.

The punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment is generated in the process of carrying out punctuation prediction on the final recognition result of the voice segment before the current voice segment, and is intermediate information for determining the punctuation prediction result.

Step S206b: and acquiring punctuation prediction information corresponding to a final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to a previous recognition result of the current voice segment as historical prediction information.

The specific implementation process and the related explanation of the step S206a and the step S206b can be referred to the specific implementation process and the related explanation of the step S102a and the step S102b in the first embodiment, and the description of this embodiment is omitted here.

Step S207: and predicting punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

The punctuation prediction method provided by the embodiment has the following advantages: (1) When the punctuation information is predicted by the text to be predicted, the historical prediction information is combined in addition to the information of the text to be predicted, and the historical prediction information is combined for prediction, so that more semantic information can be obtained, and a more accurate punctuation prediction result can be obtained; (2) The history prediction information in the embodiment adopts intermediate information which is generated in the process of performing punctuation prediction on the history recognition result and is used for determining the punctuation prediction result, and the intermediate information is not the history recognition result, and compared with the method of directly predicting by using the history recognition result, the calculation amount can be greatly reduced, so that the punctuation prediction efficiency is improved; (3) The embodiment predicts by combining the history prediction information whether the intermediate recognition result or the final recognition result of the voice fragment is obtained, so that a more accurate prediction result can be obtained whether the intermediate recognition result or the final recognition result of the voice fragment is predicted; (4) When the type corresponding to the text to be predicted is increased, the text to be predicted is continuously subjected to label prediction only when the number of the increased words is larger than the first preset number, so that the efficiency of punctuation prediction on the recognition result of the voice fragment can be improved.

Third embodiment

The present embodiment describes a specific implementation procedure of "predicting punctuation information of a word in a text to be predicted according to historical prediction information and the text to be predicted" in the above embodiment.

Based on the historical prediction information and the text to be predicted, the process of predicting punctuation information of words in the text to be predicted may include: and predicting punctuation information of words in the text to be predicted by using a pre-established punctuation prediction model based on the historical prediction information and the text to be predicted.

The punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice fragments, and when the punctuation prediction model is trained by the training text, the punctuation prediction model predicts punctuation information of each word in the training text according to the word before the word and a second preset number of words after the word.

Next, a description will be given of a process of predicting punctuation information of words in a text to be predicted by using a punctuation prediction model established in advance based on the history prediction information and the text to be predicted.

Referring to fig. 3, a flow chart of predicting punctuation information of words in a text to be predicted based on historical prediction information and the text to be predicted by using a pre-established punctuation prediction model may include:

Step S301: and removing punctuation prediction information of the last second preset number of words in the history identification result from the history prediction information, wherein the information obtained after removal is used as prediction reference information.

Specifically, if the text to be predicted is the first recognition result of the current voice segment, punctuation prediction information of the last second preset number of words in the final recognition result of the forward adjacent voice segment of the current voice segment is removed from the historical prediction information; if the text to be predicted is a non-first recognition result of the current voice segment, punctuation prediction information of a last second preset number of words in a previous recognition result of the current voice segment is removed from the historical prediction information.

It should be noted that, the forward adjacent speech segment of the current speech segment refers to a speech segment located before and adjacent to the current speech segment, and the previous intermediate recognition result of the current speech segment is preferably a recognition result located before and adjacent to the text to be predicted in the recognition results of the current speech segment.

Illustratively, the second preset number is 4, and the current speech segment is the 2 nd speech segment VAD2: if the text to be predicted is the first recognition result of VAD2, the historical prediction information is punctuation prediction information corresponding to the final recognition result of VAD1, and if the step S301 is to be executed, the punctuation prediction information of the last 4 words in the final recognition result of VAD1 is removed from the historical prediction information, and if the final recognition result of VAD1 is "honored mr women are good in afternoon", the punctuation prediction information of the 4 words of "women are good in afternoon" is removed from the historical prediction information; if the text to be predicted is the non-first recognition result of VAD2, for example, the 3 rd intermediate recognition result of VAD2, the historical prediction information is the punctuation prediction information corresponding to the final recognition result of VAD1 and the punctuation prediction information of the 2 nd recognition result of VAD2, and the step S301 is to remove the punctuation prediction information of the last 4 words of the 2 nd intermediate recognition result of VAD2 from the historical prediction information.

In the above, it is mentioned that, when the punctuation prediction model is trained by using the training text, the punctuation prediction model predicts the punctuation after the word according to the word before the word and the second preset number of words after the word, and similarly, when the punctuation prediction model predicts the recognition result of the voice segment before the current voice segment, that is, the history recognition result, the punctuation after each word is also predicted according to the word before each word and the second preset number of words after each word, however, the words after the last second preset number of words in the history recognition result are not more than the second preset number, which means that the punctuation prediction information of the last second preset number of words in the history recognition result may be inaccurate, in order to avoid adverse effects of the punctuation prediction information of the last second preset number of words in the history recognition result on the punctuation prediction information of the target text, the present embodiment removes the punctuation prediction information of the last second preset number of words in the history recognition result from the history prediction information.

Step S302: and splicing the last second preset number of words in the history recognition result with the part which does not participate in punctuation prediction in the text to be predicted, and taking the spliced text as an input text.

Because step S301 removes the prediction information of the last second preset number of words in the history recognition result, so that the information of the last second preset number of words in the history recognition result is absent, in this regard, the step splices the last second preset number of words in the history recognition result with the portion of the text to be predicted that does not participate in punctuation prediction, and uses the spliced text as the input text of the punctuation prediction model, that is, the last second preset number of words in the history recognition result also participate in punctuation prediction.

It should be noted that if the text to be predicted is the first recognition result of the current speech segment, all words in the text to be predicted do not participate in punctuation prediction, and in this case, the last second preset number of words in the history recognition result are spliced with the whole text to be predicted; if the text to be predicted is a non-first intermediate result of the current speech segment, the part of the text to be predicted which does not participate in punctuation prediction is a word of which the number of words to be predicted is increased compared with the previous recognition result of the current speech segment, and in this case, the last second preset number of words in the historical recognition result are spliced with the word of which the number of words to be predicted is increased compared with the previous recognition result of the current speech segment.

Illustratively, the second preset number is 4, and the current speech segment is the 2 nd speech segment VAD2: if the text to be predicted is the first recognition result of VAD2, and the final recognition result of the speech segment forward adjacent to the current speech segment is "honored women are good afternoon", then step S302 is to perform the following concatenation of the last 4 words "women are good afternoon" in "honored women are good afternoon" with the text to be predicted: "lady good afternoon < SEP > is in this", the spliced text is used as the input text of the punctuation prediction model; if the text to be predicted is not the first recognition result of the VAD2, and the 5 th recognition result of the VAD2 is "the favorite time of arfein in this spring" the 5 th recognition result of the VAD2 is focused on ", the 5 th recognition result of the VAD2 is increased by" the user focused on "compared with the 4 th recognition result of the VAD2 (" the favorite time of arfein this spring "), the last 4 words of the 4 th recognition result of the VAD2 are focused on" the favorite time of arfein "and" the user focused on "to obtain a spliced text" the favorite time of arfein "and the spliced text is used as the input text of the punctuation prediction model.

Note that the symbol < SEP > is used to distinguish the recognition results of different speech segments, that is, the text before and after the symbol < SEP > is the recognition result of different speech segments.

Step S303: and inputting the prediction reference information and the input text into a punctuation prediction model to perform punctuation prediction so as to obtain punctuation information of words in the text to be predicted.

Referring to fig. 4, a schematic flow chart of punctuation prediction performed by inputting prediction reference information and input text into a punctuation prediction model may include:

step S401: a token prediction model is used to determine a token vector for each word in the input text.

Wherein, the characterization vector of a word is a vector for representing the semantic meaning of the word.

Specifically, each word in the input text is input into a word vector determining module of the punctuation prediction model, and a characterization vector of each word in the input text is obtained.

Step S402: and determining a target vector corresponding to each word in the input text by using the punctuation prediction model, the prediction reference information and the characterization vector of each word in the input text.

The target vector corresponding to one word in the input text can represent the relativity of the word before the word and the second preset number of words after the word in the input text with the word respectively.

Specifically, for each word in the input text, the attention module of the punctuation prediction model, the prediction reference information, the characterization vector of the word, the characterization vector of a third preset number of words before the word, and the characterization vector of a second preset number of words after the word can be used for determining the target vector which corresponds to the word and can be related to the third preset number of words before the word and the second preset number of words after the word respectively.

When determining the target vector corresponding to each word in the input text, the mask may be used to mask information of words except the word, a third preset number of words before the word, and a second preset number of words after the word.

Step S403: and determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the characterization vector of each word in the input text and the target vector corresponding to each word in the input text.

Specifically, the process of determining punctuation information of words in a text to be predicted by using a punctuation prediction model, a characterization vector of each word in an input text, and a target vector corresponding to each word in the input text includes:

step S4031: and for each word in the input text, predicting the punctuation information of each word in a second preset number of words before the word by using a punctuation information determining module of a punctuation prediction model, a characterization vector of the word and a target vector corresponding to the word.

Specifically, for each word in the input text, determining the probability that the punctuation after each word in the second preset number of words before the word is each punctuation category in the preset plurality of punctuation categories by using the punctuation prediction model, the characterization vector of the word and the target vector corresponding to the word, and determining the punctuation information after each word in the second preset number of words before the word according to the determined probability. The preset plurality of punctuation categories may include: no punctuation, no stop mark, no comma, no period, no question mark, no exclamation mark.

For example, the second preset number is 4, the input text includes the word "good" and the 4 words before the word "good" are "women of mr. Each woman", and all that is required in step S4031 is to predict the punctuation information after the four words of "women of mr. Each woman" by using the punctuation prediction model, the characterization vector of the word "good" and the target vector corresponding to the word "good".

Step S4032: for each word in the input text, punctuation information after the word is determined by a punctuation information determination module of a punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word.

The punctuation of a word in the input text may be determined by the punctuation information predicted for the word by a second preset number of words following the word, alternatively, the punctuation of a word in the input text may be determined by the punctuation information predicted for the word by a word farthest from the word in the second preset number of words following the word.

For example, if the second preset number is 4, the input text includes the word "mr", "4 words after mr" are "each woman is in afternoon", and the punctuation information predicted for mr by the word "afternoon" is determined as the punctuation information of mr.

Note that, for each word in the last second preset number of words of the input text, since the number of words thereafter is less than the second preset number, the present embodiment determines punctuation information determined for the word by the word farthest away as punctuation information of the word. For example, the second preset number is 4, the last "4" words in the input text are "this spring is fragrant", the words after "this" are only 3, and the embodiment determines the "predicted punctuation information for" this "as the" punctuation information ". It should be noted that, the last word in the input text is predicted by the last word round without prediction result, and punctuation information of the last word in the input text is predicted together when the next text to be predicted is predicted.

Step S4033: and acquiring punctuation information of the words in the text to be predicted from the punctuation information of the words in the input text.

In this embodiment, punctuation information of words in the text to be predicted, which does not have punctuation information, may be obtained from the punctuation information of each word in the input text. Specifically, if the text to be predicted is the first recognition result of the current voice segment, punctuation information of each word in the text to be predicted is obtained from punctuation information of each word in the input text; if the text to be predicted is the non-first recognition result of the current voice segment, the punctuation information of the word, which is added in the text to be predicted compared with the previous recognition result of the current voice segment, is obtained from the punctuation information of each word in the input text.

Since punctuation information of the text to be predicted is determined using a punctuation prediction model established in advance, a process of establishing the punctuation prediction model is described next.

The process of establishing the punctuation prediction model comprises the following steps:

and a1, pre-training an initial punctuation prediction model by using training data in a first training data set to obtain a pre-trained punctuation prediction model.

Wherein the first training data set comprises a plurality of pieces (typically hundreds of millions) of training data, each piece of training data being sentence-level monolingual text data having punctuation.

And a2, screening out training data with better quality from the first training data set.

Specifically, a part of data with better quality can be manually screened out from the training data set, and a part of data with worse quality is screened out at the same time, so that two types of training data are obtained, a two-class model is trained by using the screened two types of training data, and the training data in the training data set is classified by using the two-class model, so that the training data with better quality can be obtained from the training data set according to the classification result of the training data in the training data set, for example, twenty millions of training data can be obtained.

And a3, constructing new training data by using the screened training data, forming a second training data set by the constructed training data, and performing fine adjustment on the pre-trained punctuation prediction model by using the training data in the second training data set, wherein the fine-adjusted punctuation prediction model is the final punctuation prediction model.

Specifically, the new training data may be constructed using the selected training data in three ways:

mode one, randomly inserting a symbol "< SEP >" in a sentence after word segmentation;

mode two, inserting a symbol "< SEP >" before the comma and the stop of a sentence;

And splicing the two sentences, and inserting a symbol "< SEP >" in the middle of the two sentences.

The ratio of the number of training data constructed in the three ways is 1:1:1.

Training the pre-trained punctuation prediction model by utilizing the training data constructed in the mode until the model converges, wherein the model obtained after training is the final punctuation prediction model.

In order to improve training efficiency of the model, multiple pieces of training data can be input into the model for parallel training each time during training, and punctuation prediction can be performed on multiple texts simultaneously by the punctuation prediction model obtained through training.

Fourth embodiment

The embodiment of the application further provides a punctuation prediction device, which is described below, and the punctuation prediction device described below and the punctuation prediction method described above can be referred to correspondingly.

Referring to fig. 5, a schematic structural diagram of a punctuation prediction apparatus provided in an embodiment of the present application may include: a text to be predicted acquisition module 501, a history prediction information acquisition module 502 and a punctuation prediction module 503.

The text to be predicted obtaining module 501 is configured to obtain a text to be predicted, where the text to be predicted is a current recognition result of a current speech segment, and the recognition result of a speech segment includes a plurality of intermediate recognition results and a final recognition result;

The historical prediction information obtaining module 502 is configured to obtain historical prediction information based on whether the text to be predicted is a first recognition result of the current speech segment, where the historical prediction information is intermediate information that is generated in a punctuation prediction process of the historical recognition result and is used for determining a punctuation prediction result.

And the punctuation prediction module 503 is configured to predict punctuation information of words in the text to be predicted according to the historical prediction information and the text to be predicted.

Optionally, the punctuation prediction apparatus provided in the embodiment of the present application may further include: and the updating type determining module and the quantity counting module.

And the update type determining module is used for determining the update type corresponding to the text to be predicted according to the update condition of the text to be predicted compared with the previous recognition result.

The history prediction information obtaining module is specifically configured to obtain history prediction information based on whether the text to be predicted is a first recognition result of the current speech segment when the update type corresponding to the text to be predicted is modification.

And the quantity counting module is used for counting the quantity of words of the text to be predicted, which is increased compared with the previous recognition result, when the corresponding update type of the text to be predicted is increased.

Optionally, the punctuation prediction module is specifically configured to predict punctuation information of the word in the text to be predicted by using a punctuation annotation model established in advance based on the historical prediction information and the text to be predicted.

Optionally, the history prediction information obtaining module 502 is specifically configured to obtain punctuation prediction information corresponding to a final recognition result of a speech segment before the current speech segment as history prediction information if the text to be predicted is a first recognition result of the current speech segment; if the text to be predicted is the non-first recognition result of the current voice segment, punctuation prediction information corresponding to the final recognition result of the voice segment before the current voice segment and punctuation prediction information corresponding to the previous recognition result of the current voice segment are obtained and used as historical prediction information. The punctuation prediction information is intermediate information which is generated in the process of carrying out punctuation prediction on the corresponding identification result and is used for determining the punctuation prediction result.

Optionally, the punctuation prediction module 503 includes: the system comprises a prediction reference information acquisition sub-module, an input text acquisition sub-module and a punctuation prediction sub-module.

And the prediction reference information acquisition sub-module is used for removing punctuation prediction information of the last second preset number of words in the history identification result from the history prediction information, and the information obtained after removal is used as prediction reference information.

And the input text acquisition sub-module is used for splicing the last second preset number of words in the history recognition result with the part which does not participate in punctuation prediction in the text to be predicted, and the spliced text is used as an input text.

And the punctuation prediction sub-module is used for inputting the prediction reference information and the input text into the punctuation prediction model to conduct punctuation prediction so as to obtain the punctuation information of the words in the text to be predicted.

Optionally, the prediction reference information obtaining sub-module is specifically configured to remove punctuation prediction information of a last second preset number of words in a final recognition result of a forward adjacent speech segment of the current speech segment from the historical prediction information if the text to be predicted is a first recognition result of the current speech segment; and if the text to be predicted is a non-first recognition result of the current voice fragment, removing punctuation prediction information of a last second preset number of words in a previous recognition result of the current voice fragment from the historical prediction information.

Optionally, the input text obtaining sub-module is specifically configured to splice a last second preset number of words in the history recognition result with the entire text to be predicted if the text to be predicted is a first recognition result of the current speech segment; if the text to be predicted is a non-first recognition result of the current voice segment, splicing the last second preset number of words in the historical recognition result with the part, which is increased compared with the previous middle recognition result of the current voice segment, of the text to be predicted.

Optionally, the punctuation prediction sub-module is specifically configured to determine a feature vector of each word in the input text by using the punctuation prediction model, and determine a target vector corresponding to each word in the input text by using the punctuation prediction model, the feature vector of each word in the input text, and the prediction reference information, where the target vector corresponding to one word in the input text can represent a relevance between a word located before the word and a second preset number of words located after the word in the input text and the word; and determining punctuation information of the words in the text to be predicted by using the punctuation prediction model, the characterization vector of each word in the input text and the target vector corresponding to each word in the input text.

Optionally, the punctuation prediction submodule is specifically configured to, when determining the punctuation information of the word in the text to be predicted by using the punctuation prediction model, the characterization vector of each word in the input text and the target vector corresponding to each word in the input text, predict, for each word in the input text, the punctuation information after each word in a second preset number of words before the word by using the punctuation prediction model, the characterization vector of the word and the target vector corresponding to the word; for each word in the input text, determining punctuation information of the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word; and acquiring punctuation information of the words in the text to be predicted from the punctuation information of the words in the input text.

The punctuation prediction device provided by the embodiment can accurately and efficiently predict the punctuation information of the recognition result of the voice fragment.

Fifth embodiment

The embodiment of the application also provides a punctuation prediction apparatus, please refer to fig. 6, which shows a schematic structural diagram of the punctuation prediction apparatus, where the punctuation prediction apparatus may include: at least one processor 601, at least one communication interface 602, at least one memory 603 and at least one communication bus 604;

In the embodiment of the present application, the number of the processor 601, the communication interface 602, the memory 603 and the communication bus 604 is at least one, and the processor 601, the communication interface 602 and the memory 603 complete communication with each other through the communication bus 604;

processor 601 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 603 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory), etc., such as at least one disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

Sixth embodiment

The embodiment of the application also provides a readable storage medium, which can store a program suitable for being executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A punctuation prediction method, comprising:

acquiring historical prediction information based on whether the text to be predicted is the first recognition result of the current voice fragment, wherein the historical prediction information is different from the historical recognition result, and the historical prediction information is intermediate information which is generated in the process of performing punctuation prediction on the historical recognition result and is used for determining the punctuation prediction result;

2. The punctuation prediction method of claim 1, further comprising:

3. The punctuation prediction method according to claim 1, wherein the obtaining historical prediction information based on whether the text to be predicted is the first recognition result of the current speech segment comprises:

4. The punctuation prediction method according to claim 1, wherein the predicting punctuation information of words in the text to be predicted according to the history prediction information and the text to be predicted comprises:

predicting punctuation information of words in the text to be predicted by using a pre-established punctuation marking model according to the historical prediction information and the text to be predicted;

the punctuation prediction model is obtained by training a training text with punctuation, the training text is formed by splicing texts representing recognition results of a plurality of voice fragments, and when the punctuation prediction model is trained by using the training text, the punctuation prediction model predicts the punctuation information of each word in the training text according to the word before the word and a second preset number of words after the word.

5. The punctuation prediction method according to claim 4, wherein the predicting the punctuation information of the word in the text to be predicted by using a pre-established punctuation prediction model based on the history prediction information and the text to be predicted comprises:

6. The punctuation prediction method of claim 5, wherein the removing punctuation prediction information of a last second preset number of words in a history recognition result from the history prediction information comprises:

7. The punctuation prediction method according to claim 5, wherein the splicing the last second preset number of words in the history recognition result with the portion of the text to be predicted, which does not participate in punctuation prediction, comprises:

8. The punctuation prediction method according to claim 5, wherein the inputting the prediction reference information and the input text into the punctuation prediction model for punctuation prediction to obtain the punctuation information of the word in the text to be predicted comprises:

9. The punctuation prediction method according to claim 8, wherein the determining punctuation information of words in the text to be predicted by using the punctuation prediction model, a characterization vector of each word in the input text, and a target vector corresponding to each word in the input text comprises:

For each word in the input text, determining punctuation information of the word by using the punctuation prediction model and punctuation information predicted for the word based on a second preset number of words after the word;

10. A punctuation prediction apparatus, comprising: the system comprises a text acquisition module to be predicted, a history prediction information acquisition module and a punctuation prediction module;

the history prediction information acquisition module is used for acquiring history prediction information based on whether the text to be predicted is the first recognition result of the current voice segment, wherein the history prediction information is different from the history recognition result, and the history prediction information is intermediate information which is generated in the process of punctuation prediction on the history recognition result and is used for determining the punctuation prediction result;

11. A punctuation prediction apparatus, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the respective steps of the punctuation prediction method according to any one of claims 1 to 9.

12. A readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the punctuation prediction method according to any one of claims 1 to 9.