CN115394298A

CN115394298A - Training method and prediction method of speech recognition text punctuation prediction model

Info

Publication number: CN115394298A
Application number: CN202211034353.4A
Authority: CN
Inventors: 雷金博
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-25
Anticipated expiration: 2042-08-26
Also published as: CN115394298B

Abstract

The invention discloses a training method of a speech recognition text punctuation prediction model, a speech recognition text punctuation prediction method, electronic equipment and a storage medium, wherein the training method of the punctuation prediction model comprises the following steps: dividing the audio into a plurality of sub-audios according to the voice continuity of the audio; performing voice recognition on each sub-audio, and manually marking punctuations on recognized recognition texts containing a plurality of words; determining the pause duration of each word in each sub-audio; embedding the words of each word into a vector and fusing the pause duration vector corresponding to the pause duration to serve as a representation vector of each word; and training a punctuation prediction model by using the expression vectors of all the words and the recognized text after the punctuation is artificially marked. The invention can effectively avoid word segmentation errors by segmenting the audio frequency, and the obtained prediction model can be more accurate when reasoning the punctuation insertion probability corresponding to each word by combining the phonetic features of speakers when carrying out punctuation prediction.

Description

Training method and prediction method of speech recognition text punctuation prediction model

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition text punctuation prediction model training method, a voice recognition text punctuation prediction method, electronic equipment and a storage medium.

Background

Most of the current intelligent voice devices on the market need to use a voice recognition text punctuation prediction function, such as products or software related to voice recognition, such as intelligent sound, equipment for converting conference recording into words, intelligent dialogue robots, video subtitle generation software, and the like.

In the prior art, the method for predicting the punctuation of the voice recognition text by taking the recognition text of the target audio as the input needs to perform the word segmentation on the text obtained by the voice recognition according to the context semantics before the prediction is performed by the input prediction model, and when the use scene is complex, the noise is large, and a speaker has the situations of misstatement or nonstandard pronunciation and the like, more errors exist in the recognition text obtained by the voice recognition, so that the situation of confused semantics of the recognition text is easily caused, the word segmentation errors are caused, and the punctuation prediction effect is greatly reduced.

For the above-mentioned drawbacks, the current common solution is to collect audio data in high-noise, high-recognition word error rate scenarios and manually add punctuation on these semantically unclear or high-error data sets. Based on the collected data, model training is added into a training set of the neural network to improve punctuation prediction performance of the model under a high recognition word error rate scene.

However, the above solution is still incomplete, and in fact, the real recognition environment is often represented by the recognition text generated in real time in human oral communication, so that the recognition text inevitably has strong personal habits or spoken language features, such as pauses and durations of pauses in speaking, alternate speaking among multiple speakers, and the like. If only the accuracy of the recognized text is concerned and only the recognized text is taken as an input, part of the feature information of the pause or alternate talking is lost, so that the punctuation prediction effect is poor.

For example, when the recognized text obtained by the speech recognition input by the user is "the first payment of the house is about fifteen and twelve thousand percent", the possible output result of the speech recognition text punctuation prediction method in the prior art may be "the first payment of the house is about ten percent and fifty-twenty-thousand percent", however, the meaning actually expressed by the speaker may be "the first payment of the house is about fifteen percent and twelve million percent", which is a word segmentation error caused by semantic confusion of the recognized text, and the original meaning of "fifteen percent" and "twelve million" is divided into "ten percent" and "fifty-twelve million", thereby affecting the final speech recognition text punctuation prediction result.

Disclosure of Invention

Embodiments of the present invention provide a speech recognition text punctuation prediction model training method, a speech recognition text punctuation prediction method, an electronic device, and a storage medium, so as to solve at least one technical problem in the prior art, effectively alleviate a situation that a speech recognition text punctuation prediction method in the prior art is inaccurate in prediction, and improve accuracy of speech recognition text punctuation prediction.

In a first aspect, an embodiment of the present invention provides a method for training a speech recognition text punctuation prediction model, including:

dividing the training audio into a plurality of sub-audios according to the voice continuity of the training audio;

performing voice recognition on each sub-audio, and manually marking punctuations on recognized recognition texts containing a plurality of words;

determining the pause duration of each word in each sub-audio;

embedding the words of each word into a vector and fusing the pause duration vector corresponding to the pause duration to serve as a representation vector of each word;

and training a punctuation prediction model by using the expression vectors of all the words and the recognized text after the punctuation is artificially marked.

In a second aspect, an embodiment of the present invention provides a method for predicting a text punctuation in speech recognition, including:

dividing the target audio into a plurality of sub-audios according to the voice continuity of the target audio;

performing voice recognition on each sub-audio to obtain a recognition text containing a plurality of words;

determining the pause duration of each word in each sub-audio;

and according to the time sequence of the plurality of sub-audios, splicing the expression vectors of the words in the sub-audios, inputting the expression vectors into a speech recognition text punctuation prediction model, and reasoning the punctuations of the words in the recognition text.

In a third aspect, an embodiment of the present invention provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for training a speech recognition text punctuation prediction model or the method for speech recognition text punctuation prediction described above.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the above-mentioned training method for the speech recognition text punctuation prediction model or the speech recognition text punctuation prediction method.

The embodiment of the invention has the beneficial effects that: the method combines the pause characteristics of the speaker during spoken language expression with the recognition text, so that the prediction effect of the model is obviously improved compared with that of a pure text model. Meanwhile, the method does not introduce obvious extra cost, and the memory cost and the time delay of the whole module are not obviously different from those of the prior art.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for training a speech recognition text punctuation prediction model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a training method of a speech recognition text punctuation prediction model for segmenting a training audio by using a speech endpoint detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart of step S13 of the training method of the speech recognition text punctuation prediction model for segmenting the training audio by using the speech endpoint detection model according to an embodiment of the present invention;

FIG. 4 is a flowchart of the method of step S14 in the training method of the speech recognition text punctuation prediction model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a training method of a speech recognition text punctuation prediction model for segmenting a training audio using an alignment model according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a method for predicting punctuation in a speech recognition text according to an embodiment of the present invention;

FIG. 7 is an overall flowchart of the speech recognition text punctuation prediction model training method and prediction method of the present invention;

FIG. 8 is a schematic block diagram of a training apparatus for a speech recognition text punctuation prediction model according to an embodiment of the present invention;

FIG. 9 is a schematic block diagram of a speech recognition text punctuation prediction apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an embodiment of an electronic device of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and may be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes based on a signal having one or more data packets, e.g., from a data packet interacting with another element in a local system, distributed system, and/or across a network in the internet with other systems by way of the signal.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

Fig. 1 schematically shows a process of a training method for a speech recognition text punctuation prediction model according to an embodiment of the present invention, which can be applied to a server, a computer, or other devices to train the punctuation prediction model, and can also be applied to a client device such as a smart phone, a personal computer, or other devices to train the punctuation prediction model, and as shown in fig. 1, the method includes the following steps:

step S11: dividing the training audio into a plurality of sub-audios according to the voice continuity of the training audio;

step S12: performing voice recognition on each sub-audio, and manually marking punctuations on recognized recognition texts containing a plurality of words;

step S13: determining the pause duration of each word in each sub-audio;

step S14: embedding the words of each word into a vector and fusing the pause duration vector corresponding to the pause duration to serve as a representation vector of each word;

step S15: and training a punctuation prediction model by using the expression vectors of all the words and the recognized text after the punctuation is artificially marked.

In step S11, the training audio needs to be divided first, where the division is based on the speech continuity of the training audio, so that consecutive parts can be divided into the same sub-audio according to the speech continuity in the audio, and a situation of a segmentation error caused by performing segmentation only according to speech of a recognized text in the prior art can be effectively alleviated, where the training audio may be an audio specifically used for training a punctuation prediction model, and may include various situations where errors easily occur, and the training audio is not described excessively. When the training audio is divided, a voice endpoint detection model or an alignment model can be adopted to realize the division, corresponding to two different modes, the subsequent processing steps have certain differences, and the detailed description is given by taking the voice endpoint detection model as an example. The Voice endpoint Detection model, also called Voice Activity Detection (VAD), can identify the start and end positions of Voice in the whole audio segment. Fig. 2 schematically shows a flowchart of a training method of a speech recognition text punctuation prediction model for dividing a training audio by using a speech endpoint detection model according to an embodiment of the present invention, and as shown in fig. 2, step S11 may be implemented as:

step S111A: using a voice endpoint detection model, and dividing the training audio into a plurality of sub-audios according to a time sequence by taking the non-voice segment as a partition, wherein each sub-audio is a voice segment;

step S112A: the start time and the end time of each sub audio in the training audio are recorded.

In step S111A, the training audio is processed by the speech endpoint detection model, so as to obtain a speech segment and a non-speech segment (e.g., a silence segment) in the training audio, where the speech segment is a segment with speech recognized by the speech endpoint detection model in the training audio, and the non-speech segment is a remaining segment without speech recognized by the speech endpoint detection model in the training audio, and at this time, the obtained non-speech segment may be used as a partition to partition the training audio installation time information into a plurality of sub-audios. Then, the step S112A records the start time and the end time of each sub-audio in the training audio obtained by division, so as to complete the step S11 of the training method of the speech recognition text punctuation prediction model for dividing the training audio by using the speech endpoint detection model.

And step S12, performing voice recognition on each sub-audio divided in the step S11, and performing manual punctuation on the corresponding recognized recognition texts to form target output of the training punctuation prediction model. In step S12, after performing speech recognition on each sub-audio, because each sub-audio is a speech part audio segment in the training audio, the recognition text of the training audio including a plurality of words is obtained after performing speech recognition in S12, and then after manually labeling punctuation on the recognition text, a target output result for training as a punctuation prediction model is obtained, so that the punctuation prediction model obtains an inference result with high accuracy after training.

Step S13 is a step of determining pause duration corresponding to each word in the recognized text obtained after the voice recognition, wherein the pause duration can be pre-word pause duration or post-word pause duration, if the pre-word pause duration is determined, whether punctuation needs to be inserted in front of the word or not and what punctuation needs to be inserted in front of the word is predicted by the punctuation prediction model, and at the moment, a method or a model needs to be additionally arranged at the tail end of the last sub-audio to calculate what punctuation mark the tail end of the last sub-audio should use; if the post-word pause duration is determined, the punctuation prediction model predicts whether punctuation needs to be inserted and what punctuation needs to be inserted behind the word, and at this time, for the end of the last sub-audio, the punctuation prediction model obtained by the method can be continuously adopted to perform punctuation prediction without adding other methods or models for calculation, so in the embodiment of the invention, the pause duration determined in the step S13 is preferably the post-word pause duration. In step S13, when determining the pause duration corresponding to each word, the pause duration can be determined according to the start time and the end time of each sub-audio in the training audio obtained in step S11.

Specifically, in the training method of the speech recognition text punctuation prediction model that divides the training audio by using the speech endpoint detection model, since each of the sub-audios obtained after division is a speech segment in step S11, the start time and the end time of each recorded sub-audio in the training audio are also the start time and the end time of the corresponding speech segment. Fig. 3 schematically shows a flow of an implementation method of step S13 in a training method of a speech recognition text punctuation prediction model for dividing a training audio by using a speech endpoint detection model according to an embodiment of the present invention, and as shown in fig. 3, in the training method of a speech recognition text punctuation prediction model for dividing a training audio by using a speech endpoint detection model, step S13 may be implemented to include the following steps:

step S131: taking the time difference between the ending time of the ith sub-audio and the starting time of the (i + 1) th sub-audio as the pause duration of the last word in the ith sub-audio;

step S132: setting the pause duration of the last word in the last sub-audio in the training audio to be infinite;

step S133: and setting the pause duration of other words except the last word in each sub-audio to be 0.

In step S131, it is understood that, in this embodiment, the recognized text corresponding to the sub-audio includes a plurality of words, and the start time and the end time of the recorded speech segment correspond to the start time of the first word and the end time of the last word in a certain sub-audio i, so that the time difference between the end time of the ith sub-audio and the start time of the (i + 1) th sub-audio can be determined as the pause duration of the last word in the ith sub-audio.

In step S132, the last word in the last sub-audio is followed by the end position of the training audio, and it is necessary to add a punctuation mark to identify the end position of the text, so as to distinguish the pause duration of the last word from the preceding sub-audio, and to highlight the specificity of the last sub-audio, the pause duration of the last word in the last sub-audio is set to infinity, thereby reducing the introduction of excessive punctuation judgment and weight design into the punctuation prediction model, and ensuring the accuracy of punctuation prediction.

In step S133, it can be understood that when the speech endpoint detection model is adopted, on one hand, the recorded start time and end time of each sub-audio in the training audio are only actual, and thus, for other words except the last word in the sub-audio, there is no corresponding start time and end time to calculate the pause duration of the other words. On the other hand, the inventor actually uses the speech endpoint detection model for the purpose of reducing the time length for judging each word. Since each sub-audio is obtained by dividing the sub-audio by using the speech endpoint detection model, in each sub-audio, each word is actually read continuously during the speaking process, and the pause during the period is very small, so that in step S133, the pause duration of other words except the last word in each sub-audio can be set to be 0, in other words, even if there is a slight pause between other words except the last word in the real sub-audio, the words can be regarded as continuous and uninterrupted in fact, thereby reducing the complexity of the punctuation judgment step and the weight design.

Step S14 is a step of determining the expression vector of each word according to the pause duration corresponding to each word obtained in step S13. In step S14, the expression vectors corresponding to the words are finally obtained, wherein the expression vectors are obtained by fusing the word embedding vectors corresponding to the words and the corresponding pause duration vectors. Fig. 4 schematically illustrates a method flow of step S14 in the training method of the speech recognition text punctuation prediction model in an embodiment of the present invention, and referring to fig. 4, step S14 may be implemented to include the following steps:

step S141: searching a word embedding vector corresponding to each word according to a preset word embedding matrix;

step S142: determining discrete characteristic values reflecting the pause durations of the words according to a preset pause duration mapping function, and searching pause duration vectors corresponding to the discrete characteristic values in a pause duration embedded matrix;

step S143: and splicing the word embedding vector and the pause duration vector of each word to obtain the expression vector of each word.

In step S141, the preset word embedding matrix is a matrix having each word and the word embedding vector corresponding to each word, so that the word embedding vector corresponding to each word can be found in the preset word embedding matrix according to each word. The word embedding vector is a vector corresponding to the words one by one, and whether punctuations need to be set behind the words and what punctuations are set can be obtained through calculation according to the word embedding vector.

In step S142, the preset pause duration mapping function is a function that determines discrete feature values for reflecting the pause durations of the words according to the pause durations of the words, and the preset pause duration mapping function may correspond to the corresponding discrete feature values according to the pause duration intervals in which the specific values of the pause durations fall. The preset pause duration embedding matrix is provided with various discrete characteristic values and pause duration embedding vectors corresponding to the discrete characteristic values. Therefore, the pause duration vector corresponding to each word can be finally determined according to the pause duration corresponding to each word. For a preset pause duration mapping function, in the process of a training method of a speech recognition text punctuation prediction model for dividing training audio by adopting a speech endpoint detection model, because non-speech segments are used as partitions among sub-audios obtained by dividing according to the speech endpoint detection model, the pause durations of other words except the last word in the sub-audios are all set to be 0, and for the non-speech segments used as pauses of user speaking, the pause durations are generally not too short, the number of pause duration intervals obtained by dividing in the preset pause duration mapping function can be reduced, so that the robustness and the stability of the punctuation prediction model obtained by training can be favorably improved, and the influence of result errors of the speech endpoint detection model during recognition on the prediction effect of the punctuation prediction model can be reduced. Therefore, in the process of the training method of the speech recognition text punctuation prediction model for dividing the training audio by adopting the speech endpoint detection model, the preset pause duration mapping function can be set as the following expression, so that the robustness can be effectively improved, and the prediction result is more stable.

Where x is the dwell duration in milliseconds.

In step S143, a representation vector corresponding to each word can be obtained by only concatenating the word embedding vector corresponding to each word obtained in step S141 and the pause duration vector corresponding to each word obtained in step S142.

In step S15, since the expression vectors corresponding to the words are obtained in step S14, the expression vectors of the words in the sub-audios are only needed to be spliced according to the time sequence of the sub-audios to form an input serving as a punctuation prediction model, and the artificially labeled recognition text obtained in step S12 is output as a target of the punctuation prediction model to train the punctuation prediction model.

In other embodiments, the training audio is divided, and may also be implemented using an alignment model. The alignment model, i.e. the alignment model, can align with the corresponding annotation text according to the audio to obtain the pronunciation time stamp of each word in the whole audio, including the start time and the end time. Fig. 5 schematically illustrates a flowchart of a training method of a speech recognition text punctuation prediction model for dividing training audio by using an alignment model according to an embodiment of the present invention, and referring to fig. 5, step S11 may be implemented as:

step S111B: dividing the training audio into a plurality of sub-audios according to the time sequence by using an alignment model, wherein each sub-audio is a word;

step S112B: the start time and the end time of each sub-audio in the training audio are recorded.

In step S111B, the training audio is processed through the alignment model, and a pronunciation time stamp of each word in the training audio in the whole audio can be obtained, so that the training audio can be directly divided into a plurality of sub-audios according to the obtained pronunciation time stamp of each word in the whole audio, and it can be understood that each sub-audio obtained by dividing the training audio through the alignment model is a word. Then, the recording of the start time and the end time of each sub audio in the training audio in step S112B is completed according to the pronunciation time stamp of each word in the whole audio in the training audio obtained by processing the training audio through the alignment model, and step S11 of the training method of the speech recognition text punctuation prediction model for dividing the training audio by using the alignment model is completed.

As for step S12, the specific implementation manner is the same as that of step S12 in the training method of the speech recognition text punctuation prediction model that divides the training audio by using the speech endpoint detection model, and all steps are the steps of performing speech recognition on each sub audio, and manually labeling punctuation on the obtained recognition text to obtain a target output result for training as the punctuation prediction model, and the description is not repeated here.

For step S13, different from step S13 in the training method of the speech recognition text punctuation prediction model that divides the training audio by using the speech endpoint detection model, because each divided sub-audio is a word in step S11 in the training method of the speech recognition text punctuation prediction model that divides the training audio by using the alignment model, the pause duration corresponding to each word can be directly obtained according to the start time and the end time in the training audio corresponding to each word, and step S13 can be completed. It should be noted that, similar to step S132, in the training method of the speech recognition text punctuation prediction model that uses the alignment model to divide the training audio, the pause duration of the last word (i.e. the last sub-audio) may also be set to infinity, so that the excessive punctuation judgments and weight designs introduced by the punctuation prediction model can be effectively reduced, and the accuracy of punctuation prediction is ensured.

As for the step S14, the specific implementation manner is similar to that of the step S14 in the training method of the speech recognition text punctuation prediction model that the training audio is divided by using the speech endpoint detection model, except that in the training method of the speech recognition text punctuation prediction model that the training audio is divided by using the alignment model, the pause duration corresponding to each word is obtained by processing according to the alignment model, the pause duration is different in length and is not set to 0, so that the number of pause duration intervals obtained by dividing in the preset pause duration mapping function is more, and specifically, in the flow of the training method of the speech recognition text punctuation prediction model that the training audio is divided by using the alignment model, the preset pause duration mapping function can be set to the following expression:

where x is the dwell duration in milliseconds.

The training method of the voice recognition text punctuation prediction model firstly segments the training audio based on the voice continuity of the audio so as to obtain continuous sub-audio, and then carries out voice recognition processing on each sub-audio, so that the problem of punctuation prediction error caused by word segmentation only according to the semantics of the recognized text in the prior art can be effectively reduced, and the method flows corresponding to two different audio division modes of adopting an alignment model and adopting a voice endpoint detection model are provided. When the integral method determines the expression vector corresponding to each word, the pause characteristic of the speaker during spoken language expression is added as an influence factor to combine the pause duration corresponding to the word, so that when the prediction model obtained by training infers the punctuation insertion probability corresponding to each word, the punctuation prediction effect of texts with difficult semantic understanding can be more accurate and improved to a certain extent,

compared with the background technology, the method for dividing the audio by adopting the alignment model not only adds the speaker expression characteristic of pause time and increases the accuracy of punctuation prediction, but also can obtain the timestamp of each word due to the fine granularity of the timestamp, can detect the pronunciation time and the pause time of each word, and has sensitive punctuation prediction module.

When the pause duration of each sub-audio is determined, the pause duration of the last word of the last sub-audio is set to be infinite, so that punctuation marks can be added after the last word of the last sub-audio is ensured, and the pause durations corresponding to other words except the last word in the sub-audio are set to be 0, compared with an alignment model, though sensitivity and accuracy are seemingly sacrificed, the final experimental result shows that the method and the process for dividing the audio by adopting the voice endpoint detection model are matched with a preset duration mapping function with a small pause duration interval, so that the potential influence of different corresponding discrete characteristic values caused by slight changes of the pause duration and further possible disturbance to a certain degree on the result of the punctuation prediction model can be avoided while the accuracy is maintained, and the robustness and the stability of the expression prediction model obtained by training can be improved more effectively. And finally, recognizing a text training punctuation prediction model according to the obtained expression vector of each word and the artificial label corresponding to the training audio, so as to obtain a punctuation prediction model with higher accuracy and stability.

Fig. 6 schematically shows a flow of a method for punctuation prediction of a speech recognition text according to an embodiment of the present invention, which can be applied in a server, a computer, or other devices to perform punctuation prediction on the speech recognition text, and in particular, in a product or software client device related to speech recognition, such as a smart phone, a personal computer, or the like, to perform punctuation prediction on the speech recognition text, with reference to fig. 6, the method includes the following steps:

step S21: dividing the target audio into a plurality of sub-audios according to the voice continuity of the target audio;

step S22: performing voice recognition on each sub-audio to obtain a recognition text containing a plurality of words;

step S23: determining the pause duration of each word in each sub-audio;

step S24: embedding the words of each word into a vector and fusing the pause duration vector corresponding to the pause duration to serve as a representation vector of each word;

step S25: and according to the time sequence of the plurality of sub-audios, splicing the expression vectors of the words in the sub-audios, inputting the expression vectors into a speech recognition text punctuation prediction model, and reasoning the punctuations of the words in the recognition text.

In step S21, similarly to step S11, the target audio, which is the audio to be subjected to the speech recognition text punctuation prediction and is input by the user when using the product or software, needs to be divided first. Specifically, when the target audio is divided, a voice endpoint detection model may be used, and an alignment model may also be used, where the method for dividing the target audio by using the voice endpoint detection model and the method for dividing the target audio by using the alignment model may refer to descriptions of corresponding parts in a training method of a speech recognition text punctuation prediction model, and are not described herein again. The target audio is divided based on the voice continuity of the target audio, and the continuous reading part can be divided into the same sub-audio according to the voice continuity in the audio, so that the error segmentation caused by word segmentation only according to the voice of the recognized text in the prior art can be effectively relieved.

Step S22, similar to step S12, needs to perform speech recognition on each sub-audio to obtain a recognition text corresponding to the target audio and including a plurality of words, and is different from step S12 in that, because the method is a speech recognition text punctuation prediction method and is not a training method of a prediction model, it is not necessary to manually mark punctuation on the speech recognition text.

Step S23 and step S24 are similar to step S13 and step S14, so that the detailed description of the method for determining the pause duration of each word in each sub-audio in step S23 and the detailed description of the portion for determining the expression vector of each word in step S24 can refer to the detailed description of the relevant portions in step S13 and step S14, and are not repeated herein.

And step S25, reasoning punctuation of words in the recognition text corresponding to the target audio according to the expression vectors of the words and the punctuation prediction model obtained in the step S24, splicing the expression vectors of the words in the sub-audio according to the time sequence of the plurality of sub-audio to be used as the input of the punctuation prediction model, so that the punctuation prediction model is input to obtain the voice recognition text after punctuation prediction corresponding to the target audio, and thus the punctuation prediction of the voice recognition text is completed.

The method for predicting the punctuation of the voice recognition text comprises the steps of firstly segmenting a training audio based on the voice continuity of the audio to obtain continuous sub-audios, then carrying out voice recognition processing on each sub-audio to effectively reduce the problem of punctuation prediction error caused by word segmentation only according to the semantics of the recognized text in the prior art, adding pause duration as an influence factor when determining a representation vector corresponding to each word and combining the voice characteristics of a speaker to enable the punctuation prediction effect of the text with difficult semantic understanding to be more accurate when reasoning is carried out on the punctuation insertion probability corresponding to each word according to a prediction model, and meanwhile, reducing the partitioned pause duration interval when setting a preset duration mapping function to effectively improve the robustness and stability of voice recognition punctuation prediction.

In summary, the speech recognition text punctuation prediction model training method and the speech recognition text punctuation prediction method of the present invention may be integrated to form a flowchart as shown in fig. 7, referring to fig. 7, after receiving a target audio or a training audio, a VAD model (i.e. a speech endpoint detection model) may be first adopted to segment the audio to obtain a plurality of sub-audios; then, carrying out voice recognition on each audio to obtain a voice recognition text, and if model training is to be carried out, carrying out manual marking on the voice recognition text to obtain an output part used for training a punctuation prediction model; then calculating the pause duration corresponding to each word in the speech recognition text; determining a pause duration vector corresponding to each word mapped and a word embedding vector corresponding to each word according to the pause duration of each word, and fusing the pause duration vector and the word embedding vector to obtain an expression vector corresponding to each word; then, splicing the expression vectors corresponding to each word in sequence to obtain a part used as the input of a punctuation prediction model; and finally, according to the purpose, if the punctuation prediction model needs to be trained, the manually marked speech recognition text is used as an output part of the punctuation prediction model, the spliced expression vector is used as input to train the punctuation prediction model, and if the speech recognition text punctuation prediction needs to be carried out, the spliced expression vector is used as input to be input into the punctuation prediction model, so that the speech recognition text after punctuation prediction corresponding to the target audio can be obtained, and the speech recognition text punctuation prediction is completed.

Fig. 8 schematically illustrates a speech recognition text punctuation prediction model training apparatus according to an embodiment of the present invention, and as shown in fig. 8, the apparatus includes:

the audio dividing module 1 is used for dividing the training audio into a plurality of sub-audios according to the voice continuity of the training audio;

the voice recognition module 2 is used for performing voice recognition on each sub-audio to obtain a recognition text containing a plurality of words;

the manual marking punctuation module 3 is used for manually marking punctuation on the recognition text recognized by the voice recognition module 2;

the pause duration determining module 4 is used for determining the pause duration of each word in each sub-audio;

the expression vector acquisition module 5 is used for fusing the word embedding vector of each word and the pause duration vector corresponding to the pause duration as the expression vector of each word;

and the model training module 6 is used for training the punctuation prediction model by using the expression vectors of the words obtained by the expression vector obtaining module 5 and the manually labeled recognition texts obtained by the manually labeled punctuation module 3.

In some embodiments, the audio partitioning module 1 may specifically be configured to:

using a voice endpoint detection model, and dividing the training audio into a plurality of sub-audios according to a time sequence by taking the non-voice segment as a partition, wherein each sub-audio is a voice segment;

the start time and the end time of each sub audio in the training audio are recorded.

In other embodiments, the audio dividing module 1 may specifically be configured to:

dividing the training audio into a plurality of sub-audios according to the time sequence by using an alignment model, wherein each sub-audio is a word;

In some embodiments, the pause duration determining module 4 may be specifically configured to:

taking the time difference between the ending time of the ith sub-audio and the starting time of the (i + 1) th sub-audio as the pause duration of the last word in the ith sub-audio;

setting the pause duration of the last word in the last sub-audio in the training audio to be infinite;

and setting the pause duration of other words except the last word in each sub-audio to be 0.

In some embodiments, the representation vector obtaining module 5 may be specifically configured to:

searching a word embedding vector corresponding to each word according to a preset word embedding matrix;

determining discrete characteristic values reflecting the pause durations of the words according to a preset pause duration mapping function, and searching pause duration vectors corresponding to the discrete characteristic values in a pause duration embedding matrix, wherein the expression of the preset pause duration function is

Wherein x is the pause duration in milliseconds;

and splicing the word embedding vector and the pause duration vector of each word to obtain the expression vector of each word.

It should be noted that, the implementation process and the implementation principle of the speech recognition text punctuation prediction model training device according to the embodiment of the present invention can specifically participate in the corresponding description of the above method embodiments, and therefore, are not described herein again. For example, the training device of the speech recognition text punctuation prediction model of the embodiment of the present invention can be any intelligent device, including but not limited to a server, a computer, a smart phone, a personal computer, a robot, etc., to train the punctuation prediction model.

Fig. 9 schematically illustrates a speech recognition text punctuation prediction apparatus according to an embodiment of the present invention, and as shown in fig. 9, the apparatus includes:

and the punctuation prediction module 7 is configured to splice the representation vectors of the words in the sub-audios obtained by the representation vector obtaining module 5 according to the time sequence of the plurality of sub-audios, input the representation vectors to the punctuation prediction model, and perform inference on punctuations of the words in the recognition text.

In some embodiments, the pause duration determining module 4 may specifically be configured to:

Wherein x is the pause duration in milliseconds;

It should be noted that, the implementation process and the implementation principle of the speech recognition text punctuation prediction apparatus according to the embodiment of the present invention may specifically participate in the corresponding description of the above method embodiment, and therefore, are not described herein again. The speech recognition text punctuation prediction device of the embodiment of the present invention may be any intelligent device, including but not limited to a server, a computer, etc., and may be a product or software client device related to speech recognition, such as a smart phone, a personal computer, a robot, etc., in particular, to perform punctuation prediction on a speech recognition text.

In some embodiments, the present invention provides a non-transitory computer readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform a speech recognition text punctuation prediction model training method or a speech recognition text punctuation prediction method according to any one of the above embodiments of the present invention.

In some embodiments, the present invention further provides a computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions that, when executed by a computer, cause the computer to perform the speech recognition text punctuation prediction model training method or the speech recognition text punctuation prediction method of any one of the above embodiments.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a speech recognition text punctuation prediction model training method or a speech recognition text punctuation prediction method of any of the above embodiments.

In some embodiments, the present invention further provides a storage medium having a computer program stored thereon, where the computer program is used to implement the method for training a speech recognition text punctuation prediction model or the method for predicting speech recognition text punctuation, according to any one of the above embodiments.

Fig. 10 is a schematic hardware structure diagram of an electronic device for performing a speech recognition text punctuation prediction model training method or a speech recognition text punctuation prediction method according to another embodiment of the present application, as shown in fig. 10, the device includes:

one or more processors 910 and memory 920, one processor 910 being exemplified in fig. 10.

The apparatus for performing a speech recognition text punctuation prediction model training method or a speech recognition text punctuation prediction method may further include: an input device 930 and an output device 940.

The processor 910, the memory 920, the input device 930, and the output device 940 may be connected by a bus or other means, and the bus connection is exemplified in fig. 10.

The memory 920 is a non-volatile computer-readable storage medium and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the training method of the speech recognition text punctuation prediction model or the speech recognition text punctuation prediction method in the embodiments of the present application. The processor 910 executes various functional applications of the server and data processing, i.e. implementing the training method for the speech recognition text punctuation prediction model or the speech recognition text punctuation prediction method of the above-mentioned method embodiment, by running the non-volatile software program, instructions and modules stored in the memory 920.

The memory 920 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to a speech recognition text punctuation prediction model training method or use of a speech recognition text punctuation prediction method, or the like. Further, the memory 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 920 may optionally include memory located remotely from the processor 910, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 930 may receive input numerical or character information and generate signals related to user settings and function control of the image processing apparatus. The output device 940 may include a display device such as a display screen.

The one or more modules are stored in the memory 920 and, when executed by the one or more processors 910, perform a speech recognition text punctuation prediction model training method or a speech recognition text punctuation prediction method in any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A training method of a speech recognition text punctuation prediction model is characterized by comprising the following steps:

determining the pause duration of each word in each sub-audio;

2. The method of claim 1, wherein dividing the training audio into a plurality of sub-tones according to the speech continuity of the training audio comprises:

3. The method of claim 2, wherein determining a pause duration for each term in each sub-audio comprises:

setting a pause duration of a last word in a last sub-audio in the training audio to infinity.

4. The method of claim 3, wherein determining a pause duration for each word in each sub-audio further comprises:

5. The method of claim 1, wherein dividing the training audio into a plurality of sub-tones according to the speech continuity of the training audio comprises:

the start time and the end time of each sub-audio in the training audio are recorded.

6. The method of claim 1, wherein the fusing the word embedding vector for each word and the pause duration vector corresponding to the pause duration as the representation vector for each word comprises:

determining discrete characteristic values reflecting the pause durations of the words according to a preset pause duration mapping function, and searching pause duration vectors corresponding to the discrete characteristic values in a pause duration embedded matrix;

7. The method of claim 6, wherein the predetermined pause duration mapping function is expressed by:

where x is the dwell duration in milliseconds.

8. The method of claim 1, wherein training a punctuation prediction model using the representation vectors of the words and the recognized text after the manual punctuation comprises:

and according to the time sequence of the plurality of sub-audios, splicing the expression vectors of the words in the sub-audios to be used as the input of the punctuation prediction model, outputting the recognized text with the punctuations manually marked as a target, and training the punctuation prediction model.

9. A method for predicting punctuation of a speech recognition text, comprising:

determining the pause duration of each word in each sub-audio;

and according to the time sequence of the plurality of sub-audios, splicing the expression vectors of the words in the sub-audios, inputting the expression vectors into a punctuation prediction model, and reasoning the punctuation of the words in the recognition text.

10. The method of claim 9, wherein dividing the target audio into a plurality of sub-audio according to the voice continuity of the target audio comprises:

using a voice endpoint detection model, and dividing a target audio into a plurality of sub-audios according to a time sequence by taking a non-voice segment as a partition, wherein each sub-audio is a voice segment;

the start time and the end time of each sub audio in the target audio are recorded.

11. The method of claim 10, wherein determining a pause duration for each word in each sub-audio comprises:

setting a pause duration of a last word in a last sub-audio in the target audio to infinity.

12. The method of claim 11, wherein determining a pause duration for each term in each sub-audio further comprises:

13. The method of claim 9, wherein dividing the target audio into a plurality of sub-audio according to the voice continuity of the target audio comprises:

dividing the target audio into a plurality of sub-audios according to a time sequence by using an alignment model, wherein each sub-audio is a word;

14. The method of claim 9, wherein the fusing the word embedding vector for each word and the pause duration vector corresponding to the pause duration as the representation vector for each word comprises:

15. The method of claim 14, wherein the expression of the predetermined pause duration mapping function is:

where x is the dwell duration in milliseconds.

16. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-15.

17. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 15.