CN115687935A

CN115687935A - Post-processing method, device and equipment for voice recognition and storage medium

Info

Publication number: CN115687935A
Application number: CN202310010363.2A
Authority: CN
Inventors: 徐小龙; 于璇; 娄联章; 谢凯; 尹曦; 谢育涛
Original assignee: International Digital Economy Academy IDEA
Current assignee: International Digital Economy Academy IDEA
Priority date: 2023-01-05
Filing date: 2023-01-05
Publication date: 2023-02-03

Abstract

The invention belongs to the technical field of computers, and discloses a post-processing method, a post-processing device, post-processing equipment and a storage medium for voice recognition. The method comprises the following steps: respectively carrying out data construction on text data in a corpus by multiple modes to generate diversified training data, wherein the multiple modes comprise at least two modes of a punctuation deletion mode, a standardized processing mode, a text content correction mode and a text fragment truncation mode; training the initial model according to the diversified training data to obtain a post-processing model; acquiring initial text data after voice recognition; and carrying out post-processing on the initial text data by using the trained post-processing model to obtain target text data. Through the mode, post-processing steps such as punctuation recovery, reverse text standardization, error word modification and the like can be realized by adopting a single model, the time-consuming problem caused by multi-step implementation is solved, data do not need to be marked, and the marking cost is reduced.

Description

Post-processing method, device and equipment for voice recognition and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for post-processing of speech recognition.

Background

The speech recognition is a cross-disciplinary sub-field of computer science and computational linguistics, can recognize speech as a text through technologies such as deep learning, and is widely applied to scenes such as intelligent customer service, vehicle navigation, intelligent home furnishing, simultaneous interpretation and the like. However, most existing speech recognition models are only recognized as character sequences, and in order to improve readability, optimize user experience, and improve accuracy of downstream tasks of recognized texts, post-processing such as spoken language smoothing, punctuation recovery, reverse text standardization, error word modification and the like needs to be performed on the texts output by speech recognition. The current post-processing approach has the following problems: the post-treatment is carried out in multiple steps, so that the time is consumed, and errors can be transmitted; the current inverse text standardization, error correction models and the like need to collect or label training data, and the labeling data has high cost and is not beneficial to expanding to a new field; most of the current post-processing modes are modeling for a whole sentence, and when the model is applied to an incomplete sentence identified in a streaming manner, the effect is poor, and the user experience and the accuracy of downstream tasks are influenced.

The above is only for the purpose of assisting understanding of the technical solution of the present invention, and does not represent an admission that the above is the prior art.

Disclosure of Invention

The invention mainly aims to provide a post-processing method, a post-processing device and a post-processing storage medium for voice recognition, and aims to solve the technical problem that the current post-processing mode is performed in multiple steps and consumes much time.

In order to achieve the above object, the present invention provides a post-processing method of speech recognition, comprising the steps of:

respectively carrying out data construction on text data in a corpus by multiple modes to generate diversified training data, wherein the multiple modes comprise at least two modes of a punctuation deletion mode, a standardization processing mode, a text content correction mode and a text fragment truncation mode;

training the initial model according to the diversified training data to obtain a post-processing model;

acquiring initial text data after voice recognition;

and post-processing the initial text data by using the trained post-processing model to obtain target text data.

Optionally, the performing data construction on the text data in the corpus respectively in multiple ways to generate diversified training data includes:

performing punctuation deletion processing on text data in a corpus to obtain first text data;

generating first training data according to the first text data and the text data in the corpus;

and mixing the first training data with training data generated by other modes to obtain diversified training data.

carrying out standardization processing on text data in a corpus to obtain second text data;

generating second training data according to the second text data and the text data in the corpus;

and mixing the second training data with training data generated by other modes to obtain diversified training data.

Optionally, the performing data construction on the text data in the corpus respectively through multiple modes to generate diversified training data includes:

carrying out voice synthesis on text data in a corpus to obtain voice data;

recognizing the voice data by using a voice recognition model to obtain third text data;

generating third training data according to the third text data and the text data in the corpus;

and mixing the third training data with training data generated by other modes to obtain diversified training data.

Optionally, after generating third training data according to the third text data and the text data in the corpus, the method further includes:

performing truncation processing on the third text data to obtain fourth text data;

performing truncation processing on the text data in the corpus to obtain fifth text data;

generating fourth training data according to the fourth text data and the fifth text data;

and mixing the third training data, the fourth training data and training data generated in other modes to obtain diversified training data.

Optionally, the performing truncation processing on the text data in the corpus to obtain fifth text data includes:

performing multiple truncation processing on the text data in the corpus to obtain multiple text data to be selected;

determining the similarity between each text data to be selected and the fourth text data;

and selecting the text data to be selected with the highest similarity with the fourth text data as fifth text data.

Optionally, before the data construction is performed on the text data in the corpus in multiple ways and the diversified training data is generated, the method further includes:

collecting a large amount of known text content;

carrying out sentence breaking processing on the known text content to obtain a plurality of sentence texts;

storing the plurality of sentence texts as a corpus.

In addition, in order to achieve the above object, the present invention further provides a post-processing apparatus for speech recognition, including:

the data construction module is used for respectively carrying out data construction on the text data in the corpus by a plurality of modes to generate diversified training data, wherein the plurality of modes comprise at least two modes of a punctuation deletion mode, a standardization processing mode, a text content correction mode and a text segment truncation mode;

the training module is used for training the initial model according to the diversified training data to obtain a post-processing model;

the acquisition module is used for acquiring initial text data after voice recognition;

and the post-processing module is used for post-processing the initial text data by using the trained post-processing model to obtain target text data.

In addition, to achieve the above object, the present invention further provides a post-processing device for speech recognition, including: a memory, a processor and a post-processing program of speech recognition stored on the memory and executable on the processor, the post-processing program of speech recognition being configured to implement the post-processing method of speech recognition as described above.

In addition, to achieve the above object, the present invention further provides a storage medium, on which a post-processing program for speech recognition is stored, and the post-processing program for speech recognition, when executed by a processor, implements the post-processing method for speech recognition as described above.

The method comprises the steps of respectively carrying out data construction on text data in a corpus by various modes to generate diversified training data, wherein the various modes comprise at least two modes of a punctuation deletion mode, a standardized processing mode, a text content correction mode and a text fragment truncation mode; training the initial model according to the diversified training data to obtain a post-processing model; acquiring initial text data after voice recognition; and carrying out post-processing on the initial text data by using the trained post-processing model to obtain target text data. Through the mode, post-processing steps such as punctuation recovery, inverse text standardization, error word modification and the like can be realized by adopting a single model, independent processing is not needed for tasks such as spoken language smoothness, punctuation recovery, inverse text standardization, error word correction and the like, the time-consuming problem caused by multi-step implementation is solved, data labeling is not needed, diversified training data are constructed, and the labeling cost is reduced.

Drawings

FIG. 1 is a schematic diagram of a post-processing device for speech recognition in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a speech recognition post-processing method according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of a post-processing method for speech recognition according to the present invention;

FIG. 4 is a schematic view of an exemplary process of the present invention;

FIG. 5 is an exemplary diagram of an end-to-end post-processing flow of an embodiment of the present invention;

FIG. 6 is a block diagram of a first embodiment of a post-processing device for speech recognition according to the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a post-processing device for speech recognition in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the post-processing device for speech recognition may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001 described previously.

Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of a post-processing device for speech recognition, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a post-processing program for voice recognition.

In the post-processing device for speech recognition shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the post-processing device for speech recognition of the present invention may be provided in a post-processing device for speech recognition, which calls the post-processing program for speech recognition stored in the memory 1005 through the processor 1001 and executes the post-processing method for speech recognition provided by the embodiment of the present invention.

An embodiment of the present invention provides a post-processing method for speech recognition, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of the post-processing method for speech recognition according to the present invention.

In this embodiment, the post-processing method of speech recognition includes the following steps:

step S10: and respectively carrying out data construction on the text data in the corpus by multiple modes to generate diversified training data, wherein the multiple modes comprise at least two modes of a punctuation deletion mode, a standardization processing mode, a text content correction mode and a text fragment truncation mode.

It can be understood that the main execution body of this embodiment is a post-processing device for speech recognition, and the post-processing device for speech recognition may be a computer, a server, or other devices with artificial intelligence reasoning capability, which is not limited in this embodiment.

It should be noted that the post-processing process of the speech recognition is at least one of smooth spoken language, punctuation recovery, inverse text standardization, error word modification, and the like, wherein smooth spoken language means that a spoken language text is converted into a written language text, including removal of linguistic words, repeated content, and the like; punctuation recovery means that sentence breaking is carried out on the recognized text, and corresponding punctuation marks are added; reverse text normalization refers to conversion of dates, numbers, etc. to written language expressions, such as thirty-three percent to 33%; modifying the wrong words refers to correcting the wrong words in the recognition result, such as homophone correction. In the prior art, targeted step processing is mostly performed on the above points, for example, in the process of reverse text standardization, a WeTextProcessing WFST method based on grammar rules realizes three parts including Tagger, reorder and Verbalizer, respectively analyzes and converts a text into structured information, and sequentially adjusts and reorders the structured information to splice the information; collecting and labeling extensive data, and training an inverse text standardization model by adopting a neural network; and (3) converting the punctuation restoration task into a sequence marking task, such as Distilbert-puncator, labeling the text by adopting a bert pre-training model, and predicting whether punctuation marks such as periods and the like need to be added. The current approach is divided into multiple steps for post-processing, is time consuming, and can carry out error propagation.

It should be understood that a corpus includes a large amount of written text data, and the text data in the corpus is used to construct training data, specifically, the processed and pre-processed text data are stored as training data correspondingly, for example, the text data c _1 stored in the corpus is processed in a punctuation deletion manner to be c _11, and the training data is constructed to be (c _11, c _ 1), where c _11 is sample data of the training data, and c _1 is labeled data.

The punctuation deletion mode refers to performing punctuation deletion processing on text data in the corpus, and the standardized processing mode refers to performing standardized processing such as spelling correction processing, digit and abbreviation word conversion processing, case and case conversion processing and the like on the text data in the corpus; the text content correcting mode is to correct errors of text data in the corpus; the text segment truncation mode is to perform truncation processing on text data in the corpus and construct training data corresponding to incomplete sentences. And constructing diversified training data by adopting at least two modes of a punctuation deleting mode, a standardization processing mode, a text content correcting mode and a text segment truncation mode so that a post-processing model learns information of punctuation recovery and/or reverse text standardization and/or error word modification in the training data, and thus punctuation recovery and/or reverse text standardization and/or error word modification are synchronously carried out on the text data output by speech recognition.

Further, before the step S10, the method further includes: collecting a large amount of known text content; carrying out sentence breaking processing on the known text content to obtain a plurality of sentence texts; storing the plurality of sentence texts as a corpus.

In a specific implementation, a large amount of known text contents are collected from news, novels, web pages and the like, are written text contents, so that spoken language smooth processing is convenient for text data output by voice recognition, and optionally, a Stanza open source library is adopted to segment the text, and the text is stored as txt text in a row by one sentence, and is recorded as a corpus, and the text data in the corpus is assumed to be c _ i, and the corpus comprises c _1, c _2, \8230, c _ i, 8230, c _ n and the like.

Step S20: and training the initial model according to the diversified training data to obtain a post-processing model.

It should be understood that, for the multi-step tedious problem, in the present embodiment, the initial model optionally uses an end-to-end generation type pre-training model as a backbone network, learns information carried by training data, obtains a post-processing model, and generates a text with punctuation marks and inverse text standardization based on the post-processing model.

Optionally, an end-to-end generated pre-training model used in this embodiment is T5, each block is composed of the T5 model based on Multi-Head association and Feed Forward Neural Network, and each block is stacked to form an encoder and a decoder, respectively, where 6 layers are used in the encoder and the decoder in this embodiment. Wherein the Multi-Head Attention is equivalent to the integration of h different self-attentions, and 8 heads are adopted in the embodiment. Meanwhile, the T5 model adopts an automatic supervision strategy, pre-trains the speech mask, and has relatively sufficient language modeling capability. In addition, other parameters of post-processing training are 256 maximum text length, adam is selected by an optimization algorithm, the learning rate is 1e-5, the batch processing size is 16, and 2 rounds (epoch) of training are performed. In specific implementation, the diversified training data includes processed text data and text data before processing, the input of the initial model is the processed text data, and the model training is performed by using the text data before processing as a prediction target.

Step S30: and acquiring initial text data after voice recognition.

It should be noted that the initial text data may be a rough text output by speech recognition, and the rough text may have word errors, punctuation marks, and the like, which results in poor readability.

Step S40: and post-processing the initial text data by using the trained post-processing model to obtain target text data.

It should be understood that the trained post-processing model learns the information of punctuation recovery and/or inverse text standardization and/or error word modification, and the trained post-processing model is used for post-processing the initial text data of the voice recognition divisor, so that punctuation recovery and/or inverse text standardization and/or error word modification are realized, the traditional multi-step tedious post-processing is abandoned, and the processing efficiency is improved.

It should be noted that, in this embodiment, an end-to-end generation type pre-training model T5 is used as a backbone network, a rough text for speech recognition (without punctuation marks and the like) is input, a readable and error-corrected text is output end-to-end, and various post-processing tasks are completed at the same time. Generally, training data are needed to be used for training when a model is used for post-processing, the labeling cost of the training data is high, and in the embodiment, the data are constructed in at least two ways from written texts such as news and novels, diversified training data are generated, and a large amount of training data can be generated without labeling. The way of constructing the data includes: 1. deleting punctuation from the collected text, and solving the marking problem of punctuation recovery data; 2. standardizing the collected text, and solving the labeling problem of text inverse standardization; 3. synthesizing the collected text voice, then performing voice recognition to simulate real voice recognition to output the text, and solving the problem of labeling for correcting wrong words in voice recognition; 4. and (4) randomly truncating the output text obtained in the step (3), simulating voice stream type recognition, and solving the problem of labeling incomplete sentences in the stream type recognition.

The embodiment respectively constructs the text data in the corpus by multiple modes to generate diversified training data, wherein the multiple modes comprise at least two modes of a punctuation deletion mode, a standardization processing mode, a text content correction mode and a text fragment truncation mode; training the initial model according to the diversified training data to obtain a post-processing model; acquiring initial text data after voice recognition; and carrying out post-processing on the initial text data by using the trained post-processing model to obtain target text data. Through the mode, post-processing steps such as punctuation recovery, inverse text standardization, error word modification and the like can be realized by adopting a single model, independent processing on tasks such as spoken language smoothness, punctuation recovery, inverse text standardization, error word correction and the like is not needed, the time-consuming problem caused by multi-step implementation is solved, marking data is not needed, diversified training data are constructed, the marking cost is reduced, in the implementation, modeling optimization is performed on intermediate texts in the voice stream type recognition process, and the robustness of post-processing on incomplete sentences is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a post-processing method for speech recognition according to a second embodiment of the present invention.

Based on the first embodiment, in a first implementation manner, the step S10 of the post-processing method for speech recognition in this embodiment includes:

step S101: and performing punctuation deletion processing on the text data in the corpus to obtain first text data.

It should be understood that, for each piece of text data c _ i in the collected corpus, punctuation symbols therein are deleted, and the deleted punctuation symbols are, for example, "$% & +, -/: (ii) a < = [ { a' [/is _ { _ a { } and "\8230;). "and so on, and may further include, for example, question mark, exclamation mark, special symbol, and so on, to obtain the text data c _ i1 without punctuation mark, referring to fig. 4, fig. 4 is a schematic diagram of a processing manner of an example of the present invention, for example, regarding the text data" second, the result display is 33% after 70. "get after punctuation deletion" next result shows 33% after 70 ".

Step S102: and generating first training data according to the first text data and the text data in the corpus.

First training data (c _ i1, c _ i) are generated according to the first text data c _ i1 and the text data c _ i in the corpus, wherein c _ i1 is sample data of the first training data, and c _ i is labeled data of the first training data. A training set T1 is further formed.

It should be understood that the other mode is at least one of a standardization processing mode, a text content correction mode and a text segment truncation mode, and training data generated by at least two modes are mixed to obtain diversified training data.

In a second implementation manner, the step S10 of the post-processing method for speech recognition in this embodiment includes:

step S103: and carrying out standardization processing on the text data in the corpus to obtain second text data.

In a specific implementation, spelling correction processing, number and abbreviation conversion processing, and case conversion processing are optionally performed on the text data in the corpus to obtain second text data. Alternatively, sentences are normalized using the Normalizer class of WeTextProcessing (normaize). Referring to fig. 4, for example, for the text data "second, the result display is 33% after 70. Second, the results show a percentage of thirty-three percent after seventy-zero. ".

Step S104: and generating second training data according to the second text data and the text data in the corpus.

It should be understood that the second training data (c _ i1, c _ i) is generated according to the second text data c _ i2 and the text data c _ i in the corpus, wherein c _ i2 is sample data of the second training data, and c _ i is labeled data of the second training data. A training set T2 is further formed.

It should be noted that the other method is at least one of a method for deleting a landmark, a method for correcting text content, and a method for truncating a text segment, and training data generated in at least two methods are mixed to obtain diversified training data.

In a third implementation manner, the step S10 of the post-processing method for speech recognition in this embodiment includes:

step S105: and carrying out voice synthesis on the text data in the corpus to obtain voice data.

Step S106: and recognizing the voice data by using a voice recognition model to obtain third text data.

It should be understood that the present embodiment does not limit the speech synthesis method and the speech synthesis method. Illustratively, a fastspeech framework is adopted to train a speech synthesis model based on a data set such as Aishell, and c _ i is synthesized into audio speech _ i based on the trained speech synthesis model. Illustratively, a WeNet framework is adopted to train a voice recognition model based on Wenetspeed open source data, and voice data speed _ i is recognized as text c _ i3 based on the trained voice recognition model. Referring to fig. 4, for example, text data "next, the result is displayed to be 33% after 70. "synthesized into speech data, and then speech recognized as" the next result shows that seven zeros later accounts for thirty-three percent ".

Step S107: and generating third training data according to the third text data and the text data in the corpus.

Third training data (c _ i3, c _ i) are generated according to the third text data c _ i3 and the text data c _ i in the corpus, wherein c _ i3 is sample data of the third training data, and c _ i is labeled data of the third training data. A training set T3 is further formed.

It should be understood that the other method is at least one of a method of deleting a landmark, a method of standardizing, and a method of truncating a text segment, and training data generated by at least two methods are mixed to obtain diversified training data.

In a fourth implementation manner, after step S107, the method for post-processing speech recognition according to this embodiment further includes:

step S108: and performing truncation processing on the third text data to obtain fourth text data.

Step S109: and performing truncation processing on the text data in the corpus to obtain fifth text data.

It should be noted that, in the above three implementation manners, all the training data are whole-sentence text data, and an incomplete sentence is generated in the speech stream recognition process, for example, "the result is obvious" to solve the post-processing problem of the incomplete sentence, in this embodiment, the third text data c _ i3 generated by text error correction and the text data c _ i in the corpus are cut to obtain the fourth text data c _ i3 'and the fifth text data c _ i', and the incomplete sentence is simulated, so that the post-processing model learns information carried by the incomplete sentence, and the post-processing efficiency of the incomplete sentence is improved. Alternatively, the truncation process of the third text data and the truncation process of the text data in the corpus in the present embodiment are the same fixed truncation manner, for example, truncating k characters from the head, truncating p characters from the tail, and the like. Referring to fig. 4, for example, let "next, the result shows that the ratio is 33% after 70. "the truncation processing is" next, the result display is 70 th-order percentage ", and" next, the result display is seventy-zero-order percentage "truncation processing is" next, the result display is seventy-zero-order percentage ".

Step S110: and generating fourth training data according to the fourth text data and the fifth text data.

It should be understood that fourth training data (c _ i3 ', c _ i') is generated according to the fourth text data c _ i3 'and the fifth text data c _ i', wherein c _ i3 'is sample data of the fourth training data, and c _ i' is labeled data of the fourth training data. A training set T4 is further formed.

The other method is at least one of a point deletion method, a normalization method, and a text content correction method, and the training data generated in at least two methods are mixed to obtain diversified training data.

Further, the step S109 includes: performing multiple truncation processing on the text data in the corpus to obtain multiple text data to be selected; determining the similarity between each text data to be selected and the fourth text data; and selecting the text data to be selected with the highest similarity with the fourth text data as fifth text data.

It should be understood that the multiple truncation process is used to obtain all possible truncation results for the text data in the corpus. Optionally, random truncation is performed on the third text data, and if the text data in the corpus is also randomly truncated, the obtained fourth text data c _ i3 'is not matched with the fifth text data c _ i' with a high probability. Therefore, in this embodiment, the similarity between all the truncation results of c _ i and c _ i 3' is obtained based on the similarity algorithm, and the most similar truncation result data is obtained. Optionally, the similarity is obtained by adopting a CER (word error rate) algorithm, and the calculation formula is as follows: 1-CER, where CER = (number of characters replaced + number of characters deleted + number of characters inserted)/total number of characters.

Optionally, any two training sets of the training sets T1, T2, T3, and T4 are mixed to form a training set T, where the training set T includes diversified training data. Preferably, mix T1, T2, T3, T4 training set and form training set T, including diversified training data in training set T, guaranteed the variety of training data sample, all can adopt the script to accomplish simultaneously above step, have saved a large amount of manpower and materials.

Referring to FIG. 5, FIG. 5 is an exemplary diagram of an end-to-end post-processing flow of an embodiment of the present invention; the post-processing model of this embodiment can implement the inverse normalization processing of adding punctuation and text, for example, the speech recognition outputs "the result display is thirty-three percent after seventy-zero", and the post-processing model outputs "the result display is 33 percent after 70%. "; the post-processing model of this embodiment can implement post-processing on an incomplete sentence, for example, speech recognition output "next result display is seven zeros", and post-processing model output after processing "next result display is 70"; the post-processing model of this embodiment can implement error correction, for example, outputting "the structural area is the area occupied by the wall body only" by voice recognition, and outputting "the structural area is the area occupied by the wall body after the post-processing model is processed. ".

In the embodiment, at least two modes of a punctuation deleting mode, a standardized processing mode, a text content correcting mode and a text fragment truncation mode are used for carrying out data construction on text data in a material library, diversified training data are generated, so that a post-processing model learns information carried in the training data, punctuation recovery and/or inverse text standardization and/or error word modification are carried out on the text data output by speech recognition synchronously, the post-processing of incomplete sentences can be realized, the time-consuming problem caused by multi-step implementation is solved, the efficiency of the speech recognition post-processing is improved, data do not need to be marked, diversified training data are constructed, and the marking cost is reduced.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a post-processing program for speech recognition, and the post-processing program for speech recognition, when executed by a processor, implements the post-processing method for speech recognition as described above.

Since the storage medium adopts all technical solutions of all the embodiments, at least all the beneficial effects brought by the technical solutions of the embodiments are achieved, and no further description is given here.

Referring to fig. 6, fig. 6 is a block diagram illustrating a first embodiment of a post-processing device for speech recognition according to the present invention.

As shown in fig. 6, the post-processing device for speech recognition according to the embodiment of the present invention includes:

the data construction module 10 is configured to perform data construction on text data in the corpus respectively through multiple modes to generate diversified training data, where the multiple modes include at least two modes of a punctuation deletion mode, a standardization processing mode, a text content correction mode, and a text segment truncation mode.

And the training module 20 is configured to train the initial model according to the diversified training data to obtain a post-processing model.

And an obtaining module 30, configured to obtain initial text data after voice recognition.

And the post-processing module 40 is configured to perform post-processing on the initial text data by using the trained post-processing model to obtain target text data.

It should be understood that the above is only an example, and the technical solution of the present invention is not limited in any way, and in a specific application, a person skilled in the art may set the technical solution as needed, and the present invention is not limited thereto.

It should be noted that the above-mentioned work flows are only illustrative and do not limit the scope of the present invention, and in practical applications, those skilled in the art may select some or all of them according to actual needs to implement the purpose of the solution of the present embodiment, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the post-processing method for speech recognition provided in any embodiment of the present invention, and are not described herein again.

In an embodiment, the data constructing module 10 is further configured to perform punctuation deletion processing on text data in a corpus to obtain first text data; generating first training data according to the first text data and the text data in the corpus; and mixing the first training data with training data generated by other modes to obtain diversified training data.

In an embodiment, the data constructing module 10 is further configured to perform a standardization process on text data in a corpus to obtain second text data; generating second training data according to the second text data and the text data in the corpus; and mixing the second training data with training data generated by other modes to obtain diversified training data.

In an embodiment, the data constructing module 10 is further configured to perform speech synthesis on text data in a corpus to obtain speech data; recognizing the voice data by using a voice recognition model to obtain third text data; generating third training data according to the third text data and the text data in the corpus; and mixing the third training data with training data generated by other modes to obtain diversified training data.

In an embodiment, the data constructing module 10 is further configured to perform truncation processing on the third text data to obtain fourth text data; performing truncation processing on the text data in the corpus to obtain fifth text data; generating fourth training data according to the fourth text data and the fifth text data; and mixing the third training data, the fourth training data and training data generated in other modes to obtain diversified training data.

In an embodiment, the data constructing module 10 is further configured to perform multiple truncation processing on text data in the corpus to obtain multiple text data to be selected; determining the similarity between each text data to be selected and the fourth text data; and selecting the text data to be selected with the highest similarity with the fourth text data as fifth text data.

In an embodiment, the post-processing device for speech recognition further comprises a collecting module;

the collection module is used for collecting a large amount of known text contents; carrying out sentence breaking processing on the known text content to obtain a plurality of sentence texts; storing the plurality of sentence texts as a corpus.

Further, it is to be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or system comprising the element.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A post-processing method of speech recognition, the post-processing method of speech recognition comprising:

respectively carrying out data construction on text data in a corpus by multiple modes to generate diversified training data, wherein the multiple modes comprise at least two modes of a punctuation deletion mode, a standardized processing mode, a text content correction mode and a text fragment truncation mode;

training an initial model according to the diversified training data to obtain a post-processing model;

acquiring initial text data after voice recognition;

2. The method of post-processing speech recognition according to claim 1, wherein the step of performing data construction on the text data in the corpus by a plurality of ways to generate diversified training data comprises:

punctuation mark deleting processing is carried out on text data in a corpus to obtain first text data;

3. The method of post-processing speech recognition according to claim 1, wherein the step of performing data construction on the text data in the corpus by a plurality of ways to generate diversified training data comprises:

performing standardization processing on text data in a corpus to obtain second text data;

4. The method of post-processing speech recognition according to claim 1, wherein the step of performing data construction on the text data in the corpus by a plurality of ways to generate diversified training data comprises:

carrying out voice synthesis on text data in a corpus to obtain voice data;

5. The method of post-processing of speech recognition according to claim 4, wherein after generating third training data from the third text data and text data in the corpus, the method further comprises:

6. The method of post-processing speech recognition according to claim 5, wherein the step of truncating the text data in the corpus to obtain fifth text data comprises:

7. The method for post-processing speech recognition according to any one of claims 1-6, wherein before the step of performing data construction on the text data in the corpus in multiple ways to generate diversified training data, the method further comprises:

collecting a large amount of known text content;

storing the plurality of sentence texts as a corpus.

8. A post-processing apparatus for speech recognition, the post-processing apparatus for speech recognition comprising:

9. A post-processing device for speech recognition, the device comprising: a memory, a processor and a post-processing program of speech recognition stored on the memory and executable on the processor, the post-processing program of speech recognition being configured to implement the post-processing method of speech recognition according to any one of claims 1 to 7.

10. A storage medium having stored thereon a post-processing program for speech recognition, which when executed by a processor implements a post-processing method for speech recognition according to any one of claims 1 to 7.