CN117789705A

CN117789705A - Data processing method, device, equipment and storage medium

Info

Publication number: CN117789705A
Application number: CN202311871407.7A
Authority: CN
Inventors: 李方祝; 付立; 邓丽萍; 范璐; 吴友政; 何晓冬
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-03-29

Abstract

The disclosure provides a data processing method, a device, equipment and a storage medium, and relates to the technical field of voice recognition. The method comprises the following steps: the method comprises the steps of obtaining audio data of a video to be processed and image recognition texts obtained by subtitle text recognition on corresponding video frames to be processed, carrying out voice recognition on the audio data of the video to be processed and carrying out forced alignment processing to obtain aligned texts, carrying out error correction processing on the aligned texts to obtain error corrected texts, and screening the error corrected texts by referring to the corresponding image recognition texts to obtain training data for training a voice recognition model. The method expands the training dataset of the speech recognition model.

Description

Data processing method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice recognition, and in particular relates to a data processing method, a data processing device, electronic equipment and a readable storage medium.

Background

Speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), aims to convert the lexical content in speech into computer-readable input. In the related art, a machine learning model based on an artificial intelligence technology is adopted for voice recognition. To improve the accuracy of a speech recognition model, a large amount of training data with accurate labels is generally required, and therefore, how to provide a speech recognition model with rich and effective training data is a problem to be solved.

The above information disclosed in the background section is only for enhancement of understanding of the background of the disclosure and therefore it may include information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

It is an object of the present disclosure to provide a data processing method, apparatus, electronic device and readable storage medium that extend, at least to some extent, the data set that can be used to train a speech recognition model.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to an aspect of the present disclosure, there is provided a data processing method including: acquiring audio data of a video to be processed and a corresponding image recognition text, wherein the image recognition text is obtained by performing subtitle text recognition on a video frame to be processed; performing voice recognition and forced alignment treatment on the audio data of the video to be processed to obtain aligned texts; performing error correction processing on the aligned text to obtain an error corrected text; and screening the corrected text by referring to the corresponding image recognition text to obtain training data for training a voice recognition model.

According to an embodiment of the disclosure, the audio data of the video to be processed and the corresponding image recognition text include a plurality of audio text pairs, where the audio text pairs include an image recognition text obtained by performing subtitle text recognition on a video frame of a subtitle and audio data of a time interval corresponding to the subtitle; performing error correction processing on the aligned text to obtain error corrected text, including: word segmentation processing is carried out on the aligned texts of the audio text pairs, and a word list of the aligned texts is obtained; judging whether the aligned text of the audio text pair is greater than three words after word segmentation according to the word list; if the aligned text of the audio text pair is more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using a ternary language model to obtain the probability that the aligned text corresponding to the audio text pair is a sentence; and carrying out error correction processing on the aligned texts corresponding to the audio text pairs based on the probability to obtain error corrected texts of the audio text pairs.

According to an embodiment of the present disclosure, performing error correction processing on the aligned text to obtain an error corrected text, and further including: if the aligned text of the audio text pair is not more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using the binary language model to obtain the probability that the aligned text corresponding to the audio text pair is a sentence; and carrying out error correction processing on the aligned texts corresponding to the audio text pairs based on the probability to obtain error corrected texts of the audio text pairs.

According to an embodiment of the present disclosure, performing error correction processing on the aligned text corresponding to the audio text pair based on the probability to obtain an error corrected text of the audio text pair, including: if the probability is not greater than a preset probability threshold, candidate replacement words of each word in the word list are obtained, wherein the candidate replacement words are obtained according to pinyin of the corresponding word; and scoring the text in the word list replaced by the candidate replacement words by using a language model, and determining the word list with the highest probability of being a sentence after replacement as the corrected text of the audio text pair.

According to an embodiment of the present disclosure, obtaining candidate replacement words for each word in the word list includes: the pinyin of each word in the word list is obtained; obtaining similar sounds, the similarity of which to the pinyin of each word in the word list is greater than a preset similarity threshold value; homophones of similar sounds of the words in the word list are obtained and used as candidate replacement words of the corresponding words.

According to an embodiment of the present disclosure, the training data includes a supervised learning data set and a semi-supervised learning data set; screening the corrected text with reference to the corresponding image recognition text to obtain training data for training the speech recognition model, including: obtaining the confidence coefficient of the text after error correction compared with the corresponding image recognition text; adding the corrected text with the confidence coefficient larger than a preset confidence coefficient threshold value and corresponding audio data into the supervised learning data set; and adding the corrected text with the confidence coefficient not larger than a preset confidence coefficient threshold value and corresponding audio data into the semi-supervised learning data set.

According to an embodiment of the present disclosure, the method further comprises: detecting a region where a text appears in a current video frame of the video to be processed, and obtaining a text region of the current video frame; judging whether the text area of the current video frame is a preset subtitle area or not; if the text region of the current video frame is determined to be the preset caption region, detecting continuous video frames by taking the timestamp of the current video frame as a time starting point, and obtaining a time interval corresponding to the caption of the text region of the current video frame; performing optical character recognition processing on the text region of the current video frame to obtain an image recognition text of the subtitle of the text region of the current video frame; and dividing the audio data of the video to be processed according to the time interval corresponding to each caption so as to obtain a plurality of audio text pairs.

According to an embodiment of the present disclosure, performing speech recognition on the audio data of the video to be processed and performing forced alignment processing to obtain aligned text, including: inputting the audio data in the audio text pair into the voice recognition model in a voice recognition and forced alignment tool to obtain a corresponding voice recognition text; inputting the audio data in the audio text pair and the corresponding voice recognition text into a forced alignment model in a voice recognition and forced alignment tool to obtain aligned text of the audio text pair.

According to still another aspect of the present disclosure, there is provided a data processing apparatus including: the acquisition module is used for acquiring audio data of the video to be processed and corresponding image recognition texts, wherein the image recognition texts are obtained by performing subtitle text recognition on the video frames to be processed; the voice transcription module is used for carrying out voice recognition on the audio data of the video to be processed and carrying out forced alignment processing to obtain aligned texts; the text error correction module is used for carrying out error correction processing on the aligned text to obtain an error corrected text; and the processing module is used for filtering the corrected text by referring to the corresponding image recognition text so as to obtain training data for training the voice recognition model.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a memory, a processor, and executable instructions stored in the memory and executable in the processor, the processor implementing any of the methods described above when executing the executable instructions.

According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer-executable instructions which, when executed by a processor, implement any of the methods described above.

According to the data processing method provided by the embodiment of the disclosure, the audio data of the video to be processed and the image recognition text obtained by performing subtitle text recognition on the corresponding video frame to be processed are obtained, the audio data of the video to be processed and the voice recognition are performed, forced alignment processing is performed, the aligned text is obtained, error correction processing is performed on the aligned text, the error corrected text is obtained, and the error corrected text is screened by referring to the corresponding image recognition text, so that training data for training a voice recognition model is obtained, and a training data set of the voice recognition model can be expanded.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

Fig. 1 is a schematic diagram showing a system configuration in an embodiment of the present disclosure.

Fig. 2 shows a flow chart of a data processing method in an embodiment of the present disclosure.

Fig. 3 is a flow chart illustrating a method of obtaining audio text pairs according to an exemplary embodiment.

Fig. 4 is a schematic diagram of an OCR text recognition flow shown in accordance with fig. 3.

Fig. 5 shows a schematic diagram of the processing procedure of step S204 shown in fig. 2 in an embodiment.

Fig. 6 illustrates a forced alignment diagram of speech recognition text.

Fig. 7 is a schematic diagram showing the processing procedure of step S206 shown in fig. 2 in an embodiment.

Fig. 8 shows a schematic diagram of the processing procedure of step S206 shown in fig. 2 in another embodiment.

Fig. 9 shows a schematic diagram of the processing procedure of step S8104 shown in fig. 3 in an embodiment.

Fig. 10 is a schematic flow chart of error correction of speech recognition text according to the embodiment shown in fig. 7 to 9.

Fig. 11 is a schematic diagram showing the processing procedure of step S208 shown in fig. 2 in an embodiment.

Fig. 12 is a schematic diagram of a training data screening flow according to the introduced language model shown in fig. 2 to 11.

Fig. 13 shows a block diagram of a data processing apparatus in an embodiment of the present disclosure.

Fig. 14 shows a block diagram of another data processing apparatus in an embodiment of the present disclosure.

Fig. 15 shows a schematic structural diagram of an electronic device in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, apparatus, steps, etc. In other instances, well-known structures, methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise. The symbol "/" generally indicates that the context-dependent object is an "or" relationship.

In the present disclosure, unless explicitly specified and limited otherwise, terms such as "connected" and the like are to be construed broadly and, for example, may be electrically connected or may communicate with each other; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the terms in this disclosure will be understood by those of ordinary skill in the art as the case may be.

The following is a description of terms involved in the present disclosure.

Optical character recognition (Optical Character Recognition, OCR): the method is a process of analyzing, identifying and processing the image file of the text data to obtain the text and layout information. I.e. identifying the text in the image. For example, the PaddleOCR model is a flyer-based OCR tool library, and supports Chinese and English numerical combination recognition, vertical text recognition and long text recognition, and simultaneously supports a plurality of training algorithms for text detection and text recognition. In OCR recognition, text lines in an image are first detected using a detection algorithm, and then the detected text lines are identified to specific text using a recognition algorithm.

n-gram: an algorithm based on a statistical language model can be used to predict the probability of the next word or the whole sentence. The n-gram model is a common probability model, i.e. assuming that a word or word occurrence is only related to the first n-1 words (n is a human given), the probability of the sentence as a whole is equal to the product of the probabilities of all word collocations. Commonly used are a binary language model 2-gram (Bi-gram) and a ternary language model 3-gram (Tri-gram), and the term probability calculation method uses the conditional probability in the probability theory.

As described above, in order to improve the accuracy of the speech recognition model, a large amount of training data with accurate labels is generally required, for example, the project needs to recognize the dialect speech, the initial audio data is crawled from the internet to obtain the training data for training the ASR model, and if the accuracy of labeling the keywords with accents in the labels labeled for the initial audio data is low, it is difficult to screen out the audio data with high confidence and containing the domain keywords, so that the accuracy of the ASR model is improved by the extended training data set poorly.

Therefore, the present disclosure provides a data processing method, by acquiring audio data of a video to be processed and an image recognition text obtained by performing subtitle text recognition on a corresponding video frame to be processed, performing voice recognition on the audio data of the video to be processed and performing forced alignment processing to obtain an aligned text, performing error correction processing on the aligned text to obtain an error corrected text, and screening the error corrected text with reference to the corresponding image recognition text to obtain training data for training a voice recognition model. According to the method, the text transcribed by the audio data is subjected to error correction, the corrected text is screened by referring to the image recognition text obtained by subtitle text recognition, the accuracy of labeling of the initial audio data is improved, more accurate training data can be obtained, and the training data set of the voice recognition model is expanded.

FIG. 1 illustrates an exemplary system architecture 10 in which the data processing methods or data processing apparatus of the present disclosure may be applied.

As shown in fig. 1, system architecture 10 may include a terminal device 102, a network 104, and a server 106. The terminal device 102 may be a variety of electronic devices having a display screen and supporting inputs, outputs, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, wearable devices, virtual reality devices, smart homes, and the like. The network 104 is the medium used to provide communication links between the terminal devices 102 and the server 106. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others. The server 106 may be a server or cluster of servers, etc. that provide various services, such as a processing server for running a machine learning model, a database server for storing data, etc.

A user may interact with a server 106 via a network 104 using a terminal device 102 to receive or transmit data, etc. For example, the user downloads the video to be processed from the server 106 to the terminal device 102 via the network 104, and then processes the video by the processing software on the terminal device 102 to obtain a plurality of audio text pairs of the video to be processed. For another example, a user may run a speech recognition and forced alignment tool on the server 106 via the network 104, perform speech recognition on audio data of the video to be processed, and perform forced alignment processing. For another example, the user may run the language model on the server 106 through the network 104, perform error correction processing on the aligned text to obtain an error corrected text, and then screen the error corrected text on the terminal device 102 with reference to the corresponding image recognition text to obtain training data for training the speech recognition model.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 is a flow chart illustrating a method of data processing according to an exemplary embodiment. The method shown in fig. 2 can be applied to a server side of the above system, or to a terminal device of the above system, for example.

Referring to fig. 2, a method 20 provided by an embodiment of the present disclosure may include the following steps.

In step S202, audio data of the video to be processed and a corresponding image recognition text are obtained by performing subtitle text recognition on the video frame to be processed.

In some embodiments, the video to be processed may be a published video with subtitles collected on the internet, for example, a video with subtitles crawled on the network by a crawler program according to project requirements. For example, the video can be an item for expanding a dialect data set, and the video with audio as dialects of different areas in China and the caption embedded in a picture can be crawled to be used as a dialect corpus.

In some embodiments, an OCR model may be used to detect subtitles of a video to be processed, and obtain audio data of the video to be processed including a plurality of audio text pairs and corresponding image recognition text, where the audio text pairs include the image recognition text obtained by performing subtitle text recognition on a video frame of a subtitle and audio data of a time interval corresponding to the subtitle.

For example, the crawled video data to be processed may be uploaded to a task queue of a PaddleOCR open source model, subtitles in the video are identified, the model may output subtitle text and an audio timestamp of the corresponding text, and then the audio of the video to be processed is sliced using the audio timestamp of the corresponding text to obtain an audio text-to-corresponding image identification text. The more the feedback capability of OCR text recognition of the PaddleOCR model is higher, more candidate OCR text audio pairs can be obtained, and the more the original data quantity before OCR text is screened, the more the screened OCR text annotation data will be. The specific embodiments can be seen with reference to fig. 3 and 4.

In step S204, the audio data of the video to be processed and the voice recognition are performed, and the forced alignment processing is performed, so as to obtain the aligned text.

In some embodiments, the audio data of the video to be processed and the speech recognition and forced alignment process may be performed by a speech recognition and forced alignment tool. The speech recognition and forced alignment tool may further include a speech recognition model and a forced alignment model, and after the speech recognition model performs speech recognition on the audio data of the video to be processed, the forced alignment model may be used to perform forced alignment. For example, the speech recognition and forced alignment tool may be a labelcheck main, which combines a WeNet speech recognition model with a CTC-based forced alignment model, and the detailed description will refer to FIG. 5.

In step S206, the aligned text is subjected to correction processing, and corrected text is obtained.

In some embodiments, the aligned text may be subject to error correction processing using a language model. The language model can be a model for calculating the probability of a group of word sequences as sentences, for example, a statistical language model such as n-gram and the like can be adopted, and the n-gram model has low training cost, good universality and easy deployment. Also for example, machine learning models such as Long Short-Term Memory (LSTM) networks, fastCorrect models, etc. can be used.

The n-gram model is a statistical probability model, i.e. it is assumed that a word or word occurrence is only related to the first n-1 words (n is a human given), and the probability of the sentence as a whole is equal to the product of the probabilities of all word collocations. Commonly used are 2-gram (Bi-gram) and 3-gram (Tri-gram), and the term probability calculation method uses the conditional probability in the probability theory. For a given chinese string c=c ₁ ,c ₂ ,…,c _i The probability of a character in the Bi-gram model depends only on the word preceding it, wherein the probability of the string C is expressed as in equation (1):

P _C ＝P(c ₁ c ₂ ...c _i )＝P(c ₁ )*P(c ₂ |c ₁ )*...*P(c _i |c _i-1 ) (1)

in the Tri-Gram model, the occurrence of the current keyword depends on the first two words, wherein the probability of the character string C is calculated by the following formula (2):

P _C ＝P(c ₁ c ₂ ...c _i )＝P(c ₁ )*P(c ₂ |c ₁ )*P(c ₃ |c ₂ c ₁ )...*P(c _i |c _i-1 c _i-2 ) (2)

The N-gram model may be used to evaluate whether a sentence is reasonable, and may evaluate whether a word is an erroneous word by calculating the N-gram score of the word.

In some embodiments, the language models may include a binary language model and a ternary language model, for example, bi-gram and Tri-gram may be described above, and the binary language model or the ternary language model may be selected for error correction according to the number of words after word segmentation of the aligned text, and the specific embodiment may refer to fig. 7 to 10.

In other embodiments, the aligned text may be scored by using a binary language model and a ternary language model, and the scores of the two models are combined to obtain the probability that the text is a sentence, and then the subsequent steps are performed.

In some embodiments, after training the language model through data in large-scale related fields, the language model is used for error correction processing, so that the accuracy of error correction text can be effectively improved. For example, if the method provided by embodiments of the present disclosure is used to train a speech recognition model for recognizing dialect speech, a large number of dialect materials may be used to train the language model. The dialect corpus collection channel can be derived from: purchased dialect annotation text, web page crawling disclosure dialect text, text derived from OCR recognition of a dialect video, and the like.

In step S208, the corrected text is filtered with reference to the corresponding image recognition text to obtain training data for training the speech recognition model.

In some embodiments, the text with the confidence higher than the preset confidence threshold and the corresponding audio data can be selected and stored as training data with accurate labels by calculating the confidence between the image recognition text and the text after error correction, for example, the text with the confidence higher than the preset confidence threshold is added into a supervised learning data set, and the specific implementation can refer to fig. 11 and fig. 12.

According to the data processing method provided by the embodiment of the disclosure, the text transcribed by the audio data is corrected by the voice recognition and forced alignment tool, and the corrected text is screened by referring to the image recognition text obtained by the caption text recognition, so that the accuracy of labeling of the initial audio data is improved, more accurate training data can be obtained, the training data set of the voice recognition model is expanded, and the accuracy and generalization capability of the voice recognition model are improved.

Fig. 3 is a flow chart illustrating a method of obtaining audio text pairs according to an exemplary embodiment. The method shown in fig. 3 may be used, for example, to perform subtitle detection on the video to be processed in fig. 2, to obtain an audio text pair.

Referring to fig. 3, a method 30 provided by an embodiment of the present disclosure may include the following steps.

In step S302, a region in which text appears in a current video frame of a video to be processed is detected, and a text region of the current video frame is obtained.

In some embodiments, starting from the first frame of the video to be processed, the current frame may be acquired and the region in which text occurs detected.

In step S304, it is determined whether the text region of the current video frame is a subtitle region.

In some embodiments, the text appearing in the video frame may be subtitles, or other text in the picture, such as a television station logo, a watermark, text in a scene, and so on. The subtitle region may be preset to a certain extent below (or above, etc. fixed positions) the video frame, for example, 30% (or 25%, 35%, etc.) below the video frame is the subtitle region.

In step S3062, if it is determined that the text region of the current video frame is the pre-captioned region, detecting the continuous video frame with the timestamp of the current video frame as the time start point, and obtaining the time interval corresponding to the caption of the text region of the current video frame.

In some embodiments, the continuous video frame may be detected by detecting a change in the text region of the video frame, for example, sequentially detecting the pre-captioned region of each frame subsequent to the current video frame, detecting whether the text content thereof has changed, and the frames preceding the non-detected change are all continuous video frames of the current video frame, i.e., correspond to the same caption.

In step S3064, if it is determined that the text region of the current video frame is not the subtitle region, it is determined whether the current video frame further includes other text regions. If the current video frame also includes other text regions, the process returns to step S304.

In step S3066, if the current video frame does not include other text regions, determining whether the current video frame is the last frame of the video to be processed, if so, performing step S3068; if not, step S312 is performed.

In step S3068, the next video frame is acquired, and the process returns to step S302.

In step S308, an optical character recognition process is performed on the text region of the current video frame, and an image recognition text of the subtitle of the text region of the current video frame is obtained.

In step S310, it is determined whether the time interval end corresponding to the text region subtitle of the current video frame is the time end of the video to be processed.

In step S312, the audio data of the video to be processed is segmented according to the time interval corresponding to each caption, so as to obtain a plurality of audio text pairs.

Fig. 4 is a schematic diagram of an OCR text recognition flow shown in accordance with fig. 3. The flow as shown in fig. 4 may include the following steps S402 to S410.

And step S402, text detection. And performing text detection on the current video frame, and detecting a text region appearing in the video picture.

Step S404, checking the subtitle position. And judging whether the text area is a preset caption area or not, and if so, judging that the current video frame at the beginning is the beginning time node t of the caption.

Step S406, subtitle switching detection. And detecting continuous video frames in a preset caption area until the caption of the area changes, and obtaining an end time node t+k of the caption, wherein the corresponding timestamp of the caption text is [ t: t+k ].

And step S408, identifying the text. OCR recognition is carried out on the pre-caption area, and a text is obtained.

Step S410, segmenting long audio according to the timestamp [ t: t+k ] detected by subtitle switching to obtain a text-audio pair, namely obtaining candidate data of text and voice of the training voice recognition model provided by the embodiment of the disclosure.

According to the method provided by the embodiment of the disclosure, the caption text and the timestamp information are obtained according to the recognition of the OCR model, the original long audio is segmented and aligned according to the timestamp information, the initial data of the text audio pair is obtained, the video corpus without the externally hung caption can be effectively utilized, and therefore the training data set of the voice recognition model is expanded.

Fig. 5 shows a schematic diagram of the processing procedure of step S204 shown in fig. 2 in an embodiment. As shown in fig. 5, in the embodiment of the present disclosure, the step S204 may further include the following steps.

Step S502, inputting the audio data in the audio text pair into a speech recognition model in a speech recognition and forced alignment tool to obtain a corresponding speech recognition text.

In some embodiments, the speech recognition model may be a WeNet model for implementing ASR transcription. WeNet is an open source end-to-end ASR toolkit, provides a whole set of toolchains from training to deployment, and makes the industrial landing of ASR services simpler.

Step S504, inputting the audio data in the audio text pair and the corresponding speech recognition text into a forced alignment model in a speech recognition and forced alignment tool to obtain aligned text of the audio text pair.

In some embodiments, the forced alignment model may utilize an end-to-end forced alignment method based on a connection timing classification (Connectionist Temporal Classification, CTC) algorithm to construct a forced alignment graph for (candidate for segmenting the ASR transcribed text) text, and correct substitution, insertion, and deletion errors in the ASR transcribed text accordingly, so as to improve the accuracy of the transcribed text. Fig. 6 shows an example of a forced alignment graph of speech recognition text.

In other embodiments, the ASR transcribed text may also be forcefully aligned using a forcefully aligned model with reference to the OCR recognized text. For example, a penalty may be imposed on paths with high latency based on OCR recognized text limiting the latency of the paths in CTC model loss computation, and a reduction in latency of the forced alignment model may be achieved.

Fig. 6 illustrates a forced alignment diagram of speech recognition text. The forced alignment model may include a CTC unit, where the CTC model unit segments text transcribed from the ASR model, and constructs a forced alignment graph for each candidate word obtained by the segmentation. As shown in fig. 6, fig. 6 shows an alignment diagram of the candidate word "loved in the art", wherein key features include: correct pair Ji Lujing (6008-6012-6014-6016-6018-6020-6022), delete character tag < del > and corresponding penalty p1, arbitrary insert or substitution.

As shown in fig. 6, the start tag 6004 and end tag 6006 of the "loved in" alignment chart are connected to the global fill state 6002 with a penalty p2 from the loop arc, represented by tag < gbg >. The alignment path includes three correction operations: deletion, insertion, replacement. Deletion operation: there are two paths from 0 to 1, the path through "love" keeps the character unchanged, and the other path is to execute < del > tag instruction, get the penalty of p1 after deleting "love" character. Insertion operation: executing a < gbg > tag instruction in the characters of 'sentry' and 'Jing', and obtaining p2 penalty after inserting new characters; replacement operation: the blue arrow does not walk the "industry" character path, but instead selects the Filler fill state to execute the < gbg > tag instruction, indicating that the "industry" character is deleted and a new character replaces the "industry" character.

According to the method provided by the embodiment of the disclosure, after the WeNet voice recognition model is used for carrying out voice recognition on the audio data of the video to be processed, the end-to-end forced alignment model based on CTC is used for carrying out forced alignment, the WeNet voice recognition model and the forced alignment model are respectively trained, so that the accuracy of the aligned text obtained by the voice recognition and forced alignment tool can be comprehensively improved, and then the data quantity after screening is improved.

Fig. 7 is a schematic diagram showing the processing procedure of step S206 shown in fig. 2 in an embodiment. As shown in fig. 7, in the embodiment of the present disclosure, the above step S206 may further include the following steps S702 to S708.

Step S702, word segmentation processing is carried out on the aligned texts of the audio text pairs, and a word list of the aligned texts is obtained.

In some embodiments, if applied to a chinese scene, the aligned text is segmented using chinese word segmentation techniques to obtain a word list of the aligned text for the pair of audio text. For example, if the aligned text is "today weather is good", the word list obtained by word segmentation may include three words of "today", "weather" and "good; the aligned text is "I don't know what to eat", and the word list obtained by word segmentation can comprise five words of "I", "don't", "know", "eat", "what".

The meaning of the word in the embodiment of the present disclosure is a language unit with semantic meaning, for example, may represent one to four chinese characters or may represent one word in english, and the present disclosure is illustrated by taking a chinese application scenario as an example and not limited thereto.

Step S704, judging whether the aligned text word of the audio text pair is greater than three words according to the word list.

Taking the word list obtained by word segmentation as an example, the word list comprises three words of ' today ', ' weather ', ' really good ', and the word list corresponds to the text of ' today ' weather really good ' and is not more than three words after word segmentation. Taking the word list obtained by the word segmentation as described above as an example, which includes five words of "I", "not", "know", "eat", "what", the text corresponding to the word list is greater than three words after word segmentation, i don't know what is eaten.

Step S706, if the aligned text of the audio text pair is more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using a ternary language model, and obtaining the probability that the aligned text corresponding to the audio text pair is a sentence.

In some embodiments, if the aligned text of the audio text pair is segmented into more than three words, the word list obtained by segmentation may be subjected to ternary segmentation processing to obtain a ternary pair list. And then inputting the content of the ternary pair list into a ternary language model to score, and obtaining the probability of the aligned text corresponding to the audio text pair as a sentence.

For example, taking the word list obtained by word segmentation as described above as an example including five words of "i", "no", "know", "eat", "what" and the text corresponding to the word list is greater than three words after word segmentation, performing ternary grouping processing on the text to obtain a ternary pair list, where the text is "i does not know what" is eaten: [ I don't know ] [ don't know ] what to eat ].

Step S708, if the aligned text of the audio text pair is not more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using the binary language model, and obtaining the probability that the aligned text corresponding to the audio text pair is a sentence.

In some embodiments, if the aligned text of the audio text pair is not greater than three words after word segmentation, the word list obtained by word segmentation may be subjected to binary grouping processing to obtain a binary pair list. And then inputting the content of the binary pair list into a binary language model to score, and obtaining the probability of the aligned text corresponding to the audio text pair as a sentence.

For example, the word list obtained by word segmentation includes three words, for example, "today," "weather," and "really good," and the text corresponding to the word list is not more than three words after word segmentation, and the binary pair list is obtained by performing binary pair segmentation on the text: [ weather today ] [ weather really good ].

Step S710, performing error correction processing on the aligned texts corresponding to the audio text pairs based on the probabilities to obtain error corrected texts of the audio text pairs.

In some embodiments, text with a probability below a preset probability threshold may be considered to require error correction. When the present disclosure is applied to chinese, text may be corrected using pinyin, and a specific embodiment may refer to fig. 8.

According to the method provided by the embodiment of the disclosure, the correction is performed on the caption text obtained by the voice recognition and forced alignment tool by selecting the binary language model or the ternary language model according to the word number after word segmentation of the text, so that the correction accuracy of texts with different lengths can be effectively improved, and the universality of the method is improved.

Fig. 8 illustrates a schematic process of step S710 illustrated in fig. 7 in an embodiment, as illustrated in fig. 8, in an embodiment of the present disclosure, the step S710 may further include the following steps.

Step S802, judging whether the probability of the aligned text corresponding to the audio text pair being a sentence is larger than a preset probability threshold.

In step S804, if the probability is not greater than the preset probability threshold, candidate replacement words of each word in the word list are obtained, wherein the candidate replacement words are obtained according to pinyin of the corresponding word.

In some embodiments, if the probability is not greater than the preset probability threshold, the aligned text of the pair of audio text may be considered to require error correction. Words with the same or similar pinyin as each word of the text to be corrected can be obtained to serve as candidate replacement words, and scoring is carried out through a corresponding binary language model or a ternary language model after the candidate replacement words are replaced. An embodiment of obtaining candidate replacement words from pinyin for each word may be referred to in fig. 9.

Step S806, scoring the text in the word list replaced by the candidate replacement words by using the language model, and determining that the word list with the highest probability of being a sentence after replacement is the corrected text of the audio text pair.

Step S808, if the probability is greater than the preset probability threshold, determining the aligned text corresponding to the audio text pair as the corrected text. The meaning of "corrected text" in this step is text which is immediately subjected to the filtering process of step S208 after the judgment of step S802 (the judgment result may be replacement or not).

According to the method provided by the embodiment of the disclosure, after voice recognition and forced alignment processing are carried out on the pair of the Wen Yinpin text, when correction is required to be carried out on the aligned text according to scoring judgment of the corresponding language model, candidate words with similar pinyin of the text are obtained for replacement, and the text with the highest score after replacement is selected as the corrected text, so that the accuracy of the corrected Chinese text can be effectively improved.

Fig. 9 is a schematic diagram showing the processing procedure of step S806 shown in fig. 3 in an embodiment. As shown in fig. 9, in the embodiment of the present disclosure, the above step S806 may further include the following steps.

Step S902, pinyin of each word in the word list is acquired.

Step S904, obtaining the similar sound with the pinyin similarity of each word in the word list being larger than a preset similarity threshold.

In some embodiments, the similarity of pinyin may be measured by using an edit distance, for example, a pinyin whose edit distance is less than 2 is set to have an approximation greater than a predetermined approximation threshold, that is, a pinyin whose edit distance from the pinyin of the binary word or the ternary word is 0 or 1 is set to be a similar sound.

Step S906, homophones of similar sounds of the words in the word list are obtained and used as candidate replacement words of the corresponding words.

In some embodiments, a plurality of candidate replacement words corresponding to the similar sounds may be obtained from the homophone table.

Fig. 8 and 9 are described in their entirety below with an example. For example, the aligned text of an audio text pair is "very good today", and the word segmentation results are: three words of "today", "Tianqi" and "fine" are processed by binary grouping to obtain a binary group list, wherein the binary group list is as follows: [ day start today ] [ day start very good ]. The Bi-gram model is adopted to score the sentences, and the probability of the sentences is p=55% <60% (a preset probability threshold). Then, based on the editing distance, the similar sound to the three words is obtained, and candidate replacement words are selected from the homonym list, wherein the candidate replacement words comprise weather. And replacing words in the original sentence with the candidate words, for example, replacing the candidate words with 'good weather today', and calculating the probability score of the replaced text as a sentence to be highest, namely, correcting the text as 'good weather today'.

Fig. 10 is a schematic flow chart of error correction of speech recognition text according to the embodiment shown in fig. 7 to 9. The error correction flow as shown in fig. 10 may include the following steps S1002 to S1016.

In step S1002, the aligned text (hereinafter referred to as ASR text) is obtained by performing speech recognition and forced alignment on the audio data of the video to be processed and performing the forced alignment processing by the speech recognition and forced alignment tool. For the specific embodiment, reference is made to step S204.

Step S1004, segmentation is carried out on the given ASR text by utilizing a Chinese word segmentation technology, and a word list is obtained. If the method is applied to a Chinese dialect scene, a word segmentation model can be trained by using dialect corpus.

Step S1006, it is determined whether the number of words L in the word list is not less than 3.

Step S10082, if L is greater than or equal to 3, grouping the words in the word list into a triple pair list, and inputting each triple pair in the triple pair list into a Tri-gram model to calculate the probability S of being a sentence. The Tri-gram model can be obtained through training by using a ternary word list obtained through segmentation of the collected dialect corpus.

In step S10102, it is determined whether the probability S of the output of the Tri-gram is greater than a preset probability threshold S0.

Step S10084, if L is less than 3, grouping the words in the word list into a binary pair list, and inputting each binary pair in the binary pair list into the Bi-gram model to calculate the probability S of being a sentence. The Bi-gram model can be obtained through training by using a binary word list obtained through segmentation of the collected dialect corpus.

In step S10102, it is determined whether the probability S of Bi-gram output is greater than a preset probability threshold S0.

In step S1012, if the score output by the Bi-gram model and the Tri-gram model is lower than the set threshold S0, the output word is in a high probability of error, and the word can be corrected according to the word collocation and the pinyin similarity, and the specific embodiments can refer to fig. 8 and 9.

Step S1014, replacing the corresponding word in the original sentence with the candidate replacement word obtained by the pinyin similarity plus word collocation, recalculating the sentence score, and selecting the sentence with the highest score as the corrected text 10002 to be output.

According to the method provided by the embodiment of the disclosure, after voice recognition and forced alignment treatment are carried out on the pair of the Wen Yinpin text, word segmentation is carried out on the aligned text, a binary or ternary language model is selected according to the number of words after word segmentation to score, when the fact that correction is carried out on the aligned text is judged to be needed, candidate words with similar pinyin of each word are obtained to replace, the corresponding language model is reused for scoring on the replaced text, and the text with the highest score after replacement is selected as the text after correction, so that the accuracy of the Chinese text after correction can be effectively improved.

Fig. 11 is a schematic diagram showing the processing procedure of step S208 shown in fig. 2 in an embodiment. As shown in fig. 11, in the embodiment of the present disclosure, the above step S208 may further include the following steps.

In step S1102, a confidence level of the text after error correction compared with the corresponding image recognition text is obtained.

In some embodiments, confidence may be measured using edit distance between the error corrected text and the corresponding image recognition text. For example, by text _label Text representing corrected text _ocr Representing an image recognition text obtained by OCR model recognition, confi _-l Representing the confidence of the error corrected text compared to the corresponding image recognition text, the confidence may be calculated using equation (3):

wherein distance (text) _label ；text _ocr ) Representing text _label And text _ocr Editing distance between len (text _label ) And len (text) _ocr ) Respectively represent text _label And text _ocr Length (e.g. number of characters))。

Step S1104, determining whether the confidence level of the corrected text is greater than a preset confidence threshold.

Step S1106, adding the corrected text with the confidence coefficient greater than the preset confidence coefficient threshold and the corresponding audio data to the supervised learning data set.

In some embodiments, a confidence threshold of 90%, or 85%, or 95%, etc. may be set, and text and corresponding audio with a calculated confidence above the threshold may be stored as accurate annotation data, added to the supervised learning dataset.

In step S1108, the corrected text with the confidence level not greater than the preset confidence level threshold and the corresponding audio data are added to the semi-supervised learning data set.

In some embodiments, the audio corresponding to the text with the confidence below the preset confidence threshold may also be used as unlabeled data or pseudo tag data, wherein the unlabeled data may be added to an unsupervised data set and the pseudo tag data may be added to a semi-supervised learning data set.

The data which is not screened in the related art is only used for pre-training, and the pseudo labeling data which is larger in magnitude and is not screened is not fully utilized. According to the method provided by the embodiment of the disclosure, the corrected text is screened by referring to the OCR recognition text, the text with the confidence coefficient higher than the preset confidence coefficient threshold value and the corresponding audio are selected as accurate annotation data, and the text with the confidence coefficient higher than the preset confidence coefficient threshold value and the corresponding audio are used as pseudo annotation data for semi-supervised learning training, so that the data use mode is expanded, and the use diversity of dialect data is increased.

Fig. 12 is a schematic diagram of a training data screening flow according to the introduced language model shown in fig. 2 to 11. In fig. 12, a case where a speech in an application scene is a chinese dialect is described. The flow shown in fig. 12 may include the following steps S1201 to S1212.

Step S1201 trains an n-gram model 12003 using the dialect corpus 12001. Dialect corpus 12001 is obtained by collecting dialect texts of different domestic areas disclosed on the internet.

Step S1202, detecting and recognizing the video subtitle text by utilizing the OCR model, obtaining text and timestamp information according to the OCR model recognition, and cutting and aligning the original long audio to preliminarily obtain an audio text pair and a corresponding OCR recognition text. The video with subtitles 12002 with accents or dialects may be published on the collection network and then the crawled data uploaded into the task queue of OCR models that recognize video subtitles, for example, the OCR models may use the PaddleOCR open source model.

In step S1203, the OCR text audio pair performs ASR transcription using a speech recognition and forced alignment tool (e.g., labelcheck main) 12005 to obtain an ASR transcription text. And can aggregate OCR text with the collected dialect corpus to train an n-gram model 12003.

In step S1204, semantic errors of the ASR transcribed text are corrected by using the trained n-gram model 12003, so as to obtain an error corrected text 12004, and the quality of the transcribed text is improved, so that the screened and accurately labeled data amount is improved.

In step S1206, it is determined whether or not the confidence level of the corrected text 12004 compared with the OCR text is greater than a confidence threshold (for example, may be 90%). By measuring the confidence of each tag when making the dataset, filtering text with low confidence, a quality audio text pair can be obtained.

In step S1208, when the confidence level of the corrected text 12004 compared to the OCR text is greater than the confidence threshold, the corrected text 12004 and the corresponding audio (OCR labeling data) 12006 are produced as the supervised learning dataset 12007, and the supervised learning dataset 12007 can be used to optimally train the ASR model in the speech recognition and forced alignment tool 12005.

In step S1209, when the confidence level of the corrected text 12004 compared with the OCR text is not greater than the confidence level threshold, the corrected text 12004 and the corresponding audio (pseudo labeling data) 12008 are created as the semi-supervised learning data set 12009.

Step S1210, evaluating whether an optimization condition is satisfied before optimizing the training ASR model, e.g., determining whether the data rate of increase in the supervised learning dataset 12007 is greater than a preset rate increase threshold (e.g., may be 10%, or 15%, or 20%, etc.), and if not, not performing model optimization (stopping flow)Process) to avoid too little data in the collected supervised learning dataset, resulting in overfitting when optimizing the ASR model. Such as data rate I _l Calculation can be performed using formula (4):

wherein I is _o Indicating the total amount of supervised learning data accumulated and collected in hours; i _a Is the amount of data collected for the a-th time in hours. For example, the increase rate of the 1 st collected data is 100%.

In step S1212, if the data rate of increase in the supervised learning dataset 12007 is greater than the preset rate increase threshold, the supervised learning dataset 12007 is used as a training set for optimizing the ASR model. The optimized ASR model can further improve the quality of transcribed ASR text.

The label_check_main alignment tool is limited by the accuracy and generalization capability of an ASR model, high-quality texts are difficult to transcribe in the field where the model is not found, and the confidence of transcribed texts and OCR texts is not high, so that the data size of screening by only using label_check_main is not greatly improved, and more audio with accents cannot be screened. The embodiment of the disclosure also utilizes an n-gram text error correction model trained by adopting large-scale dialect corpus data to learn sentence dependency relationship of the dialect corpus, detects error words in ASR dialect text, and corrects the error words by combining pinyin similarity with word matching, so that the screened data volume can be effectively improved, and meanwhile, the unselected pseudo-labeling data can be used for semi-supervised learning training, thereby increasing the data use mode.

Fig. 13 is a block diagram of a data processing apparatus according to an exemplary embodiment. The apparatus shown in fig. 13 may be applied to, for example, a server side of the above system or a terminal device of the above system.

Referring to fig. 13, an apparatus 130 provided by an embodiment of the present disclosure may include an acquisition module 1302, a speech transcription module 1304, a text error correction module 1306, and a processing module 1308.

The obtaining module 1302 may be configured to obtain audio data of a video to be processed and a corresponding image recognition text, where the image recognition text is obtained by performing subtitle text recognition on a video frame to be processed.

The voice transcription module 1304 may be configured to perform voice recognition on audio data of the video to be processed and perform forced alignment processing to obtain aligned text.

The text error correction module 1306 may be configured to perform error correction processing on the aligned text to obtain an error corrected text.

The processing module 1308 may be configured to filter the error corrected text with reference to the corresponding image recognition text to obtain training data for training the speech recognition model.

Fig. 14 is a block diagram of another data processing apparatus according to an exemplary embodiment. The apparatus shown in fig. 14 may be applied to, for example, a server side of the above system or a terminal device of the above system.

Referring to fig. 14, an apparatus 140 provided by an embodiment of the present disclosure may include an OCR recognition module 1401, an acquisition module 1402, a speech transcription module 1404, a text correction module 1406, and a processing module 1408.

The OCR recognition module 1410 may be configured to detect an area in which text appears in a current video frame of a video to be processed, and obtain a text area of the current video frame; judging whether a text area of a current video frame is a preset subtitle area or not; if the text region of the current video frame is determined to be the preset caption region, detecting continuous video frames by taking the timestamp of the current video frame as a time starting point, and obtaining a time interval corresponding to the caption of the text region of the current video frame; performing optical character recognition processing on the text region of the current video frame to obtain an image recognition text of the subtitle of the text region of the current video frame; and cutting the audio data of the video to be processed according to the time interval corresponding to each caption so as to obtain a plurality of audio text pairs.

The acquiring module 1402 may be configured to acquire audio data of a video to be processed and a corresponding image recognition text, where the image recognition text is obtained by performing subtitle text recognition on a video frame to be processed. The audio data of the video to be processed and the corresponding image recognition text may include a plurality of audio text pairs, where the audio text pairs include the image recognition text obtained by performing text recognition on the caption of the video frame of a caption and the audio data of the time interval corresponding to the caption.

The voice transcription module 1404 may be configured to obtain aligned text by performing voice recognition and forced alignment on the audio data of the video to be processed by the voice recognition and forced alignment tool. The speech recognition and forced alignment tool may include a speech recognition model and a forced alignment model.

The voice transcription module 1404 is further configured to input the audio data in the audio text pair into a voice recognition model to obtain a corresponding voice recognition text; inputting the audio data in the audio text pair and the corresponding voice recognition text into a forced alignment model to obtain aligned text of the audio text pair.

The text correction module 1406 may be configured to perform correction processing on the aligned text using the language model to obtain corrected text.

The language model may include a binary language model and a ternary language model.

The text correction module 1406 may be further configured to perform word segmentation on the aligned text of the audio text pair to obtain a word list of the aligned text; judging whether the aligned text of the audio text pair is more than three words after word segmentation according to the word list; if the aligned text of the audio text pair is greater than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using a ternary language model to obtain the probability that the aligned text corresponding to the audio text pair is a sentence; and carrying out error correction processing on the aligned texts corresponding to the audio text pairs based on the probability to obtain error corrected texts of the audio text pairs.

The text correction module 1406 may be further configured to score the aligned text corresponding to the audio text pair by using the binary language model if the aligned text of the audio text pair is not more than three words after word segmentation, so as to obtain a probability that the aligned text corresponding to the audio text pair is a sentence; and carrying out error correction processing on the aligned texts corresponding to the audio text pairs based on the probability to obtain error corrected texts of the audio text pairs.

The text error correction module 1406 may be further configured to obtain candidate replacement words of each word in the word list if the probability is not greater than a preset probability threshold, where the candidate replacement words are obtained according to pinyin of the corresponding word; and scoring texts in the word list replaced by the candidate replacement words by using the language model, and determining that the word list with the highest probability of being a sentence after replacement is the corrected text of the audio text pair.

The text correction module 1406 may also be used to obtain pinyin for each word in the word list; obtaining a similar sound with the similarity of the pinyin of each word in the word list being greater than a preset similarity threshold; homophones of similar sounds of the words in the word list are obtained and used as candidate replacement words of the corresponding words.

The processing module 1408 may be configured to filter the error corrected text with reference to the corresponding image recognition text to obtain training data for training the speech recognition model.

The training data may include a supervised learning data set and a semi-supervised learning data set.

The processing module 1408 may be configured to obtain a confidence level of the error corrected text as compared to the corresponding image recognition text; adding the corrected text with the confidence coefficient larger than the preset confidence coefficient threshold value and corresponding audio data into a supervised learning data set; and adding the corrected text with the confidence coefficient not larger than the preset confidence coefficient threshold value and corresponding audio data into the semi-supervised learning data set.

Specific implementation of each module in the apparatus provided in the embodiments of the present disclosure may refer to the content in the foregoing method, which is not described herein again.

Fig. 15 shows a schematic structural diagram of an electronic device in an embodiment of the disclosure. It should be noted that the apparatus shown in fig. 15 is only an example of a computer system, and should not impose any limitation on the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 15, the apparatus 1500 includes a Central Processing Unit (CPU) 1501, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1502 or a program loaded from a storage section 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data required for the operation of the device 1500 are also stored. The CPU1501, ROM 1502, and RAM 1503 are connected to each other through a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

The following components are connected to I/O interface 1505: an input section 1506 including a keyboard, mouse, and the like; an output portion 1507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 1508 including a hard disk and the like; and a communication section 1509 including a network interface card such as a LAN card, a modem, or the like. The communication section 1509 performs communication processing via a network such as the internet. A drive 1510 is also connected to the I/O interface 1505 as needed. Removable media 1511, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1510 as needed so that a computer program read therefrom is mounted into the storage section 1508 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1509, and/or installed from the removable medium 1511. The above-described functions defined in the system of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 1501.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The described modules may also be provided in a processor, for example, as: a processor includes an acquisition module, a speech transcription module, a text error correction module, and a processing module. The names of these modules do not limit the module itself in some cases, and the acquisition module may also be described as "a module that acquires initial data from a connected server side", for example.

As another aspect, the present disclosure also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to include:

acquiring audio data of a video to be processed and a corresponding image recognition text, wherein the image recognition text is obtained by performing subtitle text recognition on a video frame to be processed; performing voice recognition on audio data of the video to be processed and performing forced alignment processing to obtain aligned texts; performing error correction processing on the aligned text to obtain an error corrected text; the corrected text is filtered with reference to the corresponding image recognition text to obtain training data for training the speech recognition model.

Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that this disclosure is not limited to the particular arrangements, instrumentalities and methods of implementation described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of data processing, comprising:

acquiring audio data of a video to be processed and a corresponding image recognition text, wherein the image recognition text is obtained by performing subtitle text recognition on a video frame to be processed;

performing voice recognition and forced alignment treatment on the audio data of the video to be processed to obtain aligned texts;

performing error correction processing on the aligned text to obtain an error corrected text;

and screening the corrected text by referring to the corresponding image recognition text to obtain training data for training a voice recognition model.

2. The method according to claim 1, wherein the audio data of the video to be processed and the corresponding image recognition text include a plurality of audio text pairs, and the audio text pairs include the image recognition text obtained by performing caption text recognition on the video frame of a caption and the audio data of the time interval corresponding to the caption;

performing error correction processing on the aligned text to obtain error corrected text, including:

word segmentation processing is carried out on the aligned texts of the audio text pairs, and a word list of the aligned texts is obtained;

Judging whether the aligned text of the audio text pair is greater than three words after word segmentation according to the word list;

if the aligned text of the audio text pair is more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using a ternary language model to obtain the probability that the aligned text corresponding to the audio text pair is a sentence;

and carrying out error correction processing on the aligned texts corresponding to the audio text pairs based on the probability to obtain error corrected texts of the audio text pairs.

3. The method of claim 2, wherein performing error correction processing on the aligned text to obtain error corrected text, further comprises:

if the aligned text of the audio text pair is not more than three words after word segmentation, scoring the aligned text corresponding to the audio text pair by using a binary language model to obtain the probability that the aligned text corresponding to the audio text pair is a sentence;

4. A method according to claim 2 or 3, wherein performing error correction processing on the aligned text corresponding to the audio text pair based on the probability to obtain error corrected text for the audio text pair, comprises:

If the probability is not greater than a preset probability threshold, candidate replacement words of each word in the word list are obtained, wherein the candidate replacement words are obtained according to pinyin of the corresponding word;

and scoring the text in the word list replaced by the candidate replacement words by using a language model, and determining the word list with the highest probability of being a sentence after replacement as the corrected text of the audio text pair.

5. The method of claim 4, wherein obtaining candidate replacement words for each word in the word list comprises:

the pinyin of each word in the word list is obtained;

obtaining similar sounds, the similarity of which to the pinyin of each word in the word list is greater than a preset similarity threshold value;

homophones of similar sounds of the words in the word list are obtained and used as candidate replacement words of the corresponding words.

6. A method according to any one of claims 1 to 3, wherein the training data comprises a supervised learning data set and a semi-supervised learning data set;

screening the corrected text with reference to the corresponding image recognition text to obtain training data for training the speech recognition model, including:

obtaining the confidence coefficient of the text after error correction compared with the corresponding image recognition text;

Adding the corrected text with the confidence coefficient larger than a preset confidence coefficient threshold value and corresponding audio data into the supervised learning data set;

and adding the corrected text with the confidence coefficient not larger than a preset confidence coefficient threshold value and corresponding audio data into the semi-supervised learning data set.

7. A method according to claim 2 or 3, further comprising:

detecting a region where a text appears in a current video frame of the video to be processed, and obtaining a text region of the current video frame;

judging whether the text area of the current video frame is a preset subtitle area or not;

if the text region of the current video frame is determined to be the preset caption region, detecting continuous video frames by taking the timestamp of the current video frame as a time starting point, and obtaining a time interval corresponding to the caption of the text region of the current video frame;

performing optical character recognition processing on the text region of the current video frame to obtain an image recognition text of the subtitle of the text region of the current video frame;

and dividing the audio data of the video to be processed according to the time interval corresponding to each caption so as to obtain a plurality of audio text pairs.

8. A method according to claim 2 or 3, characterized in that,

Performing voice recognition and forced alignment processing on the audio data of the video to be processed to obtain aligned text, wherein the method comprises the following steps:

inputting the audio data in the audio text pair into the voice recognition model in a voice recognition and forced alignment tool to obtain a corresponding voice recognition text;

inputting the audio data in the audio text pair and the corresponding voice recognition text into a forced alignment model in the voice recognition and forced alignment tool to obtain aligned text of the audio text pair.

9. A data processing apparatus, comprising:

the acquisition module is used for acquiring audio data of the video to be processed and corresponding image recognition texts, wherein the image recognition texts are obtained by performing subtitle text recognition on the video frames to be processed;

the voice transcription module is used for carrying out voice recognition on the audio data of the video to be processed and carrying out forced alignment processing to obtain aligned texts;

the text error correction module is used for carrying out error correction processing on the aligned text to obtain an error corrected text;

and the processing module is used for filtering the corrected text by referring to the corresponding image recognition text so as to obtain training data for training the voice recognition model.

10. An electronic device, comprising: memory, a processor and executable instructions stored in the memory and executable in the processor, wherein the processor implements the method of any of claims 1-8 when executing the executable instructions.

11. A computer readable storage medium having stored thereon computer executable instructions which when executed by a processor implement the method of any of claims 1-8.