CN113705300A

CN113705300A - Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium

Info

Publication number: CN113705300A
Application number: CN202110282332.3A
Authority: CN
Inventors: 王书培; 刘攀; 邓理英
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2021-11-26

Abstract

The embodiment of the application discloses a method, a device, equipment and a storage medium for acquiring a phonetic-to-text training corpus, wherein the method comprises the following steps: acquiring a plurality of target video key frames of a target video, and determining character positions and character contents from each target video key frame; determining a subtitle recognition interval of the target video according to the character position and the character content in each target video key frame, wherein the character content of different target video key frames corresponding to the same position in the subtitle recognition interval is different; and recognizing the subtitle of the target video according to the subtitle recognition interval to obtain the subtitle to be processed, performing character processing on the subtitle to be processed according to a preset corpus acquisition rule to obtain the target subtitle of the target video, and generating an audio-to-text training corpus for video voice recognition according to the target video and the target subtitle. By adopting the embodiment of the invention, the extraction efficiency of the video subtitles can be improved, the acquisition convenience of the phonetic-to-text training corpus can be improved, the operation is simple, and the applicability is high.

Description

Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium

Technical Field

The present application relates to the field of computer software technologies, and in particular, to a method, an apparatus, a device, and a storage medium for obtaining a phonetic-to-text corpus.

Background

With the development of computer software technology, video resources in the internet are greatly increased, and a video-oriented subtitle extraction technology is required to be used in various scenes, for example, in the training process of converting speech into a text model, in order to obtain training corpora, subtitles in a video need to be extracted. The inventor of the present application finds, in the research and practice processes, that in the prior art, a subtitle extraction technology needs to manually label a subtitle interval in a video, so as to perform Character Recognition in the subtitle interval of the video, for example, a method such as an Optical Character Recognition (OCR) technology is adopted to recognize characters in a manually framed subtitle interval, so as to obtain subtitles of the video, which consumes much labor, and has low subtitle Recognition efficiency. In the prior art, subtitles recognized by methods such as an OCR technology are not processed, and the subtitle extraction mode is rough and is not suitable for being used as a training corpus of a voice-to-character model.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for acquiring an audio-to-text training corpus, which can improve the extraction efficiency of video subtitles and improve the acquisition convenience of the audio-to-text training corpus, and is simple to operate and high in applicability.

In a first aspect, an embodiment of the present application provides a method for obtaining a phonetic-to-text corpus, where the method includes:

acquiring a plurality of target video key frames of a target video, and determining character positions and character contents from each target video key frame;

determining a subtitle recognition interval of the target video according to the character position and the character content in each target video key frame, wherein the character content of different target video key frames corresponding to the same position in the subtitle recognition interval is different;

identifying the subtitle of the target video according to the subtitle identification interval so as to obtain the subtitle to be processed from the target video, and performing character processing on the subtitle to be processed according to a preset corpus obtaining rule so as to obtain the target subtitle of the target video;

and generating an audio-text training corpus for video voice recognition according to the target video and the target subtitle.

With reference to the first aspect, in one possible implementation, the acquiring a plurality of target video key frames of a target video includes:

acquiring a video to be processed, and determining a video key frame of the video to be processed and the frame number of the video key frame;

when the frame number of the video key frame is larger than or equal to the frame number threshold value, determining the video to be processed as a target video, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

acquiring a video to be processed, and determining a video key frame of the video to be processed and the Chinese character occurrence rate of the video key frame;

when the Chinese character occurrence rate of any video key frame in the video to be processed is greater than or equal to the occurrence rate threshold value, determining the video to be processed as a target video, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

With reference to the first aspect, in a possible implementation manner, determining a subtitle recognition interval of a target video according to a text position and text content in each target video key frame includes:

determining at least one character position from the character positions of each target video key frame as at least one character recognition interval to be selected, wherein the repeated occurrence frequency of the character content of the character recognition interval to be selected is less than a frequency threshold value;

and determining the jitter degree of each character recognition interval to be selected, and determining the character recognition interval to be selected with the jitter degree smaller than or equal to the jitter degree threshold value as the subtitle recognition interval of the target video.

With reference to the first aspect, in one possible implementation, the method further includes:

determining text similarity of text contents appearing in the text positions of the key frames of each target video;

and when the text similarity of the text contents appearing in any two times in any text position is greater than a threshold value, determining the text contents appearing in any two times as the repeatedly appearing text contents, and determining that any text position is not used as a text identification interval to be selected.

With reference to the first aspect, in a possible implementation manner, performing character processing on a subtitle to be processed according to a preset corpus obtaining rule includes:

dividing the subtitle to be processed into a plurality of subtitle clauses according to a preset time interval, removing the duplication of each subtitle clause, and combining the subtitle clauses with the character length smaller than a character length threshold value in the duplicated subtitle clauses to obtain a combined subtitle to be processed;

and performing sentence-by-sentence screening on the subtitles based on the characters in the combined subtitles to be processed to determine the target subtitles of the target video.

With reference to the first aspect, in a possible implementation manner, performing caption clause screening based on characters in the merged to-be-processed caption includes:

and eliminating the caption clauses containing the uncommon characters in the merged captions to be processed to screen out the caption clauses not containing the uncommon characters, wherein the uncommon characters at least comprise one of letters, numbers and radicals of the uncommon components.

In the embodiment of the application, by acquiring a plurality of target video key frames of a target video, further, determining the character position and the character content from each target video key frame, the subtitle identification interval of the target video can be determined according to the character position and the character content in each target video key frame. It can be understood that different target video key frames correspond to different text contents at the same position in the subtitle identification interval. And performing character recognition on the subtitle of the target video according to the subtitle recognition interval, and obtaining the subtitle to be processed from the target video. And performing post-processing such as dividing, de-duplicating and merging on the to-be-processed subtitles according to the corpus acquisition rule, further removing the subtitles containing the uncommon characters in the merged to-be-processed subtitles in a sentence-by-sentence manner to obtain target subtitles, so that the word error rate of the target subtitles is reduced to be within a standard range, and further generating a phonetic transcription text training corpus for video voice recognition according to the target videos and the target subtitles. Therefore, automatic video acquisition can be realized, the subtitles of the video are extracted and screened, and then the training corpus suitable for the voice-to-text is obtained, the acquisition efficiency of the voice-to-text training corpus is improved, and meanwhile, the corpus quality of the voice-to-text training corpus is improved.

In a second aspect, an embodiment of the present application provides an apparatus for obtaining a phonetic-to-text corpus, where the apparatus includes:

the video acquisition module is used for acquiring a plurality of target video key frames of a target video and determining character positions and character contents from each target video key frame;

the interval division module is used for determining a subtitle recognition interval of the target video according to the character position and the character content in each target video key frame, wherein the character content of different target video key frames corresponding to the same position in the subtitle recognition interval is different;

the subtitle extraction module is used for identifying the subtitles of the target video according to the subtitle identification interval so as to obtain the subtitles to be processed from the target video, and performing character processing on the subtitles to be processed according to a preset corpus acquisition rule so as to obtain the target subtitles of the target video;

and the corpus generating module is used for generating an audio-video speech recognition training corpus according to the target video and the target subtitle.

With reference to the second aspect, in a possible implementation manner, the video obtaining module includes:

the device comprises a frame number determining unit and a processing unit, wherein the frame number determining unit is used for acquiring a video to be processed, determining a video key frame of the video to be processed and the frame number of the video key frame, determining the video to be processed as a target video when the frame number of the video key frame is greater than or equal to a frame number threshold value, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

the Chinese character determining unit is used for acquiring a video to be processed, determining a video key frame of the video to be processed and the Chinese character occurrence rate of the video key frame, determining the video to be processed as a target video when the Chinese character occurrence rate of any video key frame in the video to be processed is greater than or equal to an occurrence rate threshold value, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

With reference to the second aspect, in a possible implementation manner, the interval dividing module includes:

the interval duplication removing unit is used for determining at least one character position from the character positions of each target video key frame as at least one character recognition interval to be selected, and the repeated occurrence frequency of the character content of the character recognition interval to be selected is smaller than a frequency threshold value;

and the interval determining unit is used for determining the jitter degree of each character recognition interval to be selected, and determining the character recognition interval to be selected with the jitter degree smaller than or equal to the jitter degree threshold value as the subtitle recognition interval of the target video.

With reference to the second aspect, in a possible implementation manner, the interval dividing module further includes:

and the character recognition module is used for determining the text similarity of the character contents appearing in the character positions of each target video key frame, determining the character contents appearing twice as the character contents appearing repeatedly when the text similarity of the character contents appearing twice in any character position is greater than a threshold value, and determining that any character position is not used as a character recognition interval to be selected.

With reference to the second aspect, in a possible implementation manner, the subtitle extraction module includes:

the subtitle dividing unit is used for dividing the subtitle to be processed into a plurality of subtitle clauses according to a preset time interval, removing the duplication of each subtitle clause, and combining the subtitle clauses with the character length smaller than the character length threshold value in the duplicated subtitle clauses to obtain the combined subtitle to be processed;

and the caption screening unit is used for performing caption clause screening based on the characters in the merged captions to be processed so as to determine the target captions of the target video.

With reference to the second aspect, in a possible implementation manner, the subtitle filtering unit includes:

and the sentence removing subunit is used for removing the sentences of the combined subtitles containing the uncommon characters from the to-be-processed subtitles to screen out the subtitles sentences not containing the uncommon characters, wherein the uncommon characters at least comprise one of letters, numbers and radicals of the uncommon components.

In a third aspect, an embodiment of the present application provides a terminal device, where the terminal device includes a processor and a memory, and the processor and the memory are connected to each other. The memory is used for storing a computer program supporting the terminal to execute the method provided by the second aspect and/or any possible implementation manner of the second aspect, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method provided by the second aspect and/or any possible implementation manner of the second aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, where the computer program is executed by a processor to implement the method provided by the second aspect and/or any possible implementation manner of the second aspect.

In the embodiment of the application, by acquiring a plurality of target video key frames of a target video, further, determining the character position and the character content from each target video key frame, the subtitle identification interval of the target video can be determined according to the character position and the character content in each target video key frame. It can be understood that different target video key frames correspond to different text contents at the same position in the subtitle identification interval. And performing character recognition on the subtitle of the target video according to the subtitle recognition interval, and obtaining the subtitle to be processed from the target video. And performing post-processing such as dividing, de-duplicating and merging on the to-be-processed subtitles according to the corpus acquisition rule, further, removing the subtitles containing uncommon characters in the merged to-be-processed subtitles in a sentence-by-sentence manner to obtain target subtitles, reducing the wrong character rate of the target subtitles to be within a standard range, and further generating a phonetic transcription text training corpus for video voice recognition according to the target video and the target subtitles. Therefore, automatic video acquisition can be realized, the subtitles of the video are extracted and screened, and then the training corpus suitable for the voice-to-text is obtained, the acquisition efficiency of the voice-to-text training corpus is improved, and meanwhile, the corpus quality of the voice-to-text training corpus is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a network architecture provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for obtaining phonetic transcription training corpora according to an embodiment of the present application;

fig. 3 is a schematic view of a scene of target video key frame extraction provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a method for obtaining phonetic transcription text corpora according to an embodiment of the present application;

fig. 5 is a scene schematic diagram of a video subtitle interval provided by an embodiment of the present application;

fig. 6 is a scene schematic diagram for determining a subtitle recognition interval according to an embodiment of the present application;

fig. 7 is a schematic flowchart of subtitle post-processing according to an embodiment of the present disclosure;

FIG. 8 is another flowchart illustrating a method for obtaining phonetic transcription training corpora according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an apparatus for obtaining phonetic transcription training corpora according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present invention. As shown in fig. 1, the network architecture may include a cloud server 2000 and a user terminal cluster; the user terminal cluster may include a plurality of user terminals, as shown in fig. 1, specifically including a user terminal 3000a, user terminals 3000b, …, and a user terminal 3000 n; as shown in fig. 1, the user terminal 3000a, the user terminals 3000b and …, and the user terminal 3000n may respectively establish a data connection relationship with the cloud server 2000 under a certain data interaction condition, so as to perform data interaction with the cloud server 2000.

For convenience of understanding, in the embodiment of the present application, one user terminal may be selected as a target user terminal from the plurality of user terminals shown in fig. 1, where the target user terminal may include: the smart terminal can perform a subtitle extraction function (e.g., a social application function, a short video application function, a video application function, etc.) on the video to generate the voice-to-text training corpus. For example, the user terminal 3000a shown in fig. 1 may be used as the target user terminal, and one or more applications may be integrated in the target user terminal. It should be understood that the target application integrated in the target user terminal may be collectively referred to as a target application client. The target applications may include social applications (e.g., WeChat and QQ), short video applications (e.g., micro-video), movie and television applications (e.g., Tencent video), and other applications that can be used for extracting video subtitles, and the target applications may be applications that extract subtitles from videos obtained by different applications, filter subtitles, and generate an audio-to-text corpus.

It can be understood that the method for obtaining the phonetic transcription corpus described in the embodiment of the present application may be applied to all application scenarios in which the phonetic transcription corpus is obtained in a web page or an application client (i.e., the aforementioned target application). When a target application with an audio-to-text corpus acquiring function runs in the target user terminal to extract subtitles from a video, the video extracted by the target user terminal may include a video previously built in the target application, and may also include a video currently acquired from the server 2000 through a network.

It should be understood that, in the embodiment of the present application, the application video embedded in the target application in advance and the video currently acquired from the server may be collectively referred to as a target video, and the subtitle extraction is required to generate the audio-to-text training corpus. Therefore, the subtitle extraction can be performed on the target video during the running period of the webpage or the target application to obtain the target video and the target subtitle of the target video, so that the generation quality of the training corpus can be improved when the voice-to-text corpus is generated in the webpage or the application client, and the occupation of the system memory in the subtitle extraction process of the video is reduced.

Optionally, in this embodiment of the application, before the target user terminal runs the target application, the to-be-processed video acquired from the server 2000 shown in fig. 1 may be filtered in advance to obtain the target video, and the target video is stored in the specified storage space. When the target application is run by the target user terminal, the target video can be directly loaded from the specified storage space, so that system performance loss brought to the target user terminal during the running of the target application is reduced (for example, the target user terminal can reduce occupation of video data on a system memory). Optionally, in this embodiment of the application, before the target user terminal runs the target application, the server 2000 may also be used to extract a key frame from the video representation to be processed in advance, and filter the video to be processed to obtain the target video and the key frame of the target video. The target user terminal may send a data downloading instruction (i.e., a data loading instruction) to the server 2000 through the network when running the target application, so that the server may determine whether the target user terminal satisfies the key frame extraction condition based on the terminal identifier carried in the downloading instruction. If it is determined by the server 2000 that the target user terminal satisfies the key frame extraction condition, the target user terminal may obtain the target video and the target video key frame, which are stored after the key frame extraction is performed in advance, through the server 2000, so that when the target application is running in the target user terminal, the system performance loss may be reduced, and the efficiency of obtaining the voice-to-text corpus may be improved. Therefore, before the target application is run, the target user terminal in the embodiment of the present application may further extract and filter the key frames of the video to be processed through the server 2000, so as to obtain the target video and the key frames of the target video.

Optionally, before the target user terminal runs the target application, the target video acquired from the server 2000 shown in fig. 1 may be subjected to subtitle extraction in advance to obtain the target video and the target subtitle. For example, taking the target application as a short video application (Tencent micro-video) as an example, the target user terminal may load and display the target video and the target subtitle through the short video application, and generate the text-to-speech corpus according to the target video and the target subtitle.

The video to be processed described in the embodiments of the present application may include short video, movie, tv, and music greeting cards, and may also include audio containing subtitle information. For example, taking the target application as a short video application as an example, the target user terminal may capture videos uploaded, downloaded, or browsed by a user through the short video application and extract subtitles to obtain a target video and a target subtitle, thereby generating an audio-to-text corpus.

Referring to fig. 2, fig. 2 is a flow chart illustrating a method for obtaining phonetic transcription training corpus according to an embodiment of the present application. For convenience of description, in the embodiments of the present application, a terminal is used as an execution subject of the method for obtaining the phonetic transcription training corpus. As shown in fig. 2, the method for obtaining phonetic transcription text training corpus provided in the present application includes:

s101: the method comprises the steps of obtaining a plurality of target video key frames of a target video, and determining character positions and character contents from the target video key frames.

In some feasible embodiments, when the terminal acquires the target video, the terminal may extract a key frame of the target video to obtain a key frame of the target video. Specifically, referring to fig. 3, fig. 3 is a scene schematic diagram of target video key frame extraction provided in the embodiment of the present application. As shown in fig. 3, a video key frame refers to a frame where a key action of a character or an object in a video changes in motion, and is equivalent to an original picture in a two-dimensional animation, when a puppy and a subtitle in the video continuously change, key information in the video can be completely displayed through the video key frame, and then characters in the video can be identified by using the key frame, so as to obtain a character position and a character content. For example, by recognizing the text in the key frame, it is possible to determine the text that is irrelevant to the caption, such as "waning" that appears in the middle position of each target video key frame, and the text that appears in the lower position of each target video key frame, "transparent is a dog", "very loved", "you want to like it" and the like. The video between key frames can be added by software creation, called transition frames or intermediate frames. A frame is a single image frame of the minimum unit in a video, and corresponds to each frame of a shot on a motion picture film, and one frame represents one frame or one mark on the time axis of the video. After the terminal acquires the video to be processed, the terminal may extract a key frame from the video to be processed, where the method for extracting the key frame includes, but is not limited to: the method comprises a shot-based key frame extraction method, a motion analysis-based key frame extraction method and a video clustering-based key frame extraction method. The method for extracting the key frame based on the shot is the first developed method in the field of video processing and is the most mature general method at present, and the general implementation process of the method is as follows: firstly, a video file is divided according to shot changes, and then a first frame and a last frame are selected from each shot of the video to be used as key frames. The method has the advantages of simple implementation and small calculation amount, but has great limitation, and when the content in the video changes violently and the scene is very complex, the selection of the first frame and the last frame in the shot cannot represent the change of the whole content of the video, so the method cannot meet the standards and requirements of people in the current society for extracting the key frames. The key frame extracting method based on motion analysis is a method for extracting key frames based on the attributes of object motion characteristics, and the implementation process of the method is as follows: and analyzing the optical flow of the object motion in the video shot, and selecting the video frame with the minimum optical flow moving frequency in the video shot as the extracted key frame each time. The method can extract a proper amount of key frames from most video shots, and the extracted key frames can also effectively express the motion characteristics of the video. However, this method has the characteristic of poor robustness, because it not only depends on the local characteristics of the object motion, but also has a complex calculation process, and the algorithm has a large overhead cost in time. The method for extracting the key frame based on the video clustering can divide the video frame into a plurality of clusters through clustering in the process of extracting the key frame, and then select corresponding frames in each cluster as the key frame. The video key frame extracted by the algorithm has low redundancy, and the key frame can accurately reflect all contents generated in the video. However, the clustering-based method does not fully consider the time sequence of the change between frames in the process of dividing the clusters, and a certain number of clusters need to be preset before clustering, so the applicability of the method is limited to a certain extent. In a specific embodiment, there are many methods for extracting key frames from a video, and a specific key frame extraction method may be used to extract key frames from a video according to the characteristics of the video itself, or a method of concentrating key frame extraction may be used to extract key frames from a video in combination, and the method may be specifically determined according to an actual application scene, and is not limited herein. It is understood that any process of extracting key frames from video is covered by the protection scope of the present application.

In some feasible embodiments, in order to ensure the quality of the subsequent voice-to-text training corpus, the terminal may obtain videos to be processed in batch before obtaining a plurality of target video key frames of the target videos, extract key frames of the videos to be processed, and screen the videos through the video key frames to obtain the target videos and the target video key frames suitable for being used as the voice-to-text training corpus. Referring to fig. 4, fig. 4 is a data processing diagram of a method for obtaining phonetic transcription training corpus according to an embodiment of the present application. As shown in a part a in fig. 4, the terminal may download videos to be processed in batches during the video acquisition process, acquire and manage the videos to be processed acquired in batches (for example, classify the videos according to video types and apply the videos according to sources), and further perform video preprocessing on the videos (for example, image gray processing, image stretching (for example, stretching a video image into a fixed pixel size), image sharpening, and the like), so as to facilitate subsequent positioning and subtitle extraction on a subtitle interval of the videos.

In some possible embodiments, the terminal may determine a video key frame of the video to be processed and a frame number of the video key frame, when the frame number of the video key frame is greater than or equal to a frame number threshold (e.g., 5 frames), the video to be processed is suitable as the utterance text corpus, and the terminal may determine the video to be processed as the target video and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames.

In some feasible embodiments, the terminal may further perform character recognition on the video key frame of the video to be processed, so as to determine the occurrence rate of the chinese character of the video key frame of the video to be processed. When the Chinese character occurrence rate of any video key frame in the video to be processed is greater than or equal to the occurrence rate threshold value, the video to be processed is indicated to comprise Chinese and is suitable for being used as an audio-text training corpus, the terminal can determine the video to be processed as a target video and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames. After the target video key frames are acquired, the terminal may perform character recognition on the target video key frames (for example, recognize characters in the target video key frames by using an OCR technology) to obtain character positions and character contents in the target video key frames.

S102: and determining a subtitle identification interval of the target video according to the character position and the character content in each target video key frame.

In some possible embodiments, the position of the subtitle appearing in the video is not fixed, and it is required to determine the subtitle recognition interval of the video in order to accurately recognize the subtitle of the video and further obtain the corpus suitable for the phonetic transcription training. Specifically, referring to fig. 5, fig. 5 is a scene schematic diagram of a video subtitle interval according to an embodiment of the present application. As shown in fig. 5, 200a is a video key frame in which no subtitle appears, 200b is a video key frame in which a subtitle appears in the middle of a picture, 200c is a video key frame in which a subtitle appears above a picture, and 200d is a video key frame in which a subtitle appears with a size change and is vertically arranged. However, the ideal video caption interval is shown as 200e in fig. 5, and is a relatively fixed position (for example, caption in movie and television drama video) where caption appears in each video key frame, so in order to better identify caption in video key frame, the caption identification interval may be an area where coordinates are fixed at the same position but the text content appearing at the position changes continuously. In other words, in the embodiment of the present application, when the text in each video key frame of the target video is identified through the subtitle identification interval, the text content displayed in the same subtitle identification interval by different video key frames of the target video is different, that is, the text content of different video key frames corresponding to the same position in the subtitle identification interval is different.

In some possible embodiments, the target video may further include some non-subtitle text, such as a logo, a watermark, and so on. However, the positions and the content of the characters in the video are usually unchanged or slightly changed, and the terminal can eliminate the positions of the characters irrelevant to the subtitles by using the characteristic, so that the accuracy of extracting the subtitles is further improved. That is, after the terminal determines the character position and the character content in each target video key frame, the terminal screens the character position to determine the character identification interval to be selected of the target video.

In some possible embodiments, the terminal may determine at least one candidate text recognition interval from the text positions of the target video key frames, where the number of times of repeated occurrences of text content in the candidate text recognition interval is less than a threshold number of times, so as to exclude text positions unrelated to subtitles.

In particular, in some possible implementations, the terminal may determine the text similarity of the text content appearing in the text position of each target video key frame. When the text similarity of the character contents appearing at any two times in any character position is greater than a threshold value (for example, 85%), the character contents appearing at any two times are determined as the character contents appearing repeatedly, and any character position is determined not to be used as a character recognition interval to be selected.

In some possible embodiments, after the text position is eliminated according to the text content in each target key frame, the terminal may determine the subtitle recognition interval of the target video. Specifically, referring to fig. 6, fig. 6 is a scene schematic diagram for determining a subtitle recognition interval according to an embodiment of the present application. As shown in fig. 6, the terminal may obtain the text recognition interval to be selected according to the coordinates of the text position, and further obtain the subtitle recognition interval. For example, the terminal may represent the subtitle recognition interval using the abscissa and the ordinate of the boundary pixel point of the subtitle recognition interval, may represent the subtitle recognition interval using only the abscissa or the ordinate of the boundary pixel point of the subtitle recognition interval, and may represent the subtitle recognition interval using the abscissa and the ordinate of the diagonal pixel point of the boundary of the subtitle recognition interval.

In some possible embodiments, before determining the subtitle recognition interval of the target video, in order to further improve the accuracy of subsequent subtitle extraction, the terminal may remove some text positions with a large jitter degree in the video. That is, the terminal may determine the jitter degree of each candidate character recognition interval according to at least one candidate character recognition interval, and determine the subtitle recognition interval of the target video according to the candidate character recognition interval in which the jitter degree is less than or equal to the jitter degree threshold (for example, the vertical coordinate jitter of the upper and lower boundaries of the character position does not exceed 6 pixels).

In particular, if the jitter degree of the text position in the key frame of the target video is too large (exceeds the jitter degree threshold), that is, the subtitle quality in the target video is poor, and the text position is not suitable for being used as the phonetic transcription training corpus. The terminal can also discard the existing target video, determine the target video in the video to be processed again, and execute the operation again until a target subtitle identification interval suitable for being used as the phonetic transcription text corpus is obtained.

S103: and recognizing the subtitle of the target video according to the subtitle recognition interval so as to obtain the subtitle to be processed from the target video, and performing character processing on the subtitle to be processed according to a preset corpus obtaining rule so as to obtain the target subtitle of the target video.

In some possible embodiments, as shown in part b of fig. 4, after obtaining the subtitle recognition interval, the terminal may recognize (e.g., OCR) the subtitle of the target video according to the subtitle recognition interval to obtain the subtitle to be processed from the target video, and perform subtitle post-processing on the subtitle to be processed. The subtitle post-processing includes, but is not limited to, processing operations such as deduplication processing, phrase merging, word number recognition, character and filtering, and the like. The subtitle post-processing procedure for the subtitle to be processed will be described below with reference to fig. 7.

Specifically, referring to fig. 7, fig. 7 is a schematic flowchart of subtitle post-processing according to an embodiment of the present disclosure. As shown in fig. 7, a method for subtitle post-processing provided by an embodiment of the present application includes the steps of:

s201: dividing the subtitle to be processed into a plurality of subtitle clauses according to a preset time interval, removing the duplication of each subtitle clause, and combining the subtitle clauses with the character length smaller than the character length threshold value in the duplicated subtitle clauses to obtain the combined subtitle to be processed.

In some possible embodiments, the terminal may obtain corresponding time axis information of the to-be-processed subtitles in the target video while identifying the target video to obtain the to-be-processed subtitles, for example, a first subtitle appears in n second to m second (or j frame to k frame) of the target video. The terminal can divide the subtitle to be processed into a plurality of subtitle clauses according to a preset time interval (for example, 0.2 second), and duplicate removal is performed on each subtitle clause to prevent repeated subtitles from appearing continuously for a plurality of times, so that subtitle clauses with the character length smaller than a character length threshold (for example, 2 characters) in the duplicate-removed subtitle clauses are merged to prevent the subtitle clauses from being too short, and the merged subtitle to be processed is obtained.

In particular, if the caption clause length of the caption to be processed is long or short (e.g., more than 4 to 25 characters), i.e., the speech rate in the target video is fast or slow (e.g., more than 6 words per second or less than 3 words per second), it indicates that the target video is not suitable as the phonetic transcription corpus. The terminal can also discard the existing target video, determine the target video in the video to be processed again, and execute the operation again until the subtitle to be processed suitable for being used as the phonetic transcription text corpus is obtained.

S202: and eliminating the caption clauses containing the uncommon characters in the merged captions to be processed to screen out the caption clauses not containing the uncommon characters, wherein the uncommon characters at least comprise one of letters, numbers and radicals of the uncommon components.

In some feasible implementation manners, due to certain technical defects of the character recognition technology, some rarely-used characters may not be accurately recognized, the rarely-used characters generally have a large error with characters in a target video, and if the rarely-used characters are used as target subtitles to generate a phonetic transcription training corpus, the subsequent training effect will be affected. Therefore, the terminal can remove the caption clauses containing the uncommon characters from the merged captions to be processed so as to screen out the caption clauses not containing the uncommon characters. Wherein the uncommon character at least comprises one of letters, numbers and radicals of the uncommon part.

S104: and generating an audio-text training corpus for voice recognition according to the target video and the target subtitle.

In some possible embodiments, after obtaining the target caption, the terminal may generate a corpus of the voice-to-text according to the target video and the target caption, or may extract an audio in the target video and generate a corpus of the voice-to-text according to the extracted audio and the target caption.

Referring to fig. 8, fig. 8 is another flow chart illustrating a method for obtaining phonetic transcription training corpus according to an embodiment of the present application. As shown in fig. 8, another method for obtaining a phonetic transcription text corpus provided in the present application includes:

s301: the method comprises the steps of obtaining a video to be processed, and determining a video key frame of the video to be processed, the frame number of the video key frame and the Chinese character occurrence rate of the video key frame.

S302: when the frame number of the video key frames is larger than or equal to the frame number threshold value, and the Chinese character occurrence rate of any video key frame in the video to be processed is larger than or equal to the occurrence rate threshold value, determining the video to be processed as a target video, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

In some feasible embodiments, in order to ensure the quality of the subsequent voice-to-text training corpus, the terminal may obtain videos to be processed in batch, extract key frames of the videos to be processed, and filter the videos through the key frames of the videos to obtain target videos and key frames of the target videos suitable for being used as the voice-to-text training corpus. As shown in a part a in fig. 4, the terminal may download videos to be processed in batches during the video acquisition process, acquire and manage the videos to be processed acquired in batches (for example, classify the videos according to video types and apply the videos according to sources), and further perform video preprocessing on the videos (for example, image gray processing, image stretching (for example, stretching a video image into a fixed pixel size), image sharpening, and the like), so as to facilitate subsequent positioning and subtitle extraction on a subtitle interval of the videos.

In some possible embodiments, after acquiring the video to be processed, the terminal may extract a key frame from the video to be processed, where the method for extracting the key frame includes, but is not limited to: the method comprises a shot-based key frame extraction method, a motion analysis-based key frame extraction method and a video clustering-based key frame extraction method. The method for extracting the key frame based on the shot is the first developed method in the field of video processing and is the most mature general method at present, and the general implementation process of the method is as follows: firstly, a video file is divided according to shot changes, and then a first frame and a last frame are selected from each shot of the video to be used as key frames. The method has the advantages of simple implementation and small calculation amount, but has great limitation, and when the content in the video changes violently and the scene is very complex, the selection of the first frame and the last frame in the shot cannot represent the change of the whole content of the video, so the method cannot meet the standards and requirements of people in the current society for extracting the key frames. The key frame extracting method based on motion analysis is a method for extracting key frames based on the attributes of object motion characteristics, and the implementation process of the method is as follows: and analyzing the optical flow of the object motion in the video shot, and selecting the video frame with the minimum optical flow moving frequency in the video shot as the extracted key frame each time. The method can extract a proper amount of key frames from most video shots, and the extracted key frames can also effectively express the motion characteristics of the video. However, this method has the characteristic of poor robustness, because it not only depends on the local characteristics of the object motion, but also has a complex calculation process, and the algorithm has a large overhead cost in time. The method for extracting the key frame based on the video clustering can divide the video frame into a plurality of clusters through clustering in the process of extracting the key frame, and then select corresponding frames in each cluster as the key frame. The video key frame extracted by the algorithm has low redundancy, and the key frame can accurately reflect all contents generated in the video. However, the clustering-based method does not fully consider the time sequence of the change between frames in the process of dividing the clusters, and a certain number of clusters need to be preset before clustering, so the applicability of the method is limited to a certain extent. In a specific embodiment, there are many methods for extracting key frames from a video, and the key frames of the video can be extracted by using a specific key frame extraction method according to the characteristics of the video itself, or by combining a centralized key frame extraction method with the key frame extraction method. It is understood that any process of extracting key frames from video is covered by the protection scope of the present application.

In some possible embodiments, the terminal may determine a frame number of a video key frame of the video to be processed, and when the frame number of the video key frame is greater than or equal to a frame number threshold (e.g., 5 frames), the video to be processed is suitable as the utterance data, and the terminal may determine the video to be processed as the target video and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames. After the target video key frames are acquired, the terminal may perform character recognition (e.g., OCR) on the target video key frames to obtain character positions and character contents in the target video key frames.

In some possible embodiments, before determining the video to be processed as the target video, the terminal may perform text recognition on the video key frame of the video to be processed, so as to determine the occurrence rate of the chinese character of the video key frame of the video to be processed. When the Chinese character occurrence rate of any video key frame in the video to be processed is greater than or equal to the occurrence rate threshold value, the video to be processed is indicated to comprise Chinese and is suitable for being used as an audio-text training corpus, the terminal can determine the video to be processed as a target video and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames.

S303: and determining at least one character position from the character positions of the key frames of the target videos as at least one character recognition interval to be selected, wherein the repeated occurrence frequency of the character content of the character recognition interval to be selected is less than a frequency threshold value.

In some possible embodiments, the position of the subtitle appearing in the video is not fixed, and it is required to determine the subtitle recognition interval of the video in order to accurately recognize the subtitle of the video and further obtain the corpus suitable for the phonetic transcription training. As shown in fig. 5, 200a is a video key frame in which no subtitle appears, 200b is a video key frame in which a subtitle appears in the middle of a picture, 200c is a video key frame in which a subtitle appears above a picture, and 200d is a video key frame in which a subtitle appears with a size change and is vertically arranged. Whereas an ideal video caption interval is shown at 200e in fig. 5, and is a relatively fixed position where captions appear in the respective video key frames (e.g., captions in movie and television videos).

In some possible embodiments, the terminal may determine the text similarity of the text content appearing in the text position of each target video key frame. When the text similarity of the text contents appearing in any two times in any text position is greater than a threshold (for example, 85%), the text contents appearing in any two times are determined as the repeatedly appearing text contents, and any text position is determined not to be used as the recognition interval of the characters to be selected.

In some possible embodiments, after the text position is eliminated according to the text content in each target key frame, the terminal may determine the subtitle recognition interval of the target video. As shown in fig. 6, the terminal may obtain the text recognition interval to be selected according to the coordinates of the text position, and further obtain the subtitle recognition interval. For example, the terminal may represent the subtitle recognition interval using the abscissa and the ordinate of the boundary pixel point of the subtitle recognition interval, may represent the subtitle recognition interval using only the abscissa or the ordinate of the boundary pixel point of the subtitle recognition interval, and may represent the subtitle recognition interval using the abscissa and the ordinate of the diagonal pixel point of the boundary of the subtitle recognition interval.

S304: and determining the jitter degree of each character recognition interval to be selected, and determining the character recognition interval to be selected with the jitter degree smaller than or equal to the jitter degree threshold value as the subtitle recognition interval of the target video.

S305: and identifying the subtitle of the target video according to the subtitle identification interval so as to acquire the subtitle to be processed from the target video.

In some possible embodiments, as shown in part b of fig. 4, after obtaining the subtitle recognition interval, the terminal may recognize (e.g., OCR) the subtitle of the target video according to the subtitle recognition interval to obtain the subtitle to be processed from the target video, and perform subtitle post-processing on the subtitle to be processed. The subtitle post-processing includes, but is not limited to, processing operations such as deduplication processing, phrase merging, word number recognition, character and filtering, and the like.

S306: and performing character processing on the subtitle to be processed according to a preset corpus acquisition rule to obtain the target subtitle of the target video.

S307: and generating an audio-text training corpus for video voice recognition according to the target video and the target subtitle.

In some possible embodiments, after obtaining the target caption, the terminal may verify a word error rate of the target caption, and generate an audio-to-text corpus according to the target video and the target caption when the word error rate of the target caption is less than a word error threshold (5%). The wrong word rate can be calculated by formula 1, where formula 1 is specifically as follows:

wherein CER is the wrong character rate, S is the number of replaced characters, D is the number of deleted characters, I is the number of inserted characters, and N is the total number of characters.

Further, please refer to fig. 9, fig. 9 is a schematic structural diagram of an apparatus for obtaining phonetic transcription training corpus according to an embodiment of the present application. As shown in fig. 9, the above apparatus may include:

the video obtaining module 601 is configured to obtain a plurality of target video key frames of a target video, and determine text positions and text contents from each target video key frame.

In some possible embodiments, in order to ensure the quality of the subsequent voice-to-text corpus, the video acquisition module 601 may acquire videos to be processed in batch, extract key frames of the videos to be processed, and filter the key frames of the videos to obtain target videos and target video key frames suitable for being used as the voice-to-text corpus. As shown in part a of fig. 4, the video obtaining module 601 may download videos to be processed in batch in the process of obtaining the videos, collect and manage the videos to be processed obtained in batch (for example, classify the videos according to video types and application sources), further perform video preprocessing on the videos (for example, image gray processing, image stretching (for example, stretching a video image into a fixed pixel size), and image sharpening), so as to facilitate subsequent positioning and subtitle extraction on a subtitle interval of the videos.

In some possible embodiments, after the video to be processed is acquired, the video acquisition module 601 may perform key frame extraction on the video to be processed, where the key frame extraction method includes, but is not limited to: the method comprises a shot-based key frame extraction method, a motion analysis-based key frame extraction method and a video clustering-based key frame extraction method. The method for extracting the key frame based on the shot is the first developed method in the field of video processing and is the most mature general method at present, and the general implementation process of the method is as follows: firstly, a video file is divided according to shot changes, and then a first frame and a last frame are selected from each shot of the video to be used as key frames. The method has the advantages of simple implementation and small calculation amount, but has great limitation, and when the content in the video changes violently and the scene is very complex, the selection of the first frame and the last frame in the shot cannot represent the change of the whole content of the video, so the method cannot meet the standards and requirements of people in the current society for extracting the key frames. The key frame extracting method based on motion analysis is a method for extracting key frames based on the attributes of object motion characteristics, and the implementation process of the method is as follows: and analyzing the optical flow of the object motion in the video shot, and selecting the video frame with the minimum optical flow moving frequency in the video shot as the extracted key frame each time. The method can extract a proper amount of key frames from most video shots, and the extracted key frames can also effectively express the motion characteristics of the video. However, this method has the characteristic of poor robustness, because it not only depends on the local characteristics of the object motion, but also has a complex calculation process, and the algorithm has a large overhead cost in time. The method for extracting the key frame based on the video clustering can divide the video frame into a plurality of clusters through clustering in the process of extracting the key frame, and then select corresponding frames in each cluster as the key frame. The video key frame extracted by the algorithm has low redundancy, and the key frame can accurately reflect all contents generated in the video. However, the clustering-based method does not fully consider the time sequence of the change between frames in the process of dividing the clusters, and a certain number of clusters need to be preset before clustering, so the applicability of the method is limited to a certain extent. In a specific embodiment, there are many methods for extracting key frames from a video, and the key frames of the video can be extracted by using a specific key frame extraction method according to the characteristics of the video itself, or by combining a centralized key frame extraction method with the key frame extraction method. It is understood that any process of extracting key frames from video is covered by the protection scope of the present application.

In some possible implementations, the video acquisition module 601 includes:

the frame number determining unit 6011 is configured to acquire a video to be processed, determine a video key frame of the video to be processed and a frame number of the video key frame, determine the video to be processed as a target video when the frame number of the video key frame is greater than or equal to a frame number threshold, and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames.

The chinese character determining unit 6012 is configured to obtain a video to be processed, determine a video key frame of the video to be processed and a chinese character occurrence rate of the video key frame, determine the video to be processed as a target video when the chinese character occurrence rate of any video key frame in the video to be processed is greater than or equal to an occurrence rate threshold, and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames.

In some possible embodiments, the frame number determination unit 6011 may determine the frame number of a video key frame of the to-be-processed video, and when the frame number of the video key frame is greater than or equal to a frame number threshold (e.g., 5 frames), the to-be-processed video is suitable as the utterance text corpus, and the frame number determination unit 6011 may determine the to-be-processed video as the target video, and determine a plurality of video key frames of the to-be-processed video as a plurality of target video key frames.

In some possible embodiments, the chinese character determining unit 6012 may perform text recognition on the video key frame of the video to be processed, so as to determine the occurrence rate of chinese characters in the video key frame of the video to be processed. When the occurrence rate of the chinese character of any video key frame in the video to be processed is greater than or equal to the occurrence rate threshold, it is indicated that the video to be processed includes chinese and is suitable for being used as an audio-to-text corpus, and the chinese character determining unit 6012 may determine the video to be processed as a target video and determine a plurality of video key frames of the video to be processed as a plurality of target video key frames.

In some possible implementations, the video capture module 601 may perform text recognition (e.g., OCR) on the target video key frames after capturing the target video key frames to obtain text positions and text contents in the target video key frames.

The interval dividing module 602 is configured to determine a subtitle recognition interval of the target video according to the text position and the text content in each target video key frame, where different target video key frames correspond to different text contents at the same position in the subtitle recognition interval.

In some possible implementations, the interval division module 602 includes:

the interval repetition removing unit 6021 is configured to determine at least one character position from the character positions of the target video key frames as at least one character recognition interval to be selected, where the number of times of repeated appearance of the character content of the character recognition interval to be selected is smaller than a number threshold.

In some possible embodiments, the target video may further include some non-subtitle text, such as a logo, a watermark, and so on. However, the positions and the content of the characters in the video are usually unchanged or slightly changed, and the interval deduplication unit 6021 may eliminate the positions of the characters irrelevant to the subtitles by using this characteristic, so as to further improve the accuracy of extracting the subtitles in the following process. That is, the interval repetition removing unit 6021 may filter the text positions after determining the text positions and the text contents in the key frames of the target videos, and further determine the text intervals to be selected of the target videos.

In some possible implementations, the interval division module 602 includes:

the character recognition module 6023 is configured to determine the text similarity of the character contents appearing in the character positions of each target video key frame, and when the text similarity of the character contents appearing in any two times in any character position is greater than a threshold (for example, 85%), determine the character contents appearing in any two times as the character contents appearing repeatedly, and determine that any one of the character positions is not used as the character recognition interval to be selected.

In some possible implementations, the text recognition module 6023 may determine the text similarity of the text content appearing in the text positions of the target video key frames. When the text similarity of the text contents appearing in any two times in any text position is greater than a threshold (for example, 85%), the text contents appearing in any two times are determined as the repeatedly appearing text contents, and any text position is determined not to be used as the recognition interval of the characters to be selected.

In some possible implementations, the interval division module 602 includes:

the interval determining unit 6022 is configured to determine a jitter degree of each candidate text recognition interval, and determine a candidate text recognition interval with a jitter degree smaller than or equal to a jitter degree threshold as a subtitle recognition interval of the target video.

In some possible embodiments, after the text position is eliminated according to the text content in each target key frame, the section determination unit 6022 may determine the subtitle recognition section of the target video. As shown in fig. 6, the section determining unit 6022 may obtain the candidate text recognition section according to the coordinates of the text position, and further obtain the subtitle recognition section. For example, section determining section 6022 may express the caption identifying section by the abscissa and ordinate of the caption identifying section boundary pixel point, may express the caption identifying section by only the abscissa or ordinate of the caption identifying section boundary pixel point, and may express the caption identifying section by the abscissa and ordinate of the caption identifying section boundary diagonal pixel point.

In some possible embodiments, before determining the subtitle recognition interval of the target video, in order to further improve the accuracy of the subsequent subtitle extraction, the interval determination unit 6022 may eliminate some text positions with larger jitter in the video. That is, the interval determining unit 6022 may determine the jitter degree of each candidate character recognition interval according to at least one candidate character recognition interval, and determine the subtitle recognition interval of the target video according to the candidate character recognition interval with the jitter degree smaller than or equal to the jitter degree threshold (for example, the vertical coordinate jitter of the upper and lower boundaries of the character position does not exceed 6 pixels).

In particular, if the jitter degree of the text position in the key frame of the target video is too large (exceeds the jitter degree threshold), that is, the subtitle quality in the target video is poor, and the text position is not suitable for being used as the phonetic transcription training corpus. The interval determination unit 6022 may also discard the existing target video, determine the target video in the video to be processed through the video acquisition module 601 again, and perform the foregoing operation again until the target subtitle recognition interval suitable for being used as the phonetic transcription text corpus is obtained.

The subtitle extraction module 603 is configured to identify a subtitle of the target video according to the subtitle identification interval, so as to obtain a subtitle to be processed from the target video, and perform character processing on the subtitle to be processed according to a preset corpus obtaining rule to obtain the target subtitle of the target video.

In some possible embodiments, as shown in part b of fig. 4, after obtaining the subtitle recognition interval, the subtitle extraction module 603 may recognize (e.g., OCR) the subtitle of the target video according to the subtitle recognition interval to obtain the subtitle to be processed from the target video, and perform subtitle post-processing on the subtitle to be processed. The subtitle post-processing includes, but is not limited to, processing operations such as deduplication processing, phrase merging, word number recognition, character and filtering, and the like.

In some possible implementations, the subtitle extraction module 603 includes:

the subtitle dividing unit 6031 is configured to divide the subtitle to be processed into a plurality of subtitle clauses at preset time intervals, deduplicate the subtitle clauses, and merge subtitle clauses with a character length smaller than a character length threshold in the deduplicated subtitle clauses to obtain a merged subtitle to be processed.

In some possible embodiments, the subtitle dividing unit 6031 may obtain corresponding time axis information of the to-be-processed subtitles in the target video while identifying the target video to obtain the to-be-processed subtitles, for example, a first sentence subtitle appears in n second to m second (or j frame to k frame) of the target video. The subtitle dividing unit 6031 may divide the subtitle to be processed into a plurality of subtitle clauses at a preset time interval (e.g., 0.2 second), deduplicate each subtitle clause to prevent repeated subtitles from appearing consecutively for a plurality of times, and merge subtitle clauses in the deduplicated subtitle clauses, where the character length is smaller than a character length threshold (e.g., 2 characters), to prevent the subtitle clauses from being too short, so as to obtain a merged subtitle to be processed.

In some possible implementations, the subtitle extraction module 603 includes:

and a caption screening unit 6032, configured to perform caption clause screening based on the characters in the merged to-be-processed caption to determine a target caption of the target video.

In some possible embodiments, if the caption clause length of the caption to be processed is longer or shorter (e.g., more than 4 to 25 characters), i.e., the speech speed in the target video is faster or slower (e.g., more than 6 words per second or less than 3 words per second), it indicates that the target video is not suitable as the phonetic transcription corpus. The subtitle filtering unit 6032 may also discard the existing target video, determine the target video again in the video to be processed through the video obtaining module 601 and the interval dividing module 602, and perform the above operations again until the subtitle to be processed suitable as the phonetic text corpus is obtained.

In some possible implementations, the subtitle filtering unit 6032 includes a sentence elimination subunit. The living removing subunit is used for removing the subtitles clauses containing the uncommon characters in the merged subtitles to be processed so as to screen out the subtitles clauses not containing the uncommon characters, wherein the uncommon characters at least comprise one of letters, numbers and radicals of the uncommon radicals.

In some feasible implementation manners, due to certain technical defects of the character recognition technology, some rarely-used characters may not be accurately recognized, the rarely-used characters generally have a large error with characters in a target video, and if the rarely-used characters are used as target subtitles to generate a phonetic transcription training corpus, the subsequent training effect will be affected. Therefore, the subtitle filtering unit 6032 can remove the subtitle clauses containing uncommon characters from the merged to-be-processed subtitles to filter out the subtitle clauses not containing uncommon characters. Wherein the uncommon character at least comprises one of letters, numbers and radicals of the uncommon part.

And a corpus generating module 604, configured to generate an audio-text training corpus for video speech recognition according to the target video and the target subtitle.

In some possible embodiments, after obtaining the target caption, the corpus generating module 604 may generate a corpus of the voice-to-text according to the target video and the target caption, or extract an audio in the target video, and generate a corpus of the voice-to-text according to the extracted audio and the target caption.

In some possible embodiments, after obtaining the target caption, the corpus generating module 604 may verify the rate of wrong words of the target caption, and generate the audio-to-text training corpus according to the target video and the target caption when the rate of wrong words of the target caption is less than a threshold value (5%) of wrong words. The wrong word rate can be calculated by formula 1, where formula 1 is specifically as follows:

Referring to fig. 10, fig. 10 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 10, the terminal device 1000 in this embodiment may include: the processor 1001, the network interface 1004, and the memory 1005, and the terminal apparatus 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 10, a memory 1005, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the terminal device 1000 shown in fig. 10, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

In some possible embodiments, the processor 1001 is configured to:

In some possible embodiments, the processor 1001 is further configured to:

In some possible embodiments, the processor 1001 is configured to:

It should be understood that in some possible embodiments, the processor 1001 may be a Central Processing Unit (CPU), and the processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

It should be understood that the device control application stored in the memory 1005 described above may include the following functional modules:

In some possible embodiments, the video capturing module includes:

In some possible embodiments, the interval dividing module includes:

In some possible embodiments, the interval dividing module further includes:

In some possible embodiments, the subtitle extraction module includes:

In some possible embodiments, the subtitle filtering unit includes:

In a specific implementation, the terminal device 1000 may execute, through each built-in functional module thereof, the implementation manners provided in each step in fig. 2, fig. 7, and/or fig. 8, which may specifically refer to the implementation manners provided in each step, and are not described herein again.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and is executed by a processor to implement the method provided in each step in fig. 2, fig. 7, and/or fig. 8, which may specifically refer to implementation manners provided in each step, and are not described herein again.

The computer readable storage medium may be an internal storage unit of the task processing device provided in any of the foregoing embodiments, for example, a hard disk or a memory of an electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), and the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first", "second", and the like in the claims and in the description and drawings of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for obtaining a phonetic-to-text corpus, the method comprising:

determining a subtitle identification interval of the target video according to the character position and the character content in each target video key frame, wherein the character content of different target video key frames corresponding to the same position in the subtitle identification interval is different;

recognizing the subtitle of the target video according to the subtitle recognition interval so as to obtain a subtitle to be processed from the target video, and performing character processing on the subtitle to be processed according to a preset corpus obtaining rule so as to obtain the target subtitle of the target video;

and generating an audio-text training corpus for voice recognition according to the target video and the target subtitle.

2. The method of claim 1, wherein the obtaining a plurality of target video key frames of a target video comprises:

acquiring a video to be processed, and determining a video key frame of the video to be processed and a frame number of the video key frame;

and when the frame number of the video key frame is greater than or equal to the frame number threshold, determining the video to be processed as a target video, and determining a plurality of video key frames of the video to be processed as a plurality of target video key frames.

3. The method of claim 1, wherein the obtaining a plurality of target video key frames of a target video comprises:

4. The method according to any one of claims 1 to 3, wherein the determining the subtitle recognition interval of the target video according to the text position and the text content in the key frame of each target video comprises:

determining at least one character position from the character positions of the target video key frames as at least one character recognition interval to be selected, wherein the repeated occurrence frequency of the character content of the character recognition interval to be selected is less than a frequency threshold value;

5. The method of claim 4, further comprising:

determining the text similarity of the text content appearing in the text position of each target video key frame;

when the text similarity of the character contents appearing at any two times in any character position is larger than a threshold value, determining the character contents appearing at any two times as character contents appearing repeatedly, and determining that any character position is not used as a character recognition interval to be selected.

6. The method according to claim 5, wherein the performing character processing on the to-be-processed subtitle according to a preset corpus obtaining rule comprises:

and performing caption clause screening based on the characters in the merged captions to be processed to determine the target captions of the target video.

7. The method of claim 6, wherein performing caption clause screening based on the characters in the merged caption to be processed comprises:

removing the caption clauses containing uncommon characters in the merged captions to be processed so as to screen out the caption clauses not containing the uncommon characters;

the uncommon character at least comprises one of letters, numbers and a remote component.

8. An apparatus for obtaining phonetic-to-text corpus, comprising:

the interval division module is used for determining a subtitle identification interval of the target video according to the character position and the character content in each target video key frame, wherein the character content of different target video key frames corresponding to the same position in the subtitle identification interval is different;

the subtitle extraction module is used for identifying the subtitles of the target video according to the subtitle identification interval so as to obtain the subtitles to be processed from the target video, and performing character processing on the subtitles to be processed according to a preset corpus obtaining rule so as to obtain the target subtitles of the target video;

9. A terminal device, comprising a processor and a memory, the processor and the memory being interconnected;

the memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which is executed by a processor to implement the method of any one of claims 1-7.