CN115690649A

CN115690649A - Subtitle processing method and device, electronic equipment and storage medium

Info

Publication number: CN115690649A
Application number: CN202211259549.3A
Authority: CN
Inventors: 刘芳龙; 李鑫; 李甫; 何栋梁
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-02-03

Abstract

The present disclosure provides a subtitle processing method, apparatus, electronic device, and storage medium, which relate to the technical field of artificial intelligence, and in particular, to the technical fields of computer vision, deep learning, and the like, and may be applied to scenes such as an AI-Generated Content (AI-Generated Content). The specific implementation scheme is as follows: acquiring a plurality of target video frames of a video to be processed; detecting the subtitles in each target video frame, and determining the height of the subtitles in each target video frame; determining the highest subtitle height of the video to be processed based on the height of the subtitle in each target video frame; based on the highest subtitle height of the video to be processed, the subtitles in each video frame of the video to be processed are cut or erased to obtain the target video with the subtitles removed, and intelligent removal of the video subtitles is achieved.

Description

Subtitle processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of computer vision, deep learning, and the like, and in particular, to a method and an apparatus for processing subtitles, an electronic device, and a storage medium.

Background

Subtitles refer to non-video contents such as dialogs in television, movie, and stage works displayed in a text form, and also generally refer to characters processed in later stages such as movie and television works. When a material such as a video with subtitles is re-edited, it is common to process subtitles and audio in a subtitle video to achieve video re-creation.

Disclosure of Invention

The disclosure provides a subtitle processing method, a subtitle processing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a subtitle processing method including:

acquiring a plurality of target video frames of a video to be processed;

detecting the subtitles in each target video frame, and determining the height of the subtitles in each target video frame;

determining the highest subtitle height of the video to be processed based on the height of the subtitle in each target video frame;

and cutting or erasing the subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain the target video with the subtitles removed.

According to another aspect of the present disclosure, there is provided a subtitle processing apparatus including:

the video frame acquisition module is used for acquiring a plurality of target video frames of a video to be processed;

the subtitle height determining module is used for detecting the subtitles in each target video frame and determining the height of the subtitles in each target video frame;

a highest caption height determining module, configured to determine a highest caption height of the video to be processed based on a height of a caption in each of the target video frames;

and the subtitle removing module is used for cutting or erasing the subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain the target video with the subtitles removed.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a subtitle processing method according to any one of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a subtitle processing method according to any one of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the subtitle processing method of any one of the present disclosure.

The embodiment of the disclosure realizes intelligent removal of video subtitles.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic diagram of a subtitle processing method according to the present disclosure;

FIG. 2 is a schematic illustration of a maximum caption height determination according to the present disclosure;

fig. 3 is another schematic diagram of a subtitle processing method according to the present disclosure;

FIG. 4 is a schematic diagram of an embodiment of subtitle horizontal cropping according to the present disclosure;

fig. 5 is yet another schematic diagram of a subtitle processing method according to the present disclosure;

FIG. 6a is a schematic diagram illustrating position information of a subtitle frame to be erased according to the present disclosure;

FIG. 6b is another schematic diagram illustrating the position information of the subtitle frame to be erased according to the present disclosure;

FIG. 7 is an illustration of the presentation of caption box pixel prediction according to the present disclosure;

fig. 8 is a schematic diagram of a subtitle processing apparatus according to the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a subtitle processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Subtitles generally refer to characters processed in later stages such as movie and television works, and when materials such as videos with subtitles need to be edited again, subtitle removal processing needs to be performed on each video frame of the videos, and then the video files with the subtitles removed are edited, so that video re-creation is achieved.

In the related art, professional image processing software such as PS (photo shop) is used to erase subtitles from each video frame in a video with subtitles frame by frame, or the subtitle position of each video frame in a video with subtitles is manually determined, and then the subtitles in each video frame are cut one by one to remove the subtitles in the video.

However, the way of removing the video subtitles by using professional image processing software such as PS requires a certain professional skill of an operator to operate the professional image processing software such as PS, and also requires manual frame-by-frame processing in the processing process, which results in a large workload and a large cost. The way of manually determining the subtitle position of each video frame in the video with subtitles and then cutting the subtitles one by one also needs to consume larger labor cost.

In order to realize the intellectualization of video subtitle removal, the disclosure provides a subtitle processing method, which includes the steps of obtaining a plurality of target video frames of a video to be processed, detecting subtitles in each target video frame, determining the height of the subtitles in each target video frame, determining the highest subtitle height of the video to be processed based on the height of the subtitles in each target video frame, and cutting or erasing the subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain the target video with the subtitles removed.

In the embodiment of the disclosure, subtitles in a plurality of target video frames of a video to be processed are detected, the height of the subtitles in each target video frame is determined, the highest subtitle height of the video to be processed is further determined according to the height of the subtitles in each target video frame, and then the subtitles in each video frame of the video to be processed are cut or erased based on the highest subtitle height of the video to be processed, so that the target video with the subtitles removed is obtained, the highest subtitle height for cutting or erasing the video to be processed does not need to be manually determined, and the intelligent removal of the subtitles of the video to be processed is realized.

The following describes the subtitle processing method provided by the embodiment of the present disclosure in detail.

The subtitle processing method provided by the embodiment of the disclosure can be applied to electronic equipment, such as server equipment, intelligent terminal equipment and the like. The subtitle processing method provided by the embodiment of the disclosure can be applied to scenes such as video intelligent subtitle removal and AIGC (artificial intelligence Content creation).

Referring to fig. 1, fig. 1 is a schematic flowchart of a subtitle processing method according to an embodiment of the present disclosure, including the following steps:

s101, acquiring a plurality of target video frames of a video to be processed.

The video to be processed is the video from which the subtitles are to be removed. The caption removal means that a caption area in each video frame of the video to be processed is processed, so that the caption is not displayed in each video frame of the video after the caption removal.

The plurality of target video frames of the video to be processed may be some or all of the video frames in the video to be processed.

In one possible implementation, acquiring a plurality of target video frames of a video to be processed may include: acquiring a video to be processed; and performing frame extraction processing on the video to be processed to obtain a plurality of target video frames.

Generally, each subtitle in a video to be processed cannot be lost once flashing, so that each subtitle display time period may correspond to multiple video frames, in order to save subtitle detection time and improve subtitle detection efficiency, frame extraction processing can be performed on the video to be processed to obtain multiple target video frames, and then subtitle detection is performed on the obtained multiple target video frames, so that the highest subtitle height of the video to be processed can be further rapidly determined.

In one example, for a video to be processed, a mode of extracting a plurality of preset frames per second may be adopted to perform frame extraction processing on the video to be processed, so as to retain video frames corresponding to each subtitle as much as possible. The preset frame number can be set according to actual requirements, and for example, the preset frame number can be 2 frames, 3 frames, or 4 frames, and the like.

In the embodiment of the disclosure, the frame extraction processing is performed on the video to be processed to obtain a plurality of target video frames, and then the subtitle detection is performed on the plurality of target video frames, so that the subtitle detection time is saved, and the subtitle detection efficiency is improved.

And S102, detecting the subtitles in each target video frame, and determining the height of the subtitles in each target video frame.

In an example, for each target video frame, a text detection model may be used to detect text in the target video frame to obtain position information of a detection frame corresponding to the text in the target video frame, and then according to prior information of a video subtitle, position information of the detection frame corresponding to the subtitle in the target video frame is determined from the position information of the detection frame corresponding to the text in the target video frame, and a height of a subtitle detection frame in the target video frame having a maximum distance from a bottom edge of the target video frame is determined as a height of the subtitle in the target video frame.

The Character detection model may be a pre-trained model for detecting characters in the video frame, and may also be an OCR (Optical Character Recognition) Recognition model.

The prior information of the video subtitles can be a specified rule for the subtitle region, for example, the subtitles are generally positioned at the bottom of a video frame picture, the height from the bottom edge is not more than 1/3 of the height of the video frame picture, the text height of the subtitles is generally higher than 5% of the overall height of the video frame picture, the central position of the subtitles cannot be deviated to a position which is less than 1/8 of the width of the video frame picture from the left side and the right side of the video frame picture, and the like.

In an example, for each target video frame, the subtitle in the target video frame may be detected by using a subtitle detection model to obtain position information of a detection frame corresponding to the subtitle in the target video frame, and a height of a subtitle detection frame in the target video frame with a maximum distance from a bottom edge of the target video frame is determined as a height of the subtitle in the target video frame. The caption detection model is obtained by training according to the sample video frame and the position information of the caption detection frame in the sample video frame.

And S103, determining the highest subtitle height of the video to be processed based on the height of the subtitle in each target video frame.

For example, the average value, the maximum value, the minimum value, or the like of the caption heights in each target video frame may be determined as the highest caption height of the video to be processed.

And S104, cutting or erasing the subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain the target video with the subtitles removed.

In one example, a region below the highest subtitle height of each video frame of the video to be processed is clipped, or a subtitle detection frame in the region below the highest subtitle height of each video frame of the video to be processed is shielded or subjected to pixel modification, so as to clip or erase subtitles in each video frame of the video to be processed, and obtain a target video with subtitles removed.

In the embodiment of the disclosure, subtitles in a plurality of target video frames of a video to be processed are detected, the height of the subtitles in each target video frame is determined, the highest subtitle height of the video to be processed is further determined according to the height of the subtitles in each target video frame, and then the subtitles in each video frame of the video to be processed are cut or erased based on the highest subtitle height of the video to be processed, so that the target video with the subtitles removed is obtained, the highest subtitle height for cutting or erasing the video to be processed does not need to be manually determined, so that the intelligent removal of the subtitles of the video to be processed is realized, the operation is convenient and fast, and the convenience of re-editing the video to be processed is further improved.

In a possible implementation manner, referring to fig. 2, fig. 2 is a schematic flowchart of a method for determining a maximum subtitle height of a video to be processed according to an embodiment of the present disclosure, including the following steps:

s201, with a preset height from the bottom edge of the target video frame as a starting point and a first preset number of pixel points as a unit, dividing an area below the preset height from the bottom edge of the target video frame to obtain a plurality of candidate height intervals.

It will be appreciated that the height of each video frame in the same video should be the same. For any target video frame, the area below the preset height from the bottom edge of the target video frame is divided by taking the preset height from the bottom edge of the target video frame as a starting point and taking the first preset number of pixel points as a unit, so that a plurality of candidate height intervals can be obtained.

Illustratively, according to the prior information of the video subtitles, the preset height may be set to be 1/3 of the bottom edge of the target video frame, and the first preset number may be set according to actual requirements, for example, set to be 3 pixels, 5 pixels, or 8 pixels, and so on.

S202, based on the height of the subtitles in each target video frame, counts the number of the subtitle heights included in each candidate height section.

And S203, determining the highest subtitle height of the video to be processed based on the subtitle height in each target video frame contained in the target candidate height interval.

Counting the number of the subtitle heights of the target video frames falling into each candidate height interval, determining the target candidate height interval, and determining the highest subtitle height of the video to be processed according to the height of the subtitle in each target video frame contained in the target candidate height interval. Wherein, the target candidate height interval is: and the candidate height interval is closest to the starting point and comprises the candidate height intervals of which the number of the subtitle heights is not less than a second preset number.

For example, the second preset number may be set according to actual needs, such as 2, 3, or 5, and so on. In one example, the maximum value, the minimum value, or the average value of the caption heights in each target video frame included in the target candidate height interval may be determined as the highest caption height of the video to be processed.

Illustratively, the second preset number is 3, and the target candidate height interval is: and candidate height sections which are closest to the starting point and contain not less than 3 subtitle heights. The method comprises the steps of counting the number of the caption heights of target video frames falling into each candidate height interval, determining the candidate height interval which is closest to a starting point and contains not less than 3 caption heights as the target candidate height interval, and further determining the maximum value, the minimum value or the average value of the caption heights in each target video frame contained in the target candidate height interval as the highest caption height of a video to be processed.

In the embodiment of the disclosure, the number of the subtitle heights of the target video frames falling into each candidate height interval is counted, the candidate height interval which is closest to the starting point and contains the subtitle heights not less than the second preset number is determined as the target candidate height interval, an abnormal value that the number of the subtitle heights is less than the second preset number is eliminated, and further, the highest subtitle height of the video to be processed can be accurately determined according to the height of the subtitle in each target video frame contained in the determined target candidate height interval.

In a possible implementation manner, referring to fig. 3, fig. 3 is a schematic flowchart of another subtitle processing method provided by an embodiment of the present disclosure, and includes the following steps:

s301, a plurality of target video frames of the video to be processed are obtained.

S302, detecting the subtitles in each target video frame, and determining the height of the subtitles in each target video frame.

And S303, determining the highest subtitle height of the video to be processed based on the subtitle height in each target video frame.

The implementation processes of steps S301 to S303 may refer to the implementation processes of steps S101 to S103, which is not described herein again in this disclosure.

S304, based on the highest caption height of the video to be processed, cutting the area below the highest caption height of each video frame of the video to be processed to obtain candidate video frames.

S305, performing horizontal clipping on each candidate video frame to obtain a target video with subtitles removed.

After the highest subtitle height of the video to be processed is determined, cutting the area below the highest subtitle height of each video frame of the video to be processed to obtain candidate video frames of the video to be processed. In order to avoid that the height-to-width ratio of each candidate video frame is changed greatly after the area below the highest subtitle height of each video frame of the video to be processed is cut, so that the main body of the video frame is deformed, and the impression of the video is influenced, further, each candidate video frame is horizontally cut, and the target video with subtitles removed is obtained.

In one example, according to a preset pixel value, horizontal clipping is performed on pixel values on two sides of a video picture of each candidate video frame to obtain a target video with subtitles removed. The preset pixel value can be set according to actual requirements.

Illustratively, each candidate video frame is horizontally clipped from the video picture distance of each candidate video frame at 50, 80 or 100 pixel points on both sides, so as to obtain the target video with subtitles removed.

In the embodiment of the disclosure, subtitles in a plurality of target video frames of a video to be processed are detected, the height of the subtitles in each target video frame is determined, the highest subtitle height of the video to be processed is further determined according to the height of the subtitles in each target video frame, and then the subtitles in each video frame of the video to be processed are cut based on the highest subtitle height of the video to be processed, so that the target video with the subtitles removed is obtained, the highest subtitle height for cutting the video to be processed does not need to be manually determined, so that the intelligent removal of the subtitles of the video to be processed is realized, the method is convenient and rapid, and the convenience for re-editing the video to be processed is further improved. And moreover, the area below the highest subtitle height of each video frame of the video to be processed is cut to obtain candidate video frames, and further, each candidate video frame is horizontally cut to obtain a target video with subtitles removed, so that the problem that the aspect ratio of each candidate video frame is changed greatly after the area below the highest subtitle height of each video frame of the video to be processed is cut, the main body of a video picture is deformed, and the influence on the video impression is reduced.

In a possible implementation manner, referring to fig. 4, fig. 4 is a schematic diagram of an implementation manner of horizontal cropping of subtitles provided by an embodiment of the present disclosure, and includes the following steps:

s401, main body detection is carried out on each candidate video frame to obtain the main body position of each candidate video frame.

And aiming at each candidate video frame, detecting the main body of the candidate video frame by using a target detection model to obtain the main body position of the candidate video frame. The target detection model is used for detecting a subject in the video frame and is obtained by training according to the sample video frame and the subject position of the sample video frame. The subject may be a person or an object included in the video frame, and the subject position may specifically be a position of a detection frame corresponding to the subject, such as a coordinate position of an upper left corner and a lower right corner of the detection frame.

S402, determining the horizontal clipping position of each candidate video frame based on the main body position and the preset aspect ratio of each candidate video frame.

And determining the horizontal pixel position of each candidate video frame containing the main body position under the preset aspect ratio according to the main body position and the preset aspect ratio of each candidate video frame, and determining the horizontal pixel position as the corresponding horizontal clipping position of each candidate video frame. Wherein, the preset height-width ratio can be set according to actual requirements.

Illustratively, the preset aspect ratio is 16, after obtaining the subject position of the candidate video frame, determining the subject position of the candidate video frame, and determining the horizontal pixel position of the candidate video frame corresponding to the aspect ratio of 16 as the horizontal cropping position corresponding to the candidate video frame, so as to ensure that the subject of the candidate video frame is not truncated when cropping.

And S403, performing horizontal cutting on each candidate video frame based on the horizontal cutting position of each candidate video frame to obtain a target video with subtitles removed.

In the embodiment of the disclosure, the horizontal clipping position of each candidate video frame is determined based on the body position and the preset aspect ratio of each candidate video frame, and then each candidate video frame is horizontally clipped based on the horizontal clipping position of each candidate video frame to obtain the target video with the subtitles removed, so that the body of the video frame is ensured not to be truncated during clipping, the clipped target video with the subtitles removed conforms to the original video picture proportion, the phenomenon that the aspect ratio of the video frame changes greatly after the region below the highest subtitle height of the video frame is clipped to cause the deformation of the body of the video picture is avoided, and the influence on the video impression is reduced.

In a possible implementation manner, referring to fig. 5, fig. 5 is a schematic flowchart of a subtitle processing method according to an embodiment of the present disclosure, including the following steps:

s501, a plurality of target video frames of the video to be processed are obtained.

And S502, detecting the subtitles in each target video frame, and determining the height of the subtitles in each target video frame.

S503, determining the highest subtitle height of the video to be processed based on the subtitle height in each target video frame.

The implementation processes of steps S501 to S503 may refer to the implementation processes of steps S101 to S103, which are not described herein again in this embodiment of the disclosure.

S504, aiming at each video frame of the video to be processed, detecting subtitles in the video frame to obtain position information of a subtitle frame of the video frame.

In one example, for each video frame of a video to be processed, a text detection model may be used to detect text in the video frame to obtain position information of a detection frame corresponding to the text in the video frame, and then according to the prior information of the video caption, position information of the detection frame corresponding to the caption in the video frame is determined from the position information of the detection frame corresponding to the text in the video frame, and the position information of the detection frame corresponding to the caption in the video frame is determined as the position information of the caption frame in the video frame.

The character detection model may be a model trained in advance for detecting characters in the video frame, and may also be an OCR recognition model or the like.

In an example, for each video frame of the video to be processed, the subtitle detection model may be used to detect the subtitle in the video frame to obtain the position information of the detection frame corresponding to the subtitle in the video frame, and the position information of the detection frame corresponding to the subtitle in the video frame may be determined as the position information of the subtitle frame in the video frame. The caption detection model is obtained by training according to the sample video frame and the position information of the caption detection frame in the sample video frame.

For example, the position information of the subtitle box may be coordinate information of the upper left corner and the lower right corner of the subtitle box, and the like.

S505, based on the position information of the subtitle frame of the video frame and the highest subtitle height of the video to be processed, the position information of the subtitle frame to be erased is determined.

The position information of the subtitle frame below the highest subtitle height of the video to be processed in the video frame is determined as the position information of the subtitle frame to be erased, and the subtitle frame higher than the highest subtitle height is excluded to avoid the error detection of the subtitle frame.

For example, as shown in fig. 6a, the video frame includes a subtitle frame 1 and a subtitle frame 2, the subtitle frame 2 is determined as the subtitle frame to be erased under the highest subtitle height, and the position information of the subtitle frame 2 is the position information of the subtitle frame to be erased. As shown in fig. 6b, the video frame includes subtitle frames 3 and 4, the subtitle frames 3 and 4 are both below the highest subtitle height, the subtitle frames 3 and 4 are both determined as subtitle frames to be erased, and the position information of the subtitle frames 3 and 4 is the position information of the subtitle frames to be erased.

S506, determining candidate pixels based on the position information of the subtitle frame to be erased and the pixel information of the region outside the subtitle frame to be erased.

And predicting the pixel information in the subtitle frame to be erased according to the pixel information of the area outside the subtitle frame to be erased to obtain candidate pixel information so as to intelligently fill the pixels in the subtitle frame to be erased by using the candidate pixel information.

In one possible embodiment, the determining the candidate pixels based on the position information of the subtitle frame to be erased and the pixel information of the region outside the subtitle frame to be erased may include:

inputting the position information of the subtitle frame to be erased and the pixel information of the region outside the subtitle frame to be erased into a pre-trained pixel prediction model to predict the pixels of the subtitle frame to be erased so as to obtain candidate pixels; the pre-trained pixel prediction model is obtained by training according to the position information of the sample caption frame in the sample image, the pixel information of the area outside the sample caption frame in the sample image and the pixel information of the sample caption frame.

The position information of the sample subtitle frame in the sample image, the area pixel information outside the sample subtitle frame in the sample image and the pixel information of the sample subtitle frame are trained in advance to obtain a pre-trained pixel prediction model, and the pre-trained pixel prediction model can predict the pixels of the subtitle frame to be erased by utilizing the position information of the subtitle frame to be erased and the area pixel information outside the subtitle frame to be erased.

In the embodiment of the disclosure, the pixel prediction model is trained in advance to predict the pixels of the subtitle frame to be erased by using the pixel prediction model, so that the prediction of the pixel information in the subtitle frame by using the pixel information around the subtitle is realized, and the pixels in the subtitle frame to be erased are intelligently filled by further using the predicted candidate pixel information.

S507, replacing the pixel in the subtitle frame to be erased by the candidate pixel to obtain the target video with the subtitle removed.

For example, as shown in fig. 7, the subtitle frame to be erased is the subtitle frame 5, and if the subtitle frame 5 is directly shielded, the main body-hand display in fig. 7 is incomplete. In the embodiment of the disclosure, the position information of the subtitle frame 5 and the pixel information of the region outside the subtitle frame 5 are input into a pre-trained pixel prediction model to predict the pixels of the subtitle frame 5 to obtain candidate pixels, further, the candidate pixels are used to replace the pixels in the subtitle frame 5, the original subtitles in the subtitle frame 5 are erased to obtain the target video frame without subtitles, and the subject-hand can be completely displayed, so that the intelligent filling of the pixels in the subtitle frame 5 by using the candidate pixel information is realized, and finally the target video without subtitles is obtained.

In the embodiment of the disclosure, subtitles in a plurality of target video frames of a video to be processed are detected, the height of the subtitles in each target video frame is determined, the highest subtitle height of the video to be processed is further determined according to the height of the subtitles in each target video frame, and then the subtitles in each video frame of the video to be processed are cut based on the highest subtitle height of the video to be processed, so that the target video with the subtitles removed is obtained, the highest subtitle height for cutting the video to be processed does not need to be manually determined, so that the intelligent removal of the subtitles of the video to be processed is realized, the method is convenient and rapid, and the convenience for re-editing the video to be processed is further improved. And aiming at each video frame of the video to be processed, determining the position information of the subtitle frame to be erased based on the position information of the subtitle frame of the video frame and the highest subtitle height of the video to be processed, further, predicting the pixel information in the subtitle frame by using the pixel information around the subtitle, and intelligently filling the pixels in the subtitle frame to be erased by using the predicted candidate pixel information, thereby realizing the erasing of the subtitle on the basis of keeping the original video content.

In an application scenario, after subtitles of a video to be processed are removed, a target video with subtitles removed can be re-edited or re-created. In a possible implementation manner, after obtaining the target video with subtitles removed, the following steps may be further performed:

step one, adding a preset subtitle to a target video without the subtitle to obtain a candidate video;

and step two, replacing the audio in the candidate video by using the preset dubbing to obtain a new video.

And aiming at the target video without subtitles, the intelligent editing production of the video can be realized. The method comprises the steps of adding preset subtitles in a target video with subtitles removed, wherein the preset subtitles can be set according to actual requirements, dubbing the preset subtitles after the preset subtitles are added, and replacing audio in the original target video by the preset dubbing corresponding to the preset subtitles to generate a new video.

In the embodiment of the disclosure, after the subtitles of the video to be processed are removed, new subtitles and dubbing can be added to the target video from which the subtitles are removed, so that re-editing or re-creation of the video is realized.

In an application scene, the removal of subtitles from a video to be processed can be combined with other text-to-speech technologies to realize intelligent production of the video and the like.

Illustratively, in the embodiment of the present disclosure, the subtitle processing method may further include: acquiring a video to be processed, detecting subtitles in the video frame aiming at each video frame of the video to be processed to obtain position information of a subtitle frame of the video frame, determining the position information of the subtitle frame of the video frame as the position information of the subtitle frame to be erased, inputting the position information of the subtitle frame to be erased and area pixel information outside the subtitle frame to be erased into a pre-trained pixel prediction model to predict pixels of the subtitle frame to be erased to obtain candidate pixels, and replacing the pixels in the subtitle frame to be erased by the candidate pixels to obtain a target video with the subtitles removed.

An embodiment of the present disclosure further provides a subtitle processing apparatus, with reference to fig. 8, the apparatus includes:

a video frame acquiring module 801, configured to acquire multiple target video frames of a video to be processed;

a caption height determining module 802, configured to detect captions in each target video frame, and determine a height of the captions in each target video frame;

a highest caption height determining module 803, configured to determine a highest caption height of the video to be processed based on the height of the caption in each target video frame;

and the subtitle removing module 804 is configured to clip or erase subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed, so as to obtain a target video from which the subtitles are removed.

In the embodiment of the disclosure, subtitles in a plurality of target video frames of a video to be processed are detected, the height of the subtitles in each target video frame is determined, the highest subtitle height of the video to be processed is further determined according to the height of the subtitles in each target video frame, and then the subtitles in each video frame of the video to be processed are cut based on the highest subtitle height of the video to be processed, so as to obtain the target video with the subtitles removed.

In a possible implementation manner, the video frame acquiring module 801 is specifically configured to:

acquiring a video to be processed;

and performing frame extraction processing on the video to be processed to obtain a plurality of target video frames.

In a possible implementation manner, the maximum caption height determining module 803 includes:

the interval division submodule is used for dividing an area below a preset height from the bottom edge of the target video frame by taking the preset height from the bottom edge of the target video frame as a starting point and taking a first preset number of pixel points as a unit to obtain a plurality of candidate height intervals;

the subtitle height counting submodule is used for counting the number of the subtitle heights contained in each candidate height interval based on the height of the subtitle in each target video frame;

the highest subtitle height determining submodule is used for determining the highest subtitle height of the video to be processed based on the height of subtitles in each target video frame contained in the target candidate height interval; the target candidate height interval is: and the candidate height interval is closest to the starting point and comprises the candidate height intervals of which the number of the subtitle heights is not less than a second preset number.

In a possible implementation manner, the subtitle removing module 804 includes:

the first subtitle clipping submodule is used for clipping an area below the highest subtitle height of each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain a candidate video frame;

and the second subtitle clipping submodule is used for horizontally clipping each candidate video frame to obtain a target video with subtitles removed.

In a possible implementation manner, the second subtitle cropping sub-module is specifically configured to:

performing main body detection on each candidate video frame to obtain the main body position of each candidate video frame;

determining a horizontal cropping position of each candidate video frame based on the subject position and the preset aspect ratio of each candidate video frame;

and horizontally cutting each candidate video frame based on the horizontal cutting position of each candidate video frame to obtain the target video with the subtitles removed.

In a possible implementation manner, the subtitle removing module 804 includes:

the subtitle frame detection submodule is used for detecting subtitles in each video frame of the video to be processed to obtain position information of the subtitle frame of the video frame;

the subtitle frame determining submodule is used for determining the position information of the subtitle frame to be erased based on the position information of the subtitle frame of the video frame and the highest subtitle height of the video to be processed;

the pixel determination submodule is used for determining candidate pixels based on the position information of the subtitle frame to be erased and the pixel information of the area outside the subtitle frame to be erased;

and the pixel filling sub-module is used for replacing the pixels in the subtitle frame to be erased by using the candidate pixels to obtain the target video with the subtitles removed.

In a possible implementation, the pixel determination sub-module is specifically configured to:

inputting the position information of the subtitle frame to be erased and the region pixel information outside the subtitle frame to be erased into a pre-trained pixel prediction model to predict the pixels of the subtitle frame to be erased so as to obtain candidate pixels; the pre-trained pixel prediction model is obtained by training according to the position information of the sample caption frame in the sample image, the pixel information of the area outside the sample caption frame in the sample image and the pixel information of the sample caption frame.

In a possible embodiment, the above apparatus further comprises:

the subtitle adding module is used for adding preset subtitles to the target video without the subtitles to obtain a candidate video;

and the dubbing replacing module is used for replacing the audio in the candidate video by using the preset dubbing to obtain a new video.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order. It should be noted that the head model in this embodiment is not a head model for a specific user, and cannot reflect personal information of a specific user. It should be noted that the two-dimensional face image in the present embodiment is from a public data set.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

The present disclosure provides an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of any one of the present disclosure.

The present disclosure provides a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of the present disclosure.

A computer program product comprising a computer program is provided by the present disclosure, which when executed by a processor implements the method of any one of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as the subtitle processing method. For example, in some embodiments, the subtitle processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the subtitle processing method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the subtitle processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A subtitle processing method, comprising:

acquiring a plurality of target video frames of a video to be processed;

determining the highest subtitle height of the video to be processed based on the height of subtitles in each target video frame;

2. The method of claim 1, wherein said obtaining a plurality of target video frames of a video to be processed comprises:

acquiring a video to be processed;

3. The method of claim 1, wherein the determining a highest caption height of the video to be processed based on the caption height in each of the target video frames comprises:

dividing regions below a preset height from the bottom edge of the target video frame by taking the preset height from the bottom edge of the target video frame as a starting point and taking a first preset number of pixel points as a unit to obtain a plurality of candidate height intervals;

counting the number of the subtitle heights contained in each candidate height interval based on the height of the subtitle in each target video frame;

determining the highest subtitle height of the video to be processed based on the height of subtitles in each target video frame contained in the target candidate height interval; the target candidate height interval is as follows: and the candidate height interval is closest to the starting point and comprises the candidate height intervals of which the number of the subtitle heights is not less than a second preset number.

4. The method according to any one of claims 1 to 3, wherein the cropping the subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain the subtitle-removed target video comprises:

based on the highest subtitle height of the video to be processed, cutting a region below the highest subtitle height of each video frame of the video to be processed to obtain candidate video frames;

and horizontally cutting each candidate video frame to obtain a target video with subtitles removed.

5. The method of claim 4, wherein the performing horizontal cropping on each candidate video frame to obtain a de-subtitled target video comprises:

determining a horizontal cropping position of each candidate video frame based on the main body position and a preset aspect ratio of each candidate video frame;

and horizontally cutting each candidate video frame based on the horizontal cutting position of each candidate video frame to obtain the target video with subtitles removed.

6. The method according to any one of claims 1 to 3, wherein the erasing subtitles in each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain a target video with subtitles removed comprises:

detecting subtitles in each video frame of the video to be processed to obtain position information of a subtitle frame of the video frame;

determining the position information of the subtitle frame to be erased based on the position information of the subtitle frame of the video frame and the highest subtitle height of the video to be processed;

determining candidate pixels based on the position information of the subtitle frame to be erased and the pixel information of the region outside the subtitle frame to be erased;

and replacing the pixels in the subtitle frame to be erased by using the candidate pixels to obtain the target video with the subtitles removed.

7. The method according to claim 6, wherein the determining candidate pixels based on the position information of the subtitle frame to be erased and the pixel information of the region outside the subtitle frame to be erased comprises:

8. The method of any of claims 1-7, further comprising:

adding a preset subtitle to the subtitle-removed target video to obtain a candidate video;

and replacing the audio in the candidate video by using a preset dubbing to obtain a new video.

9. A subtitle processing apparatus comprising:

10. The apparatus of claim 9, wherein the video frame acquisition module is specifically configured to:

acquiring a video to be processed;

11. The apparatus of claim 9, wherein the highest caption height determination module comprises:

a highest caption height determining submodule, configured to determine a highest caption height of the video to be processed, based on heights of captions in each of the target video frames included in the target candidate height interval; the target candidate height interval is as follows: and the candidate height interval is closest to the starting point and comprises the candidate height intervals of which the number of the subtitle heights is not less than a second preset number.

12. The apparatus according to any one of claims 9-11, wherein the caption removal module comprises:

the first subtitle clipping submodule is used for clipping the area below the highest subtitle height of each video frame of the video to be processed based on the highest subtitle height of the video to be processed to obtain a candidate video frame;

13. The apparatus of claim 12, wherein the second subtitle cropping sub-module is specifically configured to:

determining a horizontal cropping position of each candidate video frame based on a subject position and a preset aspect ratio of each candidate video frame;

14. The apparatus according to any one of claims 9-11, wherein the caption removal module comprises:

the subtitle frame detection submodule is used for detecting subtitles in each video frame of the video to be processed to obtain position information of a subtitle frame of the video frame;

15. The apparatus of claim 14, wherein the pixel determination submodule is specifically configured to:

16. The apparatus of any of claims 9-15, further comprising:

the subtitle adding module is used for adding preset subtitles to the subtitle-removed target video to obtain a candidate video;

and the dubbing replacing module is used for replacing the audio in the candidate video by using preset dubbing to obtain a new video.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.