CN114648751A

CN114648751A - Method, device, terminal and storage medium for processing video subtitles

Info

Publication number: CN114648751A
Application number: CN202011492949.XA
Authority: CN
Inventors: 林染染; 张传昊; 刘阳兴
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2020-12-17
Filing date: 2020-12-17
Publication date: 2022-06-21

Abstract

The invention is suitable for the technical field of computers, and provides a method, a device, a terminal and a storage medium for processing video subtitles, wherein the method comprises the following steps: acquiring a target image corresponding to a video to be processed; determining character edge characteristics of the target image; determining the contour feature of the target image; and determining a video subtitle area corresponding to the target image according to the character edge characteristics and the outline characteristics. In the mode, the terminal not only extracts the character edge characteristics of the target image corresponding to the video to be processed, but also extracts the outline characteristics of the target image corresponding to the video to be processed, and the background characters in the target image are effectively filtered out based on the video caption area determined by the character edge characteristics and the outline characteristics, so that the obtained video caption area is more accurate.

Description

Method, device, terminal and storage medium for processing video subtitles

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for processing video subtitles, a terminal for processing video subtitles, and a storage medium.

Background

With the continuous increase of internet video content, how to retrieve required videos from massive videos is very important. The traditional video retrieval based on keyword description can not meet the requirement of massive video retrieval due to the reasons of limited description capability, strong subjectivity and the like.

However, to recognize and translate subtitles in a video, the position of the subtitles in the video needs to be detected first. In the existing video subtitle positioning method, character recognition is directly performed on a video image, so that a large amount of background characters in the video image are extracted, and an accurate video subtitle area cannot be obtained.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method, an apparatus, a terminal and a storage medium for processing video subtitles to process video subtitles, so as to solve the problem that an existing video subtitle positioning method directly performs text recognition on a video image, which results in a large amount of background text extracted from the video image, and further results in an inaccurate video subtitle region being unable to be obtained.

A first aspect of an embodiment of the present invention provides a method for processing video subtitles, including:

acquiring a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed;

determining character edge characteristics of the target image;

determining the contour features of the target image;

and determining a video subtitle area corresponding to the target image according to the character edge characteristics and the outline characteristics.

In the above manner, the terminal not only extracts the character edge features of the target image corresponding to the video to be processed, but also extracts the contour features of the target image corresponding to the video to be processed, and the background characters in the target image are effectively filtered out based on the video subtitle region determined by the character edge features and the contour features, so that the obtained video subtitle region is more accurate. The method is beneficial to accurately identifying and translating the subtitles in the video according to the video subtitle region by the subsequent terminal, and improves the accuracy of video retrieval.

Optionally, the target image includes a plurality of video frames in the video to be processed, or includes a fused image obtained by performing image fusion processing on the plurality of video frames.

Optionally, the method for acquiring a target image corresponding to the video to be processed includes, when the target image includes a fused image obtained by performing image fusion processing on the plurality of video frames, acquiring a target image corresponding to the video to be processed, where N is an integer greater than 1, and includes:

performing image fusion processing on the fusion image corresponding to the ith frame of video frame and the (i + 1) th frame of video frame to obtain a fusion image corresponding to the (i + 1) th frame of video frame;

the fusion image corresponding to the ith frame of video frame is obtained by carrying out image fusion processing on the fusion image corresponding to the (i-1) th frame of video frame and the ith frame of video frame, the value of i is 2,3.. N-1, and the fusion image corresponding to the 1 st frame of video frame is the 1 st frame of video frame;

and when i is N-1, the fused image corresponding to the (i + 1) th frame of video frame is the target image corresponding to the video to be processed.

Optionally, the determining, according to the text edge feature and the contour feature, a video subtitle region corresponding to the target image includes:

performing feature fusion processing on the character edge features and the outline features to obtain crossed edge features;

performing morphological operation on the crossed edge features to obtain a binary image;

and performing hole filling processing on the binary image to obtain the video subtitle area.

Optionally, the determining the text edge feature of the target image includes:

inputting the target image into a trained feature extraction model for processing to obtain character edge features corresponding to the target image; the feature extraction model is obtained by training a sample image set based on an initial feature extraction network; the sample image set comprises a plurality of sample images and text edge features corresponding to each sample image.

Optionally, the determining the contour feature of the target image includes:

carrying out noise reduction processing on the target image to obtain a noise reduction image;

and carrying out adaptive threshold binarization processing on the noise-reduced image to obtain the profile characteristics.

Optionally, when the target image includes a plurality of video frames in the video to be processed, the determining text edge features of the target image includes:

extracting character edge characteristics of each video frame in the plurality of video frames;

the determining the contour feature of the target image comprises: extracting contour features of each of the plurality of video frames;

determining a video subtitle region corresponding to the target image according to the character edge feature and the contour feature, including: aiming at each video frame, obtaining a subtitle area corresponding to each video frame according to the character edge characteristics and the outline characteristics corresponding to each video frame; and fusing the subtitle areas respectively corresponding to the plurality of video frames to obtain the video subtitle areas.

Optionally, the fusing the subtitle regions respectively corresponding to the plurality of video frames to obtain the video subtitle regions includes:

determining overlapped areas in the caption areas respectively corresponding to the plurality of video frames;

and determining the video caption area based on the overlapped area.

Optionally, after determining the video subtitle region corresponding to the target image according to the text edge feature and the contour feature, the method further includes:

determining a rectangular frame corresponding to the video caption area to obtain a caption frame corresponding to the video caption area;

and displaying the caption frame and the video caption in the caption frame.

A second aspect of embodiments of the present application provides an apparatus for processing video subtitles, including:

the acquisition unit is used for acquiring a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed;

the first extraction unit is used for determining character edge characteristics of the target image;

the second extraction unit is used for determining the contour feature of the target image;

and the determining unit is used for determining the video subtitle area corresponding to the target image according to the character edge characteristic and the outline characteristic.

Optionally, the video to be processed includes N video frames, where N is an integer greater than 1, and the obtaining unit is specifically configured to:

Optionally, the determining unit is specifically configured to:

Optionally, the first extraction unit is specifically configured to:

Optionally, the second extraction unit is specifically configured to:

Optionally, when the target image comprises a plurality of video frames in the video to be processed,

the first extraction unit is specifically configured to: extracting character edge characteristics of each video frame in the plurality of video frames;

the second extraction unit is specifically configured to: extracting contour features of each of the plurality of video frames;

the determining unit is specifically configured to:

aiming at each video frame, obtaining a subtitle area corresponding to each video frame according to the character edge characteristics and the outline characteristics corresponding to each video frame; and fusing the subtitle areas respectively corresponding to the plurality of video frames to obtain the video subtitle areas.

determining the video subtitle region based on the coinciding region.

Optionally, the apparatus further comprises:

the caption frame determining unit is used for determining a rectangular frame corresponding to the video caption area to obtain a caption frame corresponding to the video caption area;

and the display unit is used for displaying the caption frame and the video caption in the caption frame.

A third aspect of embodiments of the present application provides a terminal for processing video subtitles, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the method for processing video subtitles according to the first aspect.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program, which when executed by a processor implements the steps of the method for processing video subtitles according to the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a terminal for processing video subtitles, causes the terminal for processing video subtitles to perform the steps of the method for processing video subtitles of the first aspect.

The method for processing the video caption, the device for processing the video caption, the terminal for processing the video caption and the storage medium provided by the embodiment of the application have the following beneficial effects:

according to the embodiment of the application, a terminal acquires a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed; determining character edge characteristics of a target image; determining the contour feature of the target image; and determining a video subtitle area corresponding to the target image according to the character edge characteristics and the outline characteristics. In the above manner, the terminal not only extracts the character edge features of the target image corresponding to the video to be processed, but also extracts the contour features of the target image corresponding to the video to be processed, and the extracted character edge features and contour features are fused, so that background characters in the target image can be effectively filtered, and an accurate video subtitle region can be obtained. The method is beneficial to accurately identifying and translating the subtitles in the video according to the video subtitle region by a subsequent terminal, improves the accuracy of video retrieval, and is beneficial to video analysis. The text edge features corresponding to the target image are extracted based on the trained feature extraction model, the contour features of the target image are extracted based on the adaptive threshold binarization method, the feature extraction method is low in complexity and small in calculated amount, resources are saved, and the speed of determining the video subtitle region is increased.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a flowchart illustrating an implementation of a method for processing video subtitles according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating an implementation of a method for processing video subtitles according to another embodiment of the present invention;

fig. 3 is a flowchart illustrating an implementation of a method for processing video subtitles according to another embodiment of the present invention;

fig. 4 is a schematic diagram of an apparatus for processing video subtitles according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a terminal for processing video subtitles according to another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

In the prior art, text information reflecting video content is obtained by identifying and translating subtitles in a video, and retrieval of the video is realized through the text information. However, to recognize and translate subtitles in a video, the position of the subtitles in the video needs to be detected first. In the existing video subtitle positioning method, character recognition is directly performed on a video image, so that a large amount of background characters in the video image are extracted, and an accurate video subtitle area cannot be obtained.

In view of this, the present application provides a method for processing video subtitles, in which a terminal acquires a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed; determining character edge characteristics of the target image; determining the contour feature of the target image; and determining a video subtitle area corresponding to the target image according to the character edge characteristic and the outline characteristic. In the above manner, the terminal not only extracts the character edge features of the target image corresponding to the video to be processed, but also extracts the contour features of the target image corresponding to the video to be processed, and the extracted character edge features and contour features are fused, so that background characters in the target image can be effectively filtered, and an accurate video subtitle region can be obtained. The method is beneficial to accurately identifying and translating the subtitles in the video according to the video subtitle region by a subsequent terminal, improves the accuracy of video retrieval, and is beneficial to video analysis. The text edge features corresponding to the target image are extracted based on the trained feature extraction model, the contour features of the target image are extracted based on the adaptive threshold value binarization method, the feature extraction method is low in complexity and small in calculation amount, resources are saved, and the speed of determining the video subtitle region is improved.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for processing a video subtitle according to an embodiment of the present invention. The execution subject of the method for processing the video subtitles in this embodiment may be a terminal, a server, and the like, where the terminal includes but is not limited to a mobile terminal such as a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like, and may also include a terminal such as a desktop computer. In this embodiment, taking an execution subject as an example for a terminal, the method for processing a video subtitle as shown in fig. 1 may include:

s101: acquiring a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed.

And the terminal acquires a target image corresponding to the video to be processed. The video to be processed refers to a video in a video subtitle area needing to be positioned, and the type and duration of the video to be processed are not limited. For example, the video to be processed may be composed of 5 to 15 video frames. The target image corresponding to the video to be processed is determined based on a plurality of video frames in the video to be processed, and the plurality of video frames are continuous video frames.

Illustratively, the target image corresponding to the video to be processed may include a plurality of video frames in the video to be processed. For example, the target image may be 8 video frames constituting the video to be processed. The target image corresponding to the video to be processed may also include a fused image obtained by performing image fusion processing on a plurality of consecutive video frames constituting the video to be processed. For example, the video to be processed is composed of 5 video frames, and a fused image obtained by performing image fusion processing on the 5 video frames is the target image.

Illustratively, when the target image includes a plurality of video frames in the video to be processed, after the terminal acquires the video to be processed, the terminal directly extracts the plurality of video frames constituting the video to be processed, so as to acquire the target image corresponding to the video to be processed. Optionally, several consecutive video frames may be selected from a plurality of video frames constituting the video to be processed as the target image corresponding to the video to be processed. For example, the video to be processed is composed of 15 video frames, the terminal selects 8 video frames from the 15 video frames, and the 8 video frames are used as the target images corresponding to the video to be processed.

Illustratively, when the target image includes a fused image obtained by performing image fusion processing on a plurality of video frames, after the terminal acquires the video to be processed, the terminal extracts the plurality of video frames constituting the video to be processed, performs image fusion processing on the extracted plurality of video frames, and takes the obtained fused image as the target image corresponding to the video to be processed.

Optionally, in a possible implementation manner, when the target image includes a fused image obtained by performing image fusion processing on a plurality of video frames, S101 may include S1011, specifically as follows:

s1011: performing image fusion processing on the fusion image corresponding to the ith frame of video frame and the (i + 1) th frame of video frame to obtain a fusion image corresponding to the (i + 1) th frame of video frame; the fusion image corresponding to the ith frame of video frame is obtained by carrying out image fusion processing on the fusion image corresponding to the (i-1) th frame of video frame and the ith frame of video frame, the value of i is 2,3.. N-1, and the fusion image corresponding to the 1 st frame of video frame is the 1 st frame of video frame; and when i is N-1, the fused image corresponding to the (i + 1) th frame of video frame is the target image corresponding to the video to be processed.

The video to be processed comprises N video frames, wherein N is an integer greater than 1. The fusion image is used for representing an image obtained by fusing pixel values of two video frame images. And the terminal carries out image fusion processing on the fusion image corresponding to the ith frame video frame and the (i + 1) th frame video frame in the N video frames contained in the video to be processed to obtain the fusion image corresponding to the (i + 1) th frame video frame. The value of i is 2,3.. N-1, the i +1 th frame of video frame is the next frame of video frame adjacent to the i th frame of video frame, and the i-1 th frame of video frame is the previous frame of video frame adjacent to the i th frame of video frame. For example, when the value of i is 3, the 4 th frame of video frame is the i +1 th frame of video frame, and the 2 nd frame of video frame is the i-1 th frame of video frame. It should be noted that, when i is 1, the fused image corresponding to the 1 st frame video frame is the 1 st frame video frame itself.

The process of image fusion processing by the terminal can be that two video frames to be fused are subjected to denoising processing to obtain a first denoising image and a second denoising image; and performing graying processing on the two denoised images respectively to obtain a first grayscale image and a second grayscale image. And multiplying the weighted value preset for each video frame by the corresponding gray scale image to obtain a weighted image corresponding to the first gray scale image and a weighted image corresponding to the second gray scale image. And adding the pixel values of the corresponding pixel points in the two weighted images to obtain an image which is recorded as a fused image corresponding to the weighted image.

Specifically, N is 8 as an example. Illustratively, when the video to be processed contains 8 video frames, the value of i is 2,3,4,5,6, 7. It should be noted that the fused image corresponding to the 1 st frame of video frame is the 1 st frame of video frame itself. And carrying out image fusion processing on the 1 st frame video frame and the 2 nd frame video frame to obtain a fusion image corresponding to the 2 nd frame video frame. For example, the image fusion processing process may be that the terminal performs denoising processing on the 1 st frame video frame and the 2 nd frame video frame respectively to obtain a first denoised image corresponding to the 1 st frame video frame and a second denoised image corresponding to the 2 nd frame video frame. The terminal respectively performs graying processing on the two de-noised images to obtain a first grayscale image and a second grayscale image. And multiplying the gray scale map corresponding to each video frame based on the weight value preset for each video frame. For example, the weight value corresponding to the 1 st frame of video frame is 1, the weight value corresponding to the 2 nd frame of video frame is 0.8, and 1 is multiplied by the pixel value corresponding to each pixel point in the first gray scale image to obtain a weighted image corresponding to the first gray scale image; and multiplying the 0.8 by the pixel value corresponding to each pixel point in the second gray scale map to obtain a weighted image corresponding to the second gray scale map. And adding the pixel values of the corresponding pixel points in the two weighted images to obtain an image which is recorded as a fusion image corresponding to the 2 nd frame of video frame.

And i, taking 2, and carrying out image fusion processing on the fusion image corresponding to the 2 nd frame of video frame and the 3 rd frame of video frame to obtain the fusion image corresponding to the 3 rd frame of video frame. The specific fusion process may refer to the process of performing image fusion processing on the 1 st frame video frame and the 2 nd frame video frame, which is not described herein again. And i, taking 3, and carrying out image fusion processing on the fusion image corresponding to the 3 rd frame video frame and the 4 th frame video frame to obtain the fusion image corresponding to the 4 th frame video frame. And by analogy, when i is N-1, namely i is 7, performing image fusion processing on the fused image corresponding to the 7 th frame of video frame and the 8 th frame of video frame to obtain the fused image corresponding to the 8 th frame of video frame. At this time, the fused image corresponding to the 8 th frame of video frame is the target image corresponding to the video to be processed.

It should be noted that, in order to determine the video caption area more accurately, when a plurality of video frames are selected, a small number of consecutive video frames can be selected, so as to ensure that the video caption corresponding to the plurality of video frames does not change greatly, and accordingly, the front-back change of the video caption area is not too large, thereby ensuring that the determined video caption area is more accurate. In this embodiment, the terminal performs image fusion processing on the plurality of video frames to obtain a fused image, and this processing manner for the video frames can effectively solve the problem of background change of the video frames, so that it is more accurate to determine the video subtitle region based on the fused image subsequently.

S102: and extracting character edge features of the target image.

And the terminal extracts the character edge characteristics of the target image, wherein the character edge characteristics are used for representing the characteristics corresponding to the character edges of the characters contained in the target image. Illustratively, the text edge feature includes a position corresponding to a text included in the target image and a text box corresponding to the text. Wherein the size, shape and position of the text box follows the corresponding text. For example, three groups of characters are shared in a target image, the first group of characters is vertically distributed on the left side of the target image, the second group of characters is horizontally distributed in the middle of the target image, and the third group of characters is horizontally distributed below the target image. At this time, the character edge feature of the target image includes position information corresponding to the first group of characters and a vertical text box corresponding to the first group of characters, position information corresponding to the second group of characters and a horizontal text box corresponding to the second group of characters, position information corresponding to the third group of characters and a horizontal text box corresponding to the third group of characters.

Illustratively, when the target image comprises a plurality of video frames in the video to be processed, the terminal extracts the text edge features of the target image, that is, extracts the text edge features corresponding to each video frame respectively. When the target image comprises a fusion image obtained by carrying out image fusion processing on a plurality of video frames, the terminal extracts the character edge characteristics of the target image, namely the character edge characteristics corresponding to the fusion image.

Specifically, the terminal can extract the character edge features of the target image through the trained feature extraction model. For example, the terminal may extract Text edge features of the target image through an Efficient and accurate Scene Text detection model (EAST). Exemplarily, the terminal extracting the text edge feature of the target image through the trained feature extraction model may be represented by the following formula:

MASK_east＝f(Image_sum)， (1)

in the above formula (1), Image_sumRepresenting an image of an object, MASK_eastAnd f (-) represents the character edge characteristics of the target image, and the trained characteristic extraction model processes the target image.

Optionally, the terminal may also extract text edge features of the target image by an edge detection method.

Optionally, in a possible implementation manner, when the terminal extracts the text edge feature of the target image through the trained feature extraction model, the above S102 may include S1021, specifically as follows:

s1021: inputting a target image into a trained feature extraction model for processing to obtain character edge features corresponding to the target image; the feature extraction model is obtained by training a sample image set based on an initial feature extraction network.

And the trained feature extraction model is obtained by training the sample image set based on the initial feature extraction network. The sample image set comprises a plurality of sample images and character edge features corresponding to each sample image. In the training process, inputting a sample image into an initial feature extraction network for processing, and outputting real character edge features corresponding to the sample image; calculating a loss value between the real character edge feature and the character edge feature of the sample image in the sample image set, adjusting the network parameter of the initial feature extraction network when the loss value is detected to be larger than a preset loss threshold value, and continuing training the sample image in the sample image set based on the adjusted initial feature extraction network; and when the loss value is detected to be less than or equal to the preset loss threshold value, stopping training the initial feature extraction network, and taking the initial feature extraction network at the moment as a trained feature extraction model.

The text edge features corresponding to the fused image extracted by the terminal are taken as an example for explanation. And inputting the fused image into a trained feature extraction model, and performing text box prediction on the fused image by the feature extraction model, and performing non-maximum suppression on the predicted text box to obtain character edge features corresponding to the fused image. Specifically, the feature extraction model extracts feature maps of different levels of the fused image, for example, feature maps corresponding to the fused image in the sizes of 1/32, 1/16, 1/8 and 1/4. And combining the extracted feature maps layer by layer, and sending the finally combined feature maps to an output layer. And an output layer in the feature extraction model projects the feature graph to a geometric figure feature graph and outputs the geometric figure feature graph, so that the character edge feature corresponding to the fused image is obtained. The terminal extracts the character edge features corresponding to the video frames similarly, and the description is omitted here.

S103: and extracting the contour features of the target image.

And the terminal extracts the contour features of the target image. Illustratively, the contour feature of the target image can be extracted through differential operation, and can also be extracted through an adaptive threshold value binarization method. Taking differential operation as an example for explanation, the terminal filters noise in the target image through smoothing processing, then performs first order differential operation or second order differential operation, calculates to obtain a maximum gradient value, selects a proper threshold value to extract a boundary, and obtains the contour feature of the target image.

Illustratively, when the target image comprises a plurality of video frames in the video to be processed, the terminal extracts the contour features of the target image, that is, extracts the contour features corresponding to each video frame respectively. When the target image comprises a fusion image obtained by carrying out image fusion processing on a plurality of video frames, the terminal extracts the contour features of the target image, namely extracts the contour features corresponding to the fusion image.

Optionally, in a possible implementation manner, when the terminal extracts the contour feature of the target image by using an adaptive threshold binarization method, S103 may include S1031 to S1032, specifically as follows:

s1031: and carrying out noise reduction processing on the target image to obtain a noise-reduced image.

The noise reduction processing of the target image by the terminal may include gaussian noise reduction processing and/or bilateral filtering noise reduction processing, or other processing algorithms for reducing noise of the image. Taking gaussian noise reduction as an example for explanation, exemplarily, the terminal determines a gaussian template (kernel size of gaussian filtering) corresponding to the target image, scans each pixel point in the target image, determines a weighted average value of gray values of pixel points in a neighborhood of a certain pixel point through the gaussian template, and uses the weighted average value as a pixel value corresponding to the pixel point. And by analogy, performing the processing on each pixel point in the target image to finally obtain the noise reduction image corresponding to the target image.

Illustratively, the processing of gaussian noise reduction on the target image may also be implemented by a first preset function, which is as follows:

Image_{sum_gau}＝filter_gum(Image_sum)， (2)

in the above formula (2), Image_{sum_gau}Representing a noise-reduced Image corresponding to the target Image_sumRepresenting a target image, filter_gau(. cndot.) denotes gaussian filtering.

For example, the target image may be input into a preset gaussian noise reduction model for gaussian noise reduction, and the gaussian noise reduction model outputs a noise reduction image corresponding to the target image. Or the terminal inputs the target image into a preset bilateral filtering algorithm, and bilateral filtering noise reduction processing is carried out to obtain a noise reduction image corresponding to the target image.

S1032: and carrying out adaptive threshold binarization processing on the noise-reduced image to obtain the profile characteristics.

And the terminal performs adaptive threshold binarization processing on the noise reduction image to obtain the profile characteristics. Specifically, the terminal performs area division on the noise-reduced image, and calculates a threshold corresponding to each area. Exemplarily, for a certain pixel point (x, y) in the noise-reduced image, taking the pixel point (x, y) as any one of a central point, an upper left corner point, a lower left corner point, an upper right corner point and a lower right corner point, taking an M × N sized region, calculating an average value corresponding to the region based on all the pixel points in the region, and subtracting a preset constant C from the average value to obtain a threshold value corresponding to the region. Wherein, M and N represent the length, and the value of M, N can be adjusted according to the actual situation, which is not limited. And displaying the pixel points with the pixel values larger than the threshold value in the area as white, and displaying the pixel points with the pixel values smaller than or equal to the threshold value in the area as black. And after each pixel point in the noise-reduced image is subjected to the processing, obtaining the corresponding contour characteristic of the noise-reduced image. It should be noted that, when processing the same noise reduction image, the same manner of dividing the noise reduction image into regions should be used. For example, the first pixel point is used as a central point to perform region division, and other pixel points in the noise-reduced image should also be used as central points to perform region division. The description is given for illustrative purposes only and is not intended to be limiting.

Illustratively, the terminal may also implement adaptive threshold binarization processing on the noise-reduced image through a second preset function, where the second preset function is as follows:

MASK_ada＝g(Image_{sum_gau})， (3)

in the above formula (3), Image_{sum_gau}MASK representing a noise-reduced image corresponding to a target image_adaRepresenting the contour features, and g (-) represents the adaptive threshold binarization process.

S104: and fusing the character edge feature and the outline feature to determine a video subtitle area corresponding to the target image.

The terminal fuses the character edge feature corresponding to the target image and the contour feature of the target image, and can be understood as obtaining the cross-overlapped part of the character edge feature and the contour feature, wherein the part of the feature is the video subtitle region corresponding to the target image.

Illustratively, the manner of fusing the text edge feature and the outline feature and determining the video subtitle region corresponding to the target image by the terminal may be represented by the following formula (4), specifically as follows:

in the above formula (4), MASK_crossRepresenting video subtitle regions, MASK_eastText edge features, MASK, representing target images_adaRepresenting the contour features of the target image.

Optionally, in a possible implementation manner, the S104 may include S1041 to S1043, which are specifically as follows:

s1041: and carrying out feature fusion processing on the character edge features and the outline features to obtain crossed edge features.

The cross edge feature can be understood as a cross-coincident part of the text edge feature and the outline feature of the target image. The terminal marks the pixel points with the pixel values of 1 in the character edge characteristic and the outline characteristic as white, marks the pixel points with the pixel values of 0 in the character edge characteristic and the outline characteristic as black, and the obtained pixel points marked as the same color are the cross edge characteristic.

S1042: and performing morphological operation on the crossed edge features to obtain a binary image.

The morphological operations include morphological erosion operations and morphological dilation operations, and mainly operate on white regions in the image, that is, regions composed of white pixels marked in S1041. The shape of the object can be changed through morphological operations, and popular points can be understood as erosion which is thinning and expansion which is fatiguing. For example, the video subtitle includes the text "convex", and the operation in S1041 does not display the uppermost horizontal line corresponding to the text stroke in white, and the morphological dilation operation on the uppermost horizontal line corresponding to the text stroke can expand the pixel region corresponding to the text, so that the uppermost horizontal line corresponding to the text stroke is also displayed in white (which can be understood as being displayed in a highlighted manner), which corresponds to the expansion of the highlighted region corresponding to the text. For another example, a video subtitle includes a word "one", and the pixel region corresponding to the word is originally too large, and performing morphological erosion operation on the word can reduce the pixel region corresponding to the word, so that only the horizontal line corresponding to the word stroke is displayed in white, which is equivalent to reducing the highlight region corresponding to the word.

Exemplarily, the terminal performs morphological erosion operation on the cross edge feature to obtain an erosion image; and performing morphological expansion operation on the corrosion image to obtain a binary image. Optionally, in an achievable manner, morphological dilation operation may be performed on the cross edge feature to obtain a dilated image; and carrying out morphological corrosion operation on the expansion image to obtain a binary image. After morphological operation, the video caption area is displayed in white, which can also be understood as highlight display of the video caption area, and the other areas except the video caption area are displayed in black, so that the video caption area is very obvious in the binary image.

And the terminal extracts the connected domain of the binary image to obtain a video subtitle area based on the subtitle font. For the specific connected component extraction process, reference may be made to the prior art, which is not described herein again. It is understood that the video subtitle region obtained at this time is not a standard rectangular box, but a region matching the font style in the subtitle.

In this embodiment, the terminal marks the pixel points with the same color in the text edge feature and the contour feature, uses the pixel points marked with the same color as the cross edge feature, and performs morphological operation on the cross edge feature, so that the connected domain extracted by the terminal is more accurate and the edge is smoother.

S1043: and performing hole filling processing on the binary image to obtain a video subtitle area.

Determining a pixel (0, 0) in the binary image as an initial seed point, and filling the background of the binary image by taking the point as a starting point; negating the binary image obtained after filling to obtain a new binary image; and adding the image obtained at the moment and the original binary image to obtain a hole filling result, namely obtaining a video subtitle area.

In the embodiment, the terminal performs hole filling processing on the binary image, so that noise in the video caption area can be effectively removed, and the clear and accurate video caption area can be obtained.

According to the embodiment of the application, a terminal acquires a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed; determining character edge characteristics of the target image; determining the contour feature of the target image; and determining a video subtitle area corresponding to the target image according to the character edge characteristic and the outline characteristic. In the above manner, the terminal not only extracts the character edge features of the target image corresponding to the video to be processed, but also extracts the contour features of the target image corresponding to the video to be processed, and the extracted character edge features and contour features are fused, so that background characters in the target image can be effectively filtered, and an accurate video subtitle region can be obtained. The method is beneficial to accurately identifying and translating the subtitles in the video according to the video subtitle region by a subsequent terminal, improves the accuracy of video retrieval, and is beneficial to video analysis. The text edge features corresponding to the target image are extracted based on the trained feature extraction model, the contour features of the target image are extracted based on the adaptive threshold value binarization method, the feature extraction method is low in complexity and small in calculation amount, resources are saved, and the speed of determining the video subtitle region is improved.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a method for processing video subtitles according to another embodiment of the present application. The method of processing a video subtitle as shown in fig. 2 may include:

s201: acquiring a target image corresponding to a video to be processed; the target image comprises a plurality of video frames in the video to be processed.

And the terminal acquires a target image corresponding to the video to be processed. In this embodiment, the target image includes a plurality of video frames in the video to be processed.

S202: text edge features of each of a plurality of video frames are extracted.

And the terminal respectively extracts the character edge characteristics of each video frame in the video to be processed. For a specific extraction process, reference may be made to the description in S102, which is not described herein again.

S203: contour features of each of the plurality of video frames are extracted.

And the terminal respectively extracts the contour characteristics of each video frame in the video to be processed. For a specific extraction process, reference may be made to the description in S103, which is not described herein again.

S204: and aiming at each video frame, obtaining a subtitle area corresponding to each video frame according to the character edge characteristics and the outline characteristics corresponding to each video frame.

And each video frame has corresponding text edge characteristics and outline characteristics, and the text edge characteristics and the outline characteristics corresponding to the video frame are fused aiming at each video frame to obtain a subtitle area corresponding to the video frame. For example, for the 3 rd frame video frame, the text edge feature and the outline feature corresponding to the 3 rd frame video frame are fused to obtain a subtitle region corresponding to the 3 rd frame video frame; and for the 4 th frame of video frame, fusing the character edge characteristics and the outline characteristics corresponding to the 4 th frame of video frame to obtain the subtitle area corresponding to the 4 th frame of video frame. For a specific fusion process, reference may be made to the fusion process described in S104, and details are not described here.

S205: and fusing the subtitle areas corresponding to the video frames to obtain the video subtitle areas.

And obtaining a subtitle region corresponding to each video frame through the processing in the S204, and fusing all the subtitle regions to obtain a video subtitle region corresponding to the video to be processed. Specifically, the terminal may take a region overlapped in the caption region corresponding to each video frame, and take the region as the video caption region corresponding to the video to be processed.

Optionally, in a possible implementation manner, the S205 may specifically include: determining overlapped areas in the caption areas respectively corresponding to the plurality of video frames; a video subtitle region is determined based on the overlapping region.

Illustratively, the coordinates of all pixel points constituting each caption region are acquired based on the caption region corresponding to each video frame. And comparing whether the coordinates of the pixel points in the caption areas are the same or not, namely determining whether the positions of the pixel points are overlapped or not. And marking the overlapped pixel points, wherein all the marked pixel points form an overlapped area. Specifically, the pixels in the caption area in each video frame may be marked, and an area formed by the marked pixels in the last optional video frame is used as the overlapped area. Or selecting a video frame at first, marking in the caption area of the video frame, and finally forming a superposed area by the marked pixel points. The overlapped area is the video caption area. The description is given for illustrative purposes only and is not intended to be limiting.

In the embodiment of the application, the terminal extracts the character edge characteristics of each video frame and the outline characteristics of each video frame, fuses the character edge characteristics and the outline characteristics of each video frame to obtain the subtitle region corresponding to each video frame, and then fuses the subtitle regions corresponding to each video frame to obtain the video subtitle regions. By fusing the extracted text edge characteristics and outline characteristics of each video frame, background text of each video frame can be effectively filtered out, and accurate caption areas can be obtained. The processing is carried out on each video frame, so that the obtained subtitle area corresponding to each video frame is very accurate, and the final video subtitle area determined based on the subtitle areas is more accurate.

Referring to fig. 3, fig. 3 is a schematic flowchart illustrating a method for processing video subtitles according to another embodiment of the present application. The method may include S301 to S306. For reference, the steps S301 to S304 shown in fig. 3 may refer to the above description of S101 to S104, and for brevity, the description is omitted here. The following will specifically explain steps S305 and S306.

S305: and determining a rectangular frame corresponding to the video caption area to obtain a caption frame corresponding to the video caption area.

Based on the description in S104, it can be known that the video subtitle region is a connected component based on the subtitle font, that is, the video subtitle region is a region matching the font in the subtitle. The rectangular frame may be a rectangular frame corresponding to a minimum circumscribed rectangle that just contains the video caption area, or may be a rectangular frame corresponding to a circumscribed rectangle that is slightly larger than the video caption area, and the size of the rectangular frame is not limited.

Illustratively, the terminal obtains a rectangular frame corresponding to the minimum circumscribed rectangle of the video caption area to obtain a caption frame corresponding to the video caption area, that is, to obtain a caption frame corresponding to the video to be processed. For example, the terminal respectively takes a plurality of pixel points at the top, bottom, left and right edges of the video caption area, and determines a rectangular frame based on the pixel points to obtain a caption frame corresponding to the video caption area.

S306: and displaying the caption frame and the video caption in the caption frame.

The video subtitle is a subtitle included in the video subtitle region. The subtitle frame is a rectangular frame corresponding to the video subtitle region, and accordingly, the video subtitle is included in the subtitle frame. The caption frame and the video caption in the caption frame can be displayed on a preset display interface. For example, the caption box and the video caption in the caption box are displayed on a display interface of the terminal.

It should be noted that S305 and S306 may also be executed after S205, which is not limited to the actual implementation.

In the embodiment of the application, a subtitle frame corresponding to the video subtitle region is also determined, and the subtitle frame and the video subtitle in the subtitle frame are displayed. The method can enable developers to visually see whether the video subtitles in the subtitle frame are completely displayed and meet the requirements or not, and reflect whether the determination of the video subtitle area is accurate and reasonable from the side, so that the developers can conveniently adjust the method.

Referring to fig. 4, fig. 4 is a schematic diagram of an apparatus for processing a video subtitle according to an embodiment of the present application. The device comprises units for performing the steps in the embodiments corresponding to fig. 1, 2,3. Please refer to the related descriptions in the corresponding embodiments of fig. 1, fig. 2, and fig. 3. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 4, including:

an obtaining unit 410, configured to obtain a target image corresponding to a video to be processed; the target image is determined according to a plurality of video frames in the video to be processed;

a first extraction unit 420, configured to determine a text edge feature of the target image;

a second extraction unit 430, configured to determine a contour feature of the target image;

a determining unit 440, configured to determine a video subtitle region corresponding to the target image according to the text edge feature and the contour feature.

Optionally, the video to be processed includes N video frames, where N is an integer greater than 1, and the obtaining unit 410 is specifically configured to:

the fusion image corresponding to the ith frame of video frame is obtained by carrying out image fusion processing on the fusion image corresponding to the i-1 th frame of video frame and the ith frame of video frame, the value of i is 2,3.. N-1, and the fusion image corresponding to the 1 st frame of video frame is the 1 st frame of video frame;

Optionally, the determining unit 440 is specifically configured to:

Optionally, the first extracting unit 420 is specifically configured to:

inputting the target image into a trained feature extraction model for processing to obtain character edge features corresponding to the target image; the characteristic extraction model is obtained by training a sample image set based on an initial characteristic extraction network; the sample image set comprises a plurality of sample images and text edge features corresponding to each sample image.

Optionally, the second extracting unit 430 is specifically configured to:

the first extracting unit 420 is specifically configured to: extracting character edge characteristics of each video frame in the plurality of video frames;

the second extraction unit 430 is specifically configured to: extracting contour features of each of the plurality of video frames;

the determining unit 440 is specifically configured to:

determining the video subtitle region based on the coinciding region.

Optionally, the apparatus further comprises:

Referring to fig. 5, fig. 5 is a schematic diagram of a terminal for processing video subtitles according to another embodiment of the present application. As shown in fig. 5, the terminal 5 for processing video subtitles of this embodiment includes: a processor 50, a memory 51, and computer readable instructions 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer readable instructions 52, implements the steps in the various method embodiments for processing video subtitles described above, such as S101 to S104 shown in fig. 1. Alternatively, the processor 50, when executing the computer readable instructions 52, implements the functions of the units in the above embodiments, such as the units 410 to 440 shown in fig. 4.

Illustratively, the computer readable instructions 52 may be divided into one or more units, which are stored in the memory 51 and executed by the processor 50 to accomplish the present application. The unit or units may be a series of computer readable instruction segments capable of performing specific functions for describing the execution of the computer readable instructions 52 in the terminal 5 for processing video titles. For example, the computer readable instructions 52 may be divided into an acquisition unit, a first extraction unit, a second extraction unit, and a determination unit, each unit functioning specifically as described above.

The terminal for processing the video caption may include, but is not limited to, a processor 50 and a memory 51. It will be understood by those skilled in the art that fig. 5 is only an example of a terminal 5 for processing video subtitles and does not constitute a limitation of a terminal for processing video subtitles and may include more or less components than those shown, or combine some components, or different components, for example, the terminal for processing video subtitles may further include an input output terminal, a network access terminal, a bus, etc.

The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the terminal for processing video subtitles, such as a hard disk or a memory of the terminal for processing video subtitles. The memory 51 may also be an external storage terminal of the terminal for processing video subtitles, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are equipped on the terminal for processing video subtitles. Further, the memory 51 may also include both an internal storage unit and an external storage terminal of the terminal for processing video subtitles. The memory 51 is used for storing the computer readable instructions and other programs and data required by the terminal. The memory 51 may also be used to temporarily store data that has been output or is to be output.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not cause the essential features of the corresponding technical solutions to depart from the spirit scope of the technical solutions of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims

1. A method for processing video subtitles, comprising:

determining character edge characteristics of the target image;

determining the contour features of the target image;

2. The method according to claim 1, wherein the target image comprises a plurality of video frames in the video to be processed, or comprises a fused image obtained by performing image fusion processing on the plurality of video frames.

3. The method according to claim 2, wherein the video to be processed includes N video frames, N is an integer greater than 1, and when the target image includes a fused image obtained by performing image fusion processing on the plurality of video frames, the obtaining of the target image corresponding to the video to be processed includes:

4. The method of claim 1, wherein the determining the video subtitle region corresponding to the target image according to the text edge feature and the outline feature comprises:

performing feature fusion processing on the character edge features and the outline features to obtain cross edge features;

performing morphological operation on the cross edge characteristics to obtain a binary image;

5. The method of claim 1, wherein the determining text edge features of the target image comprises:

6. The method of claim 1, wherein the determining the contour feature of the target image comprises:

7. The method of claim 1, wherein when the target image comprises a plurality of video frames in the video to be processed, the determining text edge features of the target image comprises:

the determining the contour feature of the target image comprises:

extracting contour features of each of the plurality of video frames;

determining a video subtitle region corresponding to the target image according to the character edge feature and the contour feature, including:

aiming at each video frame, obtaining a subtitle area corresponding to each video frame according to the character edge characteristics and the outline characteristics corresponding to each video frame;

and fusing the subtitle areas respectively corresponding to the plurality of video frames to obtain the video subtitle areas.

8. The method of claim 7, wherein the fusing the subtitle regions corresponding to the respective video frames to obtain the video subtitle regions comprises:

determining the video subtitle region based on the coinciding region.

9. The method according to any one of claims 1 to 8, wherein after determining the video subtitle region corresponding to the target image according to the text edge feature and the outline feature, the method further comprises:

and displaying the caption frame and the video caption in the caption frame.

10. An apparatus for processing video subtitles, comprising:

a second extraction unit, configured to determine a contour feature of the target image;

11. A terminal for processing video subtitles comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.