CN112118478B

CN112118478B - Text processing method and device, electronic equipment and storage medium

Info

Publication number: CN112118478B
Application number: CN202011011676.2A
Authority: CN
Inventors: 华路延
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2022-08-19
Anticipated expiration: 2040-09-23
Also published as: CN112118478A

Abstract

The invention provides a text processing method, a text processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; if the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within the threshold range, determining the historical text coordinate as the target text coordinate; if the target frame is not within the threshold range, determining the text coordinates extracted from the video image of the target frame as target text coordinates; and performing text processing on the video image of the target frame according to the target text coordinates. According to the invention, the text coordinate with volatility is corrected through the comparison of adjacent frames, and the text processing is carried out based on the corrected target text coordinate, so that the accuracy can be improved, and the risk of flicker instability in the text processing process can be reduced.

Description

Text processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of data processing, in particular to a text processing method and device, electronic equipment and a storage medium.

Background

With the rapid development of electronic devices, the functions of the electronic devices are gradually improved, and people often use the electronic devices to acquire information, store information, watch videos and the like. When the electronic equipment plays the video, the subtitle information can be displayed while the video image is displayed, so that the user can understand the video content conveniently.

When viewing or using some videos, a significant portion of the audience does not want to see the subtitles in the video for different reasons, and therefore there is a need to block or remove the subtitles for this portion of the population. The traditional subtitle processing scheme has low accuracy in recognizing font areas, so that the fluctuation range is large during subsequent operations of blocking/removing subtitles, the effect is unnatural, and the phenomenon of unstable flicker is easy to occur.

Disclosure of Invention

In view of this, the present invention provides a text processing method, a text processing apparatus, an electronic device, and a storage medium, so as to improve accuracy of text processing, reduce a risk of a flicker instability phenomenon, and obtain an ideal processing effect.

The technical scheme of the embodiment of the invention is as follows:

in a first aspect, the present invention provides a text processing method, including: acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within a threshold range, determining the historical text coordinate as a target text coordinate; when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, determining the text coordinate extracted from the video image of the target frame as a target text coordinate; and performing text processing on the video image of the target frame according to the target text coordinates.

Optionally, the performing text processing on the video image of the target frame according to the target text coordinates includes: acquiring a mask image and an occlusion image corresponding to the video image of the target frame; the area matched with the target text coordinate in the occlusion image has a fuzzy attribute; pixels in the mask map within an area matching the target text coordinates have a first pixel value; pixels in other areas of the mask map except the area matching the target text coordinate have a second pixel value; and performing text occlusion processing on the video image of the target frame based on the mask image and the occlusion image.

Optionally, the acquiring a mask map and an occlusion map corresponding to the video image of the target frame includes: constructing a mask image corresponding to the video image of the target frame according to the target text coordinates; and carrying out mean value blurring on the duplicate image of the video image of the target frame to obtain the occlusion image.

Optionally, the performing text processing on the video image of the target frame according to the target text coordinates includes: determining at least one font area in the video image of the target frame according to the target text coordinates; the font area is an area surrounded by outline edges of the font; constructing a mask map from the at least one font region; pixels in an area of the mask map that matches the font area have a first pixel value; pixels in other areas of the mask map except for the area matching the target text coordinates have second pixel values; performing elimination processing in the at least one font area in the video image of the target frame based on the mask map.

Optionally, the determining at least one font area in the video image of the target frame according to the target text coordinates includes: extracting font color data in an area matched with the target text coordinates in the video image of the target frame; the font color data is used for setting a color threshold; and performing pixel screening on the video image of the target frame based on the color threshold value to obtain the at least one font area.

Optionally, after determining at least one font area in the video image of the target frame according to the target text coordinates, the method further comprises: and performing erosion treatment and expansion treatment on the at least one font area.

Optionally, before acquiring the text coordinates extracted from the video image of the target frame and the historical text coordinates of the video image of at least one historical frame adjacent to the target frame, the method further includes: extracting a text coordinate set in video images of all frames in a video to be processed through a text detection network; the text coordinate set comprises text coordinates extracted from the video image of the target frame.

Optionally, before the obtaining the text coordinates extracted from the video image of the target frame and the historical text coordinates of the video image of at least one historical frame adjacent to the target frame, the method includes: receiving a user operation instruction, wherein the user operation instruction is used for indicating to acquire text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame.

In a second aspect, the present invention provides a text processing apparatus comprising: the device comprises an acquisition module, a determination module and a processing module; the acquisition module is used for acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; the determining module is used for determining the historical text coordinates as target text coordinates when errors of the text coordinates extracted from the video image of the target frame and the historical text coordinates are within a threshold range, and determining the text coordinates extracted from the video image of the target frame as the target text coordinates when the errors of the text coordinates extracted from the video image of the target frame and the historical text coordinates are not within the threshold range; and the processing module is used for performing text processing on the video image of the target frame according to the target text coordinates.

In a third aspect, the present invention provides an electronic device, which includes a machine-readable storage medium and a processor, where the machine-readable storage medium stores machine-executable instructions, and when the processor executes the machine-executable instructions, the electronic device implements the text processing method according to the first aspect.

In a fourth aspect, the present invention provides a storage medium having stored therein machine-executable instructions that, when executed, implement the text processing method according to the first aspect.

The invention provides a text processing method, a text processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within a threshold range, determining the historical text coordinate as a target text coordinate, and when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, determining the text coordinate extracted from the video image of the target frame as the target text coordinate; and performing text processing on the video image of the target frame according to the target text coordinates. The invention is different from the prior art in that, when processing the video image of the target frame, because there may exist fluctuation between the extracted text coordinate and the real text coordinate in the target frame image, the text is processed directly by using the extracted text coordinate, and there may occur unnatural phenomena such as processing transition or processing insufficiency, so in order to determine the accurate processing area of the text in the target frame image, firstly, the extracted text coordinate of the target frame is compared with the text coordinate of the history frame adjacent to the target frame, so that it can be determined whether the text coordinate in the target frame and the history frame is in the fluctuation range, if so, it indicates that the history frame and the target frame have the same sentence, because when the history frame is the previous one, the text coordinate of the history frame has been corrected, the text coordinate of the history frame is matched with the real position of the text, the text coordinates of the historical frames can be used as the target text of the target frame image, if the text coordinates of the target frame and the historical frames are not in the fluctuation range, the historical frames and the target frames do not have the same sentence, and the text coordinates of the target frame can be used as the target text.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a scene diagram provided in an embodiment of the present invention;

fig. 2 is a schematic flowchart of a text processing method according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an implementation manner of S206 according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of a mask map and an occlusion map provided by an embodiment of the invention;

fig. 5 is a schematic flowchart of another implementation manner of S206 according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of one implementation of step S206-1b provided by an embodiment of the present invention;

FIG. 7 is an exemplary diagram of another mask map and occlusion map provided by embodiments of the invention;

FIG. 8 is a schematic flow chart diagram of another text processing method provided by the embodiments of the present invention;

FIG. 9 is a schematic flow chart diagram of another text processing method provided by the embodiment of the invention;

FIG. 10 is a schematic flow chart diagram of another text processing method provided by an embodiment of the invention;

fig. 11 is a schematic functional interface diagram of an electronic device according to an embodiment of the present invention;

FIG. 12 is a functional block diagram of a text processing apparatus according to an embodiment of the present invention;

fig. 13 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that, if the terms "upper", "lower", "inner", "outer", etc. are used to indicate the orientation or positional relationship based on the orientation or positional relationship shown in the drawings or the orientation or positional relationship which the product of the present invention is used to usually place, it is only for convenience of description and simplification of the description, but it is not intended to indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

The terms and expressions referred to in the embodiments of the present invention are applied to the following explanations.

1) Corrosion (enode): deleting some pixels of the object boundary, having the function of shrinking the image, using an nXn structural element to scan each pixel in the image by the corrosion algorithm, and carrying out AND operation on the binary image covered by the nXn structural element and the nXn structural element, wherein if the nXn structural element and the nXn structural element are both 1, the pixel of the image is 1, and otherwise, the pixel is 0. After erosion, the image boundaries shrink inward.

2) Swelling (Dilate): some pixels of the object boundary are added, so that the image is expanded; the dilation algorithm uses an nXn structuring element to scan each pixel in the image. And (3) performing AND operation on the binary image covered by the nXn structural element and the nXn structural element, wherein if the nXn structural element and the nXn structural element are both 0, the pixel of the image is 0, and otherwise, the pixel is 1. After dilation, the image boundaries expand outward.

At present, in order to solve the problem of the requirement of removing subtitles when a user watches or uses a video, in the related art, a text detection network is firstly adopted to detect each frame of video image in a video to be processed, then text coordinates in the video image are extracted, and subtitle shielding is performed on an area matched with the text coordinates in the video image according to the extracted text coordinates.

However, the inventors have found that the above approach may suffer from the following drawbacks: the text coordinates extracted by the text detection network and the real coordinates of the text in the video image fluctuate, and if the text occlusion or elimination processing is performed according to the extracted text coordinates, the actual processed area may be inconsistent with the real area to be processed, the phenomenon of occlusion transition or insufficient occlusion may occur, and the processing effect is unnatural.

For example, referring to fig. 1, fig. 1 is a scene graph provided by an embodiment of the present invention, where the scene graph includes video images of 3 frames, it can be seen that a video image of a t4 frame and a video image of a t5 frame are the same text, "this", a text in a video image of a t6 frame is "this", taking a t5 frame as an example, a dashed box in the graph is an extracted coordinate, a solid box is a real coordinate of the text, "this", and if a text in the t5 image is occluded according to a coordinate represented by a dashed box, a phenomenon of occlusion transition (a background region) and occlusion deficiency (a font region) may occur.

In order to solve the above problem, the present invention provides a text processing method, please refer to fig. 2, fig. 2 is a schematic flow chart of the text processing method according to the embodiment of the present invention, and an execution main body of the text processing method may be a text processing apparatus according to the embodiment of the present invention, or an electronic device integrated with the text processing apparatus, where the text processing apparatus may be implemented in a hardware or software manner. The electronic device may be a smart phone, a tablet computer, a palm computer, a notebook computer, or a desktop computer. The text processing method can comprise the following steps:

s203, acquiring text coordinates extracted from the video image of the target frame and historical text coordinates of the video image of at least one historical frame adjacent to the target frame.

In the embodiment of the present invention, the "historical text coordinates" are matched with the region where the text in the video image of the historical frame is located, and it can be understood that the historical text coordinates can be used to represent the real position of the text in the image of the historical frame in the image. The history frame may be one frame adjacent to the target frame or two consecutive frames, for example, referring to fig. 1, assuming that the target frame is t6, the history frame may be t5, and may also be t5 and t 4.

And S204, when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within the threshold range, determining the historical text coordinate as the target text coordinate.

S205, when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, determining the text coordinate extracted from the video image of the target frame as the target text coordinate.

In the embodiment of the present invention, when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within the threshold range, and the text in the video image representing the target frame and the text in the video image representing the historical frame are the same sentence, at this time, it is considered that the text coordinate extracted from the target frame may fluctuate with the real position of the text in the image, but the historical text coordinate represents the real position of the text, so the historical text coordinate may be used as the target text coordinate of the target frame image, and may also be understood as the text coordinate to be processed in the target frame image. On the contrary, when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, and the text in the video image representing the target frame and the text in the video image representing the historical frame are not the same sentence, at this time, the text coordinate extracted from the video image of the target frame can be directly used as the target text coordinate of the target frame image.

For example, with continuing reference to fig. 1, assuming that the target frame is t5, when an error between text coordinates (a dotted frame) extracted in the video image of t5 and text coordinates of t4 (the text coordinates match with the true position of the text) is within a threshold range, the text coordinates of t4 are determined as target text coordinates of the video image of the target frame; assuming that the target frame is t6, when the error between the text coordinates extracted from the video image at t6 and the text coordinates at t5 (the target text determined from the t4 frame) is not within the threshold range, the text coordinates at t6 are determined as target text coordinates.

It can be understood that the target text coordinates can represent the real region of the text in the video image of the target frame, and the text in the region represented by the target text coordinates is blocked or eliminated, so that the accuracy can be improved, and the risk of transient processing or insufficient processing can be reduced.

And S206, performing text processing on the video image of the target frame according to the target text coordinates.

In the embodiment of the present invention, the target text coordinates may be used to determine a text position in the target frame image, in one implementation, the text in the target frame image may be subjected to occlusion processing according to the target text coordinates, and in another implementation, the text in the target frame image may be subjected to elimination processing according to the target text coordinates.

The principle of the invention for realizing the beneficial effects through the steps is as follows:

when processing the video image of the target frame, because there may exist fluctuation between the extracted text coordinates and the real text coordinates in the target frame image, the extracted text coordinates are directly used for processing the text, and there may occur unnatural phenomena such as processing transition or processing insufficiency, so in order to determine the accurate processing area of the text in the target frame image, firstly, the extracted text coordinates of the target frame are compared with the text coordinates of the history frame adjacent to the target frame, so that it can be determined whether the text coordinates in the target frame and the history frame are in the fluctuation range, if so, it indicates that the history frame and the target frame have the same sentence, since the text coordinates of the history frame are corrected when the history frame is the previous target frame, and the text coordinates of the history frame are matched with the real position of the text, then the text coordinates of the history frame can be used as the target text of the target frame image, if the text coordinates in the target frame and the historical frame are not in the fluctuation range, the fact that the historical frame and the target frame do not have the same sentence is indicated, the text coordinates in the target frame can be used as the target text, therefore, the text coordinates with fluctuation are corrected through comparison of adjacent frames, accurate target text coordinates can be obtained, the text in the region represented by the target text coordinates is shielded or eliminated, accuracy can be improved, and the risk of flicker instability in the text processing process is reduced.

Optionally, the "text coordinates" may be generally expressed as coordinates of a rectangular box, including position coordinates of each corner point (including upper left corner, lower left corner, upper right corner, and lower right corner) of the rectangular box in the image, and may also be expressed in the following format (x, y, w, h), where (x, y) is the coordinates of the upper left corner of the rectangular box where the text is located, and w and h are the width and height of the rectangular box, which is not limited herein, then the manner of determining whether the error between the text coordinates extracted from the video image of the target frame and the historical text coordinates is within the error range may be: and calculating whether the distance between the coordinates of each corner point of the extracted rectangular frame in the video image of the target frame and the coordinates of each corner point of the rectangular frame of the historical frame is within a threshold range, or whether the distance between the coordinates of one corner point (for example, the upper left corner) of the extracted rectangular frame in the video image of the target frame and the coordinates of the corner point at the corresponding position of the rectangular frame of the historical frame is within the threshold range.

Optionally, in order to improve the accuracy of the obtained target text coordinates, the "history frame" may be one frame adjacent to the target frame, or may be two consecutive frames adjacent to the target frame, and in one embodiment, when the history frame is one frame adjacent to the target frame, the target text coordinates may be determined through the implementation of the

above steps

204 and 205; in another embodiment, when the historical frames are two consecutive frames adjacent to the target frame, and it is determined whether errors between text coordinates extracted from the video image of the target frame and the historical text coordinates are within an error range, the target frame may be respectively determined, and when an error between a text coordinate of any one of the historical frames and a text coordinate extracted from the target frame is within a threshold range, the text coordinate of the corresponding historical frame may be taken as the target text coordinate, and when errors between text coordinates of two consecutive historical frames and text coordinates extracted from the target frame are within a threshold range, the text coordinate of any one of the historical frames may be taken as the target text coordinate.

Optionally, the obtained accurate target text coordinates may be used to represent a real position of a text in a video image of the target frame, and in an embodiment, the text in the video image of the target frame may be subjected to occlusion processing according to the target text coordinates, so an implementation of occlusion processing on the text is given below on the basis of fig. 2, referring to fig. 3, where fig. 3 is a schematic flowchart of an implementation of S206 according to an embodiment of the present invention, where the implementation may include:

s206-1a, acquiring a mask image and an occlusion image corresponding to the video image of the target frame.

In an embodiment of the present invention, pixels in an area of the mask map that matches the target text coordinates have a first pixel value, and pixels in other areas except the area that matches the target text coordinates have a second pixel value; for example, in one embodiment, the first pixel value may be 255, and the second pixel value may be 0, and the mask map includes a white area and a black area, where the white area matches the target text coordinate and may be characterized as an occlusion area for subsequent text occlusion. In another embodiment, the first pixel value may be 0 and the second pixel value may be 255, and the black area matches the target text coordinates.

In order to reduce the risk of damage to the video image of the target frame caused by occlusion processing, the embodiment of the invention also obtains an occlusion map corresponding to the video image of the target frame, wherein an area in the occlusion map matched with the coordinates of the target text has fuzzy attribute; fig. 4 is an exemplary diagram of a mask map and an occlusion map provided by an embodiment of the present invention, where a white area in the mask map coincides with an area where a target text of a video image of a target frame is located.

And S206-2a, performing text occlusion processing on the video image of the target frame based on the mask image and the occlusion image.

In the embodiment of the invention, the video image of the subtitle region fuzzified target frame is finally obtained by performing superposition processing on the occlusion map, the mask map and the video image of the target frame.

Optionally, in order to make the text occlusion effect tend to be natural, in the process of obtaining the occlusion map, the occlusion map may be subjected to a blurring process and the degree of blurring of the occlusion map is controlled by a set blurring radius, so that the following gives a way to obtain the mask map and the occlusion map corresponding to the video image of the target frame, and step 206-1a may include the following sub-steps:

s206-1a-1, constructing a mask image corresponding to the video image of the target frame according to the target text coordinates.

In the embodiment of the present invention, a blank image with the same size is constructed according to a video image of a target frame, then, a pixel value in an area in the image that is consistent with the target text coordinate is set as a first pixel value, and pixel values in other areas except the area that is consistent with the target text coordinate are set as second pixel values, so that the mask image can be obtained.

S206-1a-2, performing mean value blurring on the duplicate image of the video image of the target frame to obtain an occlusion image.

In the embodiment of the invention, after the video image of the target frame is copied and the copied image is obtained, in order to retain the background information in the image during the shielding processing, the copied image can be subjected to mean blurring by a fixed blurring radius, the mean blurring refers to the arithmetic mean blurring of the image, and the background contour information can be retained after the mean blurring processing.

It is understood that the above-mentioned "fuzzy radius" may be a preset multiple (e.g. 5 times) of the height in the target text coordinate, and the fuzzy radius may represent the fuzzy degree of the occlusion region, and the larger the fuzzy radius is, the more obvious the occlusion region font is eliminated, and the closer the occlusion region font is to the background contour, so that the occlusion effect tends to natural effect.

Optionally, the foregoing embodiment describes an implementation manner of performing occlusion processing on a text according to target text coordinates, and in another implementation manner, the foregoing target text coordinates may also perform elimination processing on the text in a target frame image, so that an implementation manner of performing elimination processing on the text is provided below on the basis of fig. 1, referring to fig. 5, where fig. 5 is a schematic flow chart of another implementation manner of S206 provided by an embodiment of the present invention, where:

s206-1b, determining at least one font area in the video image of the target frame according to the target text coordinates.

In the embodiment of the present invention, the font area is an area surrounded by a contour edge of each font. Since a pixel corresponding to each font exists in the region corresponding to the target text coordinate, and each pixel value is a color value of the font, a possible implementation manner can be given on the basis of fig. 5 by extracting color data in the target coordinate region and processing the video image of the target frame by setting a color threshold, referring to fig. 6, fig. 6 is a schematic flow chart of an implementation manner of step S206-1b provided by the embodiment of the present invention, that is, step S206-1b may include the following sub-steps:

s206-1b-1, extracting font color data in the area matched with the target text coordinate in the video image of the target frame.

In the embodiment of the present invention, the font color data is used to set a color threshold. For example, an area where the target text coordinate is located is selected, an RGB three-channel luminance histogram is obtained, and a color with the highest luminance ratio in the RGB three-channel luminance histogram is obtained. For example, in the RGB three-channel luminance histogram, when the luminance is concentrated around 255, the color threshold is set as R, G, and B, where R, G, and B are 255.

S206-1b-2, performing pixel screening on the video image of the target frame based on the color threshold value to obtain at least one font area.

In the embodiment of the invention, a binarized image can be obtained after setting the color threshold value for the video image of the target frame, wherein the white area is the font area. After the font area is obtained, the subsequent steps can be performed.

S206-3b, constructing a mask map according to at least one font area.

In an embodiment of the present invention, as shown in fig. 7 in the mask diagram, referring to fig. 7, fig. 7 is another schematic diagram of the mask image provided in the embodiment of the present invention, where pixels in a region of the mask diagram matching a font region have a first pixel value, and pixels in other regions except the region matching the font region have a second pixel value, for example, in an implementation, the first pixel value may be 255, and the second pixel value may be 0, then the mask diagram includes a white region and a black region, where the white region matches a target text coordinate and may be characterized as a blocking region for performing subsequent text blocking. In another embodiment, the first pixel value may be 0 and the second pixel value may be 255, and the black region matches the target text coordinates.

And S206-4b, eliminating at least one font area in the video image of the target frame based on the mask image.

In the embodiment of the present invention, for each font area, all contents in the font area may be gradually filled from the area boundary of the font area, and in a possible implementation manner, a pixel tangent line and a pixel normal line in each font area may be determined first, then an edge of each font area is determined by using a gray difference value between pixels on both sides of the pixel tangent line and the pixel normal line direction, the pixels being equidistant from a central pixel in the font area, then edge pixels are extracted to fill the font area, and no font pixel exists in the filled font area, so as to achieve the purpose of font elimination.

Optionally, in an embodiment, in the process of obtaining the font area through color screening, background pixels may be blended into the font area, so that a mottled area appears in the font area, and therefore, in order to obtain a clean font area, reduce the influence of the background pixels on the font area, remove the mottled area, and at the same time, can ensure the integrity of the font edge, and improve the accuracy of the font area, a possible implementation manner is provided below on the basis of fig. 5, referring to fig. 8, where fig. 8 is a schematic flow chart of another text processing method provided in an embodiment of the present invention, that is, after obtaining the font area, the method further includes the following steps:

s206-2b, performing corrosion treatment and expansion treatment on at least one font area.

In the embodiment of the present invention, the etching treatment means: and on the other hand, when the proportion of the font area is large and the variegated area cannot be completely removed through the erosion operation, the font area can be continuously expanded to recover the font outline boundary to obtain the font area without the variegated area.

Optionally, before processing the video image of the target frame, in order to quickly obtain the text coordinates extracted from the video image of the target frame, a possible implementation manner is given below on the basis of fig. 1, referring to fig. 9, where fig. 9 is a schematic flow chart of another text processing method provided by an embodiment of the present invention, and the method further includes:

s202, extracting a text coordinate set in the video images of all frames in the video to be processed.

In the embodiment of the present invention, the text coordinate set includes text coordinates extracted from a video image of a target frame, and after the video image of the target frame is obtained, text coordinates corresponding to the target frame can be obtained from the text coordinate set according to the target frame. The text detection network may be a text-to-text proposal network CTPN (CTPN).

Optionally, an execution subject of each step may be an electronic device, in order to meet a requirement that a user occludes or eliminates a subtitle, an interactive interface of the electronic device in an embodiment of the present invention may include a functional area for receiving a user operation, and the user operation on the functional area may trigger the electronic device to execute each step to complete occlusion or elimination of a subtitle, where a possible implementation is given below on the basis of fig. 1, referring to fig. 10, fig. 10 is a schematic flow chart of another text processing method provided in an embodiment of the present invention, and the method further includes:

s201, receiving a user operation instruction, wherein the user operation instruction is used for indicating to acquire text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame.

For convenience of understanding, please refer to fig. 11, and fig. 11 is a schematic view of a function interface of an electronic device according to an embodiment of the present invention, where in the function display interface shown in fig. 11, two function areas exist: the electronic device comprises a file uploading area and a function selection area, wherein the file uploading area is used for indicating a user to input a video file to be processed, the function selection area is used for indicating the user to select a function, for example, a function corresponds to a text processing function provided by the embodiment of the invention, and after the user clicks or touches a corresponding area of the function, the electronic device is triggered to execute the text processing method provided by the embodiment of the invention after receiving an operation instruction of the user, and the text processing is started to be performed on the video file in the file uploading area.

In order to implement the steps in the foregoing embodiment to achieve the corresponding technical effects, an implementation manner of a text processing apparatus is given below, an embodiment of the present invention further provides a text processing apparatus, referring to fig. 12, and fig. 12 is a functional block diagram of a text processing apparatus provided in an embodiment of the present invention, where the text processing apparatus 12 includes: an acquisition module 121, a determination module 122 and a processing module 123.

An obtaining module 121, configured to obtain text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; matching the historical text coordinates with text regions in the video images of the historical frames;

the determining module 122 is configured to determine, when an error between a text coordinate extracted from a video image of the target frame and a historical text coordinate is within a threshold range, the historical text coordinate as a target text coordinate; and when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not in the threshold range, determining the text coordinate extracted from the video image of the target frame as the target text coordinate.

And the processing module 123 is configured to perform text processing on the video image of the target frame according to the target text coordinates.

It will be appreciated that the obtaining module 121, the determining module 122 and the processing module 123 may be used to perform the steps of fig. 2, 3, 5, 6, 8 to achieve the corresponding technical effects.

Optionally, the text processing apparatus 12 further includes an extraction module, which is configured to extract a text coordinate set of the video to be processed.

It will be appreciated that the obtaining module 121, the determining module 122 and the extracting module may also be used to cooperatively perform the steps in fig. 9 to achieve the corresponding technical effect.

Optionally, the text processing apparatus 12 further includes a receiving module, configured to receive a user operation instruction, where the user operation instruction is used to instruct to acquire the image to be processed.

It is to be understood that the obtaining module 121, the determining module 122 and the extracting module can also be used to cooperatively perform the steps in fig. 10 to achieve the corresponding technical effect.

Fig. 13 shows an electronic device, and fig. 13 is a block diagram of the electronic device according to the embodiment of the present invention. The electronic device 13 includes a communication interface 131, a processor 132, and a memory 133. The processor 132, memory 133, and communication interface 131 are electrically connected to one another, directly or indirectly, to enable transfer or interaction of data. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 133 may be used for storing software programs and modules, such as program instructions/modules corresponding to the text processing method provided by the embodiment of the present invention, and the processor 132 executes the software programs and modules stored in the memory 133, so as to execute various functional applications and data processing. The communication interface 131 may be used for communicating signaling or data with other node devices. The electronic device 13 may have a plurality of communication interfaces 131 in the present invention.

The memory 133 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a programmable read-only memory (PROM), an erasable read-only memory (EPROM), an electrically erasable read-only memory (EEPROM), and the like.

The processor 132 may be an integrated circuit chip having signal processing capabilities. The processor may be a general-purpose processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.

It is to be understood that the respective modules of the above-described text processing apparatus 12 may be stored in the memory 133 of the electronic device 13 in the form of software or Firmware (Firmware) and executed by the processor 132, and at the same time, data, codes of programs, etc. required to execute the above-described modules may be stored in the memory 133.

An embodiment of the present invention provides a storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the text processing method according to any one of the foregoing embodiments. The computer readable storage medium may be, but is not limited to, various media that can store program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a PROM, an EPROM, an EEPROM, a magnetic or optical disk, etc.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method of text processing, the method comprising:

acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; the historical text coordinates are coordinates that have been corrected;

when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is within a threshold range, representing that the texts in the target frame and the historical frame are the same, and determining the historical text coordinate as a target text coordinate;

when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, representing that the texts in the target frame and the historical frame are different, determining the text coordinate extracted from the video image of the target frame as a target text coordinate;

and performing text processing on the video image of the target frame according to the target text coordinates.

2. The text processing method according to claim 1, wherein the text processing the video image of the target frame according to the target text coordinates comprises:

acquiring a mask image and an occlusion image corresponding to the video image of the target frame; the area matched with the target text coordinate in the occlusion image has a fuzzy attribute; pixels in the mask map within an area matching the target text coordinates have a first pixel value; pixels in other areas of the mask map except for the area matching the target text coordinates have second pixel values;

and performing text occlusion processing on the video image of the target frame based on the mask image and the occlusion image.

3. The text processing method according to claim 2, wherein the acquiring mask images and occlusion images corresponding to the video images of the target frame comprises:

constructing a mask image corresponding to the video image of the target frame according to the target text coordinates;

and carrying out mean value blurring on the duplicate image of the video image of the target frame to obtain the occlusion image.

4. The method according to claim 1, wherein the text-processing the video image of the target frame according to the target text coordinates comprises:

determining at least one font area in the video image of the target frame according to the target text coordinates; the font area is an area surrounded by outline edges of the font;

constructing a mask map from the at least one font region; pixels in the mask map within a region matching the font region have a first pixel value; pixels in other areas of the mask map except the area matching the target text coordinate have a second pixel value;

performing elimination processing in the at least one font area in the video image of the target frame based on the mask map.

5. The method of claim 4, wherein said determining at least one font area in the video image of the target frame based on the target text coordinates comprises:

extracting font color data in an area matched with the target text coordinates in the video image of the target frame; the font color data is used for setting a color threshold value;

and performing pixel screening on the video image of the target frame based on the color threshold value to obtain the at least one font area.

6. The text processing method of claim 4, wherein after the step of determining at least one font area in the video image of the target frame based on the target text coordinates, the method further comprises:

and performing erosion treatment and expansion treatment on the at least one font area.

7. The text processing method according to claim 1, wherein before acquiring the text coordinates extracted from the video image of the target frame and the historical text coordinates of the video image of at least one historical frame adjacent to the target frame, the method further comprises:

extracting a text coordinate set in video images of all frames in a video to be processed through a text detection network; the text coordinate set comprises text coordinates extracted from the video image of the target frame.

8. The text processing method according to claim 1, wherein before acquiring the text coordinates extracted from the video image of the target frame and the historical text coordinates of the video image of at least one historical frame adjacent to the target frame, the method comprises:

receiving a user operation instruction, wherein the user operation instruction is used for indicating to acquire text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame.

9. A text processing apparatus, characterized by comprising: the device comprises an acquisition module, a determination module and a processing module;

the acquisition module is used for acquiring text coordinates extracted from a video image of a target frame and historical text coordinates of a video image of at least one historical frame adjacent to the target frame; the historical text coordinates are matched with text regions in the video images of the historical frames; the historical text coordinates are coordinates that have been corrected;

the determining module is used for representing that texts in the target frame and the historical frame are the same when errors of the text coordinates extracted from the video image of the target frame and the historical text coordinates are within a threshold range, and determining the historical text coordinates as target text coordinates; when the error between the text coordinate extracted from the video image of the target frame and the historical text coordinate is not within the threshold range, representing that the texts in the target frame and the historical frame are different, determining the text coordinate extracted from the video image of the target frame as a target text coordinate;

and the processing module is used for performing text processing on the video image of the target frame according to the target text coordinates.

10. An electronic device comprising a machine-readable storage medium having stored thereon machine-executable instructions and a processor, wherein the processor, when executing the machine-executable instructions, implements the text processing method of any one of claims 1-8.

11. A storage medium having stored therein machine-executable instructions that, when executed, implement the text processing method of any one of claims 1-8.