CN113569613A

CN113569613A - Image processing method, image processing apparatus, image processing device, and storage medium

Info

Publication number: CN113569613A
Application number: CN202110195326.4A
Authority: CN
Inventors: 侯昊迪; 余亭浩
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-10-29

Abstract

The application discloses an image processing method, an image processing device and a storage medium, wherein the method comprises the following steps: acquiring an image to be detected; calling a trained target detection model to determine the prediction category of a text box to which each text content in the image to be detected belongs, wherein the prediction category is used for indicating the probability that the text content contained in the text box is incomplete; if the prediction type of at least one text box is a target type, obtaining the confidence coefficient of the target text box of which each prediction type is the target type, wherein the target type is used for indicating that the probability that the text content contained in the text box is incomplete content is greater than a first preset threshold value; if the confidence of at least one target text box is larger than a second preset threshold, the prediction type of the image to be detected is output to be a first type, the first type is used for indicating that the image to be detected is an incomplete subtitle image, and the accuracy and the recall rate of the incomplete subtitle image recognition can be improved.

Description

Image processing method, image processing apparatus, image processing device, and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image processing method, an image processing apparatus, an image processing device, and a computer storage medium.

Background

With the popularity and development of information streams and short videos, more and more platforms are available for users (including self-media and general users) to produce and distribute images or videos. Incomplete subtitles may appear in images or videos uploaded and published by users, so that the image quality or the video quality is reduced, and the watching experience of the users is seriously influenced. Therefore, how to identify incomplete subtitle images is an important research topic in image processing technology.

Disclosure of Invention

The embodiment of the invention provides an image processing method, an image processing device and a storage medium, which can determine the prediction type of each text box in an image to be detected through a target detection model, identify the image to be detected as an incomplete subtitle image based on the prediction type of the text box, and improve the accuracy and recall rate of identifying the incomplete image.

In one aspect, an embodiment of the present invention provides an image processing method, where the image processing method includes:

acquiring an image to be detected;

calling a trained target detection model to determine a prediction category of a text box to which each text content in the image to be detected belongs, wherein the prediction category is used for indicating the probability that the text content contained in the text box is incomplete;

if the prediction type of at least one text box is a target type, obtaining the confidence coefficient of the target text box of which each prediction type is the target type, wherein the target type is used for indicating that the probability that the text content contained in the text box is incomplete content is greater than a first preset threshold value;

and if the confidence coefficient of at least one target text box is greater than a second preset threshold value, outputting the prediction type of the image to be detected as a first type, wherein the first type is used for indicating that the image to be detected is an incomplete subtitle image.

In another aspect, an embodiment of the present invention provides an image processing apparatus, including:

the acquisition unit is used for acquiring an image to be detected;

the determining unit is used for calling the trained target detection model to determine the prediction category of the text box to which each text content in the image to be detected belongs, wherein the prediction category is used for indicating the probability that the text content contained in the text box is incomplete;

the obtaining unit is further configured to obtain a confidence level of a target text box of which each prediction type is a target type if the prediction type of at least one text box is the target type, where the target type is used to indicate that a probability that text content included in the text box is incomplete content is greater than a first preset threshold;

and the output unit is used for outputting the prediction type of the image to be detected as a first type if the confidence coefficient of at least one target text box is greater than a second preset threshold, and the first type is used for indicating that the image to be detected is an incomplete subtitle image.

In another aspect, an embodiment of the present invention provides an image processing apparatus, where the image processing apparatus includes an input interface and an output interface, and the image processing apparatus further includes:

a processor adapted to implement one or more instructions; and the number of the first and second groups,

a computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:

acquiring an image to be detected;

In yet another aspect, an embodiment of the present invention provides a computer storage medium, where one or more instructions are stored, and the one or more instructions are adapted to be loaded by a processor and execute the following steps:

acquiring an image to be detected;

When the embodiment of the invention identifies whether the image to be detected is the incomplete subtitle image, the trained target detection model can be called to determine the prediction type of the text box to which each text content in the image to be detected belongs, and then whether the target text box with the prediction type as the target type exists in the image to be detected is determined. And if the target text box with the prediction type of the target text box exists, determining that the image to be detected is an incomplete self-caption image according to the confidence coefficient of the target text box and a second preset threshold value. Compared with the incomplete subtitle image recognition method based on image classification, the method has the advantages that the text box possibly containing complete subtitle content and the text box possibly containing incomplete subtitle content in the image to be detected are considered, the trained target detection model is called to determine the prediction type of the text box to which each text content in the image to be detected belongs, each text box in the image to be detected can be distinguished, local area information in the image to be detected can be concerned, and the accuracy and recall rate of the incomplete subtitle image recognition are improved. And the incomplete subtitle images can be identified according to the comparison result of the confidence degree of the target text box containing the incomplete subtitle content and the second preset threshold value, so that the accuracy of identification of the incomplete subtitle images can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a partial subtitle provided by an embodiment of the present invention;

fig. 2 is a schematic diagram of a partial subtitle according to an embodiment of the present invention;

FIG. 3a is a schematic view of a scene of an image processing method according to an embodiment of the present invention;

FIG. 3b is a schematic view of another scene of an image processing method according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 5 is a flow chart of another image processing method according to an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. Computer Vision technology (CV) is a science for researching how to make a machine "see", and more specifically, it refers to that a camera and an image processing device are used to replace human eyes to perform machine Vision such as recognition, tracking and measurement on a target, and further perform image processing, so that the image processing device processes the image into an image more suitable for human eyes to observe or transmit the image to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision techniques typically include a variety of techniques including image processing, video processing, and Optical Character Recognition (OCR).

Along with the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields, for example, common intelligent house, intelligent wearable equipment, virtual assistant, intelligent sound box, intelligent marketing, unmanned driving, automatic driving, unmanned aerial vehicle, robot, intelligent medical treatment, intelligent customer service and the like. In addition, the artificial intelligence technology can also be applied to other fields, for example, the identification of incomplete subtitle images can be realized by adopting a computer vision technology in the artificial intelligence technology.

The incomplete subtitle image is an image in which incomplete text content exists. In one embodiment, when the video or image published by the user is not directly captured by the image capture device, but is obtained by clipping the image material by the user, the image or video published by the user may have incomplete subtitles. For example, due to factors such as different platform size ratios, a user may need to crop an image material, which may result in a text box to which subtitles belong in the image material being cropped, so that the subtitle content in the cropped text box is incomplete, and an image including incomplete subtitle content, that is, an incomplete subtitle image, may be obtained. The subtitles in different text directions have different types, and can be divided into horizontal subtitles (as shown in fig. 1) and vertical subtitles (as shown in fig. 2) according to the text directions. In the following, an example of an image including a horizontal subtitle and an incomplete subtitle image including a vertical subtitle will be described.

Referring to fig. 1, fig. 1 includes an image 100 and a normal subtitle 101 in the image 100. Optionally, during the cropping process, image 100 may be cropped laterally to yield image 110. Accordingly, the normal captions 101 in the image 100 may also be cut horizontally during the clipping process, resulting in incomplete captions, i.e., the horizontal captions 111, and the image 110 containing the horizontal captions 111 is an incomplete caption image. Optionally, during the clipping process, image 100 may be cropped vertically, resulting in image 120. Accordingly, the normal subtitles 101 in the image 100 may also be vertically cropped during the cropping process, so that the incomplete subtitles, i.e., the vertically cropped subtitles 121, are obtained, and the image 120 containing the vertically cropped subtitles 121 is an incomplete subtitle image. Optionally, during the cropping process, the image 100 may be cropped horizontally and vertically at the same time to obtain the image 130. Accordingly, the normal subtitles 101 in the image 100 may also be cropped horizontally and vertically at the same time, resulting in incomplete subtitles, i.e., the cropping subtitles 131. However, in the case of the horizontal subtitles, the effect of horizontal clipping is greater than that of vertical clipping, and therefore, for the horizontal subtitles subjected to both horizontal clipping and vertical clipping, the clipped horizontal subtitles may be referred to as horizontal subtitles.

Referring to fig. 2, fig. 2 includes an image 200 and a normal subtitle 201 in the image 200. Optionally, during the cropping process, image 200 may be cropped laterally to yield image 210. Accordingly, the normal vertical subtitle 201 in the image 200 may also be clipped horizontally during the clipping process, so that an incomplete vertical subtitle, i.e., a landscape vertical subtitle 211, is obtained, and the image 210 including the landscape vertical subtitle 211 is an incomplete subtitle image. Optionally, during the clipping process, the image 200 may be cropped vertically, resulting in the image 220. Accordingly, the normal vertical subtitles 201 in the image 200 may also be vertically cropped in the cropping process, so that an incomplete vertical subtitle, i.e., the vertical subtitle 221 is obtained, and the image 220 including the vertical subtitle 221 is an incomplete subtitle image. Optionally, during the clipping process, the image 200 may be clipped horizontally and vertically at the same time, resulting in the image 230. Accordingly, the normal subtitle 201 in the image 200 may also be horizontally cropped and vertically cropped at the same time, resulting in an incomplete subtitle, i.e., a vertically cropped subtitle 231. For the vertical subtitles, the influence of vertical clipping is greater than that of horizontal clipping, so that for the vertical subtitles which are subjected to horizontal clipping and vertical clipping simultaneously, the clipped vertical subtitles can be called vertical clipped vertical subtitles.

The incomplete subtitle images illustrated above are either incomplete subtitle images including incomplete subtitles or incomplete subtitle images including incomplete subtitles. It should be understood that the incomplete subtitle image according to the present invention may also include both incomplete subtitles and incomplete subtitles.

In one embodiment, the embodiment of the present invention provides an image processing method applied to the field of image processing technology based on the above-mentioned machine learning algorithm. In the image processing method, the trained target detection model can be called to determine the prediction category of the text box to which each text content in the image to be detected belongs. When the target text box containing incomplete text content in the image to be detected is determined according to the prediction category of the text box, whether the image to be detected is an incomplete subtitle image or not can be judged according to the confidence coefficient of the target text box. Compared with the identification method of the incomplete subtitle images based on image classification, the local features in the images to be detected can be effectively concerned, the text content of each text box in the images to be detected can be distinguished through the target detection model, the text boxes containing the complete subtitle content and the text boxes containing the incomplete subtitle content can be detected from the images to be detected simultaneously, the incomplete subtitle images are identified according to the text boxes of the incomplete subtitle content, the identification accuracy and the recall rate of the incomplete subtitle images can be improved, and the recall rate is improved by 21 percentage points under the same identification accuracy.

In a specific implementation, the image processing method may be executed by an image processing device, and the image processing device mentioned herein may refer to any device having a data calculation function, such as a terminal device or a server. The terminal device may include, but is not limited to: smart phones, tablets, laptops, wearable devices, desktop computers, and the like; the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a Content Delivery Network (CDN), a middleware service, a domain name service, a security service, a big data and artificial intelligence platform, and the like.

The image processing method is wide in application scene and can be suitable for various platforms which need content verification or cover page image selection. Such as news information platforms, browser platforms, short video platforms, and the like. When the image processing method is applied to the information flow platform, the content or the cover page can be effectively checked, incomplete subtitle images in the platform are obviously reduced, and user experience is improved.

In one embodiment, the image processing method provided by the embodiment of the invention can be applied to various platforms which need content auditing. Before a user uploads a published image or video, the image processing device needs to process the image to be uploaded, the image to be published, the video to be uploaded or the video to be published by using the image processing method, identify whether incomplete subtitle images exist in the image to be uploaded, the image to be published, the video to be uploaded or the video to be published, and intercept or reduce the right if the incomplete subtitle images exist, for example, the image to be uploaded or the video to be uploaded are intercepted; for another example, the right of an image or a video to be distributed is reduced. If there is no incomplete subtitle image, it will be published or uploaded normally, e.g. into the recommendation pool, as shown in fig. 3 a. In one embodiment, the image processing method provided by the embodiment of the invention can be applied to various platforms which need cover page selection. In the information flow platform and the short video platform, in order to attract a user to read, image processing equipment needs to process candidate cover drawings in an image to be uploaded, an image to be published, a video to be uploaded or a video to be published, if incomplete subtitles exist in the candidate cover drawings, namely the candidate cover drawings are incomplete subtitle images, the candidate cover drawings cannot be used as final cover drawings, the candidate cover drawings are removed, and the next flow of cover drawing selection cannot be triggered. If the incomplete caption image does not exist in the candidate cover picture, i.e. the cover picture is the complete caption image, the next flow of cover picture selection can be triggered, and the cover picture processing module is triggered to execute the next step, as shown in fig. 3 b.

Based on the above description, an image processing method provided in the embodiment of the present invention; the image processing method may be performed by the above-mentioned image processing apparatus. Referring to fig. 4, the image processing method may include the following S401-S404:

s401, obtaining an image to be detected.

In the embodiment of the present invention, the resolution of the image to be detected may be any value; for example, the resolution of the image to be detected may be 600 × 800, the resolution of the image to be detected may be 3000 × 4000, and so on. Specifically, the image to be detected may be an independent image of one frame, or may be an image of any frame in an image sequence. The image sequence comprises a plurality of image sets which are arranged according to the time sequence; for example, the sequence of images may be a video, which may be a movie video, a short video, or the like. The short video may also be referred to as a short video, which is generally a video that is broadcast on the new internet media within N minutes (e.g., 4 minutes, 5 minutes, etc.).

S402, calling the trained target detection model to determine the prediction category of the text box to which each text content in the image to be detected belongs, wherein the prediction category is used for indicating the probability that the text content contained in the text box is incomplete.

The prediction category of the text box is any one of a plurality of categories to be selected. The image processing device may determine a plurality of categories to be selected for the text box based on the text content within the text box. In one embodiment, as can be seen from the foregoing, the subtitles may include the following 6 cases: normal horizontal subtitles, horizontal cropping subtitles, vertical cropping subtitles, normal vertical subtitles, horizontal cropping subtitles and vertical cropping subtitles. The plurality of categories of text boxes to be selected may include one or more of the following: the text box containing normal horizontal subtitles, the text box containing horizontal subtitles of horizontal cuts, the text box containing vertical subtitles of vertical cuts, the text box containing normal vertical subtitles, the text box containing vertical subtitles of horizontal cuts and the text box containing vertical subtitles of vertical cuts.

The prediction category is used to indicate a probability that the text content contained in the text box is incomplete, that is, the prediction category is used to indicate a probability that the subtitles contained in the text box are incomplete subtitles.

The target detection model may include any one of the following: the target detection model is constructed based on the EfficientDet algorithm, the YOLO algorithm and the RCNN algorithm.

In one embodiment, for a target detection model constructed by the EfficientDet algorithm, the target detection model may be composed of a neural network EfficientNet, a bidirectional feature pyramid network (BiFPN), and a target detection predictive head network. The EfficientNet is an image feature extraction network which is constructed through Network Automatic Search (NAS) and based on a Convolutional Neural Network (CNN), and is used for extracting a basic feature set of an image to be detected, wherein the basic feature set comprises basic features of the image to be detected under multiple sizes. The BiFPN is an image fusion network and is used for fusing basic features under multiple sizes in the basic feature set to obtain fusion features, so that the fusion features can be converged under multiple sizes in revetment, and the accuracy of a target detection model is improved. And the target detection prediction head network is used for predicting the prediction category of the text box to which each text content in the image to be detected belongs according to the fusion characteristics.

In one embodiment, for a target detection model constructed based on the YOLO algorithm, the target detection model first divides an image to be detected into fixed grids (for example, 7 × 7), and if the center of a certain sample object falls on the corresponding grid, the grid is responsible for regression of the position of the object; and predicting the position and the confidence information of the object by each grid, encoding the information into a vector, outputting the result of the grid prediction, and combining the prediction results of all grids to obtain the prediction category of the text box to which each text content in the image to be detected belongs.

In one embodiment, for a target detection model constructed based on the RCNN algorithm, the target detection model firstly extracts candidate regions in an image to be detected by using a Region pro posal algorithm, and normalizes the extracted candidate regions to obtain an input value of CNN. And then, performing operations such as convolution and/or pooling on the input value by using the CNN to obtain a feature vector with fixed dimensionality. And finally, classifying the feature vectors by using a feature training classifier to obtain the prediction category of the text box to which each text content in the image to be detected belongs.

And S403, if the prediction type of at least one text box is a target type, acquiring the confidence of the target text box with each prediction type as the target type, wherein the target type is used for indicating that the probability that the text content contained in the text box is incomplete content is greater than a first preset threshold value.

Specifically, if the prediction type of at least one text box in the image to be detected is the target type, the confidence of the target text box with each prediction type being the target type is obtained, and if the target text box with the prediction type being the target type does not exist in the image to be detected, the prediction type of the image to be detected is output as the second type, and the second type is used for indicating that the image to be detected is the complete caption image. Wherein the target text box may include one or more of: the text box comprises a text box with a horizontal cutting caption, a text box with a vertical cutting caption, a text box with a horizontal cutting vertical caption and a text box with a vertical cutting vertical caption. The probability that the subtitle content contained in each target text box is incomplete subtitles is larger than a first preset threshold value.

For example. Suppose that there are 2 text boxes in the image to be detected, which are respectively a text box a, a text box B and a text box C. And determining the prediction type A of the text box A, the prediction type B of the text box B and the prediction type C of the text box C through the trained target detection model. When the probability that the text content contained in the text box A is indicated to be incomplete by the prediction type A is smaller than or equal to a first preset threshold, the probability that the text content contained in the text box B is indicated to be incomplete by the prediction type B is smaller than or equal to the first preset threshold, and the probability that the text content contained in the text box C is indicated to be incomplete by the prediction type C is smaller than or equal to the first preset threshold, a target text box with the preset type as a target type does not exist in the image to be detected. The image processing apparatus may output the type of the image to be detected as the second type. When the probability that the text content contained in the text box A is indicated to be incomplete by the prediction type A is smaller than or equal to a first preset threshold, the probability that the text content contained in the text box B is indicated to be incomplete by the prediction type B is larger than the first preset threshold, and the probability that the text content contained in the text box C is indicated to be incomplete by the prediction type C is larger than the first preset threshold, a target text box with the preset type as a target type exists in the image to be detected. The image processing device may obtain the confidence of text box B and the confidence of text box C.

S404, if the confidence coefficient of at least one target text box is larger than a second preset threshold value, outputting the prediction type of the image to be detected as a first type, wherein the first type is used for indicating that the image to be detected is an incomplete subtitle image.

Specifically, the image processing device may compare the target text box with a second preset threshold, and if the confidence of at least one target text box is greater than the second preset threshold, output the prediction type of the image to be detected as the first type; and if the confidence coefficient of each target text box is less than or equal to a second preset threshold value, outputting the prediction type of the image to be detected as a second type.

After the above example is carried out, after the confidence level of the text box B and the confidence level of the text box C are obtained, if the confidence level of the text box B and the confidence level of the text box C are both smaller than or equal to a second preset threshold value, outputting that the type of the image to be detected is a second type; if the confidence of the text box B is greater than the second preset threshold or the confidence of the text box C is greater than the second preset threshold, the image processing device may output that the prediction type of the image to be detected is the first type.

The second preset thresholds corresponding to the prediction categories of the target text boxes may be the same or different. In an embodiment, when the second preset thresholds corresponding to the prediction categories of the target text boxes are the same, the confidence degrees of the target text boxes can be compared with the second preset threshold, and the prediction type of the image to be detected is output according to the comparison result. And taking over the above example, comparing the confidence of the text box B and the confidence of the text box C with the same second preset threshold, and outputting the prediction type of the image to be detected according to the comparison result. In another embodiment, the second preset threshold corresponding to the prediction category of each target text box may be different. For example, when the target text box is a text box containing a landscape caption, the second preset threshold is a second preset threshold 1; when the target text box is a text box containing a vertical cropping horizontal subtitle, the second preset threshold is a second preset threshold 2; when the target text box is a text box containing a landscape vertical subtitle, the second preset threshold is a second preset threshold 3; and when the target text box is a text box containing a vertical caption, the second preset threshold is a second preset threshold 4. Bearing the above example, if the text box B is a text box containing horizontal subtitles and the text box C is a text box containing vertical subtitles, the image processing apparatus compares the confidence of the text box B with a second preset threshold 1, compares the confidence of the text box C with a second preset threshold 2, and outputs the prediction type of the image to be detected according to the comparison result. For another example, when the target text box is a text box containing a landscape subtitle or when the target text box is a text box containing a portrait subtitle, the second preset threshold is a second preset threshold 1; when the target text box is a text box containing a landscape vertical subtitle or when the target text box is a text box containing a portrait vertical subtitle, the second preset threshold is a second preset threshold 2. And so on.

As can be seen from the above description of the embodiment of the method shown in fig. 4, the image processing method shown in fig. 4 may call the trained target detection model to recognize the incomplete subtitle image. Then the target detection model needs to be trained before the trained target detection model is invoked. Based on this, referring to fig. 5, fig. 5 shows a flow chart of another image processing method, which may include S501-S506:

s501, performing character recognition on the initial image through an optical character recognition algorithm, and determining each candidate text box in the initial image and text content contained in each candidate text box.

The Optical Character Recognition (OCR) algorithm is a process in which an image processing apparatus determines candidate text boxes and text contents included in the candidate text boxes by detecting dark and light patterns, and then translates the candidate text boxes into computer words by using a Character Recognition method.

In one embodiment, the image processing device may include an OCR module that performs character recognition on the initial image by an OCR algorithm using an OCR recognition service provided by the data platform portion to determine candidate text boxes in the initial image and text contents contained in the candidate text boxes. The initial image may be an image material acquired by an image acquisition device, or an image obtained through editing.

S502, calling an incoherence recognition model to process the text content contained in each candidate text box, and determining a reference text box with the incoherence text content in all the candidate text boxes contained in the initial image.

Since the OCR algorithm detects and identifies all text contents in the initial image, it is impossible to effectively distinguish whether the text contents in the initial image are subtitle contents manually added or background characters carried by the image capturing apparatus in the image captured by the image capturing apparatus, for example, character contents included in a captured image material. Moreover, since the background text carried by the image itself is usually located in the lowest layer of the initial image, the background text carried by the image itself may be covered by the layer placed above the lowest layer in the image, so that the text content of the background text carried by the image itself is not smooth. Therefore, the non-compliance recognition model can be invoked to process the text content contained in each candidate text box, and the reference text box with the non-compliance text content can be determined in all the candidate text boxes contained in the initial image.

Specifically, the image processing device may invoke the non-compliance recognition model to perform non-compliance detection on each line of text content of each candidate text box, so as to obtain a non-compliance evaluation value of each line of text content of each candidate text box; secondly, performing context splicing on each line of text content of each candidate text box to obtain spliced text content of each candidate text box, and calling a noncompliance identification model to perform noncompliance detection on the spliced text content of each candidate text box to obtain a noncompliance evaluation value of the spliced text content of each candidate text box; and finally, determining the non-compliance evaluation value of each candidate text box based on the non-compliance evaluation value of each line of text content of each candidate text box and the non-compliance evaluation value of the spliced text content of each candidate text box, and if the non-compliance evaluation value of any one candidate text box in each candidate text box is larger than the non-compliance evaluation threshold value, determining that any one candidate text box is a text content non-compliance reference text box.

Wherein the discordance recognition model may be a Bert-based text classification model. The semantic compliance recognition can be carried out on the text content, and the irregularity evaluation value of the text content is obtained.

In order to avoid misjudging some caption contents with line feed as discordant background characters, the discordant evaluation value of the candidate text box needs to be determined based on the discordant evaluation value of each line of text contents of the candidate text box and the discordant evaluation value of the spliced text contents of the candidate text box. Alternatively, the minimum value of the irregularity evaluation values of each line of the text contents of the candidate text box and the irregularity evaluation values of the stitched text contents of the candidate text box may be used as the irregularity evaluation value of the candidate text box. Optionally, the irregularity evaluation values of each line of the text content of the candidate text box and the irregularity evaluation values of the spliced text content of the candidate text box may be weighted and summed to obtain the irregularity evaluation values of the candidate text box.

And if the non-compliance evaluation value of any one candidate text box is greater than the non-compliance evaluation threshold value, determining that the any one candidate text box is a text content non-compliance reference text box. There may be a plurality of types of the disconnect evaluation value, for example, the disconnect evaluation value is a disconnect score, the disconnect evaluation value is a disconnect grade, and the like. For example, when the discontent evaluation value is the discontent score, the discontent evaluation threshold value is the discontent score threshold value, and if the discontent score of the candidate text box is greater than the discontent score threshold value, the candidate text box is the reference text box. For example, when the discontent evaluation value is a discontent level, the discontent evaluation threshold value is a discontent level threshold value, and if the discontent level of the candidate text box is higher than the discontent level threshold value, the candidate text box is the reference text box.

S503, filtering text contents contained in the reference text box in the initial image to obtain a target image.

The initial image may be one image or an image set including a plurality of images. Accordingly, the target image obtained from the initial image may be one image or may be an image set including a plurality of images.

And S504, cutting the target image according to a preset cutting mode to obtain a sample image.

Optionally, when the target image is one image, the one image may be cropped multiple times to obtain a sample image set including multiple sample images. Optionally, when the target image is an image set including a plurality of images, each image in the image set may be cut in a preset cutting manner to obtain a sample image corresponding to each image, and then the sample image set may be obtained based on the sample images corresponding to all the images; or each image in the image set may be cut in a preset cutting manner for multiple times to obtain multiple sample images corresponding to each image, and then the sample image set may be obtained based on all the sample image images. The preset cutting mode may include random cutting and/or fixed cutting. The vertical and horizontal range of the cropping indicated by the preset cropping mode may not exceed one third of the height and width of the target image.

S505, identifying each text box in the sample image according to the position and the size of each text box in the target image and a preset cutting mode; each text box in the sample image is compared to a corresponding text box in the target image to determine a baseline type for the sample image.

Specifically, each text box in the sample image is identified according to the position and the size of each text box in the target image and a preset cutting mode; each text box in the sample image is compared with a corresponding text box in the target image to determine a category of each text box in the sample image. Thereby determining a reference type of the sample image.

To better illustrate the embodiment of the present invention, the following description is made with reference to the example of fig. 1, in fig. 1, if the image 100 is a target image, a text box containing normal subtitles 101 is named as a text box 102. Assume that the sample image 110 is obtained through a preset cropping mode. In the sample image 110, the text box 112 may be recognized, and comparing the text box 112 with the text box 102 may determine that the text box 112 is cropped, so that it may be determined that the normal subtitle 101 within the text box 112 is cropped, resulting in the cropping subtitle 111. The category of the respective text box in each sample image of the sample image set can be determined by a method similar to the above-described method of determining the text box of the banner contained in the sample image. Text boxes in each sample image may include one or more of the following categories: the text box containing normal horizontal subtitles, the text box containing horizontal subtitles of horizontal cuts, the text box containing vertical subtitles of vertical cuts, the text box containing normal vertical subtitles, the text box containing vertical subtitles of horizontal cuts and the text box containing vertical subtitles of vertical cuts. Based on this, each sample image may determine the underlying type of the sample image based on the category of the respective text box within the sample image.

S506, constructing a training sample containing the sample image and the reference type of the sample image, and calling a target detection model to process the sample image to obtain the prediction type of the sample image; and training the target detection model according to the prediction type and the reference type to obtain the trained target detection model.

After obtaining the trained target detection model through the above S501-S506, the image processing apparatus may predict the image to be detected by using the trained target detection model, and output a detection result of the image to be detected, that is, output a prediction type of the image to be detected as an incomplete subtitle image or a complete subtitle image.

When a training sample is constructed, the initial image is subjected to character recognition through an OCR algorithm, a noncompliance recognition model is called to determine a reference text box with discordance text content from all candidate text boxes in the initial image, text content contained in the reference text box in the initial image is filtered to obtain a target image, the target image is cut according to a preset cutting mode to obtain a sample image, and a reference type of the sample image is determined according to each text box in the sample image and the corresponding text box in the target image, so that a training sample set containing a large number of reference types of the sample image and the sample image can be obtained, and the labor cost and the time cost of data labeling are saved. Meanwhile, compared with the recognition method of incomplete subtitle images based on OCR, the method can determine the reference text box containing the background characters through the discordance recognition model, can accurately distinguish incomplete subtitles and background characters detected by OCR, avoids mistakenly recognizing the discordance background characters into the incomplete subtitles, improves the accuracy of constructing a training sample, further improves the accuracy of a trained target detection model, and improves the accuracy of recognizing the incomplete subtitle accurate images.

Based on the description of the above embodiment of the image processing method, the embodiment of the present invention also discloses an image processing apparatus, which may be a computer program (including a program code) running in the above mentioned image processing device. The image processing apparatus may perform the method shown in fig. 4 or fig. 5. Referring to fig. 6, the image processing apparatus may operate the following units:

an acquiring unit 601 configured to acquire an image to be detected;

a determining unit 602, configured to invoke a trained target detection model to determine a prediction category of a text box to which each text content in the image to be detected belongs, where the prediction category is used to indicate a probability that the text content included in the text box is incomplete;

the obtaining unit 601 is further configured to obtain, if there is a prediction category of at least one text box as a target category, a confidence level of the target text box of which each prediction category is the target category, where the target category is used to indicate that a probability that text content included in the text box is incomplete content is greater than a first preset threshold;

an output unit 603, configured to output that the prediction type of the image to be detected is a first type if the confidence of at least one target text box is greater than a second preset threshold, where the first type is used to indicate that the image to be detected is an incomplete subtitle image.

In an embodiment, after obtaining the confidence level that each prediction category is the target text box of the target category, the output unit 603 is further configured to:

and if the confidence of each target text box is less than or equal to the second preset threshold, outputting that the type of the image to be detected is a second type, wherein the second type is used for indicating that the image to be detected is a complete subtitle image.

In another embodiment, the output unit 603 is configured to, if the confidence of at least one target text box is greater than a second preset threshold, output that the prediction type of the image to be detected is the first type, and includes:

comparing the confidence of each target text box with a second preset threshold corresponding to the prediction category of each target text box;

and if the confidence coefficient of at least one target text box is greater than a second preset threshold corresponding to the prediction type of the target text box, outputting the type of the image to be detected as the first type.

In another embodiment, before the trained target detection model is called to determine the prediction type of the text box to which each text content in the image to be detected belongs, the obtaining unit 601 is further configured to obtain a training sample, where the training sample includes a sample image and a reference type of the sample image;

calling a target detection model to process the sample image to obtain the prediction type of the sample image;

and training the target detection model according to the prediction type and the reference type to obtain the trained target detection model.

In another embodiment, the obtaining unit 601 obtains the training samples, including:

cutting the target image according to a preset cutting mode to obtain a sample image;

identifying each text box in the sample image according to the position and the size of each text box in the target image and the preset cutting mode;

comparing each text box in the sample image with a corresponding text box in the target image to determine a reference type of the sample image;

constructing a training sample comprising the sample image and a reference type for the sample image.

In another embodiment, before the obtaining unit 601 cuts the target image according to the preset cutting mode to obtain the sample image, the obtaining unit is further configured to:

performing character recognition on an initial image through an optical character recognition algorithm, and determining each candidate text box in the initial image and text content contained in each candidate text box;

calling an incoherence recognition model to process the text content contained in each candidate text box, and determining a reference text box with the incoherence text content in all the candidate text boxes contained in the initial image;

and filtering the text content contained in the reference text box in the initial image to obtain the target image.

In another embodiment, the acquiring unit 601 invokes a noncompliance recognition model to process the text content contained in each candidate text box, and determines a reference text box with noncompliance text content in all candidate text boxes contained in each initial image, including:

calling the non-compliance recognition model to carry out non-compliance detection on each line of text content of each candidate text box to obtain a non-compliance evaluation value of each line of text content of each candidate text box;

performing context splicing on each line of text content of each candidate text box to obtain spliced text content of each candidate text box, and calling the discordance identification model to perform discordance detection on the spliced text content of each candidate text box to obtain a discordance evaluation value of the spliced text content of each candidate text box;

determining the irregularity evaluation value of each candidate text box based on the irregularity evaluation value of each line of text content of each candidate text box and the irregularity evaluation value of the spliced text content of each candidate text box;

and if the non-compliance evaluation value of any one candidate text box in the candidate text boxes is greater than the non-compliance evaluation threshold value, determining that the any one candidate text box is the reference text box with the non-compliance text content.

According to an embodiment of the present invention, each step involved in the method shown in fig. 4 or fig. 5 may be performed by each unit in the image processing apparatus shown in fig. 6. For example, steps S401 and S403 shown in fig. 4 are performed by the acquisition unit 601 shown in fig. 6, step S402 is performed by the determination unit 602 shown in fig. 6, and step S404 is performed by the output unit 604 shown in fig. 6.

According to another embodiment of the present invention, the units in the image processing apparatus shown in fig. 6 may be respectively or entirely combined into one or several other units to form the image processing apparatus, or some unit(s) thereof may be further split into multiple units with smaller functions to form the image processing apparatus, which may achieve the same operation without affecting the achievement of the technical effect of the embodiment of the present invention. The units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the present invention, the image processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units.

According to another embodiment of the present invention, the Processing element and the memory element may be comprised of a Central Processing Unit (CPU), a random access memory medium (RAM), a read only memory medium (ROM), and the like. A general purpose computing device, such as a computer, runs a computer program (including program code) capable of executing the steps involved in the corresponding method as shown in fig. 4 or fig. 5 to construct an image processing apparatus as shown in fig. 6 and to implement the image processing method of the embodiment of the present invention. The computer program may be recorded on a computer-readable recording medium, for example, and loaded and executed in the image processing apparatus via the computer-readable recording medium.

Based on the description of the embodiment of the image processing method, the embodiment of the invention also discloses image processing equipment. Referring to fig. 7, the image processing apparatus includes at least a processor 701, an input interface 702, an output interface 703, and a computer storage medium 704, which may be connected by a bus or other means.

The computer storage medium 704 is a memory device in the image processing apparatus for storing programs and data. It is understood that the computer storage medium 704 herein may include a built-in storage medium of the image processing apparatus, and may also include an extended storage medium supported by the image processing apparatus. The computer storage medium 704 provides a storage space that stores an operating system of the image processing apparatus. Also stored in this memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by processor 701. Note that the computer storage media herein can be high-speed RAM memory; optionally, the image Processing apparatus may further include at least one computer storage medium remote from the processor, where the processor may be referred to as a Central Processing Unit (CPU), and is a core and a control center of the image Processing apparatus, and the processor is adapted to implement one or more instructions, and specifically load and execute the one or more instructions to implement the corresponding method flow or function.

In one embodiment, one or more instructions stored in the computer storage medium 704 may be loaded and executed by the processor 701 to implement the steps involved in performing the corresponding method as shown in fig. 4 or fig. 5, and in particular, one or more instructions stored in the computer storage medium 704 may be loaded and executed by the processor 701 to implement the steps of:

acquiring an image to be detected;

In one embodiment, after obtaining the confidence level that each prediction category is the target text box of the target category, the processor 701 is further configured to:

In another embodiment, the processor 701 is configured to output that the prediction type of the image to be detected is the first type if the confidence level of at least one target text box is greater than a second preset threshold, where the method includes:

In another embodiment, before the trained target detection model is called to determine the prediction type of the text box to which each text content in the image to be detected belongs, the processor 701 is further configured to obtain a training sample, where the training sample includes a sample image and a reference type of the sample image;

In another embodiment, the processor 701 obtains training samples, including:

In another embodiment, before the processor 701 crops the target image according to the preset cropping mode to obtain the sample image, the processor 701 is further configured to:

In another embodiment, the step of calling the noncompliance recognition model by the processor 701 to process the text content contained in each candidate text box, and determining a reference text box with noncompliance text content in all candidate text boxes contained in each initial image includes:

It should be noted that the embodiment of the present invention also provides a computer program product or a computer program, and the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer readable storage medium. The processor of the image processing apparatus reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the image processing apparatus performs the steps performed in fig. 4 or fig. 5 of the above-described image processing method embodiment.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An image processing method, comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein after obtaining the confidence level that each prediction category is the target text box of the target category, further comprising:

3. The method as claimed in claim 1, wherein if the confidence of the existence of at least one target text box is greater than a second preset threshold, outputting the prediction type of the image to be detected as a first type, comprising:

4. The method of claim 1, wherein before the invoking of the trained target detection model to determine the prediction category of the text box to which each text content in the image to be detected belongs, the method further comprises:

acquiring a training sample, wherein the training sample comprises a sample image and a reference type of the sample image;

5. The method of claim 4, wherein the obtaining training samples comprises:

6. The method as claimed in claim 5, wherein before the cropping the target image according to the preset cropping mode to obtain the sample image, the method further comprises:

7. The method of claim 6, wherein said invoking the discordance recognition model to process the text content contained in each candidate text box and determine a reference text box with discordance of the text content in all candidate text boxes contained in each initial image comprises:

8. An image processing apparatus characterized by comprising:

the acquisition unit is used for acquiring an image to be detected;

9. An image processing apparatus comprising an input interface, an output interface, characterized by further comprising:

a computer storage medium having stored thereon one or more instructions adapted to be loaded by the processor and to perform the image processing method according to any of claims 1-7.

10. A computer storage medium having stored thereon one or more instructions adapted to be loaded by a processor and to perform the image processing method according to any of claims 1-7.