CN113761245B

CN113761245B - Image recognition method, device, electronic equipment and computer readable storage medium

Info

Publication number: CN113761245B
Application number: CN202110510014.8A
Authority: CN
Inventors: 侯昊迪; 余亭浩; 张绍明; 陈少华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2023-10-13
Anticipated expiration: 2041-05-11
Also published as: CN113761245A

Abstract

The application provides an image recognition method, an image recognition device, electronic equipment and a computer readable storage medium; the method comprises the following steps: performing attribute prediction processing on the image to obtain a plurality of candidate attribute frames corresponding to the objects in the image; performing aggregation processing on the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames; screening each group of candidate attribute frames based on the cross comparison of each group of candidate attribute frames to obtain target attribute frames corresponding to each category; and carrying out category identification processing on the object based on the target attribute frame corresponding to each category to obtain the category of the image. The application can improve the accuracy of image recognition.

Description

Image recognition method, device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to image processing technologies, and in particular, to an image recognition method, an image recognition device, an electronic device, and a computer readable storage medium.

Background

Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding. Artificial intelligence technology is a comprehensive subject, and relates to a wide range of fields, such as natural language processing technology, machine learning and other directions, and with the development of technology, the artificial intelligence technology will be applied in more fields and has an increasingly important value.

Image recognition is an important application of artificial intelligence, in which images are detected, and the detection result usually includes a plurality of similar candidate boxes. In the related art, in the process of filtering similar candidate frames, the phenomena of filtering the candidate frames by mistake and filtering the candidate frames by omission often occur, so that the accuracy of the image recognition result is poor.

Disclosure of Invention

The embodiment of the application provides an image recognition method, an image recognition device, electronic equipment and a computer readable storage medium, which can improve the accuracy of image recognition.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image identification method, which comprises the following steps:

performing attribute prediction processing on an image to obtain a plurality of candidate attribute frames corresponding to objects in the image;

performing aggregation processing on the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames;

screening the candidate attribute frames of each group based on the cross comparison of the candidate attribute frames of each group to obtain target attribute frames corresponding to each category;

and carrying out category identification processing on the object based on the target attribute frame corresponding to each category to obtain the category of the image.

An embodiment of the present application provides an image recognition apparatus, including:

the prediction module is used for carrying out attribute prediction processing on the image to obtain a plurality of candidate attribute frames corresponding to the objects in the image;

the aggregation module is used for carrying out aggregation processing on the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames;

the screening module is used for screening each group of candidate attribute frames based on the cross comparison of each group of candidate attribute frames to obtain target attribute frames corresponding to each category;

and the identification module is used for carrying out category identification processing on the object based on the target attribute frame corresponding to each category to obtain the category of the image.

In the above scheme, the plurality of candidate attribute frames include a candidate overall attribute frame corresponding to the overall object and a candidate local attribute frame corresponding to the local object, and the target attribute frame includes a target overall attribute frame and a target local attribute frame; the screening module is further configured to:

determining a target global attribute frame corresponding to the global of the object based on at least one of the candidate global attribute frames;

and traversing each group of candidate local attribute frames, executing filtering operation on the candidate local attribute frames in the same group, and taking the candidate local attribute frames which are obtained by filtering and belong to the object and have the maximum corresponding attribute probability as the target local attribute frames.

In the above scheme, the screening module is further configured to:

when the number of at least one candidate overall attribute frame is one, the candidate overall attribute frame is taken as the target overall attribute frame;

when the number of at least one candidate overall attribute frame is a plurality of, and the number of the objects is one, taking the candidate overall attribute frame with the maximum corresponding attribute probability as the target overall attribute frame;

when the number of the at least one candidate integral attribute frames is a plurality of, and the number of the objects is a plurality of, determining the cross-over ratio between the at least one candidate integral attribute frames, aggregating the candidate integral attribute frames with the cross-over ratio larger than a first cross-over ratio threshold value into a plurality of groups of candidate integral attribute frames, and taking the candidate integral attribute frame with the largest attribute probability corresponding to each group of candidate integral attribute frames as the target integral attribute frame.

In the above scheme, the screening module is further configured to:

determining the intersection ratio of two candidate local attribute frames based on the positions of the two candidate local attribute frames in the same group;

when the intersection ratio is larger than a first intersection ratio threshold value, filtering out candidate local attribute frames with smaller attribute probability from the two candidate local attribute frames;

And when the cross ratio is smaller than or equal to the first cross ratio threshold, filtering out candidate local attribute frames which do not belong to the object from the two candidate local attribute frames based on the target overall attribute frame.

In the above scheme, the two candidate local attribute frames are a first candidate local attribute frame and a second candidate local attribute frame respectively; the screening module is further configured to:

determining an intersection area and a merging area of the first candidate local attribute frame and the second candidate local attribute frame based on the positions of the first candidate local attribute frame and the second candidate local attribute frame;

and taking the ratio of the intersection area to the merging area as the intersection ratio of the first candidate local attribute frame and the second candidate local attribute frame.

In the above scheme, the screening module is further configured to:

when the intersection ratio is smaller than or equal to the first intersection ratio threshold value, determining intersection ratios of the two candidate local attribute frames and the target overall attribute frame respectively;

and filtering out candidate local attribute frames with the cross ratio with the target overall attribute frame being smaller than or equal to a second cross ratio threshold value.

In the above scheme, the prediction module is further configured to:

Carrying out convolution processing on the image to obtain image characteristics;

classifying the image features to obtain a plurality of forward candidate frames;

and adjusting the plurality of forward candidate frames to obtain the plurality of candidate attribute frames.

In the above scheme, the identification module is further configured to:

inquiring the mapping table to obtain the score corresponding to each target attribute frame;

adding the scores corresponding to each target attribute frame to obtain a sum;

a category of the image is determined based on the summed score intervals.

In the above scheme, the categories of the images include low-quality images and non-low-quality images; the identification module is further configured to:

reducing or prohibiting recommendation of the image when the category of the image is a low quality image;

and when the category of the image is a non-low quality image, sending the image to a recommendation queue to wait for recommendation.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the image recognition method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for realizing the image recognition method provided by the embodiment of the application when being executed by a processor.

Embodiments of the present application provide a computer program product or a computer program, which comprises computer instructions stored in a computer-readable storage medium, from which a processor of an electronic device reads the computer instructions, and which is executed by the processor, so that the electronic device performs the image recognition method provided by the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

the method has the advantages that the multiple candidate attribute frames of the image are subjected to aggregation processing based on categories, so that the multiple groups of obtained candidate attribute frames correspond to different categories, target attribute frames corresponding to the categories are screened out from each group of candidate attribute frames, and the filtering efficiency and the filtering accuracy of the candidate attribute frames are improved; and performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

Drawings

FIGS. 1A-1B are schematic diagrams of candidate attribute frames output by a target detection model according to an embodiment of the present application;

FIGS. 1C-1D are schematic diagrams of NMS algorithm screening candidate attribute frames provided by the related art;

FIGS. 1E-1F are schematic diagrams of a Class specific NMS algorithm screening candidate attribute frame provided by the related art;

FIGS. 1G-1H are schematic diagrams of screening candidate attribute frames by using a class-sensitive non-maximum suppression algorithm according to an embodiment of the present application;

FIG. 2A is a schematic diagram of an image recognition system 10 according to an embodiment of the present application;

FIG. 2B is a schematic diagram of an image recognition system 10 according to an embodiment of the present application;

fig. 3A is a schematic flow chart of an image recognition method according to an embodiment of the present application;

fig. 3B is a schematic flow chart of an image recognition method according to an embodiment of the present application;

fig. 3C is a schematic flow chart of an image recognition method according to an embodiment of the present application;

FIG. 4 is a flow chart of content detection and recommendation provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an intersection ratio provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server 200-1 according to an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first/second/third" are used merely to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence where allowed, to enable embodiments of the application described herein to be implemented in other than those illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Information flow: the message source is a data format through which the website propagates the latest information to users, usually arranged in a time-axis manner, which is the most primitive, intuitive, and basic presentation form of the information stream, and a prerequisite for the users to subscribe to the website is that the website provides the message source, and the message source is converged together as aggregation.

2) Non-maximum Suppression (NMS) algorithm: it suppresses maxima by searching for local maxima. It finds wide application in computer vision tasks such as edge detection, face detection, object detection, and the like. Taking object detection as an example, in the process of object detection, a large number of candidate attribute frames are generated at the position of the same image object, and the candidate attribute frames may overlap with each other, so that the optimal object attribute frame of the image object can be determined through an NMS algorithm, and redundant candidate attribute frames are eliminated.

3) Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance.

With the development of content industries such as information flow and short video, more and more image and video content are uploaded to the internet by users. Because the fish-bone content uploaded by users is mixed and has different quality, some low-quality content (such as content with low colloquial image, incomplete picture cutting and the like) needs to be identified. The low-quality content is typically identified using an object detection model. Since the object detection model typically predicts a large number of candidate attribute boxes, many of which are redundant and even erroneous, these candidate attribute boxes need to be filtered for more accurate detection results. In auditing contents, detection targets in the same category may include many different attributes, for example, in human body low custom identification, chest category includes multiple attributes of normal chest, protruding chest, bare chest and the like, and in incomplete picture identification, subtitle category includes multiple attributes of normal subtitle, vertical subtitle, horizontal subtitle and the like.

The candidate attribute frame screening method in the related art mainly comprises an NMS algorithm and a specific category non-maximum suppression (class specific NMS) algorithm. The NMS algorithm calculates the cross-over ratio of all candidate attribute frames (including the candidate overall attribute frames and the candidate local attribute frames) output by the target detection model, and filters the candidate attribute frames according to the cross-over ratio and the attribute probability of the corresponding candidate attribute frames. The class specific NMS algorithm differs from the NMS algorithm in that the former performs cross-ratio calculation and candidate attribute frame filtering only on candidate attribute frames predicted to be of the same class.

In the task of image low-quality content recognition, there are generally both corresponding high-quality types (such as complete clipping of face, normal chest) and low-quality types (such as incomplete clipping of face, exposure of chest) for some kind of target. In the identification of low-quality types such as low custom and incomplete human body, the candidate whole attribute frame and the candidate local attribute frame of the human body are required to be detected simultaneously. These features lead to the problem of the candidate attribute frame screening method in the related art that the correct candidate attribute frame is filtered by mistake and the wrong candidate attribute frame is not filtered by mistake, so that the accuracy of image recognition is lower. The human body attribute and the sensitive part detection task in the human body low custom identification are taken as examples for explanation.

Referring to fig. 1A to fig. 1B, fig. 1A to fig. 1B are schematic diagrams of candidate attribute frames output by a target detection model according to an embodiment of the present application. For a certain part (such as a leg, a waist) or the whole body of the human body, a plurality of candidate attribute frames may exist, wherein the candidate attribute frames are independent from each other, and each candidate attribute frame has a corresponding label (indicating the category and attribute of the candidate attribute frame) and attribute probability.

The human body attribute and sensitive part detection task needs to detect the candidate overall attribute frames and the candidate local attribute frames of the human body at the same time, and a cross phenomenon often occurs between the candidate attribute frames of different categories (the candidate overall attribute frames and the candidate local attribute frames may cross, and the candidate local attribute frames may also cross). Referring to fig. 1C-1D, fig. 1C-1D are schematic diagrams of screening candidate attribute frames by an NMS algorithm provided in the related art. Because many popular scenes may deliberately feature some parts of the human body, the intersection between the candidate global property frame and the candidate local property frame is relatively large, resulting in the NMS algorithm possibly misfiltering some important candidate global property frames and candidate local property frames. After screening candidate attribute frames by the NMS algorithm, both the leg candidate attribute frames in fig. 1C and the chest candidate attribute frames in fig. 1D are mis-filtered.

The class specific NMS algorithm can avoid the problem of wrong filtering caused by crossing candidate attribute frames of different types, but often outputs a plurality of candidate attribute frames of the same part after filtering due to the fact that different attributes (such as foot normal/foot control, chest exposure/chest micro exposure/chest protrusion, etc.) of the same part are similar, so that an optimal candidate attribute frame cannot be selected from the candidate attribute frames. Referring to fig. 1E-1F, fig. 1E-1F are schematic diagrams of screening candidate attribute frames by using a class specific NMS algorithm provided in the related art. Since the class specific NMS algorithm cannot distinguish between different attributes of the same location (i.e. cannot distinguish between candidate attribute frames with the same category and different attributes), the noise candidate attribute frames such as "foot normal" in fig. 1E and "chest bare" and "chest micro-bare" in fig. 1F cannot be filtered and retained.

The embodiment of the application provides an image recognition method which can improve the accuracy of image recognition.

The image recognition method provided by the embodiment of the application can be implemented by various electronic devices, for example, the method can be implemented by a terminal or a server alone or by the server and the terminal in a cooperative manner. For example, the terminal alone bears an image recognition method described below, or the terminal transmits a content upload request to a server, and the server executes the image recognition method based on the received content upload request.

The electronic device provided by the embodiment of the application can be various types of terminal devices or servers, wherein the servers can be independent physical servers, can be server clusters or distributed systems formed by a plurality of physical servers, and can be cloud servers for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms; the terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which the present application is not limited to.

Taking a server as an example, for example, a server cluster deployed in a cloud may be used, an artificial intelligence cloud Service (AIaaS) is opened to users, an AIaaS platform splits several common AI services and provides independent or packaged services in the cloud, and the Service mode is similar to an AI theme mall, and all users can access one or more artificial intelligence services provided by using the AIaaS platform through an application programming interface.

For example, one of the artificial intelligence cloud services may be an image recognition service, that is, a cloud server is packaged with the image recognition program provided by the embodiment of the present application. The terminal responds to the content uploading operation of the user, a content uploading request carrying the image is sent to a cloud server, the cloud server invokes a packaged image recognition program to recognize the image, the category of the image is obtained, the uploading of the image is controlled based on the category of the image, and the category of the image and uploading results (success and failure) are returned to the terminal.

In some embodiments, an exemplary image recognition system is described by taking a server and a terminal cooperatively implementing the image recognition method provided in the embodiments of the present application as an example. Referring to fig. 2A, fig. 2A is a schematic architecture diagram of an image recognition system 10 according to an embodiment of the present application. The terminal 400 is connected to the server 200-1 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. Wherein the terminal 400 transmits a content upload request carrying an image to the server 200-1 in response to a content upload operation of the user. The server 200-1 responds to the content uploading request and performs attribute prediction processing on the image to obtain a plurality of candidate attribute frames; screening the candidate attribute frames with different attributes in the same category to obtain a target attribute frame corresponding to each category; and determining the category of the image based on the target attribute frame corresponding to each category, and controlling the uploading of the image based on the category of the image. Finally, the server 200-1 returns the category of the image (e.g., low-quality image or non-low-quality image) and the uploading result (success, failure) to the terminal 400.

Embodiments of the present application may also be implemented by blockchain techniques, see fig. 2B, where both the server and the terminal may join the blockchain network 500 to become one of the nodes. The type of blockchain network 500 is flexible and diverse, and may be any of public, private, or federated chains, for example. Taking a common chain as an example, any electronic device of a business entity, such as a server, may access the blockchain network 500 without authorization to act as a consensus node for the blockchain network 500, e.g., server 200-1 maps to consensus node 500-1 in the blockchain network 500, server 200-2 maps to consensus node 500-2 in the blockchain network 500, and server 200-3 maps to consensus node 500-0 in the blockchain network 500.

Taking blockchain network 500 as an example of a federated chain, a server may access blockchain network 500 to become a node after having been authorized. The server 200-1 responds to the content uploading request carrying the image, and after determining the category of the image, sends the category of the image to other servers (such as the server 200-2 and the server 200-3), and the other servers can verify the category of the image (whether the category of the image is correct or not can be verified) by executing the intelligent contract. When the nodes exceeding the number threshold confirm that the verification is passed, the digital signature (i.e., endorsement) will be signed thereto, and when the determined category of the image has a sufficient endorsement, the server 200-1 controls the uploading of the image based on the category of the image, and returns the uploading result (success, failure) to the terminal.

Therefore, in the embodiment of the application, the accuracy and the reliability of image identification can be improved by the mode of carrying out consensus verification on the categories of the images through a plurality of nodes.

The image recognition method provided by the embodiment of the present application will be described below with reference to the accompanying drawings, where the execution subject of the method may be a server, and specifically may be implemented by the server running the above various computer programs; of course, it will be apparent from the following understanding that the image recognition method provided in the embodiment of the present application may be implemented by a terminal or by a terminal and a server in cooperation.

Referring to fig. 3A, fig. 3A is a schematic flow chart of an image recognition method according to an embodiment of the present application, and the steps shown in fig. 3A will be described.

In step 101, an attribute prediction process is performed on an image, so as to obtain a plurality of candidate attribute frames corresponding to objects in the image.

In some embodiments, the image may be an image to be published that is uploaded by the user, or may be an image that has been uploaded to the publication. The image may be a separate picture, a picture in a teletext or a video frame. Taking an image as a video frame for example, when a server receives a video uploaded by a user, extracting a plurality of video frames in the video, and carrying out attribute prediction processing on the video frames in batches or one by one to obtain a plurality of candidate attribute frames.

The objects in the image may be persons, animals, objects or scenes, etc. The number of objects in each image may be one or more. When the number of the objects is a plurality of, attribute prediction processing is performed on the image, and a plurality of candidate attribute frames corresponding to each object in the image are obtained.

In some embodiments, the image may be subjected to attribute prediction processing by a target detection model such as EfficientDet, YOLO, SSD (Single Shot Mult iBox Detector) to obtain a plurality of candidate attribute frames. The attribute prediction processing is performed on the image to obtain a plurality of candidate attribute frames corresponding to the object in the image, which can be implemented in the following manner.

And carrying out convolution processing on the image to obtain image characteristics. Wherein the image needs to be scaled to a fixed size before the image is convolved. The image may be convolved by a plurality of convolution blocks, each of which may include a plurality of sequentially connected convolution layers, an activation layer, and a pooling layer. And then, classifying the image features to obtain a plurality of forward candidate frames. In some possible examples, the image features may be classified by a softmax function or a sigmoid function to obtain a plurality of positive candidate boxes, and simultaneously obtain a plurality of negative candidate boxes, remove the negative candidate boxes, and retain the positive candidate boxes. Wherein positive candidate boxes characterize possible candidate attribute boxes and negative candidate boxes characterize erroneous candidate attribute boxes. And then, determining the offset of each forward candidate frame, and adjusting the position of the corresponding forward candidate frame according to the offset to obtain a plurality of candidate attribute frames. In some possible examples, the offset includes a center offset and a scaling factor, the scaling factor including a vertical scaling factor and a horizontal scaling factor. And firstly moving the center of the forward candidate frame to a new position according to the center offset, then scaling the height of the forward candidate frame according to the vertical scaling multiple, and scaling the width of the forward candidate frame according to the horizontal scaling multiple to obtain the candidate attribute frame. And in the process of adjusting the forward candidate frames, removing the candidate attribute frames exceeding the image boundary and the forward candidate frames with the sizes smaller than the preset value to obtain a plurality of candidate attribute frames.

In order to facilitate the subsequent screening of candidate attribute frames, after the candidate attribute frames are determined, the category and attribute of the candidate attribute frames need to be determined. And classifying the candidate attribute frames to obtain the label and attribute probability of each candidate attribute frame. The labels comprise the category and the attribute of the corresponding candidate attribute frame. The attributes characterize the overall or local nature of the object, and each category (e.g., chest category) may include multiple attributes (e.g., chest normals, chest salience, etc.). The attribute probability is a probability that the candidate attribute frame has an attribute in the tag. For example, if a label of a candidate attribute frame is "adult female true", then the categories of the candidate attribute frame are "adult" and "female", and the attribute of the candidate attribute frame is "true" (i.e., true person, not model or image, etc.).

In some embodiments, the object detection model is trained by multi-labeled sample data as it is trained, wherein the labels of the sample data include category labels and attribute labels. For the same object, the same object can be divided from different aspects to obtain different multiple categories, for example, for people, the same object can be divided from the aspects of age, gender, occupation and the like to obtain multiple categories which are independent from each other. Accordingly, a sample of data may have only one category label, such as "adult" or "non-adult"; one sample data may also have multiple category labels, such as "adult", "girl", "teacher", etc. Therefore, the multi-classification target detection model can be obtained through training of sample data with multiple labels, and the model detection capability is improved. When the image is subjected to attribute prediction processing through the target detection model trained by the sample data with the attribute tags and the category tags, the categories and the corresponding attributes of the objects in the image can be obtained.

In step 102, aggregation processing is performed on the multiple candidate attribute frames based on the categories of the multiple candidate attribute frames, so as to obtain multiple groups of candidate attribute frames.

In some embodiments, after obtaining the candidate attribute frames and determining the corresponding categories according to the labels thereof, the candidate attribute frames belonging to the same category may be aggregated into one group to obtain multiple groups of candidate attribute frames.

For example, the labels of the plurality of candidate attribute frames are adult female true, foot control, foot normal, chest bare, chest micro-bare, and chest protruding, respectively. According to the label, the foot control and the foot can be determined to be in the category of foot part, and the chest exposure, chest micro-exposure and chest protrusion are all in the category of chest. Therefore, the candidate attribute frames corresponding to foot control and normal foot can be aggregated into one group, the candidate attribute frames corresponding to chest exposure, chest micro exposure and chest protrusion are aggregated into another group, and the candidate attribute frames (candidate integral attribute frames) corresponding to adult women are independently taken as one group, so that three groups of candidate attribute frames are obtained.

Therefore, not only can the candidate whole attribute frames be distinguished from the candidate local attribute frames, but also the candidate local attribute frames with different categories can be distinguished, and the situation that the candidate attribute frames with different categories are compared together in screening because of crossing in position is avoided, so that the candidate attribute frames are filtered by mistake is avoided.

In step 103, screening processing is performed on each group of candidate attribute frames based on the cross comparison of each group of candidate attribute frames, so as to obtain a target attribute frame corresponding to each category.

In some embodiments, the plurality of candidate attribute frames includes a candidate global attribute frame corresponding to the entirety of the object and a candidate local attribute frame corresponding to the part of the object. For example, the candidate attribute frame labeled "adult female true" in fig. 1A corresponds to the person overall, so it is a candidate overall attribute frame, and the candidate attribute frame labeled "foot control" in fig. 1A corresponds to the person local (foot), so it is a candidate local attribute frame. The target attribute frame comprises a target overall attribute frame and a target local attribute frame, wherein the target overall attribute frame corresponds to the overall object, and the target local attribute frame corresponds to the local object.

In some embodiments, filtering each set of candidate attribute frames based on the cross-correlation of each set of candidate attribute frames to obtain a target attribute frame corresponding to each category may be implemented through step 1031 and step 1032 in fig. 3B.

In step 1031, a target global property frame corresponding to the global object is determined based on the at least one candidate global property frame.

In some embodiments, when the number of at least one candidate global attribute frame is one, the candidate global attribute frame is taken as a target global attribute frame, for example, if only one candidate global attribute frame with a label of "adult female" in fig. 1A, the candidate global attribute frame is taken as a target global attribute frame corresponding to a person in the image.

When the number of at least one candidate overall attribute frame is a plurality of and the number of the objects is one, the candidate overall attribute frame with the largest corresponding attribute probability is taken as the target overall attribute frame. For example, only one person in the image is determined to be the candidate overall attribute frame of the label "adult female true" and the candidate overall attribute frame of the label "minor female true" respectively through attribute prediction processing. And the attribute probabilities of the candidate global attribute frames are respectively 0.903 and 0.305, and the candidate global attribute frame with the attribute probability of 0.903 is taken as a target global attribute frame.

When the number of the at least one candidate integral attribute frames is a plurality of, and the number of the objects is a plurality of, determining the cross-over ratio between the at least one candidate integral attribute frames, aggregating the candidate integral attribute frames with the cross-over ratio larger than the first cross-over ratio threshold value into a plurality of groups of candidate integral attribute frames, and taking the candidate integral attribute frame with the largest attribute probability corresponding to each group of candidate integral attribute frames as the target integral attribute frame.

For example, the image includes a person a and a person b, and there are 5 candidate overall attribute frames in the image. And determining the cross-over ratio between every two candidate integral attribute frames in the 5 candidate integral attribute frames in sequence. The intersection ratio of the candidate overall attribute frame 1 and the candidate overall attribute frame 2 is greater than a first intersection ratio threshold, and the intersection ratio of the candidate overall attribute frame 3 and the candidate overall attribute frame 4, the intersection ratio of the candidate overall attribute frame 3 and the candidate overall attribute frame 5, and the intersection ratio of the candidate overall attribute frame 4 and the candidate overall attribute frame 5 are also greater than the first intersection ratio threshold. The candidate global property frame 1 and the candidate global property frame 2 are aggregated into one group (e.g., corresponding to person a) and the candidate global property frame 3, the candidate global property frame 4, and the candidate global property frame 5 are aggregated into another group (e.g., corresponding to person b). And if the attribute probability corresponding to the candidate overall attribute frame 1 is greater than the attribute probability corresponding to the candidate overall attribute frame 2, taking the candidate overall attribute frame 1 as a target overall attribute frame corresponding to the person a. And if the attribute probability corresponding to the candidate overall attribute frame 3 is greater than the attribute probabilities corresponding to the candidate overall attribute frame 4 and the candidate overall attribute frame 5, taking the candidate overall attribute frame 3 as a target overall attribute frame corresponding to the person b. Therefore, the target overall attribute frame corresponding to each object in the image can be determined, and the situation that the target overall attribute frame and the object do not correspond to each other is avoided.

In step 1032, each group of candidate local attribute frames is traversed, filtering operation is performed on the candidate local attribute frames in the same group, and the candidate local attribute frames which belong to the object and have the highest corresponding attribute probability and are obtained through filtering are used as target local attribute frames.

In some embodiments, when traversing each group of candidate local attribute frames, the candidate local attribute frames in each group may be traversed randomly, or the candidate local attribute frames in each group may be sorted in a descending order (or in an ascending order) based on the attribute probabilities corresponding to the candidate local attribute frames, and the candidate local attribute frames are traversed in the sorted order.

In some embodiments, because there may be multiple objects in the image and the candidate local attribute frame may not belong to the reason for any object, it is desirable to ensure that the filtered candidate local attribute frame belongs to the corresponding object in the image. Because there may be multiple candidate local attribute frames (the attributes may be the same or different) that are independent of each other in the same class of the same object, it is necessary to keep the candidate local attribute frame with the largest attribute probability corresponding to the class, and filter out the candidate local attribute frames with other attribute probabilities not the largest.

In some embodiments, performing the filtering operation on candidate local property blocks in the same group may be accomplished through steps 10321 through 10324 in fig. 3C.

In step 10321, the intersection ratio of the two candidate local attribute frames is determined based on the locations of the two candidate local attribute frames in the same group.

In some embodiments, the two candidate local attribute frames may be any two candidate local attribute frames in the same group, or may be two adjacent candidate local attribute frames. The two candidate local attribute frames may be a first candidate local attribute frame and a second candidate local attribute frame. Based on the positions of the first candidate local attribute frame and the second candidate local attribute frame, the intersection area and the merging area of the first candidate local attribute frame and the second candidate local attribute frame can be determined, and then the ratio of the intersection area to the merging area is taken as the intersection ratio of the first candidate local attribute frame and the second candidate local attribute frame. Wherein the intersection ratio of the two candidate local attribute frames reflects the similarity of the two candidate local attribute frames. The larger the cross ratio value is, the more the overlapped parts of the two candidate local attribute frames are, and the screening is needed.

In step 10322, comparing the cross-over ratio with a first cross-over ratio threshold, and when the cross-over ratio is greater than the first cross-over ratio threshold, performing step 10323; when the cross-over ratio is less than or equal to the first cross-over ratio threshold, step 10324 is performed.

In some embodiments, the higher the first cross-over threshold, the fewer candidate local attribute frames the cross-over is greater than the first cross-over threshold, and the higher the filtering efficiency, but cases may occur where the cross-over is below the first cross-over threshold. The lower the first cross-over threshold, the more candidate local attribute frames the cross-over ratio is greater than the first cross-over threshold, the higher the accuracy of the filtering, and the lower the filtering efficiency. Therefore, the first cross ratio threshold needs to be set reasonably. The corresponding cross-correlation thresholds of the candidate local attribute frames of different categories may be the same (e.g., the first cross-correlation threshold) or different.

In step 10323, the candidate local attribute frame having the smaller attribute probability of the two candidate local attribute frames is filtered out.

In some embodiments, because a higher attribute probability represents a higher accuracy of the candidate local attribute frames, when the overlap ratio is greater than the first overlap ratio threshold, candidate local attribute frames with a lower attribute probability should be filtered out of the two candidate local attribute frames with a higher overlap ratio, while candidate local attribute frames with a higher attribute probability are retained. In this way, the accuracy of the filtering can be improved.

In step 10324, candidate local attribute frames that do not belong to the object from the two candidate local attribute frames are filtered out based on the target global attribute frame.

In some embodiments, when the intersection ratio is less than or equal to the first intersection ratio threshold, it is indicated that the two candidate local attribute boxes do not belong to the same object. It may be that one of the candidate local attribute frames belongs to an object currently being subjected to the filtering process, and the other candidate local attribute frame does not belong to any object in the image (is misrecognized), or belongs to another object in the image that is different from the object currently being subjected to the filtering process. Thus, another candidate local property box needs to be filtered out.

In some embodiments, filtering out candidate local attribute frames that do not belong to the object from the two candidate local attribute frames based on the target global attribute frame may be implemented in the following manner. Respectively determining the intersection ratio of the two candidate local attribute frames and the target overall attribute frame; and filtering out candidate local attribute frames with the cross ratio with the target overall attribute frame being smaller than or equal to a second cross ratio threshold value.

The second cross ratio threshold is different from the first cross ratio threshold, and the second cross ratio threshold is used for measuring the coincidence degree between the candidate local attribute frame and the target overall attribute frame. The target overall attribute frame is the target attribute frame corresponding to the overall object currently undergoing screening processing. For example, the second intersection ratio threshold is 0.6, the image includes the person 3 and the person 4, and the object currently being subjected to the screening process is the person 3. And if the intersection ratio of the two candidate local attribute frames and the target overall attribute frame corresponding to the person 3 is 0.7 and 0.2, filtering out the candidate local attribute frames with the intersection ratio of 0.2 and reserving the candidate local attribute frames with the intersection ratio of 0.7. Thus, the filtering operation can be avoided to be carried out on two candidate local attribute frames which are partially overlapped but do not belong to the same object, so that the correct candidate local attribute frames which need to be reserved are filtered by mistake.

And for a plurality of candidate local attribute frames in the same group, after determining the candidate local attribute frames needing to be reserved in two candidate local attribute frames, continuing to execute filtering operation on the reserved candidate local attribute frames and a new candidate local attribute frame in the same group until all candidate local attribute frames in the same group are subjected to the filtering operation, so that the candidate local attribute frame which belongs to the object and has the maximum corresponding attribute probability can be obtained, and the candidate local attribute frame is taken as the target local attribute frame. Thus, under the condition of ensuring the accuracy of the target local attribute frames, the unique target local attribute frames in each group of candidate local attribute frames can be obtained.

In some embodiments, filtering processing is performed on each group of candidate attribute frames based on the cross comparison of each group of candidate attribute frames to obtain a target attribute frame corresponding to each category, which may also be implemented in the following manner. A target global property frame corresponding to the global of the object is determined based on the at least one candidate global property frame. And calculating the intersection ratio of the target overall attribute frame and each candidate local attribute frame, and screening out candidate local attribute frames with the intersection ratio greater than a second intersection ratio threshold value from each group of candidate local attribute frames. The screened candidate local attribute frames and the target overall attribute frames correspond to the same object in the image. And taking the candidate local attribute frame with the largest attribute probability corresponding to each group of the screened candidate local attribute frames as the target local attribute frame of the category corresponding to the group.

When the number of the objects in the image is a plurality, a plurality of target overall attribute frames corresponding to the plurality of objects one by one exist in the image. And calculating the intersection ratio of each target overall attribute frame and each candidate local attribute frame. For each target global property frame, candidate local property frames with the cross-over ratio greater than a second cross-over ratio threshold value can be obtained, and the candidate local property frames correspond to the same object in the image with the target global property frame. Then, when determining the target local attribute frame corresponding to each object, the candidate local attribute frame with the highest corresponding attribute probability can be selected from the plurality of candidate local attribute frames corresponding to the object under each category (i.e. each group) to serve as the target local attribute frame of the corresponding category.

For example, the image includes a person 5 and a person 6, the person 5 corresponds to the target overall attribute frame 1, and the person 6 corresponds to the target overall attribute frame 2. The candidate local attribute frames are 2 groups, namely a chest group and a leg group. The intersection ratio of the candidate local attribute frame 1 and the candidate local attribute frame 2 in the chest group and the target overall attribute frame 1 is larger than a second intersection ratio threshold value, and the attribute probability corresponding to the candidate local attribute frame 1 is larger than the attribute probability corresponding to the candidate local attribute frame 2. The intersection ratio of the candidate local attribute frame 3 and the candidate local attribute frame 4 in the leg group and the target overall attribute frame 1 is larger than a second intersection ratio threshold value, and the attribute probability corresponding to the candidate local attribute frame 3 is larger than the attribute probability corresponding to the candidate local attribute frame 4. The candidate local attribute frame 1 is taken as the target local attribute frame of the chest group (chest class) corresponding to the person 5, and the candidate local attribute frame 3 is taken as the target local attribute frame of the leg group (leg class) corresponding to the person 5. Similarly, the target local attribute frames corresponding to the person 6 may be determined from the chest group and the leg group, respectively.

Therefore, candidate attribute frames belonging to different objects and of different categories can be distinguished, screening is carried out from a plurality of candidate attribute frames of the same category, screening accuracy is improved, and false screening is avoided.

In step 104, the object is subjected to category identification processing based on the target attribute frame corresponding to each category, and the category of the image is obtained.

In some embodiments, the object is subjected to category identification processing based on the target attribute frame corresponding to each category, so as to obtain the category of the image, which can be implemented in the following manner: inquiring a mapping table based on the label of each target attribute frame to obtain the score corresponding to each target attribute frame; adding the scores corresponding to each target attribute frame to obtain a sum; the category of the image is determined based on the summed score intervals.

The mapping table stores the labels of the target attribute frames and the corresponding scores. For example, the score corresponding to each target attribute frame is between [0,1], and when the label of the target attribute frame is "chest exposure", the corresponding score is 1 score; when the label of the target attribute frame is 'chest normal', the corresponding score is 0 score. At this time, the higher the score, the more serious the chest exposure, the greater the likelihood that the image is a low-custom image. The sum of the scores corresponding to the respective target attribute frames characterizes the likelihood that the image as a whole is a popular image, and the higher the sum is, the greater the likelihood that the image is a popular image. After the sum is obtained, the score interval to which the sum belongs is determined, and different score intervals correspond to different categories. For example, the category of the image corresponding to the score interval [0,0.5] is set in advance as a non-low quality image, and the category of the image corresponding to the score interval (0.5, ++) is set as a low quality image. When the sum is 0.3, the category of the image can be determined to be a non-low quality image.

It can be seen that since the labels of each target attribute frame reflect the category of the image to some extent, when the labels of the respective target attribute frames are integrated, an accurate category of the image can be obtained.

In some possible examples, after obtaining the score corresponding to each target attribute frame through the mapping table, multiplying the scores by the weight of the corresponding target attribute frame and summing to obtain a sum; the category of the image is determined based on the summed score intervals. The weight of the target attribute frame may be the attribute probability of the target attribute frame, or the weight corresponding to the category of the target attribute frame, where the importance of the target attribute frames of different categories is different, and the weights are also different. For example, in the human body attribute and sensitive part detection task, weights corresponding to the chest class, the waist class, the leg class, and the foot class may be set to 0.6, 0.2, 0.1, and 0.1, respectively.

Therefore, the weight corresponding to the target attribute frames of different categories is considered, and the target attribute frames of the categories with larger influence on the image can be considered more, so that more accurate image categories are obtained.

In some embodiments, the object in the image may also be subjected to a category identification process by a machine learning method (such as through a model AlexNet, googLeNet, etc.), so as to obtain the category of the image.

In some embodiments, the recommendation policies for different categories of images are different. The categories of images may include low quality images and non-low quality images. Low quality images include low custom images, incomplete images, sensitive images (including sensitive vocabulary), and the like. When the category of the image is a low-quality image, the recommendation of the image is reduced or forbidden; when the category of the image is a non-low quality image, the image is sent to a recommendation queue of a recommendation system to wait for recommendation.

In some possible examples, the recommendation system may sort the images in the recommendation queue in descending order according to the rank of the account number uploading the images and recommend the images according to the sort. The account with high grade can be an original account, an official account or an account with the number of vermicelli exceeding a preset threshold (such as 100 ten thousand). Therefore, the images uploaded by the account numbers with high recommendation levels are preferentially distributed, and important content published by authorities or high-quality content generally focused by the public can be preferentially distributed.

In other possible examples, the recommendation system may also order the images in the recommendation queue according to their presentation, including general images, images in a motion picture, and video frames in a video. When the image is a video frame, the uploaded content is known to be video, and the video has high network requirements and can be smoothly played under the condition of good network. When the image is an image in the moving picture, the uploaded content is known to be the moving picture, and the requirement on the network is higher than that of a general image. Therefore, the images in the recommendation queue can be sequenced according to the priority order of the general images, the images in the moving pictures and the video frames in the video, and the images are recommended according to the sequencing, so that most account numbers can be smoothly received and watched on the content distributed by the recommendation system.

It can be seen that the embodiment of the application can determine the corresponding distribution recommendation strategy according to the quality of the image, improve the recommendation efficiency and reduce the workload of equipment by reducing the recommendation of low-quality images.

In some embodiments, the categories of images may also include hot images, i.e., images that have a high recent popularity and a lot of user attention, such as a common expression pack; cold images, i.e. images with low recent heat and little attention of the user, such as images for scientific research. When the category of the image is a popular image, the image is sent to a recommendation queue to wait for recommendation; when the category of the image is a cold image, the recommendation of the image is reduced. In this way, the exposure rate of the recommended image can be improved, and the recommendation efficiency can be improved.

It can be seen that, in the embodiment of the application, the multiple candidate attribute frames of the image are subjected to aggregation treatment based on categories, so that the obtained multiple groups of candidate attribute frames correspond to different categories, and the target attribute frames corresponding to the categories are screened out from each group of candidate attribute frames, thereby improving the filtering efficiency and the filtering accuracy of the candidate attribute frames; the problem of error filtering caused by comparison among candidate attribute frames of different categories is avoided. And performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

In the following, an exemplary application of the image recognition method provided by the embodiment of the present application in low-quality content recognition and recommendation scenes will be described.

Referring to fig. 4, fig. 4 is a flowchart of content detection and recommendation provided by an embodiment of the present application. The steps shown in fig. 4 will be described.

In step 201, an image is predicted by a target detection model, so as to obtain candidate attribute frames and corresponding attribute probabilities in the image.

After the user uploads the content, it is necessary to detect an image or video in the content uploaded by the user to determine whether the content belongs to low-quality content. Images (including video frames) may be predicted by an object detection model of EfficientDet, YOLO, SSD, RCNN (Regions with CNN features), retinaNet, etc., to determine candidate attribute boxes in the image and corresponding attribute probabilities.

In step 202, filtering and screening the candidate attribute frames by using a non-maximum suppression algorithm sensitive to the category to obtain target attribute frames corresponding to each category.

In step 203, the low-quality content identification module identifies the content in each target attribute frame, and if it is determined that the content is non-low-quality content, step 204 is executed; if it is determined that the content is low quality content, step 205 is performed.

In step 204, the image is sent to a recommendation pool (i.e., recommendation queue) awaiting recommendation.

In step 205, the image is intercepted or compressed.

Taking the figures in the image as examples, the figures in the image can be divided into different categories, such as feet, legs, waists, and the like, according to the parts of the human body. Each category has a corresponding attribute, for example, for the category of chest, the attribute may be normal chest, bare chest, or prominent chest, etc. When an image is detected through the target detection model, a plurality of corresponding candidate attribute frames are obtained for each part of the human body, and the attributes of the candidate attribute frames may be the same or different. For example, in fig. 1A, there are 3 candidate attribute frames corresponding to the foot, and the attributes are respectively: foot control, foot control and foot normal, with corresponding attribute probabilities of 0.680, 0.289 and 0.206, respectively. Multiple attributes corresponding to each category need to be screened, namely multiple candidate attribute frames under each category are screened. The process of filtering candidate attribute frames by a class-sensitive non-maximum suppression algorithm is described below.

(1) And aggregating similar candidate attribute frames. The candidate attribute frames with different attributes in the same category are aggregated into a group, the candidate attribute frames with the categories of the chest such as normal chest, bare chest, protruding chest and the like are divided into a group, and the candidate attribute frames in the same group are ranked from high to low according to the corresponding attribute probability.

(2) And calculating the cross-over ratio of the candidate attribute frames. And according to the sorting of the attribute probabilities, calculating the cross ratio between every two candidate attribute frames in the same group in sequence, namely calculating the ratio of the area of the cross region of the two candidate attribute frames to the area of the merging region of the two candidate attribute frames. As shown in fig. 5, the ratio of the candidate attribute frame a to the candidate attribute frame B, i.e., the ratio of a n B (a and B intersect) to a u B (a and B intersect), is calculated. Therefore, the intersection ratio of every two candidate attribute frames is calculated from front to back according to the sequence, and when the intersection ratio is larger than the first intersection ratio threshold value, only the candidate attribute frames with larger attribute probability are reserved.

(3) Candidate attribute frames are screened. If the intersection ratio of the two candidate attribute frames is larger than the first intersection ratio threshold value, filtering one of the two candidate attribute frames with smaller attribute probability, only retaining one with larger attribute probability, and so on, screening out the candidate attribute frame corresponding to each category as the target attribute frame of the corresponding category.

The method comprises the steps of calculating the intersection ratio of candidate attribute frames and screening the candidate attribute frames, wherein the purpose of avoiding the possible error caused by gathering the candidate attribute frames of the same category of different people into a group when a plurality of people exist in an image, and screening one candidate attribute frame with the largest attribute probability from the group as a target attribute frame. For example, for the person 1, candidate attribute frames of the foot are candidate attribute frame 1 and candidate attribute frame 2; for the person 2, candidate attribute frames of the foot are a candidate attribute frame 3 and a candidate attribute frame 4. When the target attribute frame of the foot of the person 2 needs to be determined, if the candidate attribute frames of the person 1 and the person 2 are aggregated into a group, and the attribute probability is the largest attribute probability of the candidate attribute frame 1, the candidate attribute frame 1 of the foot of the person 1 may be used as the target attribute frame of the foot of the person 2, which results in the error of selecting the target attribute frame and further affects the subsequent low-quality content discrimination.

As shown in fig. 1G and fig. 1H, by calculating by using the class-sensitive non-maximum suppression algorithm provided by the embodiment of the present application, only the "foot control" candidate attribute frame with the attribute probability of 0.680 is reserved for the candidate attribute frame of the foot class in fig. 1G, and meanwhile, the "leg temptation" candidate attribute frame of the leg class is not filtered by mistake. While the two noise candidate attribute frames of "chest exposure" and "chest micro exposure" are filtered in fig. 1H, the candidate attribute frame of "chest protrusion" with the highest attribute probability is retained.

It should be noted that, the application scenario of the embodiment of the present application is not limited to low-quality content identification and recommendation, and the image identification method provided by the embodiment of the present application may be applied to screen candidate attribute frames as long as the scenario of similar multi-attribute target detection exists.

It can be seen that the problem of homogeneous multi-attribute object detection is often faced when low-quality content identification is performed through an object detection model. In this case, the candidate attribute frame screening method in the related art often causes many correct candidate attribute frames to be filtered by mistake and incorrect candidate attribute frames to be filtered by omission, so that the application requirements cannot be met. Therefore, the embodiment of the application provides a category-sensitive non-maximum suppression algorithm for the detection of the similar multi-attribute targets so as to screen candidate attribute frames of the similar multi-attribute targets. The relation among the categories is considered in a non-maximum suppression algorithm sensitive to the categories, and candidate attribute frames are screened only to the candidate attribute frames with different attributes in the same category, but not to the candidate attribute frames with different categories (such as feet and legs). Therefore, the false filtering caused by the intersection of different candidate attribute frames can be avoided while the candidate attribute frames of the noise in the same category are effectively filtered, the false filtering and missing filtering problems are effectively solved, and the accuracy rate and recall rate of low-quality content identification are improved.

An exemplary structure of an electronic device according to an embodiment of the present application is described below, taking the electronic device as an example of a server, referring to fig. 6, fig. 6 is a schematic structural diagram of a server 200-1 according to an embodiment of the present application, where the server 200-1 shown in fig. 6 includes: at least one processor 210, a memory 240, at least one network interface 220. The various components in server 200-1 are coupled together by bus system 230. It is understood that the bus system 230 is used to enable connected communications between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 230 in fig. 6.

The processor 210 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 240 optionally includes one or more storage devices that are physically located remote from processor 210.

Memory 240 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 240 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 240 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

The operating system 241 includes system programs, such as a framework layer, a core library layer, a driver layer, etc., for handling various basic system services and performing hardware-related tasks, as well as for implementing various basic services and handling hardware-based tasks.

Network communication module 242 for reaching other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.

In some embodiments, the image recognition device 243 provided in the embodiments of the present application may be implemented in software, and fig. 6 shows the image recognition device 243 stored in the memory 240, which may be software in the form of a program, a plug-in, or the like, including the following software modules: prediction module 2431, aggregation module 2432, screening module 2433, and identification module 2434 are logical, and thus can be arbitrarily combined or further split depending on the functionality implemented. The functions of the respective modules will be described hereinafter.

The prediction module 2431 is configured to perform attribute prediction processing on the image to obtain a plurality of candidate attribute frames corresponding to the object in the image; the aggregation module 2432 is configured to aggregate the plurality of candidate attribute frames based on the categories of the plurality of candidate attribute frames, to obtain a plurality of groups of candidate attribute frames; the screening module 2433 is configured to perform screening processing on each set of candidate attribute frames based on the cross comparison of each set of candidate attribute frames, so as to obtain a target attribute frame corresponding to each category; and the recognition module 2434 is used for performing category recognition processing on the object based on the target attribute frame corresponding to each category to obtain the category of the image.

In some embodiments, the plurality of candidate attribute frames includes a global candidate attribute frame corresponding to the entirety of the object and a local candidate attribute frame corresponding to the part of the object, and the target attribute frame includes a target global attribute frame and a target local attribute frame; the filtering module 2433 is further configured to determine a target global attribute frame corresponding to the global object based on the at least one candidate global attribute frame; traversing each group of candidate local attribute frames, executing filtering operation on the candidate local attribute frames in the same group, and taking the candidate local attribute frames which are obtained by filtering and belong to the object and have the maximum corresponding attribute probability as target local attribute frames.

In some embodiments, the filtering module 2433 is further configured to, when the number of at least one candidate global attribute frame is one, regard the candidate global attribute frame as the target global attribute frame; when the number of at least one candidate integral attribute frame is a plurality of, and the number of the objects is one, taking the candidate integral attribute frame with the maximum corresponding attribute probability as a target integral attribute frame; when the number of the at least one candidate integral attribute frames is a plurality of, and the number of the objects is a plurality of, determining the cross-over ratio between the at least one candidate integral attribute frames, aggregating the candidate integral attribute frames with the cross-over ratio larger than the first cross-over ratio threshold value into a plurality of groups of candidate integral attribute frames, and taking the candidate integral attribute frame with the largest attribute probability corresponding to each group of candidate integral attribute frames as the target integral attribute frame.

In some embodiments, the filtering module 2433 is further configured to determine an intersection ratio of two candidate local attribute frames based on the locations of the two candidate local attribute frames in the same group; when the cross-over ratio is larger than a first cross-over ratio threshold, filtering out candidate local attribute frames with smaller attribute probability from the two candidate local attribute frames; and when the cross ratio is smaller than or equal to the first cross ratio threshold value, filtering out candidate local attribute frames which do not belong to the object from the two candidate local attribute frames based on the target overall attribute frame.

In some embodiments, the two candidate local attribute frames are a first candidate local attribute frame and a second candidate local attribute frame, respectively; the filtering module 2433 is further configured to determine an intersection area and a merging area of the first candidate local attribute frame and the second candidate local attribute frame based on the positions of the first candidate local attribute frame and the second candidate local attribute frame; and taking the ratio of the intersection area to the merging area as the intersection ratio of the first candidate local attribute frame and the second candidate local attribute frame.

In some embodiments, the filtering module 2433 is further configured to determine, when the intersection ratio is less than or equal to the first intersection ratio threshold, intersection ratios of the two candidate local attribute frames and the target global attribute frame, respectively; and filtering out candidate local attribute frames with the cross ratio with the target overall attribute frame being smaller than or equal to a second cross ratio threshold value.

In some embodiments, the prediction module 2431 is further configured to convolve the image to obtain an image feature; classifying the image features to obtain a plurality of forward candidate frames; and adjusting the plurality of forward candidate frames to obtain a plurality of candidate attribute frames.

In some embodiments, the identifying module 2434 is further configured to query the mapping table to obtain a score corresponding to each target attribute frame; adding the scores corresponding to each target attribute frame to obtain a sum; the category of the image is determined based on the summed score intervals.

In some embodiments, the categories of images include low quality images and non-low quality images; the recognition module 2434 is further configured to reduce or prohibit recommendation of the image when the category of the image is a low-quality image; when the category of the image is a non-low quality image, the image is sent to a recommendation queue to wait for recommendation.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the image recognition method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform an image recognition method provided by embodiments of the present application, for example, the image recognition method shown in fig. 3A.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application performs aggregation processing on candidate attribute frames of an image to obtain multiple groups of candidate attribute frames, where each group of candidate attribute frames corresponds to different categories. Target attribute frames are screened from candidate attribute frames in the same category, so that the number of possible target attribute frames is reduced, the filtering efficiency and the filtering accuracy of the candidate attribute frames are improved, and the problem of error filtering caused by comparison among candidate attribute frames in different categories is avoided. And performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An image recognition method, the method comprising:

performing aggregation processing on the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames; the multiple sets of candidate attribute frames comprise a candidate overall attribute frame corresponding to the overall object and a candidate local attribute frame corresponding to the local object, and the target attribute frame comprises a target overall attribute frame and a target local attribute frame;

Determining the target overall attribute frame corresponding to the overall of the object based on at least one candidate overall attribute frame;

traversing each group of the candidate local attribute frames, and determining the intersection ratio of the two candidate local attribute frames based on the positions of the two candidate local attribute frames in the same group; when the intersection ratio is larger than a first intersection ratio threshold value, filtering out candidate local attribute frames with smaller attribute probability from the two candidate local attribute frames; when the intersection ratio is smaller than or equal to the first intersection ratio threshold value, determining intersection ratios of the two candidate local attribute frames and the target overall attribute frame respectively; filtering out candidate local attribute frames with the cross ratio with the target overall attribute frame being smaller than or equal to a second cross ratio threshold; the second intersection ratio threshold is used for measuring the coincidence degree between the candidate local attribute frame and the target overall attribute frame; the candidate local attribute frame which belongs to the object and has the maximum corresponding attribute probability and is obtained through filtering is used as the target local attribute frame;

2. The method of claim 1, wherein the determining a target global property box corresponding to the global of the object based on at least one of the candidate global property boxes comprises:

3. The method of claim 1, wherein the two candidate local attribute frames are a first candidate local attribute frame and a second candidate local attribute frame, respectively;

The determining the intersection ratio of the two candidate local attribute frames based on the positions of the two candidate local attribute frames in the same group comprises the following steps:

4. The method of claim 1, wherein performing the attribute prediction process on the image results in a plurality of candidate attribute frames corresponding to the objects in the image, comprising:

5. The method according to claim 1, wherein the performing the category identification process on the object based on the target attribute frame corresponding to each category to obtain the category of the image includes:

adding the scores corresponding to each target attribute frame to obtain a sum;

a category of the image is determined based on the summed score intervals.

6. The method of claim 1, wherein the categories of images include low quality images and non-low quality images; the method further comprises the steps of:

7. An image recognition apparatus, comprising:

a screening module, configured to determine a target overall attribute frame corresponding to the overall of the object based on at least one candidate overall attribute frame; the multiple sets of candidate attribute frames comprise a candidate overall attribute frame corresponding to the overall object and a candidate local attribute frame corresponding to the local object, and the target attribute frame comprises a target overall attribute frame and a target local attribute frame;

The screening module is also used for traversing each group of the candidate local attribute frames and determining the intersection ratio of the two candidate local attribute frames based on the positions of the two candidate local attribute frames in the same group; when the intersection ratio is larger than a first intersection ratio threshold value, filtering out candidate local attribute frames with smaller attribute probability from the two candidate local attribute frames; when the intersection ratio is smaller than or equal to the first intersection ratio threshold value, determining intersection ratios of the two candidate local attribute frames and the target overall attribute frame respectively; filtering out candidate local attribute frames with the cross ratio with the target overall attribute frame being smaller than or equal to a second cross ratio threshold; the second intersection ratio threshold is used for measuring the coincidence degree between the candidate local attribute frame and the target overall attribute frame; the candidate local attribute frame which belongs to the object and has the maximum corresponding attribute probability and is obtained through filtering is used as the target local attribute frame;

8. An electronic device, the electronic device comprising:

A memory for storing executable instructions;

a processor for implementing the image recognition method of any one of claims 1 to 6 when executing executable instructions stored in the memory.

9. A computer readable storage medium storing executable instructions for causing a processor to perform the image recognition method of any one of claims 1 to 6.