CN113761245A

CN113761245A - Image recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113761245A
Application number: CN202110510014.8A
Authority: CN
Inventors: 侯昊迪; 余亭浩; 张绍明; 陈少华
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-12-07
Anticipated expiration: 2041-05-11
Also published as: CN113761245B

Abstract

The application provides an image identification method, an image identification device, electronic equipment and a computer readable storage medium; the method comprises the following steps: performing attribute prediction processing on the image to obtain a plurality of candidate attribute frames corresponding to objects in the image; aggregating the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames; screening each group of candidate attribute frames based on intersection of each group of candidate attribute frames and comparing the candidate attribute frames to obtain a target attribute frame corresponding to each category; and performing category identification processing on the object based on the target attribute frame corresponding to each category to obtain the category of the image. By the method and the device, the accuracy of image recognition can be improved.

Description

Image recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to image processing technologies, and in particular, to an image recognition method, an image recognition apparatus, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a comprehensive technique in computer science, and by studying the design principles and implementation methods of various intelligent machines, the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to a wide range of fields, for example, natural language processing technology and machine learning and the like, and along with the development of the technology, the artificial intelligence technology can be applied in more fields and exerts more and more important values.

Image recognition is an important application of artificial intelligence, and in the image recognition process, an image is detected, and a detection result generally comprises a plurality of similar candidate frames. In the process of filtering similar candidate frames, the phenomena of mistakenly filtering the candidate frames and missing filtering the candidate frames often occur in the related technology, so that the accuracy of the image identification result is poor.

Disclosure of Invention

The embodiment of the application provides an image identification method, an image identification device, electronic equipment and a computer readable storage medium, which can improve the accuracy of image identification.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an image identification method, which comprises the following steps:

performing attribute prediction processing on an image to obtain a plurality of candidate attribute frames corresponding to objects in the image;

aggregating the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames;

screening each group of candidate attribute frames based on intersection of each group of candidate attribute frames and comparing the candidate attribute frames to obtain a target attribute frame corresponding to each category;

and performing category identification processing on the object based on the target attribute box corresponding to each category to obtain the category of the image.

An embodiment of the present application provides an image recognition apparatus, including:

the prediction module is used for performing attribute prediction processing on the image to obtain a plurality of candidate attribute frames corresponding to the object in the image;

the aggregation module is used for aggregating the candidate attribute frames based on the categories of the candidate attribute frames to obtain a plurality of groups of candidate attribute frames;

the screening module is used for screening each group of candidate attribute frames based on intersection of each group of candidate attribute frames and comparing the candidate attribute frames to obtain a target attribute frame corresponding to each category;

and the identification module is used for carrying out category identification processing on the object based on the target attribute box corresponding to each category to obtain the category of the image.

In the above solution, the plurality of candidate attribute boxes include a candidate global attribute box corresponding to the whole of the object and a candidate local attribute box corresponding to the local of the object, and the target attribute box includes a target global attribute box and a target local attribute box; the screening module is further configured to:

determining a target overall attribute box corresponding to the entirety of the object based on at least one of the candidate overall attribute boxes;

and traversing each group of candidate local attribute boxes, performing filtering operation on the candidate local attribute boxes in the same group, and taking the candidate local attribute box which belongs to the object and has the highest corresponding attribute probability obtained by filtering as the target local attribute box.

In the foregoing scheme, the screening module is further configured to:

when the number of at least one candidate overall attribute box is one, taking the candidate overall attribute box as the target overall attribute box;

when the number of at least one candidate overall attribute box is multiple and the number of the objects is one, taking the candidate overall attribute box with the maximum corresponding attribute probability as the target overall attribute box;

when the number of the at least one candidate overall attribute frame is multiple and the number of the objects is multiple, determining the intersection ratio among the at least one candidate overall attribute frame, aggregating the candidate overall attribute frames with the intersection ratio larger than a first intersection ratio threshold value into multiple groups of candidate overall attribute frames, and taking the candidate overall attribute frame with the highest attribute probability in each group of candidate overall attribute frames as the target overall attribute frame.

In the foregoing scheme, the screening module is further configured to:

determining the intersection ratio of two candidate local attribute boxes in the same group based on the positions of the two candidate local attribute boxes;

when the intersection ratio is larger than a first intersection ratio threshold value, filtering out candidate local attribute boxes with lower attribute probability in the two candidate local attribute boxes;

and when the intersection ratio is smaller than or equal to the first intersection ratio threshold, filtering out the candidate local attribute box which does not belong to the object from the two candidate local attribute boxes based on the target overall attribute box.

In the above solution, the two candidate local attribute frames are a first candidate local attribute frame and a second candidate local attribute frame, respectively; the screening module is further configured to:

determining an intersection area and a merge area of the first candidate local attribute box and the second candidate local attribute box based on the positions of the first candidate local attribute box and the second candidate local attribute box;

and taking the ratio of the intersection area to the merging area as the intersection ratio of the first candidate local attribute box and the second candidate local attribute box.

In the foregoing scheme, the screening module is further configured to:

when the intersection ratio is smaller than or equal to the first intersection ratio threshold, respectively determining the intersection ratio of the two candidate local attribute frames and the target overall attribute frame;

and filtering out candidate local attribute frames of which the intersection ratio with the target overall attribute frame is less than or equal to a second intersection ratio threshold value.

In the foregoing solution, the prediction module is further configured to:

performing convolution processing on the image to obtain image characteristics;

classifying the image features to obtain a plurality of forward candidate frames;

and adjusting the plurality of forward candidate frames to obtain the plurality of candidate attribute frames.

In the foregoing solution, the identification module is further configured to:

inquiring a mapping table to obtain a score corresponding to each target attribute box;

adding the corresponding scores of each target attribute box to obtain a sum;

determining a category of the image based on a score interval of the sum.

In the above scheme, the image category includes low-quality images and non-low-quality images; the identification module is further configured to:

reducing or prohibiting a recommendation for the image when the category of the image is a low-quality image;

and when the image is not low-quality image, sending the image to a recommendation queue to wait for recommendation.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the image recognition method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores executable instructions for causing a processor to execute the method for recognizing the image provided by the embodiment of the application.

The embodiment of the present application provides a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of an electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the image recognition method provided in the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

by carrying out category-based aggregation processing on a plurality of candidate attribute frames of the image, enabling a plurality of groups of obtained candidate attribute frames to correspond to different categories, and screening out a target attribute frame corresponding to the category from each group of candidate attribute frames, the filtering efficiency and the filtering accuracy of the candidate attribute frames are improved; and performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

Drawings

FIGS. 1A-1B are schematic diagrams of candidate attribute boxes of an output of a target detection model provided by an embodiment of the present application;

FIGS. 1C-1D are schematic diagrams of NMS algorithm screening candidate attribute boxes provided by the related art;

FIGS. 1E-1F are schematic diagrams of a Class specific NMS algorithm screening candidate attribute box provided in the related art;

FIGS. 1G-1H are schematic diagrams of a class-sensitive non-maximum suppression algorithm screening candidate attribute box provided by an embodiment of the present application;

fig. 2A is a schematic diagram of an architecture of an image recognition system 10 according to an embodiment of the present application;

fig. 2B is a schematic diagram of an architecture of the image recognition system 10 according to an embodiment of the present application;

fig. 3A is a schematic flowchart of an image recognition method according to an embodiment of the present application;

fig. 3B is a schematic flowchart of an image recognition method according to an embodiment of the present application;

fig. 3C is a schematic flowchart of an image recognition method according to an embodiment of the present application;

FIG. 4 is a flow chart of content detection and recommendation provided by an embodiment of the present application;

FIG. 5 is a schematic illustration of the cross-over ratio provided by an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server 200-1 according to an embodiment of the present disclosure.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first/second/third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first/second/third" may, where permissible, be interchanged with a particular order or sequence so that embodiments of the application described herein may be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Information flow: message sources, a data format through which web sites disseminate updated information to users, are usually arranged in a timeline, the most primitive, intuitive, and basic presentation of information streams, and a prerequisite for a user to be able to subscribe to a web site is that the web site provides a source of messages, and the sources of messages are aggregated together, referred to as an aggregate.

2) Non-maximum Suppression (NMS, Non-Max Suppression) algorithm: it suppresses maxima by searching for local maxima. It finds wide application in computer vision tasks such as edge detection, face detection, object detection, etc. Taking target detection as an example, in the process of target detection, a large number of candidate attribute boxes are generated at the same image target position, and the candidate attribute boxes may overlap with each other, so that the optimal target attribute box of the image target can be determined through an NMS algorithm, and redundant candidate attribute boxes are eliminated.

3) Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer.

With the development of content industries such as information flow and short video, more and more images and video contents are uploaded to the internet by users. Because the contents uploaded by the users are mixed and have different qualities, some low-quality contents (such as contents of low-grade images, incomplete picture cutting and the like) need to be identified. Low quality content is typically identified using object detection models. Since the target detection model usually predicts many candidate attribute boxes, many of which are redundant and even erroneous, the candidate attribute boxes need to be screened for more accurate detection results. When the content is audited, the detection targets of the same category may include many different attributes, for example, in human body vulgar recognition, the chest category includes multiple attributes such as normal chest, protruding chest, and naked chest, and in incomplete picture recognition, the horizontal caption category includes multiple attributes such as normal horizontal caption, vertical horizontal caption, and the like.

The candidate attribute box screening method in the related art mainly includes an NMS algorithm and a class specific non-maximum NMS (class specific NMS) algorithm. The NMS algorithm calculates the cross-over ratio of all candidate attribute boxes (including a candidate whole attribute box and a candidate local attribute box) output by the target detection model pairwise, and filters the candidate attribute boxes according to the cross-over ratio and the attribute probability of the corresponding candidate attribute boxes. The class specific NMS algorithm differs from the NMS algorithm in that the former only performs intersection, comparison and filtering on candidate attribute boxes predicted to be of the same class.

In the task of recognizing low-quality contents of images, a certain class of objects usually have corresponding high-quality types (such as complete cutting of a face and normal chest) and low-quality types (such as incomplete cutting of the face and naked chest). In the recognition of low-quality types such as human body vulgar and human body incompleteness, the candidate overall attribute frame and the candidate local attribute frame of the human body need to be detected simultaneously. Due to the characteristics, the problem that the correct candidate attribute frame is filtered by mistake and the wrong candidate attribute frame is not filtered exists in the candidate attribute frame screening method in the related technology, and the accuracy of image recognition is low. The task of detecting human body attributes and sensitive parts in human body low-custom recognition is taken as an example for explanation.

Referring to fig. 1A-1B, fig. 1A-1B are schematic diagrams of a candidate attribute box output by a target detection model according to an embodiment of the present application. There may be multiple candidate attribute frames for a certain part (such as leg, waist) or the whole of the human body, the candidate attribute frames are independent of each other, and each candidate attribute frame has a corresponding label (indicating the category and attribute of the candidate attribute frame) and attribute probability.

The human body attribute and sensitive part detection task needs to detect a candidate overall attribute frame and a candidate local attribute frame of a human body at the same time, and the candidate attribute frames of different types often have a cross phenomenon (the candidate overall attribute frame and the candidate local attribute frame may be crossed, and the candidate local attribute frame may be crossed). Referring to fig. 1C to 1D, fig. 1C to 1D are schematic diagrams of a NMS algorithm screening candidate attribute box provided in the related art. Because many vulgar scenes intentionally emphasize some parts of the human body, the intersection between the candidate overall attribute box and the candidate local attribute box is large, so that the NMS algorithm may filter some important candidate overall attribute boxes and candidate local attribute boxes by mistake. After the candidate attribute boxes are screened by the NMS algorithm, both the candidate attribute box of the leg in fig. 1C and the candidate attribute box of the chest in fig. 1D are filtered out by mistake.

Although the class specific NMS algorithm can avoid the problem of error filtering caused by the intersection of candidate attribute frames of different types, because different attributes (such as normal foot/foot control, chest exposure/chest slight exposure/chest protrusion and the like) of the same part are similar, a plurality of candidate attribute frames of the same part are often output after filtering, and the optimal candidate attribute frame cannot be selected. Referring to fig. 1E-1F, fig. 1E-1F are schematic diagrams of a class specific NMS algorithm candidate attribute box provided in the related art. Since class specific NMS algorithm cannot distinguish different attributes of the same part (i.e. cannot distinguish candidate attribute boxes with the same category and different attributes), noise candidate attribute boxes such as "foot normal" in fig. 1E, and "chest bare" and "chest slight dew" in fig. 1F cannot be filtered out and retained.

The embodiment of the application provides an image identification method which can improve the accuracy of image identification.

The image recognition method provided by the embodiment of the application can be implemented by various electronic devices, for example, the image recognition method can be implemented by a terminal or a server alone, or can be implemented by the server and the terminal cooperatively. For example, the terminal itself performs an image recognition method described below, or the terminal transmits a content upload request to the server, and the server executes the image recognition method based on the received content upload request.

The electronic device provided by the embodiment of the application can be various types of terminal devices or servers, wherein the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and an artificial intelligence platform; the terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the present application.

Taking a server as an example, for example, the server cluster may be deployed in a cloud, and an artificial intelligence cloud Service (AI aas, AI as a Service) is opened to a user, the AIaaS platform splits several types of common AI services, and provides an independent or packaged Service in the cloud, this Service mode is similar to an AI theme mall, and all users may access one or more artificial intelligence services provided by the AIaaS platform by using an application programming interface.

For example, one of the artificial intelligence cloud services may be an image recognition service, that is, an image recognition program provided by the embodiment of the present application is packaged in a cloud server. The terminal responds to the content uploading operation of a user, sends a content uploading request carrying an image to the cloud server, the cloud server calls the packaged image recognition program to perform image recognition on the image to obtain the type of the image, the uploading of the image is controlled based on the type of the image, and the type of the image and an uploading result (success and failure) are returned to the terminal.

In some embodiments, an exemplary image recognition system is described by taking an example in which a server and a terminal cooperatively implement the image recognition method provided by the embodiments of the present application. Referring to fig. 2A, fig. 2A is a schematic diagram of an architecture of an image recognition system 10 according to an embodiment of the present disclosure. The terminal 400 is connected to the server 200-1 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of both. The terminal 400 sends a content upload request carrying an image to the server 200-1 in response to a content upload operation of the user. The server 200-1 responds to the content uploading request, and performs attribute prediction processing on the image to obtain a plurality of candidate attribute frames; screening candidate attribute frames with the same category and different attributes to obtain a target attribute frame corresponding to each category; and determining the category of the image based on the target attribute box corresponding to each category, and controlling the uploading of the image based on the category of the image. Finally, the server 200-1 returns the category of the image (e.g., low-quality image or non-low-quality image) and the upload result (success, failure) to the terminal 400.

The embodiment of the present application can also be implemented by using a block chain technology, and referring to fig. 2B, both a server and a terminal can join the block chain network 500 to become one node thereof. The type of blockchain network 500 is flexible and may be, for example, any of a public chain, a private chain, or a federation chain. Taking a public link as an example, an electronic device such as a server of any service agent may access the blockchain network 500 without authorization, so as to serve as a common node of the blockchain network 500, for example, the server 200-1 is mapped to the common node 500-1 in the blockchain network 500, the server 200-2 is mapped to the common node 500-2 in the blockchain network 500, and the server 200-3 is mapped to the common node 500-0 in the blockchain network 500.

Taking the blockchain network 500 as an example of a federation chain, a server may access the blockchain network 500 to become a node after obtaining authorization. The server 200-1, in response to the content upload request carrying the image, determines the category of the image, and then sends the category of the image to other servers (such as the server 200-2 and the server 200-3), and the other servers may verify the category of the image by executing an intelligent contract (may verify whether the category of the image is correct). When the nodes exceeding the number threshold confirm that the verification is passed, they will be signed with a digital signature (i.e., endorsement), and when the determined class of the image has sufficient endorsement, the server 200-1 controls the uploading of the image based on the class of the image, and returns the uploading result (success, failure) to the terminal.

Therefore, in the embodiment of the application, the accuracy and the reliability of image identification can be improved by means of carrying out consensus verification on the image types through a plurality of nodes.

The image recognition method provided by the embodiment of the present application will be described below with reference to the accompanying drawings, where an execution subject of the method described below may be a server, and specifically, the server may be implemented by running the above various computer programs; of course, as will be understood from the following description, it is obvious that the image recognition method provided by the embodiments of the present application may also be implemented by a terminal or a terminal and a server in cooperation.

Referring to fig. 3A, fig. 3A is a schematic flowchart of an image recognition method provided in an embodiment of the present application, and will be described with reference to the steps shown in fig. 3A.

In step 101, an attribute prediction process is performed on an image to obtain a plurality of candidate attribute frames corresponding to objects in the image.

In some embodiments, the image may be an image uploaded by the user to be published or an image that has already been uploaded for publication. The image may be a separate picture, a picture in teletext content or a video frame. Taking an image as an example of a video frame, when a server receives a video uploaded by a user, extracting a plurality of video frames in the video, and performing attribute prediction processing on the video frames in batch or one by one to obtain a plurality of candidate attribute frames.

The object in the image may be a person, an animal, an item or a scene, etc. The number of objects in each image may be one or more. When the number of the objects is multiple, the attribute prediction processing is performed on the image, and multiple candidate attribute frames corresponding to each object in the image are obtained.

In some embodiments, the attribute prediction processing may be performed on the image through a target detection model such as an EfficientDet, YOLO, SSD (Single Shot Mult iBox Detector), and the like, so as to obtain a plurality of candidate attribute boxes. The attribute prediction processing is performed on the image to obtain a plurality of candidate attribute frames corresponding to the object in the image, and the following method can be adopted.

And carrying out convolution processing on the image to obtain image characteristics. Wherein the image needs to be scaled to a fixed size before the convolution process is performed on the image. The image may be convolved by a plurality of convolution blocks, each of which may include a plurality of sequentially connected convolution layers, activation layers, and pooling layers. Then, the image features are classified to obtain a plurality of forward candidate frames. In some possible examples, classification processing may be performed on the image features through a softmax function or a sigmoid function to obtain a plurality of positive candidate frames, and simultaneously obtain a plurality of negative candidate frames, remove the negative candidate frames, and retain the positive candidate frames. Wherein positive candidate boxes characterize possible candidate attribute boxes and negative candidate boxes characterize erroneous candidate attribute boxes. And then, determining the offset of each forward candidate frame, and adjusting the position of the corresponding forward candidate frame according to the offset to obtain a plurality of candidate attribute frames. In some possible examples, the offset includes a center offset and a scale factor, the scale factor including a vertical scale factor and a horizontal scale factor. And moving the center of the forward candidate frame to a new position according to the center offset, zooming the height of the forward candidate frame according to a vertical zoom factor, and zooming the width of the forward candidate frame according to a horizontal zoom factor to obtain a candidate attribute frame. In the process of adjusting the forward candidate frames, the candidate attribute frames exceeding the image boundary and the forward candidate frames with the size smaller than a preset value are removed, and a plurality of candidate attribute frames are obtained.

In order to facilitate subsequent screening of the candidate attribute box, after the candidate attribute box is determined, the category and the attribute of the candidate attribute box also need to be determined. And classifying the candidate attribute boxes to obtain the label and attribute probability of each candidate attribute box. Wherein the label comprises the category and the attribute of the corresponding candidate attribute box. The attributes characterize the overall or local properties of the subject, and may include multiple attributes (e.g., normal chest, chest protrusion, etc.) under each category (e.g., chest category). The attribute probability is the probability that the candidate attribute box has the attribute in the tag. For example, if the label of a candidate property box is "adult female", the category of the candidate property box is "adult" and "female", and the property of the candidate property box is "true" (i.e., is a real person, not a model or image, etc.).

In some embodiments, in training the target detection model, it is trained with multi-labeled sample data, where the labels of the sample data include class labels and attribute labels. For example, for a person, the same object may be classified in terms of age, gender, occupation, etc. to obtain a plurality of classes independent of each other. Accordingly, one sample data may have only one category label, such as "adult" or "non-adult year"; one sample data may also have multiple category labels, such as "adult", "girl", "teacher", etc. Therefore, a multi-classification target detection model can be obtained through sample data training with multiple labels, and the model detection capability is improved. When the attribute prediction processing is performed on the image through the target detection model trained by the sample data with the attribute labels and the plurality of category labels, the plurality of categories and corresponding attributes of the object in the image can be obtained.

In step 102, the multiple candidate attribute frames are aggregated based on the categories of the multiple candidate attribute frames, so as to obtain multiple groups of candidate attribute frames.

In some embodiments, after obtaining the candidate attribute boxes and determining the corresponding categories according to the labels thereof, the candidate attribute boxes belonging to the same category may be aggregated into one group, so as to obtain a plurality of groups of candidate attribute boxes.

For example, labels of the plurality of candidate attribute boxes are adult girls, foot control, normal feet, chest bare, chest microlent, and chest prominent, respectively. From the above labels, it can be determined that foot control and foot normality belong to the category of "feet", and chest nudity, chest slight exposure and chest protrusion belong to the category of "chest". Therefore, the candidate attribute frames corresponding to foot control and foot normal can be aggregated into one group, the candidate attribute frames corresponding to chest exposure, chest slight exposure and chest projection can be aggregated into another group, and the candidate attribute frames (candidate whole attribute frames) corresponding to adult girls are individually used as one group to obtain three groups of candidate attribute frames.

Therefore, the candidate overall attribute frame and the candidate local attribute frame can be distinguished, the candidate local attribute frames of different types can be distinguished, and the situation that the candidate attribute frames are filtered by mistake due to comparison of the candidate attribute frames of different types together in screening caused by position intersection is avoided.

In step 103, each group of candidate attribute frames is screened based on intersection and comparison of each group of candidate attribute frames, and a target attribute frame corresponding to each category is obtained.

In some embodiments, the plurality of candidate property boxes includes a candidate global property box corresponding to an entirety of the object and a candidate local property box corresponding to a local portion of the object. For example, the attribute frame candidate labeled "adult girl" in fig. 1A corresponds to the entire character and is therefore the attribute frame candidate, and the attribute frame candidate labeled "foot control" in fig. 1A corresponds to the local character (foot) and is therefore the attribute frame candidate. The target attribute box comprises a target overall attribute box and a target local attribute box, wherein the target overall attribute box corresponds to the whole of the object, and the target local attribute box corresponds to the local of the object.

In some embodiments, the target attribute frame corresponding to each category is obtained by performing screening processing based on the intersection and comparison of each group of candidate attribute frames, which may be implemented in

steps

1031 and 1032 in fig. 3B.

In step 1031, a target whole attribute box corresponding to the whole of the object is determined based on at least one candidate whole attribute box.

In some embodiments, when the number of at least one candidate overall property frame is one, the candidate overall property frame is used as the target overall property frame, and if only one candidate overall property frame labeled as "adult girl" in fig. 1A is used, the candidate overall property frame is used as the target overall property frame corresponding to the person in the image.

And when the number of at least one candidate overall attribute box is multiple and the number of the objects is one, taking the candidate overall attribute box with the maximum corresponding attribute probability as a target overall attribute box. For example, only one person in the image is subjected to the attribute prediction processing, and the whole attribute frame candidates corresponding to the whole person are determined to be the whole attribute frame candidate labeled "adult girl" and the whole attribute frame candidate labeled "minor girl", respectively. And the attribute probabilities of the candidate overall attribute boxes are 0.903 and 0.305 respectively, and the candidate overall attribute box with the attribute probability of 0.903 is taken as the target overall attribute box.

When the number of the at least one candidate overall attribute frame is multiple and the number of the objects is multiple, determining the intersection ratio among the at least one candidate overall attribute frame, aggregating the candidate overall attribute frames with the intersection ratio larger than a first intersection ratio threshold value into a plurality of groups of candidate overall attribute frames, and taking the candidate overall attribute frame with the maximum attribute probability in each group of candidate overall attribute frames as a target overall attribute frame.

For example, the image includes person a and person b, and there are 5 candidate overall attribute boxes in the image. And sequentially determining the intersection ratio between every two candidate overall attribute frames in the 5 candidate overall attribute frames. The intersection ratio of the candidate overall attribute frame 1 and the candidate overall attribute frame 2 is greater than a first intersection ratio threshold, the intersection ratio of the candidate overall attribute frame 3 and the candidate overall attribute frame 4, the intersection ratio of the candidate overall attribute frame 3 and the candidate overall attribute frame 5, and the intersection ratio of the candidate overall attribute frame 4 and the candidate overall attribute frame 5 are also greater than the first intersection ratio threshold. The whole attribute box candidate 1 and the whole attribute box candidate 2 are aggregated into one group (e.g., corresponding to person a), and the whole attribute box candidate 3, the whole attribute box candidate 4, and the whole attribute box candidate 5 are aggregated into another group (e.g., corresponding to person b). And if the attribute probability corresponding to the candidate overall attribute frame 1 is greater than the attribute probability corresponding to the candidate overall attribute frame 2, taking the candidate overall attribute frame 1 as a target overall attribute frame corresponding to the person a. And if the attribute probability corresponding to the candidate overall attribute frame 3 is greater than the attribute probabilities corresponding to the candidate overall attribute frame 4 and the candidate overall attribute frame 5, taking the candidate overall attribute frame 3 as a target overall attribute frame corresponding to the character b. Therefore, the target overall attribute frame corresponding to each object in the image can be determined, and the situation that the target overall attribute frame and the object do not correspond to each other is avoided.

In step 1032, each group of candidate local attribute boxes is traversed, a filtering operation is performed on the candidate local attribute boxes in the same group, and the candidate local attribute box which belongs to the object and has the highest corresponding attribute probability is taken as the target local attribute box.

In some embodiments, when traversing each group of candidate local attribute frames, the candidate local attribute frames in each group may be randomly traversed, or the candidate local attribute frames in each group may be sorted in a descending order (or sorted in an ascending order) based on the attribute probability corresponding to the candidate local attribute frames, and the candidate local attribute frames are traversed according to the sorted order.

In some embodiments, it is desirable to ensure that the filtered candidate local property box belongs to the corresponding object in the image because there may be multiple objects in the image and the candidate local property box may not belong to any object. And because there may be multiple independent candidate local attribute boxes (the attributes may be the same and may be different) in the same category of the same object, it is necessary to retain the candidate local attribute box with the highest attribute probability corresponding to the category and filter out the candidate local attribute boxes with other attribute probabilities that are not the highest.

In some embodiments, performing the filtering operation on the candidate local attribute boxes in the same group may be implemented through steps 10321 to 10324 in fig. 3C.

In step 10321, a cross-over ratio of two candidate local attribute boxes is determined based on the locations of the two candidate local attribute boxes in the same group.

In some embodiments, the two candidate local attribute boxes may be any two candidate local attribute boxes in the same group, or may be two adjacent candidate local attribute boxes. The two candidate local attribute boxes may be a first candidate local attribute box and a second candidate local attribute box. Based on the positions of the first candidate local attribute box and the second candidate local attribute box, the intersection area and the merge area of the first candidate local attribute box and the second candidate local attribute box may be determined, and then the ratio of the intersection area to the merge area is taken as the intersection ratio of the first candidate local attribute box and the second candidate local attribute box. Wherein the intersection ratio of the two candidate local attribute boxes reflects the similarity of the two candidate local attribute boxes. The larger the intersection ratio value is, the more the overlapped parts of the two candidate local attribute frames are, and the two candidate local attribute frames need to be screened.

In step 10322, comparing the cross-over ratio to a first cross-over ratio threshold, and when the cross-over ratio is greater than the first cross-over ratio threshold, performing step 10323; when the cross-over ratio is less than or equal to the first cross-over ratio threshold, step 10324 is performed.

In some embodiments, the higher the first intersection ratio is, the fewer candidate local attribute boxes are intersected than are greater than the first intersection ratio threshold, the more efficient the filtering, but there may be cases where the intersection ratios are all below the first intersection ratio threshold. The lower the first intersection ratio is than the threshold, the more candidate local attribute boxes with intersection ratios greater than the first intersection ratio threshold, the higher the filtering accuracy, and at the same time, the lower the filtering efficiency. Therefore, the first intersection ratio threshold needs to be set reasonably. The intersection ratio thresholds corresponding to the candidate local attribute boxes of different categories may be the same (e.g., all are the first intersection ratio thresholds), or may be different.

In step 10323, the candidate local attribute box with the smaller attribute probability is filtered out of the two candidate local attribute boxes.

In some embodiments, since a higher probability of the attribute represents a higher accuracy of the candidate local attribute box, when the intersection ratio is greater than the first intersection ratio threshold, the candidate local attribute box with a lower probability of the attribute should be filtered out of the two candidate local attribute boxes with a higher degree of overlap, and the candidate local attribute box with a higher probability of the attribute should be retained. Thus, the accuracy of the filtering can be improved.

In step 10324, a candidate local attribute box of the two candidate local attribute boxes that does not belong to the object is filtered out based on the target global attribute box.

In some embodiments, when the intersection ratio is less than or equal to the first intersection ratio threshold, it is stated that the two candidate local property boxes do not belong to the same object. It may be the case that one of the candidate local attribute boxes belongs to an object currently being subjected to the screening process, the other candidate local attribute box does not belong to any object in the image (for misrecognition), or belongs to another object in the image that is different from the object currently being subjected to the screening process. Therefore, another candidate local attribute box needs to be filtered out.

In some embodiments, filtering out the candidate local attribute box not belonging to the object from the two candidate local attribute boxes based on the target global attribute box can be implemented in the following manner. Respectively determining the intersection ratio of the two candidate local attribute frames and the target overall attribute frame; and filtering out candidate local attribute boxes of which the intersection ratio with the target overall attribute box is less than or equal to a second intersection ratio threshold value.

And the second intersection ratio threshold is different from the first intersection ratio threshold, and is used for measuring the coincidence degree between the candidate local attribute frame and the target overall attribute frame. The target whole attribute box is a target attribute box corresponding to the whole object currently being subjected to the screening processing. For example, the second intersection ratio threshold is 0.6, the image includes the person 3 and the person 4, and the person 3 is the object currently being subjected to the filtering processing. And if the intersection ratio of the two candidate local attribute frames to the target overall attribute frame corresponding to the person 3 is 0.7 and 0.2 respectively, filtering out the candidate local attribute frame with the intersection ratio of 0.2 to the target overall attribute frame, and reserving the candidate local attribute frame with the intersection ratio of 0.7 to the target overall attribute frame. In this way, it is possible to avoid performing a filtering operation on two candidate local attribute boxes that are partially overlapped but do not belong to the same object, thereby resulting in mis-filtering of a correct candidate local attribute box that needs to be retained.

For a plurality of candidate local attribute frames in the same group, after determining a candidate local attribute frame needing to be reserved in two of the candidate local attribute frames, continuing to perform filtering operation on the candidate local attribute frame which is reserved and a new candidate local attribute frame in the same group until all the candidate local attribute frames in the same group are subjected to the filtering operation, so that a candidate local attribute frame which belongs to the object and has the highest corresponding attribute probability can be obtained and is taken as a target local attribute frame. Therefore, the only target local attribute frame in each group of candidate local attribute frames can be obtained under the condition of ensuring the accuracy of the target local attribute frame.

In some embodiments, the target attribute box corresponding to each category is obtained by performing screening processing based on intersection and comparison of each group of candidate attribute boxes, and may also be implemented in the following manner. A target global property box corresponding to the entirety of the object is determined based on the at least one candidate global property box. And calculating the intersection ratio of the target overall attribute frame and each candidate local attribute frame, and screening out the candidate local attribute frames of which the intersection ratio with the target overall attribute frame is greater than a second intersection ratio threshold value in each group of candidate local attribute frames. The screened candidate local attribute frames and the target overall attribute frame correspond to the same object in the image. And taking the candidate local attribute frame with the maximum attribute probability in the candidate local attribute frames screened out by each group as the target local attribute frame of the category corresponding to the group.

When the number of objects in the image is plural, a plurality of target whole attribute boxes corresponding to the plural objects one to one exist in the image. And calculating the intersection ratio of each target overall attribute box and each candidate local attribute box. For each target overall attribute frame, candidate local attribute frames with the intersection ratio larger than a second intersection ratio threshold value can be obtained, and the candidate local attribute frames and the target overall attribute frame correspond to the same object in the image. Then, when determining the target local attribute box corresponding to each object, the candidate local attribute box with the highest attribute probability may be selected from the multiple candidate local attribute boxes corresponding to the object under each category (i.e., each group) as the target local attribute box of the corresponding category.

For example, the image includes a person 5 and a person 6, the person 5 corresponding to the target overall attribute box 1, and the person 6 corresponding to the target overall attribute box 2. The candidate local attribute box has 2 groups, namely a chest group and a leg group. The intersection ratio of the candidate local attribute frame 1 and the candidate local attribute frame 2 in the chest group to the target overall attribute frame 1 is greater than a second intersection ratio threshold, and the attribute probability corresponding to the candidate local attribute frame 1 is greater than the attribute probability corresponding to the candidate local attribute frame 2. The intersection ratio of the candidate local attribute frame 3 and the candidate local attribute frame 4 in the leg group to the target overall attribute frame 1 is greater than a second intersection ratio threshold, and the attribute probability corresponding to the candidate local attribute frame 3 is greater than the attribute probability corresponding to the candidate local attribute frame 4. The candidate local attribute box 1 is set as a target local attribute box for the chest group (chest class) corresponding to the person 5, and the candidate local attribute box 3 is set as a target local attribute box for the leg group (leg class) corresponding to the person 5. Similarly, the target local attribute box corresponding to the person 6 may be determined from the chest group and the leg group, respectively.

Therefore, the candidate attribute frames belonging to different types of different objects can be distinguished, screening is carried out from the candidate attribute frames of the same type, the screening accuracy is improved, and the occurrence of error screening is avoided.

In step 104, a class identification process is performed on the object based on the object attribute box corresponding to each class, and the class of the image is obtained.

In some embodiments, the class identification processing is performed on the object based on the object attribute box corresponding to each class to obtain the class of the image, which may be implemented as follows: inquiring a mapping table based on the label of each target attribute box to obtain a score corresponding to each target attribute box; adding the corresponding scores of each target attribute frame to obtain a sum; the category of the image is determined based on the score interval of the sum.

Wherein, the mapping table stores the label and the corresponding score of the target attribute box. For example, the corresponding score of each target attribute box is between [0, 1], and when the label of the target attribute box is 'chest bared', the corresponding score is 1; when the label of the target attribute box is "chest normal", the corresponding score is 0. At this point, a higher score indicates a more severe chest exposure and a greater likelihood that the image is a vulgar image. The sum of the scores corresponding to the respective target attribute boxes represents the possibility that the image as a whole is a low-class image, and the higher the sum, the higher the possibility that the image is a low-class image. And after the sum is obtained, determining the score interval to which the sum belongs, wherein different score intervals correspond to different categories. For example, it is preset that the category of the image corresponding to the score interval [0, 0.5] is a non-low-quality image, and the category of the image corresponding to the score interval (0.5, ∞) is a low-quality image. When the sum is 0.3, the category of the image may be determined to be a non-low quality image.

Therefore, the label of each target attribute frame reflects the image category to a certain extent, so that when the labels of the target attribute frames are integrated, the accurate image category can be obtained.

In some possible examples, after the score corresponding to each target attribute box is obtained through the mapping table, multiplying each score by the weight of the corresponding target attribute box and summing to obtain a sum; the category of the image is determined based on the score interval of the sum. The weight of the target attribute frame may be the attribute probability of the target attribute frame, or may be the weight corresponding to the category of the target attribute frame, and the importance of different categories of target attribute frames is different, and the weights are also different. For example, in the human body attribute and sensitive part detection task, the weights corresponding to the chest type, waist type, leg type, and foot type may be set to 0.6, 0.2, 0.1, and 0.1, respectively.

In this way, the weights corresponding to the target attribute frames of different categories are taken into consideration, so that the target attribute frames of the categories having larger influence on the image can be more considered, and the more accurate image category is obtained.

In some embodiments, the class of the image may also be obtained by performing class recognition processing on the object in the image through a machine learning method (e.g., through a model such as AlexNet, google lenet, etc.).

In some embodiments, the recommendation strategies for different classes of images are different. The categories of images may include low quality images and non-low quality images. The low quality images include vulgar images, incomplete images, sensitive images (containing sensitive words), and the like. When the category of the image is a low-quality image, reducing or prohibiting recommendation of the image; and when the image is not low-quality image, sending the image to a recommendation queue of a recommendation system to wait for recommendation.

In some possible examples, the recommendation system may sort the images in the recommendation queue in a descending order according to the rank of the account number from which the image is uploaded, and recommend the images according to the sort. The account with the higher rank can be an original account, an official account, or an account with the number of fans exceeding a preset threshold (e.g., 100 ten thousand). Therefore, the image uploaded by the account with the high recommendation level is preferentially distributed, so that important content published by an official or high-quality content generally concerned by the public can be preferentially distributed.

In other possible examples, the recommendation system may also sort the images in the recommendation queue according to their representation, including general images, images in motion pictures, and video frames in videos. When the image is a video frame, the uploaded content is known to be a video, which has high requirements on the network and can be smoothly played only under the condition of good network. When the image is an image in a moving picture, it is known that the uploaded content is a moving picture, and the network requirement is higher than that of a general image. Therefore, the images in the recommendation queue can be sequenced according to the priority order of the common images, the images in the motion picture and the video frames in the video, and the images are recommended according to the sequencing, so that most accounts can smoothly receive and watch the content distributed by the recommendation system.

Therefore, the method and the device for recommending the low-quality images can determine the corresponding distribution recommendation strategy according to the quality of the images, improve the recommendation efficiency, and reduce the workload of the device by reducing the recommendation of the low-quality images.

In some embodiments, the categories of images may also include hot images, i.e., images that are hot in the near future and much focused on by the user, such as common emoticons; the cold image is an image with low recent heat and less attention of a user, such as an image for scientific research. When the image category is a popular image, sending the image to a recommendation queue to wait for recommendation; when the category of the image is a cold image, the recommendation for the image is reduced. Thus, the exposure rate of the recommended image can be increased, and the recommendation efficiency can be improved.

Therefore, the method and the device have the advantages that the multiple candidate attribute frames of the image are subjected to category-based aggregation processing, so that multiple groups of obtained candidate attribute frames correspond to different categories, the target attribute frame corresponding to the category is screened out from each group of candidate attribute frames, and the filtering efficiency and the filtering accuracy of the candidate attribute frames are improved; the problem of error filtering caused by comparison among candidate attribute frames of different types is solved. And performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

Next, an exemplary application of the image recognition method provided by the embodiment of the present application to a low-quality content recognition and recommendation scene will be described.

Referring to fig. 4, fig. 4 is a flowchart of content detection and recommendation provided in an embodiment of the present application. The description will be made in conjunction with the steps shown in fig. 4.

In step 201, the image is predicted by the target detection model, and a candidate attribute frame and a corresponding attribute probability in the image are obtained.

After the user uploads the content, images or videos in the content uploaded by the user need to be detected to determine whether the content belongs to low-quality content. Images (including video frames) can be predicted through target detection models such as EfficientDet, YOLO, SSD, RCNN (Regions with CNN features), RetinaNet, and the like, so as to determine candidate attribute boxes and corresponding attribute probabilities in the images.

In step 202, the candidate attribute frames are filtered and screened by a category-sensitive non-maximum suppression algorithm to obtain a target attribute frame corresponding to each category.

In step 203, identifying the content in each target attribute box through a low-quality content identification module, and if the content is determined to be non-low-quality content, executing step 204; if the content is determined to be low quality content, step 205 is performed.

In step 204, the image is sent to a recommendation pool (i.e., a recommendation queue) to await recommendation.

In step 205, the image is intercepted or pressed.

Taking the person in the image as an example, the person in the image can be classified into different categories according to the human body part, such as foot, leg, waist, and the like. Each category has a corresponding attribute, for example, for the category of chest, the attribute may be normal chest, bare chest, or protruding chest, etc. When the image is detected through the target detection model, a plurality of corresponding candidate attribute frames are obtained for each part of the human body, and the attributes of the candidate attribute frames may be the same or different. For example, in fig. 1A, there are 3 candidate attribute boxes corresponding to the feet, and the attributes thereof are: foot control, foot control and foot are normal, and their corresponding attribute probabilities are 0.680, 0.289 and 0.206, respectively. A plurality of attributes corresponding to each category need to be filtered, that is, a plurality of candidate attribute boxes under each category need to be filtered. The process of filtering candidate attribute boxes by a category-sensitive non-maximum suppression algorithm is described below.

(1) And aggregating homogeneous candidate attribute boxes. The candidate attribute frames with the same category and different attributes are aggregated into a group, for example, the chest is taken as an example, the candidate attribute frames of the chest categories, such as normal chest, exposed chest, protruding chest and the like, are divided into a group, and the candidate attribute frames in the same group are sorted from high to low according to the corresponding attribute probability.

(2) And calculating the intersection ratio of the candidate attribute boxes. And according to the sorting of the attribute probability, calculating the intersection ratio between every two candidate attribute frames in the same group in sequence, namely calculating the ratio of the area of the intersection region of the two candidate attribute frames to the area of the merging region of the two candidate attribute frames. As shown in fig. 5, the soldier ratio of the candidate attribute box a and the candidate attribute box B is calculated, that is, the ratio of a ≈ B (intersection of a and B) to a £ B (union of a and B) is calculated. Therefore, the intersection ratio of every two candidate attribute frames is calculated from front to back according to the sequence, and when the intersection ratio is greater than the first intersection ratio threshold, only the candidate attribute frame with the higher attribute probability is reserved.

(3) And screening a candidate attribute box. And if the intersection ratio of the two candidate attribute boxes is greater than the first intersection ratio threshold, filtering out one of the two candidate attribute boxes with lower attribute probability, and only keeping the one with higher attribute probability, and so on, and screening out the candidate attribute box corresponding to each category as the target attribute box of the corresponding category.

The candidate attribute frame intersection ratio is calculated and the candidate attribute frames are screened, so that errors possibly caused when the candidate attribute frames of different people in the same category are grouped together and one candidate attribute frame with the highest attribute probability is screened out as the target attribute frame can be avoided. For example, for person 1, the candidate attribute boxes of the foot thereof are candidate attribute box 1 and candidate attribute box 2; for person 2, the candidate attribute boxes of the foot thereof are candidate attribute box 3 and candidate attribute box 4. When it is necessary to determine the target attribute frame of the foot of the person 2, if the candidate attribute frames of the person 1 and the person 2 are aggregated into one group, and the attribute probability of the candidate attribute frame 1 is the maximum attribute probability, the candidate attribute frame 1 of the foot of the person 1 may be used as the target attribute frame of the foot of the person 2, which may cause an error in selecting the target attribute frame, thereby affecting the subsequent low-quality content determination.

As shown in fig. 1G and fig. 1H, through the category-sensitive non-maximum suppression algorithm calculation provided in the embodiment of the present application, the candidate attribute box of the foot category in fig. 1G finally only retains the "foot-controlled" candidate attribute box with the attribute probability of 0.680, and meanwhile, the "leg-tempting" candidate attribute box of the leg category is not filtered by mistake. In fig. 1H, while two noise candidate attribute boxes of "chest bareness" and "chest faint" are filtered, the "chest prominent" candidate attribute box with the highest attribute probability is retained.

It should be noted that the application scenario of the embodiment of the present application is not limited to low-quality content identification and recommendation, and as long as there is a scenario with similar multi-attribute object detection, the image identification method provided in the embodiment of the present application may be applied to perform screening of candidate attribute boxes.

It can be seen that the problem of the same kind of multi-attribute target detection is often faced when low-quality content identification is performed through the target detection model. In this case, the candidate attribute frame screening method in the related art often causes the error filtering of many correct candidate attribute frames and the missing filtering of wrong candidate attribute frames, and cannot meet the application requirements. Therefore, the embodiment of the application provides a category-sensitive non-maximum suppression algorithm for the detection of the multi-attribute targets of the same type so as to screen the candidate attribute boxes of the multi-attribute targets of the same type. The relation between the categories is considered in a category-sensitive non-maximum suppression algorithm, and when the candidate attribute boxes are screened, the candidate attribute boxes are only sensitive to the candidate attribute boxes with different attributes of the same category and are not sensitive to the candidate attribute boxes with different categories (such as feet and legs). Therefore, the method can effectively filter the noise candidate attribute frames of the same type and simultaneously avoid the error filtering caused by the cross among different candidate attribute frames, thereby effectively solving the problems of error filtering and filtering omission and improving the accuracy and recall rate of low-quality content identification.

An exemplary structure of an electronic device provided in the embodiment of the present application is described below, taking the electronic device as a server as an example, referring to fig. 6, fig. 6 is a schematic structural diagram of a server 200-1 provided in the embodiment of the present application, and the server 200-1 shown in fig. 6 includes: at least one processor 210, memory 240, at least one network interface 220. The various components in server 200-1 are coupled together by a bus system 230. It is understood that the bus system 230 is used to enable connected communication between these components. The bus system 230 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 230 in fig. 6.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The memory 240 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 240 optionally includes one or more storage devices physically located remote from processor 210.

The memory 240 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 240 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 240 is capable of storing data, examples of which include programs, modules, and data structures, or subsets or supersets thereof, to support various operations, as exemplified below.

The operating system 241, including system programs for handling various basic system services and performing hardware related tasks, includes a framework layer, a core library layer, a driver layer, etc. for implementing various basic services and for handling hardware based tasks.

A network communication module 242 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

In some embodiments, the image recognition device 243 provided by the embodiments of the present application may be implemented in software, and fig. 6 illustrates the image recognition device 243 stored in the memory 240, which may be software in the form of programs and plug-ins, and includes the following software modules: prediction module 2431, aggregation module 2432, screening module 2433, and identification module 2434, which are logical and thus can be arbitrarily combined or further separated depending on the functionality implemented. The functions of the respective modules will be explained below.

A prediction module 2431, configured to perform attribute prediction processing on the image to obtain multiple candidate attribute frames corresponding to the object in the image; the aggregation module 2432 is configured to aggregate the multiple candidate attribute frames based on the categories of the multiple candidate attribute frames to obtain multiple groups of candidate attribute frames; the screening module 2433 is configured to perform screening processing on each group of candidate attribute frames based on intersection of each group of candidate attribute frames and comparing the candidate attribute frames to obtain a target attribute frame corresponding to each category; and the identifying module 2434 is configured to perform category identification processing on the object based on the target attribute box corresponding to each category to obtain a category of the image.

In some embodiments, the plurality of candidate property boxes includes a candidate global property box corresponding to a global of the object and a candidate local property box corresponding to a local of the object, and the target property box includes a target global property box and a target local property box; a screening module 2432, further configured to determine a target overall attribute box corresponding to the entirety of the object based on the at least one candidate overall attribute box; and traversing each group of candidate local attribute boxes, performing filtering operation on the candidate local attribute boxes in the same group, and taking the candidate local attribute box which belongs to the object and has the maximum corresponding attribute probability obtained by filtering as a target local attribute box.

In some embodiments, the screening module 2433 is further configured to, when the number of the at least one candidate overall attribute box is one, take the candidate overall attribute box as the target overall attribute box; when the number of at least one candidate overall attribute box is multiple and the number of the objects is one, taking the candidate overall attribute box with the maximum corresponding attribute probability as a target overall attribute box; when the number of the at least one candidate overall attribute frame is multiple and the number of the objects is multiple, determining the intersection ratio among the at least one candidate overall attribute frame, aggregating the candidate overall attribute frames with the intersection ratio larger than a first intersection ratio threshold value into a plurality of groups of candidate overall attribute frames, and taking the candidate overall attribute frame with the maximum attribute probability in each group of candidate overall attribute frames as a target overall attribute frame.

In some embodiments, the screening module 2433 is further configured to determine an intersection ratio of two candidate local attribute boxes based on the positions of the two candidate local attribute boxes in the same group; when the intersection ratio is larger than a first intersection ratio threshold value, filtering out candidate local attribute frames with lower attribute probability in the two candidate local attribute frames; and when the intersection ratio is smaller than or equal to a first intersection ratio threshold value, filtering out the candidate local attribute box which does not belong to the object in the two candidate local attribute boxes based on the target overall attribute box.

In some embodiments, the two candidate local attribute boxes are a first candidate local attribute box and a second candidate local attribute box, respectively; the screening module 2433 is further configured to determine an intersection area and a merging area of the first candidate local attribute box and the second candidate local attribute box based on the positions of the first candidate local attribute box and the second candidate local attribute box; and taking the ratio of the intersection area to the merging area as the intersection ratio of the first candidate local attribute box and the second candidate local attribute box.

In some embodiments, the screening module 2433 is further configured to determine intersection ratios of the two candidate local attribute boxes and the target whole attribute box when the intersection ratio is less than or equal to a first intersection ratio threshold; and filtering out candidate local attribute boxes of which the intersection ratio with the target overall attribute box is less than or equal to a second intersection ratio threshold value.

In some embodiments, the prediction module 2431 is further configured to perform convolution processing on the image to obtain an image feature; classifying the image features to obtain a plurality of forward candidate frames; and adjusting the plurality of forward candidate frames to obtain a plurality of candidate attribute frames.

In some embodiments, the identifying module 2434 is further configured to query the mapping table to obtain a score corresponding to each target attribute box; adding the corresponding scores of each target attribute frame to obtain a sum; the category of the image is determined based on the score interval of the sum.

In some embodiments, the categories of images include low quality images and non-low quality images; an identification module 2434, further configured to reduce or prohibit the recommendation for the image when the category of the image is a low-quality image; and when the category of the image is the non-low-quality image, sending the image to a recommendation queue to wait for recommendation.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the image recognition method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform an image recognition method provided by embodiments of the present application, for example, the image recognition method as shown in fig. 3A.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, in the embodiment of the present application, the candidate attribute frames of the image are aggregated to obtain a plurality of groups of candidate attribute frames, where each group of candidate attribute frames corresponds to different categories. The target attribute frames are screened from the candidate attribute frames of the same category, the number of possible target attribute frames is reduced, the filtering efficiency and the filtering accuracy of the candidate attribute frames are improved, and the problem of error filtering caused by comparison among the candidate attribute frames of different categories is solved. And performing category identification processing on the object in the image based on the target attribute frame obtained by filtering, so that the accuracy of image identification can be improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An image recognition method, characterized in that the method comprises:

2. The method of claim 1, wherein the plurality of candidate property boxes includes a candidate global property box corresponding to the entirety of the object and a candidate local property box corresponding to the portion of the object, and wherein the target property box includes a target global property box and a target local property box;

the step of screening each group of candidate attribute boxes based on intersection and comparison of each group of candidate attribute boxes to obtain a target attribute box corresponding to each category comprises the following steps:

3. The method of claim 2, wherein determining a target global property box corresponding to the entirety of the object based on at least one of the candidate global property boxes comprises:

4. The method of claim 2, wherein performing a filtering operation on candidate local property boxes in the same group comprises:

5. The method of claim 4, wherein the two candidate local attribute boxes are a first candidate local attribute box and a second candidate local attribute box, respectively;

the determining the intersection ratio of the two candidate local attribute boxes based on the positions of the two candidate local attribute boxes in the same group comprises:

6. The method according to claim 4, wherein the filtering out the candidate local attribute box not belonging to the object from the two candidate local attribute boxes based on the target global attribute box when the intersection ratio is less than or equal to the first intersection ratio threshold comprises:

7. The method of claim 1, wherein performing the attribute prediction process on the image to obtain a plurality of candidate attribute boxes corresponding to objects in the image comprises:

performing convolution processing on the image to obtain image characteristics;

8. The method according to claim 1, wherein the performing a class identification process on the object based on the target attribute box corresponding to each class to obtain the class of the image comprises:

adding the corresponding scores of each target attribute box to obtain a sum;

determining a category of the image based on a score interval of the sum.

9. The method of claim 1, wherein the categories of images include low-quality images and non-low-quality images; the method further comprises the following steps:

10. An image recognition apparatus, comprising: