CN115116085A

CN115116085A - Image identification method, device and equipment for target attribute and storage medium

Info

Publication number: CN115116085A
Application number: CN202110297296.8A
Authority: CN
Inventors: 余亭浩; 侯昊迪; 张绍明; 陈少华; 杨韬
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-19
Filing date: 2021-03-19
Publication date: 2022-09-27

Abstract

The embodiment of the application provides a method, a device, equipment and a storage medium for identifying a picture aiming at a target attribute, wherein the method comprises the following steps: acquiring a picture to be identified; determining at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of human body label, and determining at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label; and according to at least one item of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out recognition processing on the picture to be recognized to obtain a target attribute recognition result. According to the method, whether the picture to be recognized is the target attribute picture or not is recognized by determining the type and the confidence coefficient of the human body label and the type and the confidence coefficient of the non-human body label corresponding to the picture to be recognized, and the accuracy of picture recognition for the target attribute in the field of artificial intelligence is improved.

Description

Image identification method, device and equipment for target attribute and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for identifying a picture for a target attribute.

Background

Vulgar picture recognition is an important application direction of computer vision in artificial intelligence. Under different service scenes corresponding to the pictures to be identified, the definition of the vulgar is different, wherein the vulgar is a target attribute. In teenager scenes, the vulgar management is more strict, and some types of edge wiping should be identified as vulgar, such as beach bikini, walking show and the like; in some popularization and drainage scenes, the vulgar management and control are stricter, and sensitive parts of the human body in the picture to be recognized may need to be recognized as the vulgar. In contrast, in a search scenario, the dimension of vulgar management is very relaxed, because the user actively searches, and the user does not need to identify the vulgar as long as the user does not seriously vulgar violate the regulatory requirement. The prior art cannot accurately identify whether the picture to be identified is popular or not according to different service scenes.

Disclosure of Invention

The application provides a picture identification method, a device, equipment and a storage medium aiming at target attributes aiming at the defects of the existing mode, and aims to solve the problem of accurately identifying whether a picture to be identified is vulgar or not aiming at different service scenes.

In a first aspect, the present application provides a method for identifying a picture of a target attribute, including:

acquiring a picture to be identified;

determining at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of human body label, and determining at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label;

and according to at least one of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out identification processing on the picture to be identified to obtain a target attribute identification result.

In one embodiment, determining the confidence levels of at least one type of human body tag and a corresponding type of human body tag corresponding to the picture to be recognized, and the confidence levels of at least one type of non-human body tag and a corresponding type of non-human body tag corresponding to the picture to be recognized includes:

inputting the picture to be identified into a preset skeleton network of a neural network, and performing convolution processing to obtain a feature map output by each network layer in the skeleton network;

inputting the characteristic diagram output by at least one network layer into a detection network of a neural network, and carrying out detection processing to obtain at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of human body label;

and inputting the characteristic diagram output by at least one network layer into a classification network of the neural network, and performing classification processing to obtain at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label.

In one embodiment, the type of non-human tag includes at least one of an action class, a clothing class, an animal class, an article class, and a scene class; the classification network comprises at least one of a corresponding action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network and a scene classification sub-network.

In one embodiment, the skeleton network comprises a convolutional neural network EfficientNet; the detection network comprises a bidirectional characteristic pyramid network Bi-FPN; the classification network comprises a feature extraction model and any one of a channel attention model SEnet and a space attention model CBAM.

In one embodiment, inputting the feature map output by at least one network layer into a classification network of a neural network, and performing classification processing to obtain at least one type of non-human body label corresponding to the picture to be recognized and a confidence of the non-human body label of the corresponding type, including:

inputting a feature map output by at least one network layer into the feature extraction model, obtaining image features through convolution processing, and performing global pooling processing on the image features to obtain a feature map subjected to dimension reduction processing;

inputting the feature map subjected to the dimension reduction processing into a channel attention model, and weighting different weights of each channel of the feature map subjected to the dimension reduction processing to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the corresponding type of non-human body label;

or inputting the feature map subjected to the dimension reduction processing into a space attention model, weighting each channel of the feature map subjected to the dimension reduction processing with different weights to obtain a weighted feature map, and weighting different spatial positions of the weighted feature map with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type.

In one embodiment, according to at least one of at least one type of human body tag, at least one type of non-human body tag, a confidence of at least one type of human body tag, and a confidence of at least one type of non-human body tag, the image to be recognized is subjected to recognition processing, and a target attribute recognition result is obtained, wherein the recognition result includes at least one of the following items:

when the type of at least one type of human body label belongs to normal classification and the confidence coefficient of at least one type of human body label is less than or equal to a preset first threshold value, determining that the target attribute identification result of the picture to be identified is a target attribute picture;

when the type of at least one type of human body label belongs to abnormal classification except normal classification, and the confidence coefficient of at least one type of human body label is larger than a preset second threshold value, or the confidence coefficient of at least one type of non-human body label is larger than a preset third threshold value, determining that the target attribute recognition result of the picture to be recognized is a target attribute picture;

and when the type of the at least one type of non-human body label is an action type and the confidence coefficient of the at least one type of non-human body label is greater than a preset fourth threshold value, determining that the target attribute identification result of the picture to be identified is a target attribute picture.

In one embodiment, before acquiring the picture to be recognized, the method further includes:

constructing a training sample set based on a preset label set;

training a skeleton network, a detection network and a classification network of a neural network to be trained based on the training sample set to obtain a preset neural network;

training a skeleton network, a detection network and a classification network of a neural network to be trained to obtain a preset neural network, and the training comprises the following steps:

initializing a skeleton network, a detection network and a classification network of a neural network to be trained, and initializing a loss function comprising neural network parameters;

the following processing is executed in each iteration training process of the neural network to be trained:

taking a training picture included in a training sample set as an input sample of a neural network to be trained, taking at least one type of human body label and at least one type of non-human body label corresponding to the training picture as output results of the neural network to be trained, and substituting the input sample and the output results into a loss function to determine a corresponding neural network parameter when the loss function obtains a minimum value; and updating the neural network to be trained according to the determined neural network parameters.

In a second aspect, the present application provides an image recognition apparatus for a target attribute, including:

the first processing module is used for acquiring a picture to be identified;

the second processing module is used for determining the confidence degrees of at least one type of human body label corresponding to the picture to be recognized and the corresponding type of human body label, and the confidence degrees of at least one type of non-human body label corresponding to the picture to be recognized and the corresponding type of non-human body label;

and the third processing module is used for identifying the picture to be identified according to at least one of the confidence degrees of the at least one type of human body label, the at least one type of non-human body label, the at least one type of human body label and the at least one type of non-human body label to obtain a target attribute identification result.

In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and a bus;

a bus for connecting the processor and the memory;

a memory for storing operating instructions;

the processor is configured to execute the image identification method for the target attribute according to the first aspect of the present application by calling the operation instruction.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program for executing the method for identifying a picture with respect to a target attribute of the first aspect of the present application.

The technical scheme provided by the embodiment of the application has at least the following beneficial effects:

acquiring a picture to be identified; determining at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of human body label, and determining at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label; and according to at least one of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out identification processing on the picture to be identified to obtain a target attribute identification result. Therefore, the type and the confidence coefficient of the human body label corresponding to the picture to be recognized and the type and the confidence coefficient of the non-human body label are determined, whether the picture to be recognized is the target attribute picture or not is recognized aiming at different service scenes, and the picture recognition accuracy aiming at the target attribute in the field of artificial intelligence is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic diagram of a system architecture provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a method for identifying an image according to a target attribute according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a neural network provided in an embodiment of the present application;

fig. 4 is a schematic diagram of image identification for a target attribute according to an embodiment of the present application;

fig. 5 is a schematic flowchart of another image identification method for a target attribute according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an image recognition apparatus for target attributes according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present invention more apparent and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The embodiment of the application provides a picture identification method for target attributes aiming at vulgar picture identification in the field of artificial intelligence, and relates to the technical field of computer vision in the field of artificial intelligence, such as image identification, and various fields of cloud technology, such as cloud computing, cloud service and the like in the cloud technology.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Cloud computing (cloud computing) is a computing model that distributes computing tasks over a pool of resources formed by a large number of computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as if they are infinitely expandable and can be acquired at any time, used on demand, expanded at any time, and paid for use.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

According to the logic function division, a PaaS (Platform as a Service) layer can be deployed on an IaaS (Infrastructure as a Service) layer, a SaaS (Software as a Service) layer is deployed on the PaaS layer, and the SaaS can be directly deployed on the IaaS. PaaS is a platform on which software runs, such as a database, a web container, etc. SaaS is a variety of business software, such as web portal, sms, and mass texting. Generally speaking, SaaS and PaaS are upper layers relative to IaaS.

The so-called artificial intelligence cloud Service is also generally called AIaaS (AI as a Service, chinese is "AI as a Service"). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface), and part of the qualified developers can also use an AI framework and an AI infrastructure provided by the platform to deploy and operate and maintain the self-dedicated cloud artificial intelligence services.

For better understanding and description of the embodiments of the present application, some technical terms used in the embodiments of the present application will be briefly described below.

A neural network: the method is an arithmetic mathematical model which simulates animal neural network behavior characteristics and performs distributed parallel information processing. The network achieves the aim of processing information by adjusting the mutual connection relationship among a large number of nodes in the network depending on the complexity of the system.

Loss function: the loss function is used to evaluate the degree of difference between the predicted value and the true value of the model. In addition, the loss function is also an optimized objective function in the neural network, the process of training or optimizing the neural network is the process of minimizing the loss function, the smaller the loss function is, the closer the predicted value of the model is to the true value, and the better the accuracy of the model is.

EfficientNet: the convolutional neural network EfficientNet is a skeleton network Backbone, combines EfficientNet and BiFPN, and provides a detector family EfficientDet.

Bi-FPN: a bidirectional Feature Pyramid Network (BiFPN, Bi-directional Feature Pyramid Network) introduces learnable weights to learn the importance of different input features, and simultaneously, multi-scale Feature fusion from top to bottom and from bottom to top is repeatedly applied.

EfficientDet: the detector family EfficientDet is a target detection algorithm series, and respectively comprises eight algorithms from D0 to D7.

Attention: the Attention mechanism is a method for extracting specific vectors from a vector expression set for weighted combination according to some rules or some additional information. Attention models (Attention models) such as SENET, CBAM, etc.

SENET: SENET (Squeeze-and-interaction Networks) is a channel attention model, SENET is an attention mechanism module for channels.

CBAM: CBAM (volumetric Block Attention module) is a spatial Attention model, which is a module of Attention mechanism that combines spatial (spatial) and channel (channel).

inception-v3+ attention: the inception-v3+ attention model comprises an inception-v3 and an attention model, wherein the inception-v3 is a convolution network.

The accuracy is as follows: the accuracy rate is for the prediction result, and indicates how many of the samples predicted to be positive are true positive samples.

And (4) recall rate: recall is for the original sample, which indicates how many cases in the sample were predicted to be correct.

YOLO: YOLO (You Only Look one) is an object recognition and positioning algorithm based on a deep neural network, and has the biggest characteristic of high operation speed and can be used for a real-time system.

RefineDet: the RefineDet is an object detection algorithm, inherits the advantages of a single-stage design method and a two-stage design method, and overcomes the defects of the single-stage design method and the two-stage design method.

Confidence coefficient: confidence, also called reliability, or confidence level, confidence coefficient, i.e. when a sample estimates an overall parameter, its conclusion is always uncertain due to the randomness of the sample. Therefore, a probabilistic statement method, i.e. interval estimation in mathematical statistics, is used, i.e. how large the corresponding probability of the estimated value and the overall parameter are within a certain allowable error range, and this corresponding probability is called confidence.

Cospelay: cospelay (style Play) refers to the use of clothing, accessories, props, and make-up to Play the role of cartoon works, in games, and ancient characters.

The technical scheme provided by the embodiment of the application relates to an artificial intelligence image recognition technology, and the following detailed description is given to the technical scheme of the application and how to solve the technical problem by the technical scheme of the application. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The scheme provided by the embodiment of the application can be suitable for any application scene needing image identification of the target attribute in the image identification field, the type and the confidence coefficient of the human body label corresponding to the image to be identified and the type and the confidence coefficient of the non-human body label are determined through the scheme, whether the image to be identified is the target attribute image or not is identified aiming at different service scenes, and the accuracy of image identification aiming at the target attribute in the image identification field is improved. In order to better understand the scheme provided by the embodiment of the present application, the scheme is described below with reference to a specific application scenario.

In an embodiment, fig. 1 shows a schematic structural diagram of an image identification system for a target attribute, to which the embodiment of the present application is applied, and it can be understood that the image identification method for a target attribute provided in the embodiment of the present application can be applied to, but is not limited to, the application scenario shown in fig. 1.

In the present example, as shown in fig. 1, the picture recognition system for the target attribute in this example may include, but is not limited to, a server 101, a network 102, and a user terminal 103 in which a client program is installed. The user terminal 103 may communicate with the server 101 through the network 102. The server 101 includes a database 1011 and a processing engine 1012. The user terminal 103 includes a man-machine interaction screen 1031 (user interface for application programs), a processor 1032 and a memory 1033; the man-machine interaction screen 1031 is used for a user to browse a picture to be recognized through the man-machine interaction screen, the processor 1032 is used for processing relevant operations of the user, and the memory 1033 is used for storing the picture to be recognized.

Acquiring a picture to be identified; determining the confidence degrees of at least one type of human body label corresponding to the picture to be recognized and the corresponding type of human body label, and the confidence degrees of at least one type of non-human body label corresponding to the picture to be recognized and the corresponding type of non-human body label; and according to at least one of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out identification processing on the picture to be identified to obtain a target attribute identification result.

As shown in fig. 1, a specific implementation process of the picture identification method for the target attribute in the present application may include steps S1-S5:

in step S1, for any user, the picture to be recognized may be browsed through the man-machine interaction screen 1031 of the user terminal 103, and the user terminal 103 sends the picture to be recognized to the server 101.

Step S2, the processing engine 1012 in the server 101 acquires the picture to be recognized; the database 1011 in the server 101 may be used to store the picture to be recognized.

Step S3, the processing engine 1012 in the server 101 determines the confidence levels of at least one type of human body tag corresponding to the picture to be recognized and the corresponding type of human body tag, and the confidence levels of at least one type of non-human body tag corresponding to the picture to be recognized and the corresponding type of non-human body tag; the database 1011 in the server 101 may also be configured to store the confidence levels of at least one type of human body tag and a corresponding type of human body tag, and the confidence levels of at least one type of non-human body tag and a corresponding type of non-human body tag corresponding to the picture to be recognized.

Step S4, the processing engine 1012 in the server 101 performs recognition processing on the picture to be recognized according to at least one of at least one type of human body tag, at least one type of non-human body tag, a confidence level of at least one type of human body tag, and a confidence level of at least one type of non-human body tag, so as to obtain a target attribute recognition result; the database 1011 in the server 101 may also be used to store the target attribute identification result.

In step S5, the server machine 101 transmits the target attribute recognition result to the user terminal 103.

It should be understood that the above description is only an example, and the present embodiment is not limited thereto.

The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server or a server cluster providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The network 102 may include, but is not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, Wi-Fi, and other networks that enable wireless communication. The user terminal 103 may be a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a notebook computer, a digital broadcast receiver, an MID (Mobile Internet Devices), a PDA (personal digital assistant), a desktop computer, a vehicle-mounted terminal (e.g., a vehicle-mounted navigation terminal), a smart speaker, a smart watch, etc., and the user terminal and the server may be directly or indirectly connected through wired or wireless communication, but are not limited thereto. The specific determination may also be based on the requirements of the actual application scenario, and is not limited herein. The target attribute may be vulgar.

Referring to fig. 2 and fig. 2 show a schematic flow chart of a picture identification method for a target attribute provided in an embodiment of the present application, where the method may be executed by any electronic device, such as a server, and as an optional implementation, the method may be executed by the server. As shown in fig. 2, the image identification method for the target attribute provided in the embodiment of the present application includes the following steps:

and S101, acquiring a picture to be identified.

In one embodiment, the picture to be identified may be obtained from an online service flow, a database, or other data source, and the picture to be identified may be a picture in a public number publication article, may be a chat picture in an instant messaging process, and may also be an advertisement picture published by an e-commerce, and the like. The target attribute is vulgar, and the picture identification aiming at the target attribute is performed, namely the picture identification aiming at the vulgar. The purpose of acquiring the picture to be identified is to determine whether the picture to be identified is a vulgar picture.

S102, determining the confidence degrees of at least one type of human body label corresponding to the picture to be recognized and the corresponding type of human body label, and the confidence degrees of at least one type of non-human body label corresponding to the picture to be recognized and the corresponding type of non-human body label.

In one embodiment, a set of tags is pre-constructed, the set of tags including at least one type of human-body tags and at least one type of non-human-body tags.

In one embodiment, the body tag is composed of a type of body tag and attributes of the body tag. Types of body tags include human-body part, human-body-age stage, human-body sex, human-body type, and the like. The human body-body part can be subdivided into a plurality of sub-classes, e.g., human body-body part-face, human body-body part-chest, human body-body part-lumbo-abdomen, human body-body part-back, human body-body part-hip, human body-body part-trigone, human body-body part-leg, human body-body part-foot, etc.; wherein the attributes of the face include normal and vulgar; the attributes of the chest include normal, first sensitive words, second sensitive words, and third sensitive words; attributes of the abdomen include normal and first sensitive words; the attributes of the back include normal and first sensitive words; attributes of the buttocks include normal and first sensitive words; the attributes of the triangular region comprise normal, first sensitive words and second sensitive words; the attributes of the leg include normal and first sensitive words; the attributes of the foot include normal and fourth sensitive words. Human body category-attributes of the middle of the age group include infant, child, adolescent, adult, and elderly. Attributes of gender in human sex include male, female and unknown. Human classes-attributes of human types in human types include real, virtual, and artwork.

For example, the human body tag is a chest-normal, wherein the type of the human body tag is a human body class-body part-chest, and the attribute of the human body tag is a normal included in the attribute of the chest, that is, the attribute of the human body tag is normal. The human body label is chest-first sensitive word, wherein the type of the human body label is human body-body part-chest, the attribute of the human body label is the first sensitive word included in the attribute of the chest, namely the attribute of the human body label is the first sensitive word, for example, if the first sensitive word is vulgar, the attribute of the human body label is vulgar, and the human body label is chest-vulgar.

In one embodiment, the non-body tag is composed of a type of non-body tag and a property of the body tag. Types of non-human tags include clothing, items, actions, animals, scenes, and the like. The properties of the dressing include nurse's clothing, servant's clothing, miss's clothing, uniform, sailor's clothing, lady's professional clothing, rally's clothing, cospelay, maternity's photograph, wedding's photograph, etc. The class of items may be subdivided into sub-classes, e.g., class of items-adult items, class of items-sensitive items, class of items-underwear, etc.; wherein the attributes of the adult article comprise a first adult article, a second adult article, and a third adult article, the attributes of the sensitive article comprise a first sensitive article, a second sensitive article, and a third sensitive article, and the attributes of the undergarment comprise an upper undergarment and a lower undergarment. The attributes of the action class include a first sensitive action, a second sensitive action, a third sensitive action, a fourth sensitive action, a fifth sensitive action, a sixth sensitive action, and a seventh sensitive action. The attributes of the animal species include a first animal-sensitive behavior, a second animal-sensitive behavior, and a third animal-sensitive behavior. Attributes of the scene classes include sports, cheering team performance, beach, swimming pool, walk show, dance performance, star red blanket, fitness scene, indoor and outdoor.

For example, the non-body tag is a scene class-walk show, wherein the type of the non-body tag is a scene class, and the attribute of the non-body tag is walk show.

In one embodiment, determining the confidence levels of at least one type of human body tag corresponding to the picture to be recognized and the corresponding type of human body tag, and the confidence levels of at least one type of non-human body tag corresponding to the picture to be recognized and the corresponding type of non-human body tag includes steps a 1-A3:

and step A1, inputting the picture to be recognized into a preset skeleton network of a neural network, and performing convolution processing to obtain a feature map output by each network layer in the skeleton network.

In one embodiment, as shown in fig. 3, the neural network includes a skeleton network 201, a detection network 202 and a classification network 203, wherein the skeleton network (Backbone)201 may be a convolutional neural network EfficientNet; the detection network 202 includes a Bi-directional feature pyramid network (Bi-FPN) 206; the classification network 203 may include a feature extraction model (Category feature) 204 and an Attention model 205, where the Attention model 205 may be a channel Attention model SENet or a spatial Attention model CBAM. The picture to be identified is input into the skeleton network 201, and convolution processing is performed through each convolution layer in the skeleton network 201 to obtain a feature map output by each convolution layer in the skeleton network, wherein the network layer of the skeleton network comprises the convolution layers.

Step A2, inputting the characteristic diagram output by at least one network layer into a detection network of the neural network, and carrying out detection processing to obtain at least one type of human body label corresponding to the picture to be recognized and the confidence of the corresponding type of human body label.

In one embodiment, as shown in fig. 3, the Backbone network (Backbone)201 may be a convolutional neural network EfficientNet, the Backbone network 201 includes 7 submodule blocks, each submodule block includes at least one network layer, that is, each submodule block includes at least one convolutional layer; the 7 blocks are respectively P1, P2, P3, P4, P5, P6 and P7, feature maps output by the P3, the P4, the P5, the P6 and the P7 are input into the detection network 202 of the neural network, and detection processing is carried out to obtain at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the human body label of the corresponding type.

For example, a picture to be recognized with a size of 768 × 3 is input to the skeleton network 201 to be convolved; 768, 768 and 3 of 768 and 3 respectively represent the length, the width and the depth of the picture to be identified; the size of the feature map output by P3 is 96 × 144; the size of the signature from P4 output was 48 × 288; the size of the signature from P5 output was 48 × 528; the size of the signature output by P6 is 24 × 720; the size of the characteristic map output by P7 is 24 × 2112. And inputting the feature maps output by the P3, the P4, the P5, the P6 and the P7 into the detection network 202, and performing detection processing to obtain at least one type of human body label corresponding to the picture to be identified and the confidence of the corresponding type of human body label.

Step A3, inputting the characteristic diagram output by at least one network layer into a classification network of a neural network, and performing classification processing to obtain at least one type of non-human body label corresponding to the picture to be recognized and the confidence of the non-human body label of the corresponding type.

In one embodiment, as shown in fig. 3, the classification network 203 includes five classification subnetworks, which are an action classification subnetwork 2031, a dress classification subnetwork 2032, an animal classification subnetwork 2033, an item classification subnetwork 2034, and a scene classification subnetwork 2035, respectively. The feature maps output by at least one network layer are respectively input into the action classification sub-network 2031, the dressing classification sub-network 2032, the animal classification sub-network 2033, the article classification sub-network 2034 and the scene classification sub-network 2035 for classification processing, and at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the non-human body label of the corresponding type are obtained from at least one of the five classification sub-networks.

For example, as shown in fig. 3, the skeleton network 201 includes 7 submodule blocks, each submodule block including at least one network layer, that is, each submodule block includes at least one convolutional layer; the 7 blocks are respectively P1, P2, P3, P4, P5, P6 and P7, the feature map output by the P5 is input into the scene classification sub-network 2035, and classification processing is performed to obtain the non-human body label of the scene class corresponding to the picture to be recognized and the confidence coefficient of the non-human body label of the scene class; the size of the characteristic graph output by the P5 is 48 × 528.

In one embodiment, the types of non-human tags include at least one of an action class, a dressing class, an animal class, an article class, and a scene class; the classification network comprises at least one of a corresponding action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network, and a scene classification sub-network.

In one embodiment, the types of non-human tags include action type, dress type, animal type, article type, scene type, and the like; the classification network comprises an action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network, a scene classification sub-network and the like.

In one embodiment, the Backbone network Backbone is a convolutional neural network EfficientNet, and the detector EfficientDet-D2 is constructed through the convolutional neural network EfficientNet and the bidirectional feature pyramid network Bi-FPN.

In one embodiment, the method includes the steps of inputting a feature map output by at least one network layer into a classification network of a neural network, performing classification processing to obtain at least one type of non-human body label corresponding to a picture to be recognized and a confidence of the corresponding type of non-human body label, and including steps B1-B3:

and step B1, inputting the feature graph output by at least one network layer into the feature extraction model, obtaining image features through convolution processing, and performing global pooling processing on the image features to obtain the feature graph after dimension reduction processing.

For example, as shown in fig. 3, the skeleton network 201 includes 7 submodule blocks, each submodule block including at least one network layer, that is, each submodule block includes at least one convolutional layer; the 7 blocks are respectively P1, P2, P3, P4, P5, P6 and P7, the feature map output by the P5 is input to the feature extraction model 204 included in the scene classification subnetwork 2035, image features are obtained through convolution processing, global pooling globalpooling processing is performed on the image features, and the feature map after dimension reduction processing is obtained; the size of the characteristic graph output by the P5 is 48 × 528.

And step B2, inputting the feature map after the dimension reduction processing into a channel attention model, and weighting each channel of the feature map after the dimension reduction processing with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type.

In an embodiment, as shown in fig. 3, the feature map after the dimension reduction processing is input to the Attention model 205, and when the Attention model 205 is a channel Attention model, different weights are weighted on each channel of the feature map after the dimension reduction processing, so as to obtain the non-human body label of the scene class corresponding to the picture to be recognized and the confidence of the non-human body label of the scene class. For example, a scene class non-human label is a scene class-walk show, i.e., a non-human label is a walk show; the confidence of the non-human label for this scene class is 0.84.

And step B3, or inputting the feature map after the dimension reduction processing into a space attention model, weighting each channel of the feature map after the dimension reduction processing with different weights to obtain a weighted feature map, and weighting different spatial positions of the weighted feature map with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type.

In an embodiment, as shown in fig. 3, the feature map after the dimension reduction processing is input to the Attention model 205, and when the Attention model 205 is a spatial Attention model, different weights are applied to channels of the feature map after the dimension reduction processing to obtain a weighted feature map, and different weights are applied to different spatial positions of the weighted feature map to obtain the confidence levels of the non-human body labels of the scene class and the non-human body labels of the scene class corresponding to the picture to be recognized. For example, a non-human tag of a scene class is a scene class-sports, i.e., the non-human tag is sports, and the confidence of the non-human tag of the scene class is 0.56.

S103, according to at least one of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, the image to be recognized is recognized, and a target attribute recognition result is obtained.

In one embodiment, as shown in FIG. 3, the detection network 202 outputs at least one type of human body label and a confidence level of the at least one type of human body label, and the classification network 203 outputs at least one type of non-human body label and a confidence level of the at least one type of non-human body label. For example, the human body tag output by the detection network 202 is a human body-body part-chest-first sensitive word, that is, the human body tag is a chest first sensitive word, and the confidence of the human body tag is 0.80, that is, the confidence of the chest first sensitive word is 0.80; the non-human label output by classification network 203 is a scene class-walk show, i.e., the non-human label is a walk show, and the confidence of the non-human label is 0.55.

In one embodiment, the image to be recognized is subjected to recognition processing according to at least one of at least one type of human body tag, at least one type of non-human body tag, the confidence of at least one type of human body tag, and the confidence of at least one type of non-human body tag, so as to obtain a target attribute recognition result, wherein the target attribute recognition result includes at least one of the following C1-C3:

and C1, when the type of the at least one type of human body label belongs to the normal classification, and the confidence of the at least one type of human body label is less than or equal to a preset first threshold, determining that the target attribute identification result of the picture to be identified is the target attribute picture.

For example, the picture to be recognized includes a human face. The human body label is a human body-body part-face-normal label, namely the human body label is a normal face, the confidence coefficient of the human body label is 0.52, namely the confidence coefficient of the normal face is 0.52, the preset first threshold value is 0.80, and the confidence coefficient of the normal face is 0.52 less than the preset first threshold value 0.80, so that the face is actually abnormal or vulgar, namely the vulgar identification result of the picture to be identified is a vulgar picture, wherein the target attribute is vulgar.

In one embodiment, when the type of the at least one type of human body label belongs to the normal classification and the confidence of the at least one type of human body label is greater than a preset first threshold, it is determined that the target attribute identification result of the to-be-identified picture is a non-target attribute picture.

And C2, when the type of the at least one type of human body label belongs to an abnormal classification except a normal classification, and the confidence coefficient of the at least one type of human body label is larger than a preset second threshold value, or the confidence coefficient of the at least one type of non-human body label is larger than a preset third threshold value, determining that the target attribute identification result of the picture to be identified is the target attribute picture.

For example, as shown in fig. 4, the picture to be recognized includes a human face. The human body label is a human body-body part-face-vulgar, namely the human body label is a face vulgar, the confidence coefficient of the human body label is 0.85, namely the confidence coefficient of the face vulgar is 0.85, the preset first threshold value is 0.80, and the confidence coefficient of the face normality is 0.85 greater than the preset first threshold value 0.80, so that the face is abnormal or vulgar, namely the vulgar identification result of the picture to be identified is a vulgar picture, wherein the target attribute is the vulgar.

As another example, the picture to be recognized includes a chest of a human body. The human body label is a human body class-body part-chest-second sensitive word, namely the human body label is a chest second sensitive word, the confidence coefficient of the human body label is 0.86, namely the confidence coefficient of the chest second sensitive word is 0.86, the preset second threshold value is 0.78, and the confidence coefficient of the chest second sensitive word is 0.86 which is greater than the preset second threshold value 0.78, so that the chest is actually abnormal or vulgar, namely the vulgar recognition result of the picture to be recognized is a vulgar picture, wherein the target attribute is vulgar.

As another example, the picture to be recognized includes a walking show and the back of a human body. The human body label is a human body-body part-back-first sensitive word, namely the human body label is a back first sensitive word, the confidence coefficient of the human body label is 0.80, namely the confidence coefficient of the back first sensitive word is 0.80, the preset second threshold value is 0.70, and the confidence coefficient of the back first sensitive word is 0.80 which is greater than the preset second threshold value 0.70; meanwhile, the non-human body label is a scene type walk show, namely the non-human body label is a walk show, the confidence coefficient of the non-human body label is 0.85, namely the confidence coefficient of the walk show is 0.85, a preset third threshold value is 0.75, and the confidence coefficient of the walk show, namely 0.85, is greater than the preset third threshold value and is 0.75; the fact that the back is abnormal or vulgar in the scene of the walking show is indicated, namely the vulgar identification result of the picture to be identified is a vulgar picture.

And C3, when the type of the at least one type of non-human body label is an action type and the confidence of the at least one type of non-human body label is greater than a preset fourth threshold, determining that the target attribute identification result of the picture to be identified is a target attribute picture.

For example, the picture to be recognized includes a first sensitive action. The non-human body label is an action class-a first sensitive action, namely the non-human body label is a first sensitive action, the confidence coefficient of the non-human body label is 0.84, namely the confidence coefficient of the first sensitive action is 0.84, the preset fourth threshold value is 0.78, and the confidence coefficient of the first sensitive action is 0.84 which is greater than the fourth threshold value 0.78, so that the first sensitive action is abnormal or vulgar, namely the vulgar identification result of the picture to be identified is a vulgar picture.

In one embodiment, before the picture to be recognized is obtained in step S101, steps D1-D2 are further included:

and D1, constructing a training sample set based on the preset label set.

In one embodiment, the preset set of tags includes at least one type of human body tag and at least one type of non-human body tag. The training sample is a picture marked with a human body label and/or a non-human body label, wherein when the training sample comprises the human body picture, the human body picture also needs to mark the position of a human body part, and the human body part comprises a face, a chest, a waist and abdomen part, a back part, a hip part, a triangular area, legs, feet and the like.

And D2, training the skeleton network, the detection network and the classification network of the neural network to be trained based on the training sample set to obtain a preset neural network.

In one embodiment, training the skeleton network, the detection network and the classification network of the neural network to be trained to obtain a preset neural network includes steps D11-D12:

step D11, initializing the skeleton network, the detection network and the classification network of the neural network to be trained, and initializing the loss function including the neural network parameters.

Step D12, executing the following processing in each iterative training process of the neural network to be trained:

In one embodiment, the neural network is based on an efficientDet, and in the neural network training, human body detection branches can be trained according to a conventional target detection training mode, wherein the human body detection branches comprise a skeleton network backbone and a detection network; and then fixing parameters of the backbone network backbone, and training a plurality of classification branches in sequence, wherein the classification branches are respectively an action classification sub-network, a dressing classification sub-network, an animal classification sub-network, an article classification sub-network, a scene classification sub-network and the like which are included in the classification network.

In the embodiment of the application, whether the picture to be identified is the target attribute picture or not is identified according to different service scenes by determining the type and the confidence coefficient of the human body label and the type and the confidence coefficient of the non-human body label corresponding to the picture to be identified, so that the accuracy of picture identification according to the target attribute in the field of artificial intelligence is improved.

In order to better understand the method provided by the embodiment of the present application, the following further describes the scheme of the embodiment of the present application with reference to an example of a specific application scenario.

The image identification method for the target attribute is applied to the field of image identification, for example, low-custom image identification in the field of image identification; the identification of the vulgar pictures has various applications in business, such as directly intercepting the vulgar pictures, assisting people to review the vulgar pictures and the like.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating another image identification method for a target attribute according to an embodiment of the present application, where the method may be executed by any electronic device, such as a server, and as an alternative implementation, a server is taken as an example for explaining an execution subject of the method. As shown in fig. 5, the image identification method for the target attribute provided in the embodiment of the present application includes the following steps:

s201, constructing a training sample set based on a preset label set.

In one embodiment, the preset set of tags includes at least one type of human body tags and at least one type of non-human body tags. The training sample is a picture marked with a human body label and a non-human body label, wherein when the training sample comprises the human body picture, the human body picture is required to be marked with the position of a human body part, and the human body part comprises a face, a chest, a waist and abdomen part, a back part, a hip part, a triangular area, legs and feet.

S202, training the neural Network FGVNet (Fine-trained Gulgar image recognition Network) to be trained based on the training sample set to obtain the preset neural Network FGVNet.

In one embodiment, the neural network FGVNet includes a Backbone network Backbone, a detection network, and a classification network; the skeleton network Backbone is a convolutional neural network EfficientNet; the detection network comprises a bidirectional characteristic pyramid network Bi-FPN, wherein EfficientNet and the Bi-FPN construct a detector EfficientDet-D2; the classification network comprises 5 classification sub-networks, wherein the 5 classification sub-networks are respectively an action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network and a scene classification sub-network, and each of the 5 classification sub-networks comprises a feature extraction model and an Attention model, wherein the Attention model can be a channel Attention model SENET or a spatial Attention model CBAM. The neural network FGVNet may include YOLO or refledet.

And S203, inputting the picture to be identified into a framework network of a preset neural network FGVNet, and performing convolution processing to obtain a feature map output by each convolution layer in the framework network.

In one embodiment, the picture to be recognized comprises a teenager walking scene and a teenager hip.

S204, inputting the characteristic diagram output by the at least one convolution layer into a Bi-FPN included in a detection network of a preset neural network FGVNet, and carrying out detection processing to obtain at least one type of human body label corresponding to the picture to be identified and the confidence coefficient of the corresponding type of human body label.

In one embodiment, the picture to be recognized comprises a teenager walking scene and a teenager hip. The human body label is a human body-body part-buttocks-first sensitive word, namely the human body label is a buttocks first sensitive word, the confidence coefficient of the human body label is 0.80, namely the confidence coefficient of the buttocks first sensitive word is 0.80, the preset second threshold value is 0.70, and the confidence coefficient of the buttocks first sensitive word is 0.80 which is larger than the preset second threshold value 0.70; the fact that the buttocks are abnormal or vulgar in the scene that teenagers walk shows up is indicated, namely the vulgar identification result of the picture to be identified is a vulgar picture.

S205, inputting the feature map output by at least one network layer into a feature extraction model of the neural network FGVNet, obtaining image features through convolution processing, and performing global pooling processing on the image features to obtain a feature map after dimension reduction processing.

S206, inputting the feature map after the dimension reduction processing into an Attention model of a neural network FGVNet, and weighting each channel of the feature map after the dimension reduction processing with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type.

In one embodiment, the pictures to be recognized comprise a teenager walk scene and a teenager hip. The non-human body label is a scene type-walking show, namely the non-human body label is a walking show, the confidence coefficient of the non-human body label is 0.85, namely the confidence coefficient of the walking show is 0.85, a preset third threshold value is 0.75, and the confidence coefficient of the walking show is 0.85 greater than the preset third threshold value and is 0.75; the fact that the buttocks are abnormal or vulgar in the scene that teenagers walk shows up is indicated, namely the vulgar identification result of the picture to be identified is a vulgar picture. Through the Attention model, the confidence degrees of the non-human body labels of other scenes corresponding to the picture to be recognized and the non-human body labels of other scenes except the walking show can be obtained, for example, other scenes comprise sports, cheering team performance, sand beach, swimming pool, dance performance, star red blanket, fitness scenes, indoor and outdoor scenes and the like.

S207, according to at least one of the human body label of at least one type, the non-human body label of at least one type, the confidence coefficient of the human body label of at least one type and the confidence coefficient of the non-human body label of at least one type, the picture to be recognized is recognized, and a low-custom recognition result is obtained.

In the embodiment of the application, the type and the confidence coefficient of the human body label corresponding to the picture to be recognized and the type and the confidence coefficient of the non-human body label are determined, so that whether the picture to be recognized is a vulgar picture or not is recognized according to different service scenes, and the accuracy of recognition of the vulgar picture in the field of image recognition is improved.

The image identification method for the target attribute provided by the embodiment of the application has good flexibility, and has an effect which is obviously superior to that of a simple classification model on a plurality of services. The image identification method aiming at the target attribute can flexibly meet the requirements of different service scenes, and different test data are constructed according to the auditing standards of different services at different periods, wherein all the test data are sampled from corresponding service actual data; the results of the evaluation on these test data are shown in the following table (1):

TABLE (1) FGVNet, BASELINE accuracy and recall

It should be noted that the BASELINE model compared in table (1) is based on the classification model acceptance-v 3+ edition; based on the version of FGVNET in the table (1), the strategy can be flexibly adjusted in different business applications according to the result output by the model, thereby achieving the purpose of adapting to different business standards; although the classification model increment-v 3+ attribute can meet the available requirements in some service scenes, when different service standards are faced, the accuracy and recall rate of the classification model increment-v 3+ attribute are greatly reduced, and only data can be collected again and the model can be retrained to meet the requirements of different services, which is time-consuming and labor-consuming.

Based on the same inventive concept, the embodiment of the present application further provides a picture identification apparatus for a target attribute, a schematic structural diagram of the apparatus is shown in fig. 6, and the picture identification apparatus 40 for a target attribute includes a first processing module 401, a second processing module 402, and a third processing module 403.

The first processing module 401 is configured to obtain a picture to be identified;

the second processing module 402 is configured to determine confidence levels of at least one type of human body tag corresponding to the picture to be recognized and a corresponding type of human body tag, and confidence levels of at least one type of non-human body tag corresponding to the picture to be recognized and a corresponding type of non-human body tag;

the third processing module 403 is configured to perform recognition processing on the picture to be recognized according to at least one of the at least one type of human body tag, the at least one type of non-human body tag, the confidence of the at least one type of human body tag, and the confidence of the at least one type of non-human body tag, so as to obtain a target attribute recognition result.

In an embodiment, the second processing module 402 is specifically configured to:

inputting the picture to be recognized into a framework network of a preset neural network, and performing convolution processing to obtain a feature map output by each network layer in the framework network;

and inputting the characteristic diagram output by the at least one network layer into a classification network of the neural network, and performing classification processing to obtain at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label.

In one embodiment, the type of non-human tag includes at least one of an action class, a clothing class, an animal class, an article class, and a scene class; the classification network comprises at least one of a corresponding action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network, and a scene classification sub-network.

inputting the feature map output by at least one network layer into a feature extraction model, obtaining image features through convolution processing, and performing global pooling processing on the image features to obtain a feature map subjected to dimension reduction processing;

In an embodiment, the third processing module 403 is specifically configured to:

In one embodiment, the first processing module 401 is further configured to:

constructing a training sample set based on a preset label set;

and training the skeleton network, the detection network and the classification network of the neural network to be trained based on the training sample set to obtain a preset neural network.

The first processing module 401 is specifically configured to:

the following processing is executed in each iterative training process of the neural network to be trained:

The application of the embodiment of the application has at least the following beneficial effects:

Based on the same inventive concept, the embodiment of the present application further provides an electronic device, a schematic structural diagram of which is shown in fig. 7, where the electronic device 9000 includes at least one processor 9001, a memory 9002, and a bus 9003, and at least one processor 9001 is electrically connected to the memory 9002; the memory 9002 is configured to store at least one computer executable instruction, and the processor 9001 is configured to execute the at least one computer executable instruction so as to perform the steps of any one of the image recognition methods for a target attribute as provided by any one of the embodiments or any one of the alternative embodiments in the present application.

Further, the processor 9001 may be an FPGA (Field-Programmable Gate Array) or other devices with logic processing capability, such as an MCU (micro controller Unit) and a CPU (Central processing Unit).

acquiring a picture to be identified; determining at least one type of human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of human body label, and determining at least one type of non-human body label corresponding to the picture to be recognized and the confidence coefficient of the corresponding type of non-human body label; and according to at least one item of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out recognition processing on the picture to be recognized to obtain a target attribute recognition result. Therefore, the type and the confidence coefficient of the human body label corresponding to the picture to be recognized and the type and the confidence coefficient of the non-human body label are determined, whether the picture to be recognized is the target attribute picture or not is recognized aiming at different service scenes, and the picture recognition accuracy aiming at the target attribute in the field of artificial intelligence is improved.

Based on the same inventive concept, embodiments of the present application further provide a computer-readable storage medium, which stores a computer program, and the computer program is used for implementing, when being executed by a processor, any one of the steps of the method for identifying a picture with respect to a target attribute provided in any one of the embodiments or any one of the alternative embodiments of the present application.

The computer-readable storage medium provided by the embodiments of the present application includes, but is not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks, ROMs (Read-Only memories), RAMs (Random Access memories), EPROMs (Erasable Programmable Read-Only memories), EEPROMs (Electrically Erasable Programmable Read-Only memories), flash memories, magnetic cards, or optical cards. That is, a readable storage medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

The embodiment of the present application further provides a computer program product containing instructions, which when run on a computer device, causes the computer device to execute the image identification method for the target attribute provided in the foregoing method embodiments.

It will be understood by those within the art that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer programs. Those skilled in the art will appreciate that the computer program product may be implemented by a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, implement the aspects specified in the block or blocks of the block diagrams and/or flowchart illustrations disclosed herein.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A picture identification method for target attributes is characterized by comprising the following steps:

acquiring a picture to be identified;

determining the confidence degrees of at least one type of human body label corresponding to the picture to be recognized and the corresponding type of human body label, and the confidence degrees of at least one type of non-human body label corresponding to the picture to be recognized and the corresponding type of non-human body label;

and according to at least one item of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label, carrying out recognition processing on the picture to be recognized to obtain a target attribute recognition result.

2. The method according to claim 1, wherein the determining the confidence levels of the at least one type of human body tag and the corresponding type of human body tag corresponding to the picture to be recognized and the confidence levels of the at least one type of non-human body tag and the corresponding type of non-human body tag corresponding to the picture to be recognized comprises:

inputting the picture to be recognized into a preset skeleton network of a neural network, and performing convolution processing to obtain a feature map output by each network layer in the skeleton network;

inputting the characteristic diagram output by at least one network layer into a detection network of the neural network, and carrying out detection processing to obtain at least one type of human body label corresponding to the picture to be recognized and the confidence of the corresponding type of human body label;

and inputting the characteristic diagram output by at least one network layer into a classification network of the neural network, and performing classification processing to obtain at least one type of non-human body label corresponding to the picture to be recognized and the confidence of the non-human body label of the corresponding type.

3. The method of claim 2, wherein the type of the non-human body tag comprises at least one of an action class, a dressing class, an animal class, an article class, and a scene class; the classification network comprises at least one of a corresponding action classification sub-network, a dress classification sub-network, an animal classification sub-network, an article classification sub-network, and a scene classification sub-network.

4. The method of claim 2, wherein the skeletal network comprises a convolutional neural network EfficientNet; the detection network comprises a bidirectional characteristic pyramid network Bi-FPN; the classification network comprises a feature extraction model and any one of a channel attention model SEnet and a space attention model CBAM.

5. The method according to claim 4, wherein the step of inputting the feature map output by the at least one network layer into a classification network of the neural network for classification processing to obtain the confidence levels of the at least one type of non-human body label corresponding to the picture to be recognized and the corresponding type of non-human body label comprises:

inputting the feature map subjected to the dimension reduction processing into the channel attention model, and weighting each channel of the feature map subjected to the dimension reduction processing with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type;

or inputting the feature map subjected to the dimension reduction processing into the spatial attention model, weighting each channel of the feature map subjected to the dimension reduction processing with different weights to obtain a weighted feature map, and weighting different spatial positions of the weighted feature map with different weights to obtain at least one type of non-human body label corresponding to the picture to be identified and the confidence coefficient of the non-human body label of the corresponding type.

6. The method according to claim 1, wherein the identifying the picture to be identified according to at least one of the at least one type of human body tag, the at least one type of non-human body tag, the confidence of the at least one type of human body tag, and the confidence of the at least one type of non-human body tag obtains a target attribute identification result, and the method includes at least one of:

when the type of the at least one type of human body label belongs to normal classification, and the confidence coefficient of the at least one type of human body label is smaller than or equal to a preset first threshold value, determining that the target attribute identification result of the picture to be identified is a target attribute picture;

when the type of the at least one type of human body label belongs to abnormal classification except normal classification, and the confidence coefficient of the at least one type of human body label is larger than a preset second threshold value, or the confidence coefficient of the at least one type of non-human body label is larger than a preset third threshold value, determining that the target attribute identification result of the picture to be identified is a target attribute picture;

7. The method according to claim 2, wherein before the obtaining the picture to be recognized, further comprising:

constructing a training sample set based on a preset label set;

training a skeleton network, a detection network and a classification network of a neural network to be trained based on the training sample set to obtain the preset neural network;

the training of the skeleton network, the detection network and the classification network of the neural network to be trained to obtain the preset neural network comprises the following steps:

initializing a skeleton network, a detection network and a classification network of the neural network to be trained, and initializing a loss function comprising neural network parameters;

executing the following processing in each iterative training process of the neural network to be trained:

taking a training picture included in the training sample set as an input sample of the neural network to be trained, taking at least one type of human body label and at least one type of non-human body label corresponding to the training picture as output results of the neural network to be trained, and substituting the input sample and the output results into the loss function to determine a corresponding neural network parameter when the loss function obtains a minimum value; and updating the neural network to be trained according to the determined neural network parameters.

8. An image recognition apparatus for a target attribute, comprising:

the first processing module is used for acquiring a picture to be identified;

and the third processing module is used for identifying the picture to be identified according to at least one of the at least one type of human body label, the at least one type of non-human body label, the confidence coefficient of the at least one type of human body label and the confidence coefficient of the at least one type of non-human body label to obtain a target attribute identification result.

9. An electronic device, comprising: a processor, a memory;

the memory for storing a computer program;

the processor is used for executing the picture identification method for the target attribute according to any one of claims 1 to 7 by calling the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored which, when being executed by a processor, is adapted to carry out the method for picture recognition of a target property according to any one of claims 1-7.