CN116958965A

CN116958965A - Method and device for identifying cover picture, server and storage medium

Info

Publication number: CN116958965A
Application number: CN202210365407.9A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Chengdu Co Ltd
Current assignee: Tencent Technology Chengdu Co Ltd
Priority date: 2022-04-07
Filing date: 2022-04-07
Publication date: 2023-10-27

Abstract

The disclosure provides a method, a device, a server and a storage medium for identifying a cover picture, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the following steps: performing key point detection on a target object in the cover picture to be identified to obtain a key point detection result of the target object; under the condition that the key point detection result is that the target object is incomplete, determining a semantic application scene corresponding to the cover picture to be identified based on the cover picture to be identified and the title information of the content item to which the cover picture to be identified belongs; and determining the integrity of the cover picture to be identified in the semantic application scene based on the key point detection result and the semantic application scene. According to the method and the device, whether the object in the cover picture is complete is not used as a judgment standard of the integrity of the cover picture, and judgment is carried out by combining the key point detection result of the target object and the semantic application scene of the cover picture to be identified, so that the identification precision of the cover picture is improved.

Description

Method and device for identifying cover picture, server and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, and in particular relates to a method and a device for identifying a cover picture, a server and a storage medium.

Background

As internet technology has evolved and the threshold for content production has decreased, the amount of content upload has increased at an exponential rate. In general, a content producer only configures one original cover picture meeting the current scene requirement, other application scenes are not considered, and a server needs to cut the original cover picture aiming at different application scenes, so that the size and the object of the original cover picture are inevitably damaged in the cutting process. The cover picture is taken as a core element of the content, and the integrity of the object in the cover picture directly influences the distribution effect of the content, so that the integrity identification is needed for the object in the cover picture.

The related technology is based on a simple two-classification model, and the integrity recognition is carried out on the object in the cover picture. However, the identification mode can only identify the incomplete cover picture of the object in the physical sense angle, and the identification accuracy is low.

Disclosure of Invention

The embodiment of the disclosure provides a method, a device, a server and a storage medium for identifying a cover picture, which can improve the identification accuracy of the cover picture. The technical scheme is as follows:

in a first aspect, a method for identifying a cover picture is provided, where the method includes:

Performing key point detection on a target object in a cover picture to be identified to obtain a key point detection result of the target object;

under the condition that the key point detection result is that the target object is incomplete, determining a semantic application scene corresponding to the cover picture to be identified based on the cover picture to be identified and the title information of the content item to which the cover picture to be identified belongs;

and determining the integrity of the cover picture to be identified in the semantic application scene based on the key point detection result and the semantic application scene.

In a second aspect, there is provided an apparatus for recognizing a cover picture, the apparatus comprising:

the detection module is used for carrying out key point detection on a target object in the cover picture to be identified to obtain a key point detection result of the target object;

the first determining module is used for determining a semantic application scene corresponding to the cover picture to be identified based on the cover picture to be identified and the title information of the content item to which the cover picture to be identified belongs under the condition that the key point detection result is that the target object is incomplete;

and the second determining module is used for determining the integrity of the cover picture to be identified in the semantic application scene based on the key point detection result and the semantic application scene.

In a third aspect, a server is provided, where the server includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the method for identifying a cover image according to the first aspect.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the method for identifying a cover image according to the first aspect.

In a fifth aspect, there is provided a computer program product comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a server, the processor executing the computer program code such that the server performs the method of identifying a cover image according to the first aspect.

The technical scheme provided by the embodiment of the disclosure has the beneficial effects that:

by detecting the key points of the target object in the cover picture to be identified, whether the target object in the cover picture to be identified is complete or not is determined, and as the key points are main points forming the target object, the definition of the integrity of the target object in a physical sense angle is improved by adopting the key point detection method, and other unimportant points on the target object are not required to be focused in the detection process, so that the detection speed is improved. When the completeness of the cover picture to be identified is determined, whether the object in the cover picture is complete is not used as a criterion of the completeness of the cover picture any more, but the determination is carried out by combining the key point detection result of the target object and the semantic application scene of the cover picture to be identified.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.

Fig. 1 is a schematic diagram of an implementation environment related to a method for identifying a cover picture according to an embodiment of the disclosure;

fig. 2 is a schematic diagram of each functional module included in a cover picture recognition system according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a method for identifying a cover image according to an embodiment of the disclosure;

FIG. 4 is a flowchart of another method for identifying a cover image according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a key point labeling result provided by an embodiment of the present disclosure;

fig. 6 is a flowchart for identifying the integrity of a target object in a cover picture based on a key point detection result according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a cover photo according to an embodiment of the disclosure;

FIG. 8 is a schematic diagram of another cover image provided by an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of another cover image provided by an embodiment of the present disclosure;

FIG. 10 is an overall logic diagram of a method for identifying a cover image according to an embodiment of the present disclosure;

FIG. 11 is a flowchart of a training method for a semantic application scene recognition model provided by an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a semantic application scenario partitioning result provided by an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a Transform unit according to an embodiment of the disclosure;

FIG. 14 is a schematic diagram of a network architecture for training a multimodal picture pre-training model provided by an embodiment of the present disclosure;

fig. 15 is a schematic structural diagram of a device for identifying a cover picture according to an embodiment of the disclosure;

fig. 16 is a server for cover picture identification, according to an exemplary embodiment.

Detailed Description

For the purposes of clarity, technical solutions and advantages of the present disclosure, the following further details the embodiments of the present disclosure with reference to the accompanying drawings.

It will be understood that the terms "each," "plurality," and "any" as used in this disclosure, including two or more, each refer to each of the corresponding plurality, and any one refers to any one of the corresponding plurality. For example, the plurality of words includes 10 words, and each word refers to each of the 10 words, and any word refers to any one of the 10 words.

Information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, presented data, etc.), and signals, to which the present disclosure relates, are all user-authorized or fully authorized by parties, and the collection, use, and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant country and region. For example, the face keypoints, body keypoints, etc. of the subject referred to in this disclosure are all acquired with sufficient authorization.

Prior to performing the embodiments of the present disclosure, the techniques involved in the embodiments of the present disclosure will be first described.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of researching how to make a machine "look at", and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as identifying and measuring on a target, and further performing graphic processing, so that the Computer processing becomes an image more suitable for the human eye to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required.

Next, terms related to the embodiments of the present disclosure will be explained.

Content item: refers to content that can be digitized (or digitized) by any electronic processing means such as scanning, as well as content that has been digitized. The content items include documents formed of characters and (still) images, audio, video of moving images, and the like.

Article (c): articles that are recommended for reading by the subject may contain video or pictures, and the articles are usually actively edited and released after a public number is opened from the media.

Video: for the video recommended to the user to read, including a small video in vertical version and a short video in horizontal version, are provided in the form of Feeds stream.

MCN (Multi-Channel Network), multi-Channel Network: the method is a product form of a multi-channel network, combines PGC contents, and ensures continuous output of the contents under the powerful support of capital, thereby finally realizing stable realization of business.

PGC (Professional Generated Content, personal production content): for internet terminology, refer to professional production content (e.g., video websites), expert production content (e.g., microblogs). PGCs are used to refer broadly to content personalization, view angle diversification, and social relationship virtualization.

UGC (User Generated Content, user originated content): with the concept of advocating individualization as main characteristics, the method is not a specific service, but a new way for users to use the Internet, namely, the original downloading is mainly changed into downloading and uploading and re-uploading.

PUGC (Professional User Generated Content, personal user generated content): is professional audio content produced in UGC form that is relatively close to PGC.

Feeds: the source of the message (English: web Feed, news Feed, synchronized Feed) is translated into source material, feed, information provision, manuscript, abstract, source, news subscription, web Feed, etc., feeds is a data format through which websites propagate the latest information to objects, usually arranged in a time axis, and Timeline is the most primitive and intuitive presentation form of Feed. A prerequisite for an object to be able to subscribe to a website is that the website provides a source of messages. Feed is converged at one place called aggregation (aggregation), and the application for aggregation is called an aggregator (aggregator). For the final object, the aggregator is an application dedicated to subscribing to websites, also commonly referred to as RSS (Really Simple Syndication ) reader, feed reader, news reader, etc.

Deep learning: the deep learning concept is derived from the research of an artificial neural network, and the multi-layer sensor comprising multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.

Referring to fig. 1, an implementation environment related to a method for identifying a cover image according to an embodiment of the disclosure is shown, where the implementation environment includes: a terminal 101 and a server 102, wherein the terminal 101 communicates with the server 102 via a network 103, and the network 103 may be a wired network or a wireless network.

The terminal 101 is a terminal used by a content producer, and the terminal 101 may be, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, and the like. The terminal 101 has installed therein a content item production distribution application, and based on the installed application, it is possible to transmit the produced content item and related information (including an original cover picture, a title, etc.) to the server 102, so that the server 102 performs operations of storage, auditing, distribution, etc.

The server 102 is a background server of a content item production and release application, and the server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a CDN (Content Delivery Network ), basic cloud computing services such as big data and an artificial intelligence platform, and the like. The server 102 receives and stores the content item and the related information uploaded by the terminal 101, cuts the original cover picture according to different application scenes, obtains the cover picture under different application scenes, and further performs integrity recognition on the cover picture. And when the cover picture is recognized to be complete, distributing the content item with the cover picture.

The above-described implementation environment may further include: and a terminal 104. The terminal 104 is a terminal used by a content consumer, and the terminal 104 communicates with the server 102 via the network 103. The terminal 104 may be, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc. The terminal 104 has installed therein a content item production distribution application, and based on the installed application, the terminal 104 can receive a content item pushed by the server 102 or pull the content item from the server 102.

Fig. 2 is a schematic diagram of a cover picture identifying system provided in an embodiment of the disclosure, referring to fig. 2, the system includes: the system comprises a content production end, a content consumption end, a content uplink and downlink interface service module, a content database, a dispatching center service module, a duplication elimination service module, a content storage service module, a dispatching center service module, a semantic incomplete identification service module, a downloaded file system, a content distribution outlet service module, a manual auditing system, a semantic application scene identification service module, a video frame extraction and image-text content analysis cutting module, an object incomplete detection service module, a picture multi-mode pre-training database, a crawling and data preprocessing system, a multi-mode picture pre-training model and the like.

The content production end is used for generating content items, and the content items comprise at least one content producer such as PGC, UGC or MCN. The content production end provides content items such as graphics context or video through an API (Application Programming Interface, application program interface) system of the terminal or the server, and the content items are main content sources of the recommendation system. The content production end can communicate with the uplink and downlink interface service module, and can upload the content item to the uplink and downlink interface service module through communicating with the uplink and downlink interface service module, if the content item is in a graphic form, the graphic content is usually sourced from a lightweight publishing end and an editing content inlet, if the content item is in a video form, the video content is usually sourced from a shooting and photographing end, and in the shooting process, the local video content can select matched music, filter templates, beautifying functions of videos and the like. The content production end can also report playing behavior data (such as a card, loading time, playing click and the like) in the uploading process of the content item to the background, so that the background can carry out statistical analysis.

The content consumption end is used for acquiring a content item, and can communicate with the uplink and downlink content interface service module to acquire content index information from the uplink and downlink content interface service module, wherein the content index information comprises a title, an author, a cover picture, a file size, a cover picture link, a code rate, a file format, a release time, a mark whether original or not and the like, and if the content item is uploaded for the first time, the content consumption end also comprises classification and label information (comprising primary classification, secondary classification, tertiary classification and the like of the content item in the manual auditing process, for example, an article related to a mobile phone, the primary classification is science and technology, the secondary classification is smart phone, the tertiary classification is domestic mobile phone, and the label information is a mobile phone of a certain model of a certain brand) and the like; the content consumption end can also communicate with the content storage service module, and acquire content source files from the content storage service module based on the content index information, wherein the content source files comprise video source files, picture source files and the like; the content consumption end can also communicate with the content delivery outlet service module to receive the content items pushed by the content delivery outlet service module. The content consumption end can also report playing behavior data (such as a card, loading time, playing click and the like) in the process of downloading the content item to the background, so that the background can carry out statistical analysis. In addition, content consumers typically browse content data using Feeds streaming.

The uplink and downlink content interface service module is a main communication module communicated with the outside, and can receive the content item uploaded by the content generation end through communication with the content generation end, store the content index information of the content item into a content database and store the content source file into the content storage service module. The uplink and downlink content interface service module can also send the content items uploaded by the content generation end and the content items from external channels to the dispatch center service module, and the dispatch center service module carries out subsequent processing and circulation. In addition, the uplink and downlink content interface service module can also store the content items from the external channel sources to the content storage service module and the content database.

The content database serves as a core database of content items in which content meta-information of content items issued by all content producers is stored. When the dispatching center service module processes the content item, the content meta-information of the content item is read from the content database, and the processing result is returned to the content item database so as to update the content meta-information of the content item. The processing of the content item by the dispatching center service module comprises machine processing and manual auditing processing, wherein the machine processing mainly comprises auditing the quality of the content item so as to filter out low-quality content items, and also can carry out weight removal on the content item, and the processing result is returned to the content database after the processing is completed. After the machine processing is completed, the manual auditing system carries out manual auditing on the content item, and then the auditing result and state of the content item are transmitted back to the content database. In addition, when the video frame extraction and image-text content analysis cutting module cuts the cover pictures according to different application scenes, the content meta information is read from the content database, the cover pictures obtained by cutting are further cut according to the content meta information, and the cover pictures are returned to the content database for storage. In addition, in the process of constructing the multi-mode pre-training model of the picture and the semantic application scene recognition model, if the label information of the cover picture for modeling needs to be acquired, the content meta information of the content item to which the cover picture belongs is also acquired from the content database.

The content storage service module is used for storing content source files of the content items, wherein the content source files are content entity information besides content meta information of the content items, such as video source files, picture source files of graphic content and the like. The content consumer may obtain the content source file for the content item directly from the content storage service based on the content meta-information of the content item. The content storage service module is also used for providing a content source file for the video frame extracting and picture-text analysis cutting module, so that the video frame extracting and picture-text analysis cutting module can extract video frames from the content source file and cut the extracted video frames to generate candidate cover pictures.

The dispatching center service module is responsible for the content item circulation and the whole dispatching process, can receive the content items sent by the uplink and downlink content interface service module, and can acquire content meta-information of the content items from the content database. The dispatch center service module is also capable of dispatching the machine processing system and the manual auditing system to process the content items and control the sequence and priority of dispatching. When it is determined by the manual review system that the content item is enabled, the content item is provided to the content consumer via a content distribution outlet service module (typically a recommendation engine or search engine, etc.), at which time the content consumer obtains content index information, i.e., the access address of the content item. The dispatching center service module can also communicate with the semantic incomplete identification service module, and the semantic incomplete identification service module is called to identify the integrity of the objects in the cover pictures, so that the screened cover pictures are ensured to be semantically complete and matched with the context in which the content item is located.

The duplicate removal service module is a part of the machine processing system and is used for removing duplicate of the content items uploaded by the content generation end, and when two content items are detected to be similar, only one content item is usually reserved for subsequent processing so as to reduce the calculated amount in the subsequent processing process. When the duplicate removal service module removes duplicate content items, the duplicate removal service module can acquire the ebeeding vectors of the two content items, further judge the similarity between the two content items by calculating the cosine distance between the ebeeding vectors of the two content items, and determine that the two content items are similar if the similarity is greater than a certain threshold value.

The manual auditing system is a carrier of manual service capability and is mainly used for auditing and filtering sensitive contents which cannot be identified by a machine. In the auditing process, classification labels can be marked on short videos, small videos and the like, and the marked labels are secondarily confirmed.

The downloading file system is used for downloading the content source file from the content storage service module and controlling the downloading speed and the downloading progress in the downloading process. The download file system includes a set of parallel servers that form a server cluster for task scheduling and distribution for performing scheduling and distribution tasks. After the downloading of the content source file is completed, the downloaded content source file is provided for a video frame extracting and image-text content analyzing and cutting module, and the video frame extracting and image-text content analyzing and cutting module extracts the video frame from the content source file.

The video frame extraction and graphic content analysis clipping module is used for extracting video frames from the content source file, and the extracted video frames can provide data sources for candidate cover pictures. When the plurality of video frames are extracted, the image-text content of each video frame is analyzed, and the plurality of video frames and the original cover picture are taken as candidate cover pictures. The video frame extraction and image-text content analysis cutting module intelligently cuts the candidate cover pictures according to the display requirements under different application scenes to obtain cover pictures with different specifications, and the cover pictures with different specifications are required to be subjected to integrity detection.

The object incomplete detection service module is used for detecting the integrity of an object in the cover picture, and adopts a double-branch target and key point detection model. The dual-branch target and key point detection model comprises a backhaul network, a feature encoding module, a prediction branch and the like.

The crawling and data preprocessing system is used for crawling corresponding pictures from the Internet based on the search words, and the pictures corresponding to the same search word are considered to be similar in the crawling process.

The multi-mode picture pre-training database is a database for training a multi-mode picture pre-training model, and the data in the multi-mode picture pre-training database are picture-text pairs, and the sources of the picture multi-mode pre-training database comprise at least one of the following modes: the first mode is that the search word and the image-text pair formed by the images crawled from the internet by adopting a crawler system are based on the search word; in the second mode, the image-text pairs in the database are currently used in public; and thirdly, content is generated into image-text pairs consisting of cover images and titles uploaded by the terminal. The image-text pairs with multiple sources enrich the multi-mode pre-training database of the images and enhance the generalization of the data in the multi-mode pre-training database of the images. The multi-mode picture pre-training model is a model pre-trained based on picture-text pairs and a plurality of pre-training tasks in a picture multi-mode pre-training database, and can mine the internal relation between pictures and texts.

The semantic application scene recognition service module is used for calling the semantic application scene recognition model to recognize the semantic application scene corresponding to the cover picture. The semantic application scene model is obtained by adopting training samples under different semantic application scenes on the basis of the multi-mode picture pre-training model and fine-tuning model parameters of the multi-mode picture pre-training model. Different semantic application scenarios include body part close-up scenarios, specific scenarios, apparel presentation scenarios, and the like.

The semantic incomplete identification service module is used for determining the integrity of the cover picture under the semantic application scene by combining the integrity identification result of the object in the cover picture, the semantic application scene corresponding to the cover picture and the identification strategy.

The content distribution outlet service module is an output outlet of the content item, the content distribution outlet service module, the dispatching center service module and the content consumption end can communicate, the dispatching center service module can distribute the content meta-information of the content item which passes the manual verification through the content distribution outlet service module through the communication with the content distribution outlet service module, the content consumption end can receive the content meta-information of the content item distributed by the content distribution outlet service module, and the content distribution outlet service module is mainly automatically distributed based on a recommendation algorithm or manually distributed when distributing the content meta-information of the content item to the content consumption end.

The embodiment of the present disclosure provides a method for identifying a cover picture, taking a server shown in fig. 1 to execute the embodiment of the present disclosure as an example, referring to fig. 3, a method flow provided by the embodiment of the present disclosure includes:

301. and detecting key points of the target object in the cover picture to be identified to obtain a key point detection result of the target object.

302. And under the condition that the key point detection result is that the target object is incomplete, determining a semantic application scene corresponding to the cover picture to be identified based on the cover picture to be identified and the title information of the content item to which the cover picture to be identified belongs.

303. And determining the integrity of the cover picture to be identified in the semantic application scene based on the key point detection result and the semantic application scene.

According to the method provided by the embodiment of the disclosure, whether the target object in the cover picture to be identified is complete is determined by detecting the key points of the target object in the cover picture to be identified, and as the key points are the main points forming the target object, the definition of the integrity of the target object in a physical sense angle is improved by adopting the key point detection method, and other unimportant points on the target object are not required to be focused in the detection process, so that the detection speed is improved. When the completeness of the cover picture to be identified is determined, whether the object in the cover picture is complete is not used as a criterion of the completeness of the cover picture any more, but the determination is carried out by combining the key point detection result of the target object and the semantic application scene of the cover picture to be identified.

In another embodiment of the present disclosure, performing keypoint detection on a target object in a cover picture to be identified to obtain a keypoint detection result of the target object, including:

identifying a plurality of face key points and a plurality of human body key points of a target object in a cover picture to be identified;

determining the face integrity of the target object based on the plurality of face key points;

determining the human body integrity of the target object based on the plurality of human body key points;

and under the condition that at least one of the face or the human body of the target object is incomplete, determining the key point detection result as the incomplete target object.

In another embodiment of the present disclosure, determining face integrity of a target object based on a plurality of face keypoints comprises:

fusing the features of the key points of the face in different dimensions to obtain fused features of the key points of the face;

determining a first key point score of the face key point based on the fusion characteristic of the face key point, wherein the first key point score is used for representing the probability of the face key point contained in the target object;

under the condition that the first key point score does not accord with the first score threshold value condition, determining that the target object does not contain the face key point;

and under the condition that the target object does not contain the key points of the human face, determining that the human face of the target object is incomplete.

In another embodiment of the present disclosure, identifying the human body integrity of the target object based on a plurality of human body keypoints comprises:

fusing the features of different dimensions of the key points of the human body to obtain the fused features of the key points of the human body;

determining a second key point score of the human key points based on the fusion characteristics of the human key points, wherein the second key point score is used for representing the probability of the human key points contained in the target object;

under the condition that the second key point score does not meet a second score threshold condition, determining that the target object does not contain human key points;

and under the condition that the target object does not contain the key points of the human body, determining that the human body of the target object is incomplete.

In another embodiment of the present disclosure, determining a semantic application scenario corresponding to a cover picture to be identified based on the cover picture to be identified and header information of a content item to which the cover picture to be identified belongs includes:

invoking a semantic application scene recognition model to recognize the cover picture to be recognized and the title information to obtain a semantic application scene corresponding to the cover picture to be recognized, wherein the semantic application scene recognition model is used for recognizing the semantic application scene of the cover picture based on the cover picture and the corresponding title information.

In another embodiment of the present disclosure, determining the integrity of the cover picture to be identified in the semantic application scenario based on the key point detection result and the semantic application scenario includes:

determining the integrity of the target object in the semantic application scene based on the key point detection result;

under the condition that the target object is determined to be complete in the semantic application scene based on the key point detection result, determining that the cover picture to be identified is complete in the semantic application scene;

and under the condition that the target object is determined to be incomplete in the semantic application scene based on the key point detection result, determining that the cover picture to be identified is incomplete in the semantic application scene.

In another embodiment of the present disclosure, determining the integrity of the target object in the semantic application scenario based on the keypoint detection result includes:

under the condition that the semantic application scene is a human body part close-up scene, determining a human body part needing close-up in the human body part close-up scene, determining that the human body part needing close-up is complete based on a key point detection result, and determining that a target object is complete in the semantic application scene;

under the condition that the semantic application scene is a clothing display scene, determining clothing to be displayed in the clothing display scene, determining that the clothing to be displayed is complete based on a key point detection result, and determining that a target object is complete in the semantic application scene;

And under the condition that the semantic application scene is a specific scene, determining scene elements to be represented in the specific scene, determining that the scene elements to be represented are complete based on the key point detection result, and determining that the target object is complete in the semantic scene.

In another embodiment of the present disclosure, the body part close-up scene includes any one of a head close-up scene, a neck close-up scene, a collarbone close-up scene, an arm close-up scene, an upper body close-up scene, a leg close-up scene, or a foot close-up scene;

the clothing display scene comprises any one of a coat display scene, a lower coat display scene or a shoe display scene;

the specific scene includes any one of a eating scene, a painting scene, a musical instrument playing scene, a hand-made scene, a food-made scene, a character non-body scene, a multi-person scene, or a non-main character scene.

Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.

The embodiment of the present disclosure provides a method for identifying a cover picture, taking a server shown in fig. 1 to execute the embodiment of the present disclosure as an example, referring to fig. 4, a method flow provided by the embodiment of the present disclosure includes:

401. And the server acquires the cover picture to be identified.

The cover pictures to be identified are pictures which need to be subjected to integrity detection in the embodiment of the disclosure, the number of the cover pictures to be identified is at least one, and the at least one cover picture to be identified can meet display requirements of the same terminal in different scenes, for example, different display requirements of the same terminal in horizontal and vertical screen scenes, and also meet display requirements of different terminals in the same scene, for example, different display requirements of terminals with different screen sizes. Sources of cover pictures to be identified include, but are not limited to, the following:

in the first mode, after the content generator produces the content item, a picture is selected as an original cover picture for the generated content item, and then the content item and the original cover picture are uploaded to a server, and the server performs operations such as storage, auditing, distribution and the like. In response to receiving an original cover picture uploaded by a content producer, the server intelligently cuts the specification and the size of the original cover picture according to the display requirements of the terminal in different scenes to obtain cover pictures of different specifications, and further takes the original cover picture uploaded by the content producer and the cover pictures of different specifications obtained by cutting as the cover picture to be identified.

In the second way, after the content producer has produced the content item, the original cover picture is not specified for the content item, but the content item is uploaded to the server. In response to receiving a content item uploaded by a content producer, the server selects at least one picture from the content item as an original cover picture, and then intelligently cuts the specification and the size of the selected original cover picture according to the display requirements of the terminal in different scenes to obtain cover pictures with different specifications, and further uses the at least one original cover picture and the cover pictures with different specifications obtained by cutting as cover pictures to be identified. Of course, if no picture is included in the content item, for example, the content item is plain text, the server may also acquire at least one picture matching the title, abstract, etc. of the text content as an original cover picture from the internet based on the title, abstract, etc. of the text content.

402. The server performs target detection on the cover picture to be identified to obtain a target object of the cover to be identified.

The target object is a foreground object forming the cover content of the cover picture to be identified, the target object comprises characters, animals, vehicles, furniture articles, clothes and the like, the number of the target object is at least one, the category of the target object is at least one, for example, the picture content of the cover picture to be identified comprises two characters, and the target object is the two characters in the cover picture to be identified; for another example, the image content of the cover image to be identified includes a person and a car, and the target object is the person and the car in the cover image to be identified.

When the server detects the target of the cover picture to be identified, the following method can be adopted: extracting image features of a cover picture to be identified to obtain a feature map corresponding to the cover picture to be identified, determining a plurality of candidate frames in the cover picture to be identified, mapping the candidate frames into the feature map to obtain a plurality of corresponding candidate feature maps, carrying out maximum pooling treatment on the plurality of candidate feature maps to obtain a plurality of candidate region maps with the same size, carrying out classification treatment and candidate frame position regression treatment on the plurality of candidate region maps to obtain detection frames of a target object, and identifying the target object from each detection frame by a server based on the position of each detection frame.

403. And the server detects the key points of the target object in the cover picture to be identified, and a key point detection result of the target object is obtained.

In order to detect the integrity of the target object in the cover picture to be identified, the target object is usually subjected to key point detection. Taking the class of the target object as an example, when the related technology detects the key points of the person, two modes of human body key points and human face key points are mainly adopted, however, when the detection mode of the human body key points is adopted, only the condition of incomplete human body can be detected, and the condition of incomplete facial features can not be accurately identified; when the detection method of the key points of the human face is adopted for detection, only the condition of incomplete facial features of the human face can be detected, but the problem of human body deficiency cannot be identified, so that the problem of character integrity identification is difficult to effectively solve in the related technology. Although the related technology (such as ZoomNet and the like) detects the key points of the human face and the key points of the human body at the same time, and then judges the incomplete degree of the person through the detected key points, the detection mode can provide more guiding information for the incomplete identification of the person, but has poor detection effect on the key points of the incomplete person, is complex and heavy in model, is not suitable for large-scale cover picture detection processing, and is difficult to apply on line.

In order to improve the accuracy of key point detection and better meet the requirement of calculation efficiency in the practical application process, the embodiment of the disclosure adopts a dual-branch target and key point detection model when the key point detection is performed on the target object. The dual-branch target and key point detection model comprises a key point detection module, a feature coding module, a prediction branch and the like.

The key point detection module may be a backhaul network, etc., where the backhaul network is an EffencientNetB4 network, etc., and is used to detect each key point in the target object and extract a feature map of each key point. The backhaul network comprises a face detection branch and a human body detection branch, wherein the face detection branch is used for extracting face key points of a target object and features of different dimensions of each face key point, and the human body detection score is used for extracting human body key points of the target object and features of different dimensions of each human body key point. In view of different physiological characteristics of a human face and a human body, the number of human face key points and the number of human body key points extracted by the human face detection branches and the human body detection scores are different, the number of human face key points extracted by the human face detection branches can be 44, 68 and the like, the number of human body key points extracted by the human body detection branches can be 17, 20 and the like, and the number of the extracted human face key points and the number of human body key points are not limited in the embodiment of the disclosure. Fig. 5 shows schematic diagrams of face key points and human body key points extracted by the method provided by the embodiment of the present disclosure, referring to the left diagram in fig. 5, the number of human body key points extracted by the human body detection branch is 17, referring to the right diagram in fig. 5, and the number of face key points extracted by the human body detection branch is 44.

The feature encoding module adopts hidden layer output and multi-scale fusion of EffecientientNetB 4 and is used for carrying out multi-scale fusion on the features of different dimensionalities of each key point extracted by the key point detection module to obtain key point fusion features. Aiming at the face key points, the feature coding module fuses the features of different dimensions of each face key point to obtain fusion features of the face key points; aiming at the human body key points, the feature coding module fuses the features of different dimensions of each human body key point to obtain human body key point fusion features. Because the dimensions of the features corresponding to the human face key points and the human body key points are different, the dimensions of the fusion features of the human face key points and the fusion features of the human body key points obtained by fusion of the feature coding module are also different.

The predicted branches include a face predicted branch and a human predicted branch. The face prediction branch is used for a first key point score of each face key point, the first key point score of the face key point is used for representing the probability that an object in the cover picture contains the face key point, the value of the first key point score is between 0 and 1, and the higher the first key point score is, the higher the probability that the object in the cover picture contains the face key point is; conversely, the lower the first key point score, the less likely the object in the cover picture contains the face key point, and the more likely the face key point is missing. The human body prediction branch is used for predicting a second key point score of each human body key point, the second key point score of each human body key point is used for representing the probability that an object in the cover picture contains the human body key point, the value of the second key point score is between 0 and 1, and the higher the second key point score is, the higher the probability that the object in the cover picture contains the human body key point is; conversely, the lower the second key point score, the less likely the object in the cover picture contains the human key point, the more likely the human key point is missing.

In the embodiment of the disclosure, a server detects a key point of a target object in a cover picture to be identified based on a double-branch target and a key point detection model to obtain a key point detection result of the target object. The specific detection process comprises the following steps:

4031. the server identifies a plurality of face key points and a plurality of human body key points of a target object in the cover picture to be identified.

The server adopts a key point detection module in a double-branch target and key point detection model to detect the key points of the target object in the cover picture to be identified so as to extract the characteristics of a plurality of human face key points and different dimensions of each human face characteristic and extract the characteristics of a plurality of human body key points and different dimensions of each human body key point.

4032. The server determines the face integrity of the target object based on the plurality of face keypoints.

For each face key point of the plurality of extracted face key points, the process of determining the face integrity of the target object based on the face key point by the server comprises:

the first step, the server fuses the features of the different dimensions of the key points of the face to obtain the fused features of the key points of the face.

The server adopts a feature coding module in the dual-branch target and key point detection model to fuse the features of different dimensions of the key points of the human face to obtain the fusion features of the key points of the human face, wherein the fusion features can represent the key points of the human face from multiple dimensions.

And secondly, the server determines a first key point score of the face key point based on the fusion characteristics of the face key point.

The server inputs the fusion characteristic of the face key points into a face prediction branch of the double-branch target and key point detection model, and outputs a first key point score of the face key points.

And thirdly, under the condition that the first key point score does not meet the first score threshold condition, the server determines that the target object does not contain the face key point.

And comparing the first key point score of the key point of the human face with a first score threshold value by the server, and determining that the first key point score of the key point of the human face does not accord with a first score threshold value condition when the first key point score is smaller than the first score threshold value so as to determine that the target object does not contain the key point of the human face, otherwise, determining that the first key point score of the key point of the human face accords with a first score threshold value condition when the first key point score is larger than or equal to the first score threshold value so as to determine that the target object contains the key point of the human face. Wherein the first score threshold may be set by a technician, embodiments of the present disclosure are not particularly limited.

Fourth, the server determines that the face of the target object is incomplete under the condition that the target object does not contain the face key points.

After each face key point is processed by the method, the server can determine the inclusion condition of each face key point in the target object. When the target object contains all the face key points, determining that the face of the target object is complete; and when the target object does not contain any face key point, determining that the face of the target object is incomplete.

4033. The server determines the human body integrity of the target object based on the plurality of human body keypoints.

For each human body keypoint of the plurality of extracted human body keypoints, the process of determining the human body integrity of the target object by the server based on the human body keypoint comprises:

the first step, the server fuses the features of different dimensions of the key points of the human body to obtain the fusion features of the key points of the human body.

The server adopts a feature coding module in the dual-branch target and key point detection model to fuse the features of different dimensions of the key points of the human body, so as to obtain the human body key point fusion features capable of representing the key points of the human body from multiple dimensions.

And a second step, the server determines a second key point score of the human key points based on the fusion characteristics of the human key points.

The server inputs the fusion characteristic of the human body key points into a human body prediction branch of the double-branch target and key point detection model, and outputs a second key point score of the human body key points.

And thirdly, determining that the target object does not contain the human body key point by the server under the condition that the second key point score does not meet the second score threshold condition.

And comparing the second key point score of the human key point with a second score threshold by the server, and determining that the second key point score of the human key point does not meet the second score threshold condition when the second key point score is smaller than the second score threshold so as to determine that the target object does not contain the human key point, otherwise, determining that the second key point score of the human key point meets the second score threshold condition when the second key point score is larger than or equal to the second score threshold so as to determine that the target object contains the human key point. Wherein the second score threshold may be set by a skilled person, embodiments of the present disclosure are not particularly limited.

Fourth, the server determines that the face of the target object is incomplete under the condition that the target object does not contain the key points of the human body.

After each human body key point is processed by the method, the server can determine the inclusion condition of each human body key point in the target object. When the target object contains all the human body key points, determining that the human body of the target object is complete; and when the target object does not contain any human body key point, determining that the human body of the target object is incomplete.

4034. And under the condition that at least one of the face or the human body of the target object is incomplete, the server determines that the key point detection result is that the target object is incomplete.

When the face of the target object is determined to be incomplete or the human body is incomplete or both the face and the human body are incomplete based on the key point detection result, that is, at least one of the face or the human body of the target object is incomplete, the server determines that the key point detection result is incomplete, and further determines the integrity of the cover picture to be identified based on the subsequent step 404. It should be noted here that, typically, the cover picture includes at least one object, and if a part of the at least one object is complete and a part of the at least one object is incomplete, it will also be determined that the at least one object is incomplete.

Further, if the face and the human body of the target object are complete, the target object is determined to be complete. In the case that the target object in the cover picture to be identified is complete, the server determines that the cover picture to be identified is complete, and the cover picture integrity detection process to be identified is finished, and step 404 is not executed any more.

In addition, the above steps 4031 to 4034 are described with reference to the target object as a person, and the integrity of the target object is determined by performing the key point detection on the target object, and when the target object is a non-person such as an automobile or a garment, the outline of the target object may be detected when the integrity of the target object is determined, if the outline of the target object is complete, the target object is determined to be complete, otherwise, if the outline of the target object is incomplete, the target object is determined to be incomplete. Of course, other detection modes besides the contour detection mode based on the target object can be adopted, and will not be described here.

Fig. 6 shows a prediction flow of a dual-branch target and key point detection model, referring to fig. 6, after a cover image is input into the dual-branch target and key point detection model, 68 face key points and 17 face key points of a person in the cover image are extracted based on a backhaul network, and a feature coding module is adopted to perform multi-scale feature fusion on each face key point and each body key point to obtain fusion features of each face key point and fusion features of each body key point, and then a face prediction branch and a body prediction branch are adopted to respectively predict. For each face key point, inputting the face key point into a face prediction branch, outputting a first key point score of the face key point, and if the first key point score meets a first score threshold condition, determining that the face key point is contained in the character of the cover picture, otherwise, the character does not contain the face key point, and the face key point is missing. For each human body key point, inputting the human body key point into a human body prediction branch, outputting a second key point score of the human body key point, and if the second key point score meets a second score threshold condition, determining that the human body key point is contained in the character of the cover picture, otherwise, the character does not contain the human body key point, and the human body key point is missing.

The embodiment of the disclosure adopts the double-branch target and key point detection model to detect the integrity of the target object in the cover picture to be identified, and has the following advantages:

on the one hand, the target detection and the key point detection can share the feature extraction and feature fusion module, so that the calculation amount of the model is greatly reduced.

On the other hand, aiming at the prediction of the key points, the model is not required to predict the position of the key points, but the specific task of target object integrity detection is adopted to predict whether each key point exists in the picture, the fitting difficulty of the model is reduced by adopting the processing mode, and the flexibility and convenience of the use of the prediction result in a business scene are improved. The adoption of the double-branch target and key point detection model further improves the accuracy of predicting the incomplete degree of the person on the basis of the existing model, and is convenient to flexibly apply in the business scene.

404. And under the condition that the key point detection result is that the target object is incomplete, the server determines a semantic application scene corresponding to the cover picture to be identified based on the cover picture to be identified and the title information of the content item to which the cover picture to be identified belongs.

In an actual application scene, the integrity of the cover picture is not only dependent on the integrity of the object in a physical sense angle, but also the semantic application scene of the content item in which the cover picture is positioned is considered. The following will specifically describe with reference to fig. 7, 8 and 9. The figures in the three cover pictures in fig. 7, 8 and 9 are not complete, and the cover pictures in fig. 7, 8 and 9 are considered to be incomplete according to the detection mode of the related art, but further analysis in combination with the titles in fig. 7, 8 and 9 will find that the incompleteness of the figures does not affect the expression of semantics. For example, the title "do the east wind daily Qijun" in FIG. 7, as sales in the sales department, is its security affordable? By the title content, the person non-main body scene in the semantic application scene of fig. 7 can be seen, and under the person non-main body scene, the incompleteness of the person does not affect the expression of the context semantics, so that the cover picture shown in fig. 7 should not be considered as an incomplete cover picture. As another example, the title "how well matched is more elegant? The elegance and female charm of one-piece dress are highlighted-! By the title content, the semantic application scene of fig. 8 can be seen as a jacket close-up scene, and in the jacket close-up scene, the incompleteness of the character does not affect the expression of the context semantics, so that the cover picture shown in fig. 8 should not be considered as an incomplete cover picture. For another example, the heading "one-stroke air quality up eye line" in fig. 9, the semantic application scene in fig. 9 can be seen as an eye feature scene through the heading content, and in the eye feature scene, the incompleteness of the character does not affect the expression of the context semantics, so that the cover picture shown in fig. 9 should not be considered as an incomplete cover picture.

In order to improve the recognition accuracy of the integrity of the cover picture to be recognized, when the fact that the target object is incomplete is determined based on the key point detection result, the server acquires the title information of the content item from the content item to which the cover picture to be recognized belongs, and further invokes the semantic application scene recognition model to recognize the cover picture to be recognized and the title information to obtain the semantic application scene corresponding to the cover picture to be recognized. The semantic application scene recognition model is used for recognizing the semantic application scene of the cover picture based on the cover picture and the corresponding title information. The semantic application scene recognition model is obtained through training according to a sample marked under an actual semantic application scene, specifically, the following steps are referred to, and the details are omitted here.

405. And the server determines the integrity of the cover picture to be identified in the semantic application scene based on the key point detection result and the semantic application scene.

The server determines the integrity of the target object in the semantic application scene based on the key point detection result, and determines that the cover picture to be identified is complete in the semantic application scene when the target object is determined to be complete in the semantic application scene based on the key point detection result; and when the target object is determined to be incomplete in the semantic application scene based on the key point detection result, determining that the cover picture to be identified is incomplete in the semantic application scene. In fact, whether the target object is complete in the semantic application scene or not is determined, whether the content of the target object, which needs to be highlighted in the semantic application scene, is complete is mainly seen, and if the content of the target object, which needs to be highlighted in the semantic application scene, is determined to be complete based on the key point detection result, the target object is determined to be complete in the semantic application scene; and if the fact that the content of the target object, which needs to be highlighted under the semantic application scene, is incomplete is determined based on the key point detection result, determining that the target object is incomplete under the semantic application scene.

In another embodiment of the present disclosure, when determining the integrity of the target object in the semantic application scenario based on the keypoint detection result, the server includes the following cases:

the first condition, the semantic application scene is a human body part close-up scene

The human body part close-up scene comprises any one of a head close-up scene, a neck close-up scene, a collarbone close-up scene, an arm close-up scene, an upper body close-up scene, a leg close-up scene, a foot close-up scene and the like. When the human body part needing to be closed is determined to be complete based on the key point detection result, the server determines that the target object is complete in the semantic application scene. For example, the semantic application scene of the cover picture shown in fig. 9 is an eye feature scene, and if the eye of the person in fig. 9 is determined to be complete based on the key point detection result, the cover picture of fig. 9 is determined to be complete.

The second condition and semantic application scene is a clothing display scene

Wherein, the dress showing scene includes any one of jacket showing scene, lower garment showing scene and shoes showing scene etc.. Under the condition that the semantic application scene is a clothing display scene, the server determines clothing to be displayed in the clothing display scene, further determines whether the clothing to be displayed is complete based on the key point detection result, and when the fact that the clothing to be displayed is complete based on the key point detection result, the server determines that the target object is complete in the semantic application scene. For example, the semantic application scene of the cover picture shown in fig. 8 is a jacket close-up scene, and if the jacket of the person in fig. 8 is determined to be complete based on the key point detection result, the cover picture of fig. 8 is determined to be complete.

The third condition, the semantic application scene is a specific scene

The specific scene comprises any one of a eating scene, a handwriting drawing scene, a musical instrument playing scene, a manual making scene, a food making scene, a character non-main body scene, a multi-person scene or a non-main character scene. Under the condition that the semantic application scene is a specific scene, the server determines scene elements to be represented in the specific scene, further determines whether the scene elements to be represented are complete or not based on the key point detection result, and when the scene elements to be represented are determined to be complete based on the key point detection result, the server determines that the target object is complete in the semantic scene. For example, the semantic application scene of the cover picture shown in fig. 7 is a person non-body scene, and if the car in fig. 7 is determined to be complete based on the key point detection result, the cover picture in fig. 7 is determined to be complete.

According to the embodiment of the disclosure, the result of semantic application scene recognition is added on the basis of target detection and key point detection, and finally the key point detection result and the semantic scene recognition result are combined for detection, so that the misjudgment rate is reduced, and the accuracy rate and recall rate of cover picture integrity recognition are improved.

Fig. 10 is a flowchart illustrating overall identification of a method for identifying cover images according to an embodiment of the present disclosure, referring to fig. 10, for cover images of various specifications to be identified, performing object detection on each cover image, identifying a target object in each cover image, and further performing key point detection on the target object to obtain a key point detection result. When determining that a target object in a cover picture of a person is incomplete based on a key point detection result, screening the cover picture, inputting the screened cover picture and a title thereof into a semantic scene recognition model obtained based on multi-mode pre-training model training, outputting a semantic application scene of the cover picture, and then determining the integrity of the cover picture by combining the key point detection result and the semantic application scene of the cover picture.

The embodiment of the present disclosure provides a training method for a semantic application scene recognition model, taking a server executing the embodiment of the present disclosure as an example, referring to fig. 11, a method flow provided by the embodiment of the present disclosure includes:

1101. The server acquires a plurality of first graphic training samples.

The first image-text training sample is an image-text pair comprising a cover picture and a corresponding title. Each first image-text training sample is marked with a semantic application scene label, and the semantic application scene labels are marked according to actual semantic application scenes. For example, a semantic application scenario is shown in fig. 12, and includes a human body part close-up scenario, a specific scenario, a clothing presentation scenario, and others. In order to meet the actual service scene requirement, when the server acquires the first image-text training sample, the server can acquire corresponding cover pictures and titles according to semantic application scene labels in an actual semantic application scene to form the first image-text training sample.

1102. The server trains the multi-mode picture pre-training model based on a plurality of first image-text training samples to obtain a semantic application scene recognition model.

In order to improve the training speed of the semantic application scene recognition model and reduce the calculation resources consumed in the model training process, the embodiment of the disclosure adopts a plurality of first image-text training samples marked with semantic application scene labels to finely adjust model parameters of the multi-mode image pre-training model so as to obtain the semantic application scene recognition model. The multi-mode picture pre-training model is used for mining the internal relation between pictures and words, and the training process of the multi-mode picture pre-training model is as follows:

The first step, the server acquires a plurality of second graphic training samples.

The second picture text training samples are picture-text pairs comprising cover pictures and corresponding titles, and each second picture text training sample is marked with a label. The sources of the plurality of second teletext training samples comprise at least one of:

cover picture and title information uploaded by first source and content producer

The server acquires the cover picture and the title information which are synchronously uploaded when the content producer uploads the content item, and forms a second image-text training sample from the cover picture and the title information which are uploaded by the content producer. If the content producer did not edit the title information when uploading the content item, the server may extract text information from the cover picture or other picture using OCR (Optical Character Recognition ) as the title information for the content item; if the content producer does not specify a cover picture when uploading the content item, the server may select a picture matching the title information as the cover picture of the content item according to the title information. The cover picture and the title information are strong in relevance, can be regarded as weak supervision information, do not need special manual marks, and can be automatically collected from an information stream link no matter the cover picture and the title information selected by a content producer or the cover picture and the title information acquired by a server based on a content item.

Second source, existing graphic pair database

And taking the image-text pairs in the existing image-text pair database as a second image-text training sample. The existing graphic pair database comprises CC12M, CC M and the like. Wherein the CC12M contains a large number of image-text pairs for training an image language model (vision-and-language). CC3M is a data set containing image URLs (Uniform Resource Locator, uniform resource location system) and caption pairs for training and evaluation of machine learning image caption systems, containing about a large number of CC3M images and a large number of CC12M images, and for both types of images in the CC3M set, can be obtained from weak correlation descriptions automatically collected from the network by a filter. The existing image-text pair database also comprises a data set with classification and target detection result labels, for example, a Flickr30K, flickr K, COCO data set and the like, and image-text pairs formed by the labels of the pictures in the classification data and the pictures are second image-text training samples.

Third source and crawler system crawled pictures

And searching pictures related to the search word from the search application and the vertical website by taking entity words corresponding to the content statistical tags distributed by the information flow as the search word, and taking the picture-text pairs formed by the searched pictures and the corresponding search word as a second picture-text training sample.

According to the embodiment of the disclosure, the second image-text training samples are acquired by adopting various channels, the generalization of the second image-text training samples is increased, the precision of the multi-mode picture pre-training model is improved, and when the semantic application scene recognition model is trained based on the multi-mode picture pre-training model, the semantic application scene recognition model with higher precision can be trained by adopting fewer first image-text training samples.

And secondly, training the initial multi-mode pre-training model by the server based on a plurality of second image-text training samples and a plurality of pre-training tasks to obtain a trained multi-mode picture pre-training model.

Before the server trains the initial multi-mode pre-training model based on a plurality of second image-text training samples and a plurality of pre-training tasks, the server needs to convert image data in each second image-text training sample into an image feature sequence and convert text data into a text feature sequence, then trains the initial multi-mode pre-training model based on the image feature sequence and the text feature sequence corresponding to each second image-text training sample and the plurality of pre-training tasks, and takes a model obtained when the plurality of pre-training tasks are completed as a trained multi-mode picture pre-training model. When the image data in the second image-text training sample is converted into the image characteristic sequence, a CNN (Convolutional Neural Network ) model can be adopted for conversion, and a Transform model can also be adopted for conversion. When the transformation model is adopted for conversion, the pictures in the second image-text training sample are cut into a plurality of picture blocks, and then the image feature sequence of each picture block is obtained. By adopting the processing mode, the calculation speed is faster. When the text data in the second graphic training sample is converted into the text feature sequence, a Transform model may be used for conversion, which is not specifically limited in the embodiments of the present disclosure. Referring to fig. 13, a Transform structure is shown, which includes two parts, an encoder and a decoder, and is mainly composed of self-attention and Feed Forward Neural Network (feed-forward neural network).

Wherein the plurality of pre-training tasks includes a training task of an MLM model (Masked Language Model, mask language model), a training task of picture-to-text matching, a training task of picture text contrast learning, and a training task of picture classification.

The main purpose of the training task of the MLM model is to predict the masked words in the text sequence according to the content of the picture and the context information of the text sequence in the picture, for example, the words in the picture shown in fig. 14 are "how you know how two cats are noisy" and "how you know how two cats are noisy" are partial words to be masked in the training process, so that the multi-mode picture pre-training model is pre-trained by predicting the probability of occurrence of these words, so that the multi-mode picture pre-training model can learn the capability of predicting the subject expressed by the content of the picture, and the positions of the masked words and the number of the masked words in the text sequence are random during training.

The training task of matching the pictures and the texts mainly aims at constructing a large number of image-text training samples, increasing generalization of the training samples and judging whether the pictures and the texts in the image-text training samples are matched or not on a global level. The training task constructs a large number of graphic training samples by replacing the pictures corresponding to the texts with other pictures or replacing the texts corresponding to the pictures with other texts. When judging whether the pictures in the picture-text training sample are matched with the texts, acquiring text zone bits of the texts in the picture-text training sample, mapping the feature images corresponding to the pictures in the picture-text training sample into a binarization value, and judging whether the pictures in the picture-text training sample are matched with the texts according to the text zone bits of the texts and the binarization values corresponding to the pictures. When the feature map corresponding to the picture in the graphic training sample is mapped into a binarization value, linear ITM (Image Text Matching, matching picture and text) head can be used for mapping.

The training task of the picture text contrast learning is mainly used for helping a multi-mode picture pre-training model to learn the implicit representation of pictures and texts, and is widely applied to the unsupervised representation learning in the image field. In the image field, the picture operations include rotation, clipping, gaussian noise, masking, color conversion, filters, and the like, and in the text field, the text operations include operations of back interpretation, character insertion, character deletion, text comparison, and the like. The training task of the picture text contrast learning can carry out contrast learning based on a Loss function, and the Loss function can be a contrast Loss function (contrast Loss), and the like. The expression and principle of the contrast loss function are as follows:

wherein d= |a _n -b _n || ₂ ，a _n 、b _n The euclidean distance representing the characteristics of two samples, y is the label of whether the two samples are matched, y=1 represents that the two samples are similar or matched, y=0 represents that the two samples are not matched, and margin is a set threshold value. The contrast loss function is mainly used in dimension reduction, and the contrast learning principle is as follows: the samples which are similar in nature are still similar in the feature space after the dimension reduction (feature extraction); while the originally dissimilar samples, after dimension reduction, remain dissimilar in the feature space.

The training task of picture classification is mainly to learn the category to which the picture content belongs, and field data and supervision signals of the open source field can be added in the training process, for example, typically, the classification and label information of the picture are used as supervision signals, and the label corresponding to the predicted picture is used as weak supervision signals.

Fig. 14 shows a training process of the multi-mode picture pre-training model, referring to fig. 14, a plurality of pictures are obtained, if any picture is not marked with a title or a label, character information in the picture is extracted by adopting OCR (optical character recognition) as the title or the label of the picture, then image feature sequences of each picture are extracted by adopting transform, text feature sequences corresponding to the title or the label of each picture are extracted, and further the multi-mode picture pre-training model is trained based on the image feature sequences and the text feature sequences corresponding to each picture, the four training tasks of an MLM model, picture and text matching, picture text comparison learning and picture classification.

Referring to fig. 15, an embodiment of the present disclosure provides a device for identifying a cover picture, including:

the detection module 1501 is configured to perform key point detection on a target object in a cover picture to be identified, to obtain a key point detection result of the target object;

A first determining module 1502, configured to determine, based on the cover picture to be identified and header information of a content item to which the cover picture to be identified belongs, a semantic application scene corresponding to the cover picture to be identified when the key point detection result is that the target object is incomplete;

the second determining module 1503 is configured to determine, based on the key point detection result and the semantic application scene, the integrity of the cover picture to be identified in the semantic application scene.

In another embodiment of the present disclosure, a detection module 1501 is configured to identify a plurality of face keypoints and a plurality of human body keypoints of a target object in a cover picture to be identified; determining the face integrity of the target object based on the plurality of face key points; determining the human body integrity of the target object based on the plurality of human body key points; and under the condition that at least one of the face or the human body of the target object is incomplete, determining the key point detection result as the incomplete target object.

In another embodiment of the present disclosure, a detection module 1501 is configured to fuse features of different dimensions of the face key points to obtain fusion features of the face key points; determining a first key point score of the face key point based on the fusion characteristic of the face key point, wherein the first key point score is used for representing the probability of the face key point contained in the target object; under the condition that the first key point score does not accord with the first score threshold value condition, determining that the target object does not contain the face key point; and under the condition that the target object does not contain the key points of the human face, determining that the human face of the target object is incomplete.

In another embodiment of the present disclosure, the detection module 1501 is configured to fuse features of different dimensions of key points of a human body to obtain fused features of the key points of the human body; determining a second key point score of the human key points based on the fusion characteristics of the human key points, wherein the second key point score is used for representing the probability of the human key points contained in the target object; under the condition that the second key point score does not meet a second score threshold condition, determining that the target object does not contain human key points; and under the condition that the target object does not contain the key points of the human body, determining that the human body of the target object is incomplete.

In another embodiment of the present disclosure, a first determining module 1502 is configured to invoke a semantic application scene recognition model to recognize a cover picture to be recognized and header information to obtain a semantic application scene corresponding to the cover picture to be recognized, where the semantic application scene recognition model is configured to recognize the semantic application scene of the cover picture based on the cover picture and the corresponding header information.

In another embodiment of the present disclosure, a second determining module 1503 is configured to determine the integrity of the target object in the semantic application scenario based on the keypoint detection result; under the condition that the target object is determined to be complete in the semantic application scene based on the key point detection result, determining that the cover picture to be identified is complete in the semantic application scene; and under the condition that the target object is determined to be incomplete in the semantic application scene based on the key point detection result, determining that the cover picture to be identified is incomplete in the semantic application scene.

In another embodiment of the present disclosure, the second determining module 1503 is configured to determine, when the semantic application scene is a body part feature scene, a body part to be feature in the body part feature scene, determine, based on a key point detection result, that the body part to be feature is complete, and determine that the target object is complete in the semantic application scene; under the condition that the semantic application scene is a clothing display scene, determining clothing to be displayed in the clothing display scene, determining that the clothing to be displayed is complete based on a key point detection result, and determining that a target object is complete in the semantic application scene; and under the condition that the semantic application scene is a specific scene, determining scene elements to be represented in the specific scene, determining that the scene elements to be represented are complete based on the key point detection result, and determining that the target object is complete in the semantic scene.

The specific scene includes at least one of a eating scene, a painting scene, a musical instrument playing scene, a hand-made scene, a food-made scene, a character non-body scene, a multi-person scene, or a non-primary character scene.

In summary, by detecting the key points of the target object in the cover image to be identified, the device provided by the embodiment of the disclosure determines whether the target object in the cover image to be identified is complete, and since the key points are main points forming the target object, the key point detection method not only improves the definition of the integrity of the target object in the physical sense angle, but also does not need to pay attention to other non-important points on the target object in the detection process, thereby improving the detection speed. When the completeness of the cover picture to be identified is determined, whether the object in the cover picture is complete is not used as a criterion of the completeness of the cover picture any more, but the determination is carried out by combining the key point detection result of the target object and the semantic application scene of the cover picture to be identified.

Fig. 16 is a server for cover picture identification, according to an exemplary embodiment. Referring to fig. 16, server 1600 includes a processing component 1622 that further includes one or more processors and memory resources represented by memory 1632 for storing instructions, such as application programs, executable by processing component 1622. The application programs stored in memory 1632 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1622 is configured to execute instructions to perform the functions performed by the server in the above-described cover picture identification method.

The server 1600 may also include a power component 1626 configured to perform power management of the server 1600, a wired or wireless network interface 1650 configured to connect the server 1600 to a network, and an input output (I/O) interface 1658. The server 1100 may operate an operating system based on storage 1632,for example Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Or the like.

According to the server provided by the embodiment of the disclosure, whether the target object in the cover picture to be identified is complete is determined by detecting the key points of the target object in the cover picture to be identified, and as the key points are the main points forming the target object, the definition of the integrity of the target object in a physical sense angle is improved by adopting the key point detection method, and other non-important points on the target object are not required to be focused in the detection process, so that the detection speed is improved. When the completeness of the cover picture to be identified is determined, whether the object in the cover picture is complete is not used as a criterion of the completeness of the cover picture any more, but the determination is carried out by combining the key point detection result of the target object and the semantic application scene of the cover picture to be identified.

Embodiments of the present disclosure provide a computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor to implement a method of recognizing a cover photo. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), CD-ROM (Compact Disc Read-Only Memory), magnetic tape, floppy disk, optical data storage device, and the like.

The disclosed embodiments provide a computer program product including computer program code stored in a computer-readable storage medium, a processor of a server reading the computer program code from the computer-readable storage medium, the processor executing the computer program code to cause the server to perform a method of recognizing a cover image.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present disclosure is provided for the purpose of illustration only, and is not intended to limit the disclosure to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and principles of the disclosure.

Claims

1. A method for identifying a cover image, the method comprising:

2. The method of claim 1, wherein the performing the keypoint detection on the target object in the cover picture to be identified to obtain the keypoint detection result of the target object includes:

identifying a plurality of face key points and a plurality of human body key points of a target object in the cover picture to be identified;

determining the human body integrity of the target object based on the plurality of human body keypoints;

and under the condition that at least one of the face or the human body of the target object is incomplete, determining that the key point detection result is that the target object is incomplete.

3. The method of claim 2, wherein the determining the face integrity of the target object based on the plurality of face keypoints comprises:

fusing the features of the face key points in different dimensions to obtain fused features of the face key points;

determining that the target object does not contain the face key points under the condition that the first key point score does not accord with a first score threshold condition;

and under the condition that the target object does not contain the face key points, determining that the face of the target object is incomplete.

4. The method of claim 2, wherein identifying the human integrity of the target object based on the plurality of human keypoints comprises:

Fusing the features of the human body key points in different dimensions to obtain fused features of the human body key points;

determining that the target object does not contain the human body key points under the condition that the second key point score does not meet a second score threshold condition;

and under the condition that the target object does not contain the human body key points, determining that the human body of the target object is incomplete.

5. The method of claim 1, wherein the determining, based on the cover picture to be identified and header information of the content item to which the cover picture to be identified belongs, a semantic application scene corresponding to the cover picture to be identified includes:

6. The method of claim 1, wherein the determining the integrity of the cover picture to be identified in the semantic application scenario based on the keypoint detection result and the semantic application scenario comprises:

7. The method of claim 6, wherein determining the integrity of the target object in the semantic application scenario based on the keypoint detection result comprises:

under the condition that the semantic application scene is a body part close-up scene, determining a body part needing close-up in the body part close-up scene, determining that the body part needing close-up is complete based on the key point detection result, and determining that the target object is complete in the semantic application scene;

Determining the clothes to be displayed in the clothes display scene under the condition that the semantic application scene is the clothes display scene, determining the completeness of the clothes to be displayed based on the key point detection result, and determining the completeness of the target object in the semantic application scene;

8. The method of claim 7, wherein the body part close-up scene comprises any one of a head close-up scene, a neck close-up scene, a collarbone close-up scene, an arm close-up scene, an upper body close-up scene, a leg close-up scene, or a foot close-up scene;

9. An apparatus for identifying a cover image, the apparatus comprising:

10. A server comprising a processor and a memory, wherein the memory stores at least one piece of program code that is loaded and executed by the processor to implement the method of identifying a cover image as claimed in any one of claims 1 to 8.

11. A computer-readable storage medium having stored therein at least one program code that is loaded and executed by a processor to implement the cover picture identification method of any one of claims 1 to 8.

12. A computer program product, characterized in that the computer program product comprises a computer program code, which is stored in a computer readable storage medium, from which a processor of a server reads the computer program code, which processor executes the computer program code, so that the server performs the method of recognizing a cover picture as claimed in any one of claims 1 to 8.