CN114328990A

CN114328990A - Image integrity identification method and device, computer equipment and storage medium

Info

Publication number: CN114328990A
Application number: CN202111192347.7A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-04-12

Abstract

The present application relates to an image integrity recognition method, apparatus, computer device, storage medium and program product. The method comprises the following steps: obtaining multi-modal information of the multimedia content, wherein the multi-modal information comprises alternative cover page images and text description information; performing text feature extraction on the text description information to obtain text feature information, and performing image feature extraction on the alternative cover image to obtain image feature information, wherein the image feature information is used for describing image attribute features and image character features, and the image character features are at least one of key features of a human face and features of a human body boundary area; and carrying out integrity detection on the alternative cover image according to the text characteristic information and the image characteristic information to obtain an image integrity identification result corresponding to the alternative cover image. By adopting the method, the completeness of the cover map can be accurately identified.

Description

Image integrity identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an image integrity recognition method and apparatus, a computer device, and a storage medium.

Background

With the development of computer technology, the number of various multimedia contents including a graphic content (including an album) and a video content distributed through a media platform is increasing at an exponential rate since the media platform is also gradually rising. For such multimedia contents, the most core factors seen by the user are the title, cover art, and author of the multimedia contents. Therefore, it is necessary to select cover-drawings that retain relevance and completeness to multimedia content while ensuring their basic quality (i.e. filtering out pictures that are apparently not suitable for cover-making, such as blurred, two-dimensional, nausea and discomfort, thriller).

In the conventional technology, the cover map of multimedia content is mainly selected by matching a map or a machine model for screening when a media producer releases the content, the considered dimensions are mainly the quality of the picture, such as the definition of the picture and the picture with low quality, and then the cover map is cut into various specifications according to the displayed scene.

However, when the cover drawings are cut in various specifications, the situation that the cover drawings are incomplete inevitably occurs, and even some cover drawings are incomplete before being cut. In the conventional technology, the influence of possible incompleteness of the cover map is not considered, and the cut cover map is directly used, so that the use of some incomplete cover maps cannot meet the requirements of users.

Disclosure of Invention

In view of the above, it is necessary to provide an image integrity recognition method, apparatus, computer device, storage medium and program product for accurately recognizing the integrity of a cover page.

An image integrity recognition method, the method comprising:

obtaining multi-modal information of the multimedia content, wherein the multi-modal information comprises alternative cover page images and text description information;

performing text feature extraction on the text description information to obtain text feature information, and performing image feature extraction on the alternative cover image to obtain image feature information, wherein the image feature information is used for describing image attribute features and image character features, and the image character features are at least one of key features of a human face and features of a human body boundary area;

and carrying out integrity detection on the alternative cover image according to the text characteristic information and the image characteristic information to obtain an image integrity identification result corresponding to the alternative cover image.

An image integrity recognition device, the device comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring multi-modal information of multimedia content, and the multi-modal information comprises alternative cover page images and text description information;

the feature extraction module is used for extracting text features of the text description information to obtain text feature information, and extracting image features of the alternative cover image to obtain image feature information, wherein the image feature information is used for describing image attribute features and image character features, and the image character features are at least one of key features of a human face and features of a human body boundary area;

and the detection module is used for detecting the integrity of the alternative cover image according to the text characteristic information and the image characteristic information to obtain an image integrity identification result corresponding to the alternative cover image.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

A computer program product comprising a computer program which when executed by a processor performs the steps of:

According to the image integrity recognition method, the device, the computer equipment, the storage medium and the program product, the text feature extraction is carried out on the text description information in the multi-modal information of the multimedia content to obtain the text feature information, the feature extraction is carried out on the alternative cover image in the multi-modal information to obtain the image feature information used for describing the image attribute feature and the image character feature, the integrity detection of the alternative cover image is carried out according to the text feature information and the image feature information, the image integrity recognition result corresponding to the alternative cover image is determined, the text feature, the image attribute feature and the image character feature can be combined, the semantic-based image integrity detection is realized, and the cover image integrity is accurately recognized.

Drawings

FIG. 1 is a schematic diagram of a fragmentary view in one embodiment;

FIG. 2 is a schematic view of a fragmentary view in another embodiment;

FIG. 3 is a diagram of an exemplary embodiment of an application environment for the image integrity recognition method;

FIG. 4 is a flowchart illustrating an image integrity recognition method according to one embodiment;

FIG. 5 is a diagram illustrating face framing in one embodiment;

FIG. 6 is a schematic diagram of a portion of a packet network in one embodiment;

FIG. 7 is a diagram of a trained multi-modal fusion model network architecture in one embodiment;

FIG. 8 is a diagram illustrating a system architecture corresponding to an image integrity recognition method in one embodiment;

FIG. 9 is a block diagram showing the structure of an image integrity recognition apparatus according to an embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The application relates to artificial intelligence computer vision techniques, natural language processing, and machine learning.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the age of rapid internet development, the uploading amount of multimedia contents is exponentially increased as the threshold of multimedia content production is lowered. These multimedia Content include Content of various Content authoring mechanisms such as PGC (Professional Generated Content), UGC (User Generated Content), MCN (Multi-Channel Network) from media and organizations. Whether teletext content (including albums), or video content, the most central factors seen by users are the title, cover drawings, and author of the content. At present, cover drawings of multimedia contents are selected mainly by matching drawings or screening machine models when multimedia contents are issued from a media producer, the considered dimensions are mainly the quality of pictures, such as the definition of the pictures and the filtering of low-quality pictures, and then the modes of cutting various specifications are made according to the displayed scenes. Since the original cover drawings configured by the author are only one and do not consider that the multimedia content is subsequently displayed in different scenes, the original cover drawings need to be cut in various specifications and sizes, and the situation that the cover drawings are cut or are incomplete in size cannot be avoided, even some cover drawings are incomplete before being cut. Meanwhile, the same multimedia content has multiple display scenes, for example, a vertical chart sometimes needs to be cut into a horizontal front cover chart, in addition, a single chart, a small chart, a large chart and three charts are usually displayed for the same multimedia content in the information flow Feeds, in addition, multiple charts displayed according to a nine-square grid are also displayed in the information flow interest flow, and the pictures in the scenes all need to ensure the integrity. In the conventional technology, the influence of possible incompleteness of the cover map is not considered, and the cut cover map is directly used, so that the use of some incomplete cover maps cannot meet the requirements of users. The addition of the cover picture integrity check of multimedia content can accommodate a variety of different scenarios, and should avoid presentation if there is no suitable picture. The incomplete problem of the current cover picture is serious, and the incomplete recognition capability of a machine in the picture selection/screenshot process is avoided, so that the detection and recognition capability of an incomplete image is urgently needed.

As illustrated in fig. 1, the incomplete picture may refer to an incomplete human body, that is, the main person in the image is incomplete, including no head (110), missing upper body (120), only partial human body (130), and so on. As shown in fig. 2, an incomplete picture may refer to a face that is incomplete, i.e., the main face in the image is incomplete, such as an eyebrow missing (210) or a missing half or more (220) of the whole.

In addition, it should be noted that the cover page image directly relates to the first eye effect of the user seeing the content just like the packaging of the content, and directly affects the click conversion effect and experience of the end user. For example, for video content, the actual content, loading speed and cover page relevance are the core requirements of the user. For the image-text content, the correlation effect of the cover page image and the actual content directly influences the distribution effect of the multimedia content. The integrity of the cover map is accurately identified, the multimedia content distribution effect can be fully exerted, and the requirement of a user on using the cover map is met.

Based on the above, the image integrity identification method can be used for realizing semantic-based image integrity detection by combining text features, image attribute features and image character features, and accurately identifying the integrity of the cover map.

The image integrity recognition method provided by the application can be applied to the application environment shown in fig. 3. Wherein the terminal 302 communicates with the server 304 via a network. When a multimedia content to be published is uploaded to a server 304 from a media producer through a terminal 302, the server 304 acquires multi-mode information of the multimedia content, the multi-mode information comprises an alternative cover image and text description information, text feature extraction is performed on the text description information to obtain text feature information, image feature extraction is performed on the alternative cover image to obtain image feature information, the image feature information is used for describing image attribute features and image character features, the image character features are at least one of key features of a human face and features of a human body boundary area, alternative cover image integrity detection is performed according to the text feature information and the image feature information to obtain an image integrity recognition result corresponding to the alternative cover image, when the image integrity recognition result is image integrity, a cover image of the multimedia content is obtained according to the alternative cover image, and publishing the multimedia content according to the cover page graph. The terminal 302 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 304 may be implemented by an independent server or a server cluster composed of a plurality of servers, and may also be a node on a block chain.

In one embodiment, as shown in fig. 4, an image integrity recognition method is provided, which is described by taking the method as an example applied to the server in fig. 3, and includes the following steps:

step 402, obtaining multi-modal information of the multimedia content, wherein the multi-modal information comprises alternative cover page images and text description information.

The multimedia content refers to content to be released uploaded through a network platform. For example, the multimedia content may specifically refer to a teletext content uploaded through a network platform. For example, the multimedia content may specifically refer to a text content edited from a media through a public number established. For another example, the multimedia content may specifically refer to video content uploaded through a network platform. For example, the multimedia content may specifically be video content to be distributed uploaded by a user of a content creation organization (e.g., PGC, UGC, MCN) through a network platform.

The video content herein may specifically refer to short video. Short video refers to high-frequency pushed video content played on various new media platforms, suitable for viewing in mobile and short-time leisure states, varying from a few seconds to a few minutes. The contents integrate the topics of skill sharing, humorous work, fashion trend, social hotspots, street interviews, public education, advertising creativity, business customization and the like. Because the content is short, the content can be individually sliced or can be a series of columns. Different from micro-movies and live broadcasting, short video production does not have specific expression forms and team configuration requirements like micro-movies, has the characteristics of simple production process, low production threshold, strong participation and the like, has a spreading value compared with live broadcasting, and has a certain challenge to the file and plan work of the short video production team due to the ultrashort production period and interesting content. The advent of short videos enriches the form of new media native advertisements. At present, short videos are continuously generated from UGC, PGC and user uploading at the beginning, to a mechanism specially manufacturing the short videos, to MCN, to professional short video apps (application programs) and other head traffic platforms, and the short videos become one of important propagation modes of content creation and social media platforms.

Wherein a modality refers to a certain source or form of information. For example, a person has touch, hearing, vision, and smell, and the medium of information has voice, video, text, and the like, each of which may be referred to as a modality. The single-mode representation learning is responsible for representing information as a numerical vector which can be processed by a computer or further abstracted as a higher-level feature vector, and the multi-mode representation learning is to remove redundancy among the modes by utilizing complementarity among the multi-modes so as to learn a better feature representation. In the embodiment, through the complementarity of the multi-modal information of the multimedia content, better features are learned to realize accurate recognition of the integrity of the image.

In the present embodiment, the multimodal information includes alternative cover page images and text description information. The alternative cover image refers to an image associated with the multimedia content that is alternative to the cover image. For example, the alternative cover image may specifically refer to an alternative image-text cover corresponding to the image-text content. For another example, the alternative cover image may specifically refer to an alternative cover frame corresponding to the video content. The text description information refers to information associated with the multimedia content and describing the multimedia content in a text form. For example, the text description information may specifically refer to content text and a content tag associated with the multimedia content. The content text refers to a text for explaining the multimedia content, for example, the content text may specifically refer to an article and a content title in the teletext content. For another example, the content text may specifically refer to a video summary and a content title corresponding to the video content. For another example, the text description information may specifically refer to a character recognition result obtained by performing character recognition on image data or video frames in the multimedia content.

Specifically, before the multimedia content to be published is published, the server acquires multi-modal information of the multimedia content, so as to determine whether the alternative cover page image of the multimedia content to be published is complete according to the multi-modal information. Wherein the multimodal information includes alternative cover page images and textual description information.

Step 404, performing text feature extraction on the text description information to obtain text feature information, and performing image feature extraction on the alternative cover image to obtain image feature information, wherein the image feature information is used for describing image attribute features and image character features, and the image character features are at least one of key features of a human face and features of a human body boundary area.

The text feature information refers to feature information obtained after semantic analysis is performed on the text description information. For example, the text feature information may specifically refer to a text feature vector obtained after encoding the text description information. The image characteristic information refers to characteristic information obtained by fusing image attribute characteristics and image person characteristics, wherein the image person characteristics are at least one of key characteristics of a human face and characteristics of a human body boundary region. For example, the image feature information may specifically refer to an image feature vector obtained by fusing image attribute features, human face key features, and human body boundary region features. The image attribute features are used for describing basic attributes of the image, including attributes such as texture and edge structure of the image. The face key features are used to describe face features existing in the image, for example, the face key features may specifically refer to face frames and face key points. The human body boundary region features are used for describing human body features existing in the image. For example, the human body boundary region feature may specifically refer to an edge score map composed of human body boundary regions corresponding to the image, and the edge score refers to the integrity of the human body edge.

Specifically, after obtaining the multi-modal information, the server performs text feature extraction on the text description information to obtain text feature information, and performs image feature extraction on the alternative cover image to obtain image feature information. Furthermore, the text feature extraction method for the text description information may be a semantic analysis method for the text description information to realize the encoding of the text description information. The image feature extraction method for the alternative cover image can be that image attribute feature extraction, face key feature extraction and human body analysis are respectively carried out on the alternative cover image, and the respectively extracted features are fused to obtain image feature information for describing the image attribute features, the face key features and the human body boundary region features.

And step 406, performing integrity detection on the alternative cover image according to the text characteristic information and the image characteristic information to obtain an image integrity identification result corresponding to the alternative cover image.

The alternative cover image integrity detection means detecting whether the alternative cover image is complete, that is, whether the alternative cover image can be used as a cover image of the multimedia content. For example, detecting whether the alternative cover image is complete may specifically refer to detecting whether a human body in the alternative cover image is complete, where the human body is complete and is associated with the category of the multimedia content, for example, for the makeup type multimedia content, it is usually emphasized and displayed on a certain part of the body, such as an eyebrow, an eye or a face, and as long as the displayed certain part is complete, the alternative cover image is complete. For another example, for multimedia content such as photographs of a person, if the body of the person is obviously cut off or the main building body is cut off, the alternative cover image is incomplete.

Specifically, the server fuses the text characteristic information and the image characteristic information to obtain fused characteristic information, then performs alternative cover image integrity detection according to the fused characteristic information, determines the probability that the alternative cover image belongs to each preset image integrity recognition result, and determines the image integrity recognition result corresponding to the alternative cover image according to the probability. The preset image integrity recognition result may be set by itself as needed, for example, the preset image integrity recognition result may be slightly incomplete, severely incomplete, and complete.

Further, when the image integrity recognition result is that the image is complete, the alternative cover image is complete, and the server can obtain the cover image of the multimedia content according to the alternative cover image. If only one alternative cover image exists, the server directly uses the alternative cover image as the cover image of the multimedia content. If at least two alternative cover images exist and the image integrity recognition results of the two alternative cover images are image integrity, the server preferentially selects the alternative cover image corresponding to the maximum probability as the cover image of the multimedia content according to the probability that the two alternative cover images belong to the image integrity.

According to the image integrity recognition method, the multi-modal information of the multimedia content is acquired, the text feature extraction is carried out on the text description information to obtain the text feature information, the feature extraction is carried out on the alternative cover image to obtain the image feature information used for describing the image attribute feature and the image character feature, the integrity detection of the alternative cover image is carried out according to the text feature information and the image feature information, the image integrity recognition result corresponding to the alternative cover image is determined, the text feature, the image attribute feature and the image character feature can be combined, the semantic-based image integrity detection is realized, and the cover image integrity is accurately recognized.

In one embodiment, obtaining multimodal information for multimedia content comprises:

when the multimedia content comprises image-text content, acquiring an alternative image-text cover, image-text content text and image-text content labels corresponding to the image-text content, and extracting image data in the image-text content;

and obtaining an alternative cover image according to the alternative image-text cover, performing character recognition on image data in the image-text content, and obtaining text description information according to the image-text content character recognition result, the image-text content text and the image-text content label.

Wherein, the alternative image-text cover is the image-text cover selected when the image-text content is uploaded from the media producer. The text of the image-text content refers to the description of the characters in the image-text content, including the title of the image-text content and the article, and the title of the image-text content refers to the title written by the media producer. The teletext content tag is a tag selected by a media producer to be associated with the teletext content and used for determining the category of the teletext content. For example, the graphic content tag may be makeup, close-up, eyebrow, beach, XX animal, and the like. The image data in the image-text content refers to the matching image in the image-text content.

Specifically, when the multimedia content includes the image-text content, the server acquires an alternative image-text cover, an image-text content text and an image-text content label corresponding to the image-text content, extracts image data in the image-text content, uses the alternative image-text cover as an alternative cover image, performs character recognition on the image data in the image-text content, and uses the image-text recognition result, the image-text content text and the image-text content label as text description information. The text Recognition of the image data in the image-text content may adopt OCR (Optical Character Recognition), and this embodiment is not limited in this respect.

In the embodiment, the text description information is obtained according to the text recognition result, the text of the text content and the text label of the text content, and the multi-mode information can be obtained.

when the multimedia content comprises video content, performing frame extraction on the video content to obtain a video frame set, selecting an alternative cover frame from the video frame set, and acquiring a video content text and a video content label corresponding to the video content;

and obtaining an alternative cover image according to the alternative cover frame, performing character recognition on the video frame in the video frame set, and obtaining text description information according to the video content character recognition result, the video content text and the video content label.

The alternative cover frame is a video frame which is selected from the video frame set and can be used as a cover of the video content. The video content text refers to a text describing video content, and includes a video content title and a video abstract, where the video content title refers to a title written by a media producer. The video content tag refers to a tag selected from a media producer to be associated with the video content for determining the category of the video content. The video content tag may specifically be makeup, close-up, eyebrow-stroke, beach, XX animal, and so forth.

Specifically, when the multimedia content includes video content, the server performs frame extraction on the video content to obtain a video frame set, selects an alternative cover frame from the video frame set, acquires a video content text and a video content label corresponding to the video content, uses the alternative cover frame as an alternative cover image, performs character recognition on the video frame in the video frame set, and uses a video content character recognition result, the video content text and the video content label as text description information.

The frame extraction mode of the video content may be uniform frame extraction, that is, frame extraction is performed according to a preset frame extraction frequency, the preset frame extraction frequency may be set according to needs, and frame extraction may be achieved by calling processing tools such as Fast Forward Mpeg (multimedia video processing tool) or OpenCV (cross-platform computer vision and machine learning software library). When selecting the alternative cover frame from the video frame set, a random selection mode may be adopted, or all the video frames may be directly used as the alternative cover frame, which is not specifically limited in this embodiment. OCR may be employed to perform text recognition on video frames in a set of video frames.

In the embodiment, the text description information is obtained according to the text recognition result, the video content text and the video content label, and the multi-mode information can be obtained.

In one embodiment, performing text feature extraction on the text description information to obtain text feature information includes:

performing semantic analysis on a character recognition result and a content text in the text description information to obtain first semantic feature information, and performing semantic analysis on a content label in the text description information to obtain second semantic feature information;

and collecting the first semantic feature information and the second semantic feature information to obtain text feature information.

Specifically, the server splices the text recognition result and the content text in the text description information into a text to be analyzed, inputs the text to be analyzed into a trained semantic feature extraction model for semantic analysis to obtain first semantic feature information, inputs a content tag in the text description information into the trained semantic feature extraction model for semantic analysis to obtain second semantic feature information, and splices the first semantic feature information and the second semantic feature information to obtain text feature information. It should be noted that, when the multimedia content is a video content and the video content does not have a video abstract, the content text is only a content title, and the server splices the text recognition result and the content title to obtain a text to be analyzed.

When the trained semantic feature extraction model performs semantic analysis, the text to be analyzed and the content label are encoded to obtain a corresponding encoding vector as semantic feature information. For example, the trained semantic feature extraction model may specifically refer to a BERT (Bidirectional Encoder Representation from transforms) model. The BERT model is a pre-training model trained by using large-scale data, the core is a bidirectional Transformer Encoder (converter code), and the semantic comprehension capability of the BERT model is strong. BERT improves the baseline performance of NLP (Natural Language Processing) tasks by a large factor with a 12-layer Transformer Encoder. The BERT model is also based on a Transformer model. The Transformer model uses Self-Attention mechanism, and does not adopt the sequential structure of RNN (Recurrent Neural Network), so that the model can be trained in parallel, and can possess global information. Compared with a word2vec model (a model for generating word vectors), the BERT model pre-trained by massive texts can introduce more transfer knowledge in a video classification algorithm, and provides more accurate text characteristics.

In this embodiment, the text or the content tag to be analyzed is processed through the BERT model to extract the corresponding semantic features, that is, the text or the content tag to be analyzed is encoded, and the text or the content tag to be analyzed is converted into a vector. It should be noted that, generally, the vector of the penultimate layer of the BERT model is extracted as the text representation vector, which can integrate local and overall advantages and disadvantages and extract semantics well.

In the embodiment, the semantic analysis is performed on the content such as the character recognition result and the content text and the content label, so that the text feature extraction can be performed on the multimedia content from the content and label angles, and more comprehensive text feature information can be obtained.

In one embodiment, the image feature extraction of the alternative cover image, and the obtaining of the image feature information includes:

respectively extracting image attribute features and face key features of the alternative cover images to obtain image attribute feature information and face key feature information;

and fusing the image attribute feature information and the face key feature information to obtain image feature information.

Specifically, the server performs image attribute feature extraction on the alternative cover image through the trained deep convolutional neural network to obtain image attribute feature information. For example, the trained deep convolutional neural network may specifically refer to an inclusion-renet V2 network, and the inclusion-renet-V2 network is a change of an early inclusion V3 network, and is a multilayer convolutional network, and after pre-training is performed, an alternative cover image is input, and a feature vector is obtained after the network passes through.

Specifically, the server extracts the face key features of the alternative cover images through the trained face detection network to obtain face key feature information. The face key feature information may specifically refer to a face frame and face key point information. For example, the trained face detection network may specifically refer to a retinaFace network, the retinaFace network is a single-step inference face detector, and may output face frame and 5 pieces of face key point information (two eyes, two nose tips, and two mouth corners) at the same time, and these pieces of key point information may be used for determining the integrity of the face.

The following exemplifies the face key feature extraction in this embodiment. For a photo of a large head in an alternative cover image, such as the front of a person: it is desirable to frame as much as possible the complete character's avatar, including hair, beard, five sense organs, chin, neck up (for beard if hair is over the neck, for neck if hair is over the neck), and the character's head accessories such as hats, headwear, earrings, etc. As shown in FIG. 5, block 504 selects for the large-head box of the character: a hat (502), ears. Block 506 is a big-end selection: the head of a person is defined as the boundary between the face, hair and head gear on the left and right sides of the head, the lower part of the neck or the lowest part of the chin, and the upper part of the neck or the highest part of the head gear or the hair.

Specifically, after the image attribute feature information and the face key feature information are obtained, the server fuses the image attribute feature information and the face key feature information to obtain the image feature information. The fusion mode can be splicing image attribute feature information and face key feature information.

In the embodiment, the alternative cover page images are analyzed and the features are extracted from the basic attribute and the key features of the human face, so that more comprehensive image feature information can be obtained.

extracting image attribute features of the alternative cover image to obtain image attribute feature information, and performing human body analysis on the alternative cover image to obtain human body boundary area feature information;

and fusing the image attribute feature information and the human body boundary region feature information to obtain image feature information.

Specifically, the server performs human body analysis on the alternative cover images through the trained human body analysis model to identify human body parts, so as to obtain characteristic information of the human body boundary area. The feature information of the human body boundary region may specifically refer to an edge score map composed of the human body boundary region. For example, the trained human body parsing model may specifically refer to a partial grouping network, as shown in fig. 6, the partial grouping network first uses ResNet-101 (fast training residual network) to extract a shared feature map, then adds two branches to capture a partial region and a human body boundary region, and simultaneously generates a semantic partial score map and an edge score map, and finally performs a refinement branch (refining branch), and refines the predicted partial score map and the edge score map by integrating the partial segmentation and the human body boundary region, and the refined edge score map is human body boundary region feature information.

Specifically, after obtaining the image attribute feature information and the human body boundary region feature information, the server fuses the image attribute feature information and the human body boundary region feature information to obtain the image feature information. The fusion mode may be to splice the image attribute feature information and the human body boundary region feature information.

In this embodiment, more comprehensive image feature information can be obtained by analyzing and extracting features of the alternative cover images from two angles of basic attributes and human body analysis.

respectively extracting image attribute features and face key features of the alternative cover image to obtain image attribute feature information and face key feature information, and carrying out human body analysis on the alternative cover image to obtain human body boundary region feature information;

and fusing image attribute feature information, human face key feature information and human body boundary region feature information to obtain image feature information.

Specifically, after obtaining the image attribute feature information, the face key feature information and the human body boundary region feature information, the server may obtain the image feature information by fusing the image attribute feature information, the face key feature information and the human body boundary region feature information, and the fusing manner may be to directly splice the image attribute feature information, the face key feature information and the human body boundary region feature information.

In this embodiment, more comprehensive image feature information can be obtained by analyzing and extracting features of the alternative cover page image from multiple angles such as basic attributes, human face key features, human body analysis and the like.

In one embodiment, the human body analysis of the alternative cover image to obtain the characteristic information of the human body boundary area comprises:

carrying out sharing feature extraction on the alternative cover image to obtain a sharing feature map;

according to the shared feature map, respectively carrying out semantic part segmentation and example perception edge detection to obtain a semantic part score map and an edge score map;

and integrating the semantic part score image and the edge score image to obtain the characteristic information of the human body boundary area.

Wherein the semantic portion segmentation is used to designate each pixel in the alternative cover image as a human portion (e.g., face, arm) based on the shared feature map. Instance-aware edge detection is used to partition semantic components to different human instances based on a shared feature map. The semantic part score map is used for describing the proportion of the human body part area in the total image area. The edge score map is used to describe the integrity of the human body edge.

Specifically, the server extracts shared features of the alternative cover image to obtain a shared feature map, performs semantic part segmentation according to the shared feature map to designate each pixel in the alternative cover image as a human part to realize human part differentiation to obtain a semantic part score map, performs case perception edge detection according to the shared feature map, divides the semantic part into different human cases to realize human part differentiation to obtain an edge score map, performs mutual refinement by integrating the semantic part score map and the edge score map to obtain a refined semantic part score map and the edge score map, and can obtain human body boundary region feature information according to the refined edge score map.

For example, the human body analysis may be performed on the alternative cover image by using a trained PGN (partial grouping network) model to obtain the characteristic information of the human body boundary region. Part of the packet network redefines instance-level human body analysis into two twin subtasks that can be learned together and refined with each other through a unified network: 1) semantic segment segmentation that assigns each pixel to a human segment (e.g., face, arm); 2) instance aware edge detection, which partitions semantic components to different human instances. Partial packet networks primarily involve grouping two consecutive partitions, including partial-level pixel packets and instance-level partial packets. Part-level pixel grouping can be solved by a semantic part segmentation task with a single pixel as a part label, which learns class features, after which, given a set of independent semantic parts, instance-level part grouping can determine to which instance all parts belong based on predicted instance-aware edges, where the parts separated by instance edges will be assigned to different character instances. Namely, part of the packet network is a non-detection unified network which jointly optimizes semantic part segmentation and instance-aware edge detection.

As shown in fig. 6, a partial packet network first extracts a shared feature map using ResNet-101, then adds two branches to capture partial regions (i.e., process semantic partial segmentation) and human boundary regions (i.e., instance-aware edge detection), while generating a semantic partial score map and an edge score map, and finally performs a refinement branch to refine the predicted semantic partial score map and edge score map by integrating the partial segmentation and the human boundary regions. Because the two branches keep high correlation with each other by sharing a consistent grouping target, a refinement branch is further integrated in a partial packet network, the two branches benefit each other by utilizing complementary contextual information, namely, the edge score map can be corrected by utilizing the semantic part score map, and the semantic part score map can also be corrected by utilizing the edge score map.

In the embodiment, the shared feature map is obtained by extracting the shared features of the alternative cover images, the semantic part segmentation and the example perception edge detection are respectively performed according to the shared feature map, the semantic part score map and the edge score map are obtained, the semantic part score map and the edge score map are integrated, the characteristic information of the human body boundary area is obtained, and the acquisition of the characteristic information of the human body boundary area can be realized.

In one embodiment, performing alternative cover image integrity detection according to the text characteristic information and the image characteristic information, and determining an image integrity recognition result corresponding to the alternative cover image comprises:

fusing text characteristic information and image characteristic information to obtain fused characteristic information;

performing alternative cover image integrity detection according to the fusion characteristic information, and determining the probability that the alternative cover image belongs to each preset image integrity identification result;

and determining an image integrity recognition result corresponding to the alternative cover image according to the probability.

Specifically, the server fuses the text characteristic information and the image characteristic information to obtain fused characteristic information, performs integrity detection on the alternative cover image by using the fused characteristic information to determine the probability that the alternative cover image belongs to each preset image integrity recognition result, and determines the image integrity recognition result corresponding to the maximum probability according to the probability to serve as the image integrity recognition result corresponding to the alternative cover image.

In this embodiment, the text feature information and the image feature information are fused to obtain the fusion feature information, the alternative cover image integrity detection is performed according to the fusion feature information, the probability that the alternative cover image belongs to each preset image integrity recognition result is determined, the image integrity recognition result corresponding to the alternative cover image can be determined according to the probability, and the accurate image integrity recognition result is obtained.

In an embodiment, the image integrity recognition method in the present application can also be implemented by a trained multi-modal fusion model, as shown in fig. 7, which is a trained multi-modal fusion model network structure, and the image integrity recognition method in the present application is described in detail below according to the trained multi-modal fusion model network structure.

The trained multi-mode fusion model comprises an image feature extraction network consisting of inclusion-Resnet v2+ Retinaface + PGN and two pre-trained BERT models. The image feature extraction network is used for extracting image features of the alternative cover image to obtain image feature information (namely image vectors including alternative image-text cover/video content frame extraction image + human face and human body detection), and the two pre-training BERT models are used for performing semantic analysis on content titles (including OCR recognition texts) (namely character recognition results and content texts in the text description information) and content labels respectively to obtain corresponding semantic feature vectors (namely content title vectors and content label vectors). For video content, a frame extraction method is required to be adopted to extract key frame samples from the video content to obtain an alternative cover image of the video content.

After the image vector, the content title vector and the content label vector are obtained, the trained multi-modal fusion model can identify whether the alternative cover image of the multimedia content is complete or not through the image vector, the content title vector and the content label vector. As shown in fig. 7, the trained multi-modal fusion model allows the image vector, the content header vector and the content label vector to pass through the hidden layer, the full connection layer and the softmax multi-class multi-label layer to obtain the image integrity recognition result.

After extracting the content title vector and the content label vector, three modes are adopted to be fused into a trained multi-mode fusion model, and the first mode is to add the content title vector and the content label vector into an Encoder for input after being spliced with an image vector; the second way is to add directly to the Encoder output; the third way is an initialization vector added to the Decoder. The present application preferably uses the first approach by which the pre-trained BERT model is added to the trained multi-modal fusion model, with some enhancement of the semantic comprehension capabilities of the text. Some video contents do not have alternative cover images, the alternative cover images are from extracted frames of the video contents, image vectors can be obtained by extracting key frames of the video contents and taking the key frames as the alternative cover images, finally, the image vectors and text feature information (including content title vectors and content label vectors) are subjected to multi-mode fusion, multiple results are fused to give complete degree label information (labels are preset to be slightly incomplete, severely incomplete and non-incomplete), and the overall accuracy rate is greatly improved. As shown in fig. 7, in the present embodiment, different probabilities of the multi-class tags are output through one softmax layer. It should be noted that the BERT model is not a native BERT model, and actually uses a BERT model trained based on the corpus of the information flow service, and the corpus of the information flow can better extract semantic features for the problem of the content of the information flow.

In this embodiment, the image integrity recognition method is realized through the trained multi-modal fusion model, so that the interactive relationship among three features of text, image and video content (video frame-extracted image) can be fully utilized, the effect of model recognition is improved, and compared with a mode that each mode is a model, the trained multi-modal fusion model has lower deployment cost and resource consumption because the text and the image content are recognized as a whole. During training, samples in a training set are predicted through an untrained multi-modal fusion model, manual labeling and model training are carried out again on samples with inconsistent manual labeling and model prediction, and the labeling quality and the final effect of the samples can be improved rapidly basically through iteration of multiple rounds.

In this embodiment, an image integrity recognition method is implemented through a trained multi-modal fusion model, an existing incomplete picture classification model is upgraded to a multi-modal machine learning method, human body recognition (including human faces) and key point detection (retinafece + inclusion-respet v2+ PGN), an image classification technology and text information (context text (such as OCR information) of the content where a cover picture is located), a content title and label information) are combined, text feature extraction is uniformly performed by using a BERT model, then multi-modal fusion modeling is performed, and semantic-based image-text integrity detection is implemented. The core idea is as follows: an end-to-end detection model is provided by adopting a multi-mode deep learning technology, the cover picture original image of contents (including pictures, texts, videos and atlas contents) on a content processing link is identified and is combined with the integrity detection of various regular sizes after cutting (namely, the integrity detection is carried out on the alternative cover picture), and meanwhile, whether the cover picture is real and incomplete or not is judged by combining the semantic scene where the alternative cover picture is located (for example, if the cover picture is makeup fashion contents, the expressed and prominent theme is that a human face has close-up when the human body is made up, and the combined scene is not incomplete.

In an embodiment, as shown in fig. 8, a system architecture diagram corresponding to the image integrity recognition method in the present application is provided to illustrate an application of the image integrity recognition method in the present application. The main functions of each service module in the system architecture are as follows:

content production and consumption terminal

(1) A content producer of the PGC, the UGC, and the MCN provides a graphic content or a video content through an Application Programming Interface (API) system at a mobile terminal or a backend, which are main content sources recommending and distributing the content;

(2) through the communication with the uplink and downlink content interface service, the image and text content is uploaded, the image and text content source is usually a lightweight publishing terminal and an editing content inlet, the video content publishing is usually a shooting and photographing terminal, and the local video content can be selected to match music, a filter template, the beautifying function of the video and the like in the shooting process;

(3) the system is used as a consumer, communicates with an uplink and downlink content interface server, pushes index information of recommended and accessed content, and then communicates with a content storage server, so as to obtain corresponding content, including recommended and subscribed content, the content storage server stores content entities such as video source files and picture source files, and meta information (such as titles, authors, cover drawings, classifications, Tag information and the like) of the content is stored in a content database;

(4) meanwhile, behavior data, card pause, loading time, playing click and the like played by a user in the uploading and downloading processes are reported to the back end for statistical analysis;

(5) the consumption end generally browses the content data in a Feeds stream mode, and various data from external channels also enter a platform system through the content consumption end via an uplink and downlink content interface server;

second, up and down going content interface server

(1) Directly communicating with a content production end, and storing the content submitted from the front end, which is usually the title, the publisher, the abstract, the cover picture and the publishing time of the content, into a content database;

(2) writing meta information of the text content, such as file size, cover picture link, title, release time, author and the like, into a content database;

(3) synchronizing the content submitted by the publisher (including the content provided by the external channel) to a dispatching center server for subsequent content processing and circulation;

third, content database

(1) The key point is that the meta information of the content itself, such as file size, cover map link, code rate, file format, title, release time, author, video file size, video format, whether the original mark or the first release is also included in the classification of the content in the manual review process (including first, second and third level classification and label information, such as an article explaining an XX (national mobile phone brand) mobile phone, the first level classification is science and technology, the second level classification is a smart phone, the third level classification is a domestic mobile phone, the label information is XX, and the specific mobile phone model);

(2) reading information in the content database in the process of manual review, and simultaneously returning the result and the state of the manual review to the content database;

(3) the dispatching center mainly comprises machine processing and manual review processing, wherein the machine processing core judges various qualities such as low quality filtering, content labels such as classification and label information, and content repetition elimination, the results of the content labels and the content information can be written into a content database, and repeated content can not be subjected to repeated secondary processing manually;

(4) reading meta-information of the content from a content database when subsequent modeling identification needs tag information;

fourth, dispatch center service

(1) The method comprises the steps that the method is responsible for the whole scheduling process of content circulation, the contents stored in a warehouse are received through an uplink and downlink content interface server, and then meta information of the contents is obtained from a content database;

(2) dispatching the manual auditing system and the machine processing system, and controlling the dispatching sequence and priority;

(3) the content is enabled through a manual review system, and then is provided to a content consumer of the terminal through a direct display page of a content export distribution service (usually a recommendation engine or a search engine or operation), that is, content index information obtained by a consuming end is an entry address of content consumption access;

(4) communicating with a cover map service, completing the identification of human body cutting semantic integrity picture content in the cover map through the cover map service in the process of selecting and cutting the cover map in the process of information flow content circulation, and ensuring that the semantic integrity of the selected picture is matched with the context environment of the content;

fifth, the manual audit system

(1) The manual checking system is a carrier of manual service capability and is mainly used for checking and filtering contents which cannot be determined by machines with political sensitivity, pornography, law impermissibility and the like;

(2) label tagging and secondary validation for short and small videos

Sixth, content storage service

(1) Content entity information other than meta information of stored content, such as a video source file and a picture source file of image-text content, the terminal directly accesses the source file from a storage service when consuming video content;

(2) when extracting video content labels, providing a video source file comprising extracted frame content in the middle of a source file, and simultaneously providing a candidate set of a video cover picture from the extracted frame content;

seventh, semantic incomplete recognition service (i.e. image integrity recognition in this application)

(1) The semantic incomplete recognition model (namely the trained multi-modal fusion model) is served;

(2) the method comprises the steps of communicating with a cover picture service, specifically finishing the identification of the content of a human body cutting semantic integrity picture in a cover picture, and ensuring that the semantic integrity of the selected picture is matched with the context environment where the content is located;

eight, cover map service

(1) Communicating with a dispatching center and a video frame extraction and image-text content analysis service, and taking a frame extraction cover picture of a video (obtaining an alternative cover frame) in a main content processing flow and a picture in a cover picture candidate set (obtaining an alternative image-text cover) in image-text content as input;

(2) calling a semantic incomplete recognition service to finish the semantic complete recognition of human body cutting of the cover map, wherein the recognition result is used as a reference for subsequent picture selection, cutting and selection of the cover map;

ninthly, a semantic incomplete recognition model (namely the trained multi-mode fusion model)

(1) According to the image integrity recognition described above, combining human body recognition and key point detection (Retinaface + Resnet101+ PGN), image classification technology and text information (context text, content title, label information and the like of the content of the cover picture) to uniformly extract text features by adopting a BERT model, and then performing multi-mode fusion modeling to realize semantic-based image-text integrity detection;

ten, content distribution export service

(1) Outlets for machine and manual processing of link content output, with the dispatch center processing the final generated content pool for distribution via the outlet service

(2) Recommendation algorithm distribution and manual operation of a main distribution mode;

(3) directly communicating with a content consumption end user;

eleven-step file downloading system

(1) Downloading and acquiring original video content from a video content storage server, and controlling the downloading speed and progress, wherein the original video content is usually a group of parallel servers and is formed by related task scheduling and distribution clusters;

(2) the downloaded file calls a frame extraction service to acquire necessary video file frames from a video source file, and the necessary video file frames are used as a subsequent video cover picture to provide an original data source;

twelve, duplicate removal service

(1) By comparing the fingerprint characteristics of the video content, only one repeated similar video file is reserved and is continued to a subsequent link for processing, and unnecessary repeated files on the link are reduced. The method mainly comprises the steps of performing multi-mode feature extraction on titles, cover drawings and video contents of game contents, constructing an embedding vector of a video, and judging the similarity of video files by calculating the cosine distance of the vector.

Thirteen, video frame extraction and image-text content analysis service

(1) Acquiring necessary video file frames from a video source file to serve as a subsequently constructed video cover picture to provide an original data source;

(2) the image-text content comprises a plurality of pictures, the image-text content is analyzed, a plurality of pictures which can be used as cover pictures are extracted, the pictures are used as the image-text cover pictures and the cover pictures uploaded by an original author and are used as input, and incomplete identification service is called through the cover picture service to finish final identification.

By the image integrity recognition method, response and processing speed of incomplete contents in the semantic level can be improved, and a large amount of auditing manpower is saved. Meanwhile, the constructed incomplete sample can be closely related to the service, more pertinence is achieved, the overall recognition effect and efficiency can be greatly improved, the cover drawings can be responded to, only one original picture is provided, but the displayed scene specification is various, the number of scenes of content exposure adaptation is increased, the distribution efficiency of the content is increased, and the semantic integrity and the quality of the available cover drawings can be increased in each scene.

It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 9, there is provided an image integrity recognition apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: an acquisition module 902, a feature extraction module 904, and a detection module 906, wherein:

an obtaining module 902, configured to obtain multi-modal information of the multimedia content, where the multi-modal information includes an alternative cover image and text description information;

the feature extraction module 904 is configured to perform text feature extraction on the text description information to obtain text feature information, and perform image feature extraction on the alternative cover image to obtain image feature information, where the image feature information is used to describe an image attribute feature and an image person feature, and the image person feature is at least one of a face key feature and a human body boundary area feature;

and the detection module 906 is configured to perform integrity detection on the alternative cover image according to the text characteristic information and the image characteristic information to obtain an image integrity recognition result corresponding to the alternative cover image.

According to the image integrity recognition device, the text feature extraction is carried out on the text description information in the multi-modal information of the multimedia content to obtain the text feature information, the feature extraction is carried out on the alternative cover image in the multi-modal information to obtain the image feature information used for describing the image attribute feature and the image character feature, the integrity detection of the alternative cover image is carried out according to the text feature information and the image feature information, the image integrity recognition result corresponding to the alternative cover image is determined, the semantic-based image integrity detection can be realized by combining the text feature, the image attribute feature and the image character feature, and the cover image integrity can be accurately recognized.

In one embodiment, the obtaining module is further configured to, when the multimedia content includes the image-text content, obtain an alternative image-text cover, an image-text content text, and an image-text content tag corresponding to the image-text content, extract image data in the image-text content, obtain an alternative cover image according to the alternative image-text cover, perform text recognition on the image data in the image-text content, and obtain text description information according to an image-text content text recognition result, the image-text content text, and the image-text content tag.

In an embodiment, the obtaining module is further configured to, when the multimedia content includes video content, frame-extract the video content to obtain a video frame set, select an alternative cover frame from the video frame set, obtain a video content text and a video content tag corresponding to the video content, obtain an alternative cover image according to the alternative cover frame, perform character recognition on the video frame in the video frame set, and obtain text description information according to a video content character recognition result, the video content text, and the video content tag.

In one embodiment, the feature extraction module is further configured to perform semantic analysis on the text recognition result and the content text in the text description information to obtain first semantic feature information, perform semantic analysis on the content tag in the text description information to obtain second semantic feature information, and collect the first semantic feature information and the second semantic feature information to obtain text feature information.

In an embodiment, the feature extraction module is further configured to perform image attribute feature extraction and face key feature extraction on the alternative cover image respectively to obtain image attribute feature information and face key feature information, perform human body analysis on the alternative cover image to obtain human body boundary region feature information, and fuse the image attribute feature information, the face key feature information and the human body boundary region feature information to obtain image feature information.

In one embodiment, the feature extraction module is further configured to perform shared feature extraction on the alternative cover image to obtain a shared feature map, perform semantic portion segmentation and instance perception edge detection respectively according to the shared feature map to obtain a semantic portion score map and an edge score map, and integrate the semantic portion score map and the edge score map to obtain human body boundary area feature information.

In one embodiment, the detection module is further configured to fuse the text feature information and the image feature information to obtain fused feature information, perform integrity detection on the alternative cover image according to the fused feature information, determine probabilities that the alternative cover image belongs to each preset image integrity recognition result, and determine the image integrity recognition result corresponding to the alternative cover image according to the probabilities.

For specific limitations of the image integrity recognition apparatus, reference may be made to the above limitations of the image integrity recognition method, which are not described herein again. The modules in the image integrity recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as a preset image integrity recognition result and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an image integrity recognition method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. An image integrity recognition method, the method comprising:

obtaining multi-modal information of multimedia content, wherein the multi-modal information comprises alternative cover page images and text description information;

2. The method of claim 1, wherein obtaining multimodal information for multimedia content comprises:

and obtaining an alternative cover image according to the alternative image-text cover, performing character recognition on image data in the image-text content, and obtaining text description information according to an image-text content character recognition result, the image-text content text and the image-text content label.

3. The method of claim 1, wherein obtaining multimodal information for multimedia content comprises:

and obtaining an alternative cover image according to the alternative cover frame, performing character recognition on the video frame in the video frame set, and obtaining text description information according to a video content character recognition result, the video content text and the video content label.

4. The method according to claim 1, wherein the extracting text features from the text description information to obtain text feature information comprises:

and aggregating the first semantic feature information and the second semantic feature information to obtain text feature information.

5. The method of claim 1, wherein the image feature extraction of the alternative cover image to obtain image feature information comprises:

respectively carrying out image attribute feature extraction and face key feature extraction on the alternative cover image to obtain image attribute feature information and face key feature information, and carrying out human body analysis on the alternative cover image to obtain human body boundary region feature information;

and fusing the image attribute feature information, the face key feature information and the human body boundary region feature information to obtain image feature information.

6. The method of claim 5, wherein the human body analysis of the alternative cover image to obtain the characteristic information of the human body boundary area comprises:

and integrating the semantic part score map and the edge score map to obtain the characteristic information of the human body boundary area.

7. The method of claim 1, wherein the performing the alternative cover image integrity check according to the text feature information and the image feature information to obtain the image integrity recognition result corresponding to the alternative cover image comprises:

fusing the text characteristic information and the image characteristic information to obtain fused characteristic information;

performing alternative cover image integrity detection according to the fusion characteristic information, and determining the probability that the alternative cover image belongs to each preset image integrity recognition result;

8. An image integrity recognition apparatus, comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring multi-modal information of multimedia content, and the multi-modal information comprises alternative cover images and text description information;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by a processor.