CN115830610A

CN115830610A - Multi-mode advertisement recognition method and system, electronic equipment and storage medium

Info

Publication number: CN115830610A
Application number: CN202211387710.5A
Authority: CN
Inventors: 李琳; 崔健; 吴小华; 袁景凌; 李国强; 刘磊
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-03-21

Abstract

The invention provides a multi-mode advertisement recognition method, a multi-mode advertisement recognition system, electronic equipment and a storage medium, wherein the multi-mode advertisement recognition system comprises the following steps: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module; the advertising technology detection function based on the image-text multi-mode matching task is realized based on natural language processing, computer vision, an OCR technology and a multi-mode pre-training task, so that text characteristics and visual characteristics in the advertising poster are combined from a multi-mode angle, the accuracy of the advertising technology detection is greatly improved, further, reverse text characteristics are generated based on label information, the number of samples of the model is increased, an advertising technology detection module based on the distance between the image characteristics and the text characteristics is introduced to predict the advertising technology used in the advertising data, and the accuracy of the advertising technology identification under a small sample data set is greatly improved.

Description

Multi-mode advertisement recognition method and system, electronic device and storage medium

Technical Field

The invention relates to the technical field of advertisement propaganda recognition, in particular to a multi-mode advertisement recognition method, a multi-mode advertisement recognition system, electronic equipment and a storage medium.

Background

The advertisement, as the name implies, is an advertisement that informs the general public of the society of something. The meaning of advertisement is broadly and narrowly defined. Non-economic advertising refers to advertising that is not targeted for profit, such as government announcements, revenues, statements, etc. in education, culture, municipality, social group, etc. Economic advertising refers to advertising for profit purposes, typically commercial advertising, which is a means of disseminating information of goods or services to consumers or users through advertising media for a fee in order to promote the goods or services. Commercial advertisements are such economic advertisements.

In the era of information diversification, advertisement information is affluent to the lives of people. In the advertisement poster, a merchant can improve the effect of promotion by using promotion techniques such as exaggeration, contrast, repetition and the like in both visual and text modes. Therefore, analyzing the advertisement from the perspective of the advertising technology is a key factor for mastering the advertising trend of the market advertisement and improving the advertising effect of enterprises. Currently, the analysis of advertisement information still targets at the goods and advertisement content in the context of small sample data sets. Thus, the traditional method can not learn the characteristics of the propaganda skills from the small sample data in a centralized way, and can not analyze the advertisement propagation effect from the perspective of the multi-mode propaganda technology. Therefore, how to effectively analyze the propaganda technology of the advertisement poster is an urgent problem to be solved.

Disclosure of Invention

The invention provides a multi-mode advertisement identification method, a multi-mode advertisement identification system, electronic equipment and a storage medium, aiming at the technical problems in the prior art, and aims to solve the problem of how to effectively analyze the propaganda technology of advertisement posters.

According to a first aspect of the present invention, there is provided a multimodal advertisement recognition system comprising: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module;

the advertisement image-text identification module is used for converting characters in the target advertisement into text information based on a character identification ORC technology and acquiring a manual labeling label of the target advertisement;

the information augmentation module is used for augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information;

the feature coding module is used for coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained;

the model training module is used for training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model;

and the advertisement technology detection module is used for detecting the propaganda effect of the advertisement to be detected based on the advertisement technology detection model.

On the basis of the technical scheme, the invention can be improved as follows.

Preferably, the augmenting the text information into a forward text and a reverse text based on the artificial labeling tag includes:

acquiring an artificial labeling label and text information of the target advertisement;

and performing semantic modification on the text information based on the artificial labeling labels to obtain the text information with the semantic opposite to that of the artificial labeling labels.

Preferably, the encoding the image information of the target advertisement and the augmented text information includes:

coding the image information and the augmented text information based on a preset language model and a preset visual coding model to obtain a forward text feature pair and a reverse text feature pair of the target advertisement; the preset language model is BERT, XLNET or Roberta; the preset visual coding model is ResNet, VGG or Faster RCNN.

Preferably, the model training module includes: an attribute prediction unit and a relationship prediction unit;

the attribute prediction unit is used for replacing the attribute information in the forward text feature pair by using a preset replacement label and predicting the replaced attribute information;

and the relation prediction unit is used for shielding the relation information in the forward text feature pair by using the preset replacement label and predicting the relation in the forward text feature pair.

Preferably, the model training module includes: a mask region unit;

and the shielding region unit is used for shielding region information in the image information and predicting the shielded image information.

Preferably, the model training module further comprises: an image-text matching unit;

and the image-text matching unit is used for predicting the image-text relationship between the forward text feature pair and the backward text feature.

Preferably, the tasks based on preset training include an attribute prediction task, a relationship prediction task, a mask region task and a graph matching task.

According to a second aspect of the present invention, there is provided a multi-modal advertisement recognition method, comprising:

converting characters in the target advertisement into text information based on a character recognition ORC technology, and acquiring an artificial labeling label of the target advertisement;

augmenting the text information into forward text and reverse text based on the artificial labeling label to obtain augmented text information;

encoding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained;

training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertising technology detection model;

and carrying out propaganda effect detection on the advertisement to be detected based on the advertisement technology detection model.

According to a third aspect of the present invention, there is provided an electronic device comprising a memory, and a processor, wherein the processor is configured to implement the steps of any one of the multimodal advertisement recognition methods of the second aspect when executing a computer management class program stored in the memory.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer management-like program, which when executed by a processor, performs the steps of any of the multi-modal advertisement recognition methods of the second aspect.

The invention provides a multi-mode advertisement recognition method, a multi-mode advertisement recognition system, electronic equipment and a storage medium, wherein the multi-mode advertisement recognition system comprises the following steps: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module; the advertisement image-text identification module is used for converting characters in the target advertisement into text information based on a character identification ORC technology and acquiring an artificial labeling label of the target advertisement; the information augmentation module is used for augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information; the feature coding module is used for coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained; the model training module is used for training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model; and the advertisement technology detection module is used for detecting the propaganda effect of the advertisement to be detected based on the advertisement technology detection model. The advertising technology detection function based on the image-text multi-mode matching task is realized based on natural language processing, computer vision, an OCR technology and a multi-mode pre-training task, so that text characteristics and visual characteristics in the advertising poster are combined from a multi-mode angle, the accuracy of the advertising technology detection is greatly improved, further, reverse text characteristics are generated based on label information, the number of samples of the model is increased, an advertising technology detection module based on the distance between the image characteristics and the text characteristics is introduced to predict the advertising technology used in the advertising data, and the accuracy of the advertising technology identification under a small sample data set is greatly improved.

Drawings

FIG. 1 is a schematic structural diagram of a multi-modal advertisement recognition system according to the present invention;

FIG. 2 is a schematic diagram of a process for constructing advertisement training data according to the present invention;

FIG. 3 is a diagram illustrating encoding of image-text data provided by the present invention;

FIG. 4 is a schematic diagram of the flow of pre-training task data provided by the present invention;

FIG. 5 is a schematic diagram of an advertisement detection process provided by the present invention;

FIG. 6 is a schematic overall flow chart of the multi-modal advertisement recognition method provided by the present invention;

FIG. 7 is a flow chart of a multi-modal advertisement recognition method provided by the present invention;

fig. 8 is a schematic diagram of a hardware structure of a possible electronic device provided in the present invention;

fig. 9 is a schematic diagram of a hardware structure of a possible computer-readable storage medium according to the present invention.

Detailed Description

The following detailed description of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 is a schematic structural diagram of a multi-modal advertisement recognition system provided by the present invention, as shown in fig. 1, the system includes: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module;

the advertisement image-text identification module is used for converting characters in a target advertisement into text information based on a character identification ORC technology and acquiring a manual labeling label of the target advertisement; the information augmentation module is used for augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information; the feature coding module is used for coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained; the model training module is used for training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model; and the advertisement technology detection module is used for detecting the propaganda effect of the advertisement to be detected based on the advertisement technology detection model.

It should be noted that the preset model to be trained may be an image-text prediction model composed of an image encoder and a text encoder, and the text encoder may be BERT, XLNET or Roberta; the above-described picture encoder may be ResNet, VGG, or fast RCNN.

It will be appreciated that the word recognition ORC technique described above is primarily used to recognize words in advertising posters and then convert them into machine recognizable textual information, and that the manual Label tag may be Label information that a Label person has for each advertising poster tag, including but not limited to exaggeration, contrast, and/or repetition, etc. After passing through the advertisement Image-Text recognition module, the system constructs the data structure of the advertisement poster into < Image, text, label >.

As an embodiment, the augmenting the text information into forward text and backward text based on the artificial labeling tag includes: acquiring an artificial labeling label and text information of the target advertisement; and performing semantic modification on the text information based on the artificial labeling labels to obtain the text information with the semantic opposite to that of the artificial labeling labels.

It can be understood that the information augmentation module is configured to augment the text of the advertisement according to the Label information of the poster to obtain two text contents, one of the text contents is consistent with the Label information, and the other is opposite to the Label information. For example: and adding two different prompt prefix texts in the texts by utilizing the prompt project. Wherein Text data prefixed by "My technology is < Label >" is denoted as Text +. Text data with My technology is Not < Label > "as a prefix is marked as Text-, namely semantic modification is carried out based on manual labeling labels and Text information to obtain forward Text information and reverse Text information. Both texts will each constitute an image text pair with the original poster visual information. Wherein the Image Text pair < Image, text + > consisting of Text + will be used as a positive sample pair in the subsequent model training, and the other Image Text pair < Image, text- > will be used as an additional negative sample Image Text pair to participate in the subsequent model training. The two texts will each form an image text pair with the original poster visual information. And the Image Text pair < Image, text- > formed by Text + is used as a positive sample pair in the subsequent model training, and the other Image Text pair < Image, text- > is used as an additional negative sample Image Text pair to participate in the subsequent model training. The flow is shown in FIG. 2.

As an embodiment, the encoding the image information and the augmented text information of the target advertisement includes: coding the image information and the augmented text information based on a preset language model and a preset visual coding model to obtain a forward text feature pair and a reverse text feature pair of the target advertisement; the preset language model is BERT, XLNET or Roberta; the preset visual coding model is ResNet, VGG or Faster RCNN.

Referring to fig. 3, fig. 3 is a schematic diagram of encoding the image-text data provided by the present invention; feature extraction is performed through an Image encoder and a Text encoder respectively, feature extraction is performed on the Image Text pair < Image, text + > and the Image Text pair < Image, text- > to obtain two types of Image Text feature pairs respectively, and the Image encoder and the Text encoder are mature in industry and academia. For texts, pre-training language models such as BERT, XLNET, roberta and the like exist. For images, there are visual coding models such as ResNet, VGG, faster RCNN, etc. Both types of encoders are often derived after pre-training based on large-scale data. The user can directly call the model to encode the image data and the text data, and finally two types of image text characteristic pairs < I, T + > and < I, T- > are obtained respectively.

As an embodiment, the model training module comprises: the system comprises an attribute prediction unit, a relation prediction unit, a shielding area unit and a graph-text matching unit;

the attribute prediction unit is used for replacing the attribute information in the forward text feature pair by using a preset replacement tag and predicting the replaced attribute information; and the relation prediction unit is used for shielding the relation information in the forward text feature pair by using the preset replacement label and predicting the relation in the forward text feature pair. And the shielding region unit is used for shielding region information in the image information and predicting the shielded image information. And the image-text matching unit is used for predicting the image-text relationship between the forward text feature pair and the backward text feature.

It is to be understood that the model training module is configured to take the two image text features as input, and train the model by using four cross-modal pre-training tasks, where the pre-training tasks include: the system comprises an attribute prediction task, a relation prediction task, a shielding region task and a graph-text matching task.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating a data flow of a pre-training task provided by the present invention; in FIG. 4, in connection with the attribute prediction task, the model identifies information in the text message that represents attributes of the object, and then MASKs and replaces it with a MASK (which may include an alphanumeric or character string). The information of the attributes of the occluded object will then be predicted from the information in the image. For the relation prediction task, the relation words between the objects are shielded, and the [ MASK ] is still used for replacement. And the relationship between the two objects is predicted, again based on the content in the image. For the task of masking the region, region information in the image information is masked, and the region information is often the most obvious part of the image. The model will then predict the image areas that are occluded from the description in the text. The three types are most popular cross-modal pre-training tasks, and the model can be well trained to improve the learning capability of the model on multi-modal data. But due to these three types of pre-training tasks, in practice two cases are corresponded. First, the text information is masked, and then the correct masking information is predicted. And secondly, the image information is shielded, and then the text is used as a prompt for prediction.

Therefore, the T-obtained in the feature coding module is used as a text contradictory to the original propaganda technology, and if the T-is involved in the three pre-training tasks, the difficulty of model training is undoubtedly increased. Therefore, the three cross-modality pre-training tasks are to use only the positive image text feature pair < I, T + >. In addition, the invention also combines the self-supervision contrast learning task with the image text matching task. Specifically, i.e., within a batch, positive and negative sample pairs are automatically constructed and optimized using InfonCE Loss. The purpose of this task is to zoom in the distance between the positive image text pairs and then to make the distance between the negative image text pairs farther. Wherein a positive sample pair refers to an image feature and a text feature from the same sample in < I, T +, and image features and text features from different samples will be taken as negative samples. While for < I, T- > all will participate in this task as negative sample pairs. By the design, various defects caused by insufficient information under small sample data can be effectively avoided.

In this embodiment, the method further includes a step of detecting using the trained model, see fig. 5, where fig. 5 is a schematic diagram of an advertisement detection process provided by the present invention; in this stage, the advertising poster will be pre-processed and then it is predicted whether some promotional technique is used in the current advertising poster data based on the distance between the image and the text. As can be seen from fig. 5, the above-mentioned advertisement technology detection module processes data in accordance with the method in the training phase. The method comprises the steps of recognizing text information through an OCR technology, processing text data through a prompt project, and coding respectively by using an image coder and a text coder in a training stage to obtain two text characteristics and corresponding image characteristics. Then, the distance between the image feature and the two text features is calculated, and whether a certain propaganda technology is used in the current sample is judged according to the distance. It is worth mentioning that the classification task is redefined as a distance-based image-text matching task in the prediction stage, and the image-text matching task is completely consistent with the image-text matching task based on comparison learning in the pre-training task in the thought, so that the design can further improve the detection effect on the small sample category.

It is appreciated that in light of the deficiencies in the background art, embodiments of the present invention provide a multimodal advertisement recognition system. The system comprises: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module; the advertisement image-text identification module is used for converting characters in the target advertisement into text information based on a character identification ORC technology and acquiring a manual labeling label of the target advertisement; the information augmentation module is used for augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information; the feature coding module is used for coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained; the model training module is used for training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model; and the advertisement technology detection module is used for detecting the propaganda effect of the advertisement to be detected based on the advertisement technology detection model. The advertising technology detection function based on the image-text multi-mode matching task is realized based on natural language processing, computer vision, an OCR technology and a multi-mode pre-training task, so that text characteristics and visual characteristics in the advertising poster are combined from a multi-mode angle, the accuracy of the advertising technology detection is greatly improved, further, reverse text characteristics are generated based on label information, the number of samples of the model is increased, an advertising technology detection module based on the distance between the image characteristics and the text characteristics is introduced to predict the advertising technology used in the advertising data, and the accuracy of the advertising technology identification under a small sample data set is greatly improved.

In one possible application scenario, referring to fig. 6, fig. 6 is a schematic overall flowchart of a multi-modal advertisement recognition method provided by the present invention; as shown in fig. 6, the present embodiment is mainly divided into two stages: a training phase and a prediction phase.

In the training phase, the Text recognition technology is used to recognize characters in the Image, and the propaganda technology of the poster is manually marked out, so that the data structure < Image, text, label > is obtained. The text in the advertisement will pass through the prompt project, adding the two prompt templates. Text data prefixed with "My technology is < Label >" is denoted as Text +. Text data prefixed with "My technology is Not < Label >" is denoted as Text-. Text + will be the positive image Text pair that matches the visual information, and Text-will be the negative image Text pair that does not match the visual information. The two types of image text pairs are then input into an image encoder and a text encoder for processing, respectively, to generate two types of image text features < I, T + > and < I, T- >. In the pre-training stage of the model, the two image text features participate in a plurality of cross-modal pre-training tasks. Specifically, the attribute prediction task, the relationship prediction task, and the occlusion region task are only for positive image text features, and the teletext matching task is for both image text features. And for the image text matching task, the invention combines the task with self-supervision contrast learning, thereby increasing the number of negative image text pairs in the training process and enhancing the propaganda technology learning ability of small sample categories.

In the prediction phase, the processing work of the advertisement data in the embodiment is completely consistent with the training phase. Specifically, text information in the advertisement is extracted, two prompting templates are added through a prompting project, and the two prompting templates are respectively input into an image encoder and a text encoder. Different from the training stage, in the later link, the image features and the text features under the two prompting templates are respectively subjected to distance calculation, and the distance is used as a criterion for judgment so as to judge whether a certain propaganda technology is used in the advertisement.

Referring to fig. 7, fig. 7 is a flowchart of a multi-modal advertisement recognition method according to an embodiment of the present invention, and as shown in fig. 7, the multi-modal advertisement recognition method includes:

step S100: converting characters in the target advertisement into text information based on a character recognition ORC technology, and acquiring an artificial labeling label of the target advertisement;

it should be noted that the main executing body of the method of this embodiment may be a computer terminal device having functions of data processing, network communication, and program execution, for example: computers, tablet computers, etc.; the cloud server may also be a server device having the same similar function, or may also be a cloud server having the similar function, which is not limited in this embodiment. For ease of understanding, this embodiment and the following embodiments will be described by taking a server device as an example.

Step S200: augmenting the text information into forward text and reverse text based on the artificial labeling label to obtain augmented text information;

step S300: encoding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained;

step S400: training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model;

step S500: and carrying out propaganda effect detection on the advertisement to be detected based on the advertisement technology detection model.

It can be understood that the multi-modal advertisement recognition method provided by the present invention corresponds to the multi-modal advertisement recognition system provided by the foregoing embodiments, and the relevant technical features of the multi-modal advertisement recognition method can refer to the relevant technical features of the multi-modal advertisement recognition system, and are not described herein again.

Referring to fig. 8, fig. 8 is a schematic view illustrating an embodiment of an electronic device according to an embodiment of the invention. As shown in fig. 8, an embodiment of the present invention provides an electronic device, which includes a memory 1310, a processor 1320, and a computer program 1311 stored in the memory 1310 and executable on the processor 1320, where the processor 1320 executes the computer program 1311 to implement the following steps:

converting characters in the target advertisement into text information based on a character recognition ORC technology, and acquiring an artificial labeling label of the target advertisement; augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information; encoding the image information of the target advertisement and the augmented text information to construct a multi-modal data set to be trained; training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model; and carrying out propaganda effect detection on the advertisement to be detected based on the advertisement technology detection model.

Referring to fig. 9, fig. 9 is a schematic diagram of an embodiment of a computer-readable storage medium according to the present invention. As shown in fig. 9, the present embodiment provides a computer-readable storage medium 1400, on which a computer program 1411 is stored, which computer program 1411, when executed by a processor, implements the steps of:

The embodiment of the invention provides a multi-mode advertisement recognition method, a multi-mode advertisement recognition system, electronic equipment and a storage medium, wherein the multi-mode advertisement recognition system comprises: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module; the advertisement image-text identification module is used for converting characters in the target advertisement into text information based on a character identification ORC technology and acquiring an artificial labeling label of the target advertisement; the information augmentation module is used for augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information; the feature coding module is used for coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained; the model training module is used for training a preset model to be trained by using the multi-mode data set to be trained based on a preset training task to obtain an advertisement technology detection model; and the advertisement technology detection module is used for detecting the propaganda effect of the advertisement to be detected based on the advertisement technology detection model. The advertising technology detection function based on the image-text multi-mode matching task is realized based on natural language processing, computer vision, an OCR technology and a multi-mode pre-training task, so that text characteristics and visual characteristics in the advertising poster are combined from a multi-mode angle, the accuracy of the advertising technology detection is greatly improved, further, reverse text characteristics are generated based on label information, the number of samples of the model is increased, an advertising technology detection module based on the distance between the image characteristics and the text characteristics is introduced to predict the advertising technology used in the advertising data, and the accuracy of the advertising technology identification under a small sample data set is greatly improved.

It should be noted that, in the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to relevant descriptions of other embodiments for parts that are not described in detail in a certain embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A multimodal advertisement recognition system, the system comprising: the system comprises an advertisement image-text recognition module, an information augmentation module, a feature coding module, a model training module and an advertisement technology detection module;

the advertisement image-text identification module is used for converting characters in the target advertisement into text information based on a character identification ORC technology and acquiring an artificial labeling label of the target advertisement;

2. The multi-modal advertisement recognition system of claim 1, wherein the augmenting the textual information into forward text and reverse text based on the artificial tagging comprises:

3. The system of claim 1, wherein the encoding of the image information and the augmented text information of the targeted advertisement comprises:

4. The multi-modal advertisement recognition system of claim 3, wherein the model training module comprises: an attribute prediction unit and a relationship prediction unit;

the attribute prediction unit is used for replacing the attribute information in the forward text feature pair by using a preset replacement tag and predicting the replaced attribute information;

5. The multi-modal advertisement recognition system of claim 4, wherein the model training module comprises: a mask region unit;

6. The multi-modal advertisement recognition system of claim 4, wherein the model training module further comprises: an image-text matching unit;

7. The system of claim 1, wherein the pre-set training-based tasks include an attribute prediction task, a relationship prediction task, a mask area task, and a graph-text matching task.

8. A method for multi-modal advertisement recognition, comprising:

augmenting the text information into a forward text and a reverse text based on the artificial labeling label to obtain augmented text information;

coding the image information of the target advertisement and the augmented text information to construct a multi-mode data set to be trained;

9. An electronic device comprising a memory, a processor for implementing the steps of the multi-modal advertisement recognition method of claim 8 when executing a computer management class program stored in the memory.

10. A computer-readable storage medium, having stored thereon a computer management class program, which when executed by a processor, performs the steps of the multi-modal advertisement recognition method of claim 8.