CN115658964A

CN115658964A - Training method and device for pre-training model and somatosensory picture wind recognition model

Info

Publication number: CN115658964A
Application number: CN202210572644.2A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-01-31
Anticipated expiration: 2042-05-25
Also published as: CN115658964B

Abstract

The application relates to a training method and a device for a pre-training model and a somatosensory picture wind recognition model. The method comprises the following steps: acquiring each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample; obtaining content classification labels corresponding to the content image samples in each sample data pair; performing feature extraction on each sample data pair and the content classification label corresponding to each content image sample to obtain sample features of each content image sample, wherein the sample features comprise image features and text features; and training the initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory painting recognition model, and the somatosensory painting recognition model is used for recognizing the somatosensory painting category of data information. By adopting the method, the accuracy of recognizing the somatosensory picture wind can be ensured.

Description

Training method and device for pre-training model and somatosensory picture wind recognition model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method and a training device for a pre-training model and a somatosensory picture wind recognition model.

Background

With the age of rapid development of the internet, as the threshold of content production is lowered, the distribution amount of various contents is increased at an exponential rate. The somatosensory and the style of each content of different users are also very different no matter the content is of an image-text type or a video type, the somatosensory and the style refer to the intuitive feelings of each content of the users, and the specific content can be a Title (Title) seen by the users, or a cover picture of the content, or an account number of an author who releases the content, and the like. Therefore, there is a need to classify somatosensory paintings in dimension for describing styles and tonality of contents, and a somatosensory painting is a representation of an overall style of contents, and contents with the same style and tonality may have a certain commonality, such as positive energy and easy entertainment, and the contents with the same style and tonality may cause resonance of a class of users.

At present, the classification of the somatosensory paintings on the information flow is generally carried out by adopting an unsupervised or weakly supervised method, a large number of data samples need to be collected and the sample data is subjected to cluster analysis, and the obtained classification result has low accuracy due to strong subjectivity of the somatosensory paintings. Therefore, how to ensure the accuracy of identifying the somatosensory painting is an urgent problem to be solved.

Disclosure of Invention

In view of the above, it is desirable to provide a pre-training model and a method and an apparatus for training a somatosensory painting recognition model, which can ensure accuracy of recognition of the somatosensory painting.

In a first aspect, the present application provides a training method for pre-training a model. The method comprises the following steps:

acquiring each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample;

obtaining content classification labels corresponding to the content image samples in each sample data pair;

performing feature extraction on each sample data pair and the content classification label corresponding to each content image sample to obtain sample features of each content image sample, wherein the sample features comprise image features and text features;

and training the initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory picture recognition model, and the somatosensory picture recognition model recognizes the somatosensory picture type of data information.

In one embodiment, the data description information includes: the content image sample after processing is processed, and text information corresponding to the content image sample;

performing feature extraction on each sample data pair and the content classification label corresponding to each content map sample to obtain sample features of each content map sample, including:

respectively extracting image features of each content map sample and the processed content map sample to obtain a first image feature and a second image feature of each content map sample;

respectively extracting text features of the content classification labels and the text information corresponding to the content image samples to obtain first text features and second text features of the content image samples;

the image features include first image features and second image features, and the text features include first text features and second text features.

In one embodiment, the text feature extraction is performed on the content classification labels and the text information corresponding to the content graph samples, respectively, to obtain a first text feature and a second text feature of each content graph sample, and the method includes:

performing text feature extraction on the content classification labels corresponding to the content image samples to obtain first text features;

dividing the text information corresponding to each content pattern to obtain a text sequence corresponding to each content pattern;

and performing mask processing on each text sequence, replacing part of text marks in the text sequence after the mask processing with mask marks, and generating each second text characteristic based on each text sequence after the mask processing.

In one embodiment, the text sequence corresponding to the content map sample comprises a plurality of text marks;

performing mask processing on each text sequence, including:

calculating the contribution degree of each text label in each text sequence, wherein the contribution degree is the contribution degree of the text label to the content classification label prediction;

and determining the key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, and determining the key text marks in each text sequence as the replaced text marks in the text sequence.

In one embodiment, training the initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model includes:

constructing multi-modal characteristics corresponding to each content pattern sample based on the sample characteristics of each content pattern sample;

and training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model.

In one embodiment, training the initial pre-training model based on the multi-modal features of each content map sample to obtain a target pre-training model, includes:

training the initial pre-training model based on at least one of the fusion characteristics and the sample characteristics of each content map sample and multi-modal characteristics to obtain a target pre-training model;

the fusion features comprise a first fusion feature and a second fusion feature, and the first fusion feature is a feature constructed based on a first image feature and a second text feature of the content image sample; the second fused feature is a feature constructed based on a second image feature and a second text feature of the content map sample.

In one embodiment, training the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain a target pre-training model, includes:

in the training process:

based on the multi-modal characteristics, obtaining predicted content classification labels corresponding to the content graph samples, and calculating cross entropy loss information corresponding to the content pattern samples through the predicted content classification labels and the content classification labels;

calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of the fusion characteristic and the sample characteristic of each content pattern sample;

and updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information.

In one embodiment, the calculating and determining similarity loss information corresponding to each content pattern based on at least one of the fusion feature and the sample feature of each content pattern sample includes:

calculating a first similarity between the first image characteristic and the second image characteristic of each content map sample;

calculating a second similarity between each text sub-feature in the second text feature of each content map sample;

calculating a third similarity between the first fused feature and the second fused feature of each content map sample;

and obtaining the similarity loss information corresponding to each content pattern according to each first similarity, each second similarity and each second similarity.

In one embodiment, training the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain a target pre-training model, further comprising:

in the training process, evaluating the image-text matching degree based on the first image characteristics and the second text characteristics of each content image sample to obtain the image-text matching degree corresponding to each content image sample;

updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information, wherein the updating comprises the following steps:

and updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

In a second aspect, the application provides a training method for a somatosensory picture wind recognition model. The method comprises the following steps:

acquiring all somatosensory painting training samples and somatosensory painting labels corresponding to all the somatosensory painting training samples;

training an initial somatosensory picture wind recognition model based on each somatosensory picture wind training sample to obtain a trained somatosensory picture wind recognition model, wherein the somatosensory picture wind recognition model recognizes the somatosensory picture wind type of data information;

the obtaining mode of the initial somatosensory picture wind recognition model comprises the following steps:

and training the initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, and taking the target pre-training model as an initial somatosensory picture wind recognition model.

In a third aspect, the application provides a method for recognizing a somatosensory painting. The method comprises the following steps:

acquiring data information to be identified;

based on the data information to be recognized, acquiring a predicted somatosensory painting tag corresponding to the data information to be recognized through the somatosensory painting recognition model, wherein the predicted somatosensory painting tag is used for describing the somatosensory painting category of the data information to be recognized;

the obtaining mode of the somatosensory picture wind recognition model comprises the following steps:

acquiring motion sensing picture style training samples and motion sensing picture style labels corresponding to the motion sensing picture style training samples;

training an initial somatosensory picture wind recognition model based on each somatosensory picture wind training sample to obtain a trained somatosensory picture wind recognition model;

and training the initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, and taking the target pre-training model as an initial somatosensory picture wind recognition model.

In a fourth aspect, the application further provides a training device for pre-training the model. The device comprises:

the acquisition module is used for acquiring each sample data pair, and the sample data pairs comprise content map samples and data description information corresponding to the content map samples; obtaining content classification labels corresponding to the content image samples in each sample data pair;

the processing module is used for performing feature extraction on each sample data pair and the content classification label corresponding to each content map sample to obtain sample features of each content map sample, wherein the sample features comprise image features and text features;

and the first training module is used for training the initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, and the target pre-training model is used for training to obtain a somatosensory painting recognition model which recognizes the somatosensory painting category of data information.

In a fifth aspect, the application further provides a training device for the somatosensory picture wind recognition model. The device comprises:

the acquisition module is used for acquiring data information to be identified;

the second training module is used for training the initial somatosensory picture wind recognition model based on each somatosensory picture wind training sample to obtain a trained somatosensory picture wind recognition model;

In a sixth aspect, the application further provides a device for recognizing the somatosensory painting. The device comprises:

the acquisition module is used for acquiring data information to be identified;

the identification module is used for acquiring a predicted somatosensory painting tag corresponding to the data information to be identified through the somatosensory painting identification model based on the data information to be identified, and the predicted somatosensory painting tag is used for describing the somatosensory painting type of the data information to be identified;

the method for obtaining the somatosensory picture wind recognition model comprises the following steps:

In a seventh aspect, the present application further provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

and training the initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory painting recognition model, and the somatosensory painting recognition model is used for recognizing the somatosensory painting category of data information.

In an eighth aspect, the present application further provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

performing feature extraction on each sample data pair and the content classification label corresponding to each content graph sample to obtain sample features of each content graph sample, wherein the sample features comprise image features and text features;

training the initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, and taking the target pre-training model as an initial somatosensory picture wind recognition model

In a ninth aspect, the present application further provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:

acquiring data information to be identified;

based on the data information to be identified, acquiring a predicted somatosensory painting tag corresponding to the data information to be identified through a somatosensory painting identification model, wherein the predicted somatosensory painting tag is used for describing the somatosensory painting type of the data information to be identified;

In a tenth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In an eleventh aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

In a twelfth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring data information to be identified;

In a thirteenth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

In a fourteenth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

In a fifteenth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:

acquiring data information to be identified;

According to the pre-training model and the training method and device for the somatosensory picture wind recognition model, the computer equipment, the storage medium and the computer program product, by obtaining each sample data pair, the sample data pair comprises a content picture sample and data description information corresponding to the content picture sample, then obtaining a content classification label corresponding to the content picture sample in each sample data pair, and performing feature extraction on each sample data pair and the content classification label corresponding to each content picture sample, sample features of each content picture sample are obtained, wherein the sample features comprise image features and text features, so that the initial pre-training model is trained based on the sample features of each content picture sample, and a target pre-training model is obtained and is used for training the somatosensory picture wind recognition model for obtaining the somatosensory picture wind recognition model and the somatosensory picture wind recognition model recognition data information. Because the sample data pair comprises the content graph samples and the data description information corresponding to the content graph samples, the sample characteristics of each content graph sample can describe the characteristic information included in the content graph sample from multiple dimensions, so that the acquired target pre-training model can learn more characteristic information included in the content graph samples, and the accuracy of the trained target pre-training model can be improved. Based on the method, the target pre-training model is used for further training to obtain the somatosensory picture wind recognition model, namely, the accuracy of the somatosensory picture wind recognition model is further improved, and the accuracy of the somatosensory picture wind type of the recognition data information is further improved.

Drawings

FIG. 1 is a diagram of an embodiment of an application environment for a method for training a pre-training model;

FIG. 2 is a block diagram that illustrates a framework of the somatosensory wind recognition system in one embodiment;

FIG. 3 is a schematic flow chart diagram illustrating a method for training a pre-trained model according to one embodiment;

FIG. 4 is a schematic flow chart for obtaining sample features for various content map samples, under an embodiment;

FIG. 5 is a diagram illustrating data enhancement of a sample content graph in one embodiment;

FIG. 6 is a schematic diagram of acquiring a first image feature and a second image feature in one embodiment;

FIG. 7 is a diagram that illustrates obtaining a first textual feature and a second textual feature, in one embodiment;

FIG. 8 is a schematic flow diagram illustrating obtaining first textual features and second textual features, in one embodiment;

FIG. 9 is a diagram illustrating obtaining second text features, under an embodiment;

FIG. 10 is a flow diagram that illustrates masking of text sequences, according to one embodiment;

FIG. 11 is a flow diagram that illustrates the determination of key text labels, in one embodiment;

FIG. 12 is a schematic flow diagram illustrating a portion of a method for training a pre-trained model in accordance with one embodiment;

FIG. 13 is a diagram that illustrates obtaining multimodal features, under an embodiment;

FIG. 14 is a partial flow diagram illustrating a method for training a pre-trained model according to another embodiment;

FIG. 15 is a schematic illustration of constructing a first fused feature and a second fused feature in one embodiment;

FIG. 16 is a partial flow diagram illustrating a method for training a pre-trained model in accordance with yet another embodiment;

FIG. 17 is a flow diagram illustrating the determination of similarity loss information in one embodiment;

FIG. 18 is a partial flow diagram illustrating a method for training a pre-trained model according to yet another embodiment;

FIG. 19 is a flowchart illustrating a method of feature processing during the training process in one embodiment;

FIG. 20 is a schematic flow diagram illustrating the training of a somatosensory wind recognition model in one embodiment;

fig. 21 is a schematic flow chart illustrating a process of obtaining a somatosensory note corresponding to each somatosensory note training sample in one embodiment;

FIG. 22 is a diagram of a somatosensory note and corresponding description in one embodiment;

FIG. 23 is a flowchart illustrating a method for recognizing a somatosensory stroke according to an embodiment;

FIG. 24 is a schematic overall flowchart of a method for recognizing a somatosensory stroke in one embodiment;

FIG. 25 is a schematic diagram of an exemplary training apparatus for pre-training a model;

FIG. 26 is a schematic diagram showing the structure of a training apparatus for a somatosensory wind recognition model in one embodiment;

fig. 27 is a schematic structural view of a device for recognizing a somatosensory painting according to an embodiment;

FIG. 28 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine look, and in particular, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or to transmit to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Machine learning is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to the technologies of image processing, text processing, machine learning and the like of artificial intelligence, and is specifically explained by the following embodiments:

the training method of the pre-training model provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be placed on the cloud or other server.

Specifically, taking the application to the server 104 as an example for explanation, the server 104 may obtain sample data pairs and content classification labels corresponding to content map samples in each sample data pair from the data storage system, and then the server 104 performs feature extraction on each sample data pair and the content classification labels corresponding to the content map samples to obtain sample features of each content map sample, where the sample features include image features and text features, and based on this, the server 104 trains the initial pre-training model based on the sample features of each content map sample to obtain a target pre-training model, where the target pre-training model is used to train and obtain a somatosensory painting wind identification model, and the somatosensory painting wind identification model identifies a somatosensory painting type of data information.

Then, the terminal 102 may obtain sample data pairs and content classification labels corresponding to content image samples in each sample data pair through communication with the server 104, and then the terminal 102 performs feature extraction on each sample data pair and the content classification labels corresponding to the content image samples to obtain sample features of each content image sample, where the sample features include image features and text features, based on which, the terminal 102 trains the initial pre-training model to obtain a target pre-training model, and the target pre-training model is used to train to obtain a somatosensory image wind recognition model, which recognizes the somatosensory image wind type of data information.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle-mounted devices, aircrafts, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers. The embodiment of the invention can be applied to various scenes including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

Specifically, the training method of the pre-training model provided by the embodiment of the application can be applied to the somatosensory painting wind recognition system shown in fig. 2. The following describes a training method based on a pre-training model, a training method of a somatosensory picture recognition model, and a somatosensory picture recognition method, and the main functions of each service module are as follows:

1. content production end 201

Professional Generated Content (PGC), user Generated Content (UGC), and Multi-Channel Network product form MCN (MCN) provide graphics and video Content through an Application Program Interface (API) system, which is a main Content source of the Content generator 201. Secondly, the content production end 201 uploads the image-text content through communication with the uplink and downlink content interface server 203, the image-text content source is usually a lightweight publishing end and an editing content inlet, and the video content publishing end is usually a shooting and shooting end.

2. Content consuming side 202

The content consumption end 202 communicates with the uplink and downlink content interface server 203, obtains the index information of the access content by pushing recommendation, and then communicates with the content storage server 204 to obtain corresponding content including the recommended content and the content subscribed to the topic, and the content storage server 204 stores content entities, such as: video source files, picture source files, and the like. And meta information of the contents such as title, author, jacket photograph, category, and Tag (Tag) information is stored in the contents database 205. Secondly, the content consumption end 202 can also report behavior data played by the user during the uploading and downloading processes, information such as pause, loading time, playing click and the like to the back end for statistical analysis. And the content consumption end 202 browses the content data, and various data from external channels also enter the system through the content consumption end 202 via the uplink and downlink content interface server 203.

3. Uplink and downlink contents interface server 203

The upstream and downstream content interface server 203 is used for communicating with the content producing end 201 directly, and the content submitted from the front end, which is usually information such as title, publisher, abstract, cover sheet and publishing time, etc. of the content, is stored in the content database 205. Secondly, the uplink and downlink content interface server 203 can also convert meta information of the teletext content, such as: information such as file size, cover art link, title, release time, author, etc. is written to the content database 205. Further, the uplink and downlink content interface server 203 is configured to synchronize the content submitted by the content production end 201 to the dispatch center server 206 for subsequent content processing and streaming.

4. Content storage server 204

The content storage server 204 is used for storing content entity information other than meta information of content, such as a video source file and a picture source file of teletext content, and the terminal directly accesses the source file from the content storage server 204 when consuming video content. Secondly, when extracting the label corresponding to the sample, providing the frame extraction content in the middle of the source file of the video source file, and taking the sample extraction frame as a candidate set of the sample.

5. Content database 205

All the meta-information of the content distributed by the content production end 201 is stored in the content database 205, and the key points are the meta-information of the content itself, such as file size, cover map link, code rate, file format, title, distribution time, author, video file size, video format, whether the original mark or the initial distribution further includes the classification of the content in the manual review process, and at this time, the classification of the content in the manual review process includes classification of each level of the first level, the second level and the third level and label information, for example: an article for explaining a mobile phone A, wherein the first-level classification is science and technology, the second-level classification is a smart phone, the third-level classification is a domestic mobile phone, and the label information is A. Secondly, when the manual review system 207 performs the manual review process, the information in the content database 205 is read, and the result and the status of the manual review obtained by the manual review system 207 are also returned to the content database 205.

Further, the content processing by the dispatching center server mainly includes machine processing and manual review processing, where the machine processing core performs various quality judgments such as low quality filtering, content labels such as classification and label information, content duplication elimination is performed in the duplication elimination server 208, and specifically, the result of content duplication elimination is written into the content database 205, and completely repeated content is not issued to the manual review system 207, so as to avoid manual repeated secondary processing. Thus, subsequent modeling identifies when information such as content titles, cover drawings, tags, etc. is needed, and reads meta-information for the content from the content database 205.

6. Dispatch center server 206

The scheduling center server 206 is responsible for the whole scheduling process of content flow, and acquires the warehoused content through the uplink and downlink content interface server 203, and then acquires the meta-information of the content from the content database 205. Second, a manual review system 207 and machine processing system may also be scheduled, controlling the order and priority of scheduling. The content index information, i.e. the entry address of the content consumption access, obtained by the content consumption end 202 can also be provided to the content consumption end 202 through a direct presentation page of a content export distribution service (typically a recommendation engine, or a search engine, or an operation). Further, through communication with the content somatosensory painting wind recognition service 209, the somatosensory painting wind type of the data information is recognized and marked during the information flow content circulation process.

7. Manual review system 207

The manual auditing system 207 is a carrier with manual service capability, and is mainly used for auditing sensitive data information and other data information which cannot be determined and judged by machines, and the manual auditing system 207 can also perform secondary confirmation by labeling classified labels of special types of videos, so that the marking effect and quality are ensured.

8. Somatosensory painting wind recognition service 209 and somatosensory painting wind recognition model 210

According to the training method of the somatosensory painting style recognition model, each somatosensory painting style training sample and the somatosensory painting style label corresponding to each somatosensory painting style training sample are obtained from the somatosensory painting style training sample database 211 on the basis of the obtained target pre-training model, and the somatosensory painting style recognition model 210 is obtained through training. The somatosensory-image wind recognition service 209 is performed based on the somatosensory-image wind recognition model 210.

9. Target pre-training model 212 and multi-modal pre-training sample database 213

By the training method of the pre-training model provided by the application, each sample data pair and the content classification label corresponding to the content map sample in each sample data pair are obtained from the multi-modal pre-training sample database 213, and the target pre-training model 212 is obtained through training.

10. Crawling and data preprocessing system 214

The crawl and data pre-processing system 214 crawls corresponding content patterns from the internet through information stream content to supplement the relevant pre-training data for the corresponding domain.

11. Video framing and teletext content parsing service 215

The video framing and teletext content parsing service 215 is used to retrieve the necessary video file frames from the video source file as a subsequent video cover picture to construct to provide the original data source. Or, when there are multiple pictures in the image-text content, the video frame extraction and image-text content analysis service 215 analyzes the image-text content to extract multiple pictures which may be used as cover pictures, and the pictures are used as the image-text cover pictures and the cover pictures uploaded by the original author as input together.

Based on this, in an embodiment, as shown in fig. 3, a training method of a pre-training model is provided, which is described by taking the method as an example of being applied to the server in fig. 1, it is understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 302, obtaining each sample data pair, where the sample data pair includes a content graph sample and data description information corresponding to the content graph sample.

The sample data pair includes a content graph sample and data description information corresponding to the content graph sample, the content graph sample may be a cover graph sample or a thumbnail sample, and the data description information specifically includes a processed content graph sample obtained by processing the content graph sample and text information corresponding to the content graph sample, and the text information corresponding to the content graph sample specifically includes: the Title (Title) of the content map sample, the publisher Name (Pu in _ Name) of the content map sample, and the like.

Specifically, the server first obtains a video file sample set, where the video file sample set may be multiple video file samples downloaded from a database or multiple video file samples uploaded by a terminal, and is not limited here. Therefore, the server specifically calls a video frame extraction service and a graphic content analysis service to acquire video file frames from each video file sample, and the acquired video file frames are used as a constructed content image sample to provide an original data source.

Further, the server constructs a content image sample based on the obtained multiple video file frames, and performs data enhancement processing on each content image sample to obtain each processed content image sample. Secondly, text information extraction is carried out on each content graph sample by calling a graph-text content analysis service to obtain text information corresponding to each content graph sample, and data description information corresponding to each content graph sample can be formed through the processed content graph sample corresponding to each content graph sample and the text information, so that each sample data pair comprising the content graph sample and the data description information corresponding to the content graph sample is obtained.

And step 304, obtaining content classification labels corresponding to the content map samples in each sample data pair.

The content classification tags are used to describe the categories of content information included in the content map samples, and the content classification tag corresponding to each content map sample may be one or more. For example, if the content map sample is an animal on a lawn, the content classification labels may be a cat, a dog, a lawn, etc.

Step 306, performing feature extraction on each sample data pair and the content classification label corresponding to each content map sample to obtain sample features of each content map sample, where the sample features include image features and text features.

The image features comprise image features corresponding to the content map samples and image features corresponding to images included in the data description information of the content map sample pairs. Similarly, the text features include text features corresponding to the content classification labels and text features corresponding to texts included in the data description information of the content map sample pairs.

And 308, training the initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory painting recognition model, and the somatosensory painting recognition model is used for recognizing the somatosensory painting category of data information.

The somatosensory picture type is used for describing the style and tone corresponding type of data information, and the data information can be texts, pictures, videos or music. Specifically, the style and the tone are the manifestation of the overall style of the data information, and the data information with the same style and the tone has a certain commonality, so that resonance of a class of users can be caused, for example: eye care, healing, middle aged and elderly people, campus, and fashion. Therefore, style and tonality are a kind of overall feeling given to the user, and may be auditory feeling, such as relaxation and cheerfulness, etc., or visual feeling, such as pleasure and sadness, etc.

Based on the method, the sample characteristics of each content graph sample are used as the input of an initial pre-training model, the initial pre-training model outputs the predicted content classification labels of each content graph sample, and the initial pre-training model is trained based on the predicted content classification labels and the content classification labels to obtain a target pre-training model. The target pre-training model obtained at this time is the target pre-training model 212 in fig. 2, the target pre-training model is used for training to obtain the somatosensory painting recognition model, and the somatosensory painting recognition model is specifically used for recognizing the somatosensory painting category of the data information.

In the method for training the pre-training model, the sample data pair comprises the content graph samples and the data description information corresponding to the content graph samples, so that the sample characteristics of each content graph sample can describe the characteristic information included in the content graph sample from multiple dimensions, the acquired target pre-training model can learn more characteristic information included in the content graph samples, and the accuracy of the trained target pre-training model can be improved. Based on the method, the target pre-training model is used for further training to obtain the somatosensory picture wind recognition model, namely, the accuracy of the somatosensory picture wind recognition model is further improved, and the accuracy of the somatosensory picture wind type of the recognition data information is further improved.

In one embodiment, as shown in fig. 4, the data description information includes: the content image processing method comprises the steps of processing a content image sample, processing the content image sample, and text information corresponding to the content image sample;

step 306, performing feature extraction on each sample data pair and the content classification label corresponding to each content map sample to obtain sample features of each content map sample, which specifically includes:

step 402, performing image feature extraction on each content map sample and the processed content map sample respectively to obtain a first image feature and a second image feature of each content map sample.

The data description information comprises a processed content image sample after the content image sample is processed. Specifically, the processing of the content map sample is specifically data enhancement processing of the content map sample, such as rotation, cropping, gaussian noise, masking, color transformation, and filtering. For ease of understanding, as shown in fig. 5, after the color change processing is performed on the content map sample 502, a processed content map sample 504 is obtained. Next, the content map sample 506 is subjected to a cropping process, and a processed content map sample 508 is obtained. It should be understood that the foregoing examples are only for the understanding of the data enhancement process and are not limiting of the present application.

Based on the method, the server performs image feature extraction on each content pattern sample to obtain a first image feature corresponding to each content pattern sample. Similarly, the server performs image feature extraction on the processed content map samples included in the data description information to obtain second image features corresponding to the processed content map samples.

Specifically, the image feature extraction may specifically be performed by using a Vision transform (ViT) model, and as shown in fig. 6, the image features of the content map sample 602 and the processed content map sample 604 are respectively extracted by using the Vision transform model, so as to obtain a first image feature 606 corresponding to the content map sample 602 and a second image feature 608 corresponding to the processed content map sample 604.

Step 404, performing text feature extraction on the content classification labels and the text information corresponding to the content image samples respectively to obtain first text features and second text features of the content image samples.

The data description information comprises text information corresponding to the content graph sample. Based on the above, the server performs text feature extraction on the content classification labels corresponding to the content pattern samples to obtain first text features corresponding to the content classification labels corresponding to the content pattern samples. Similarly, the server extracts the text features of the text information included in each data description information to obtain second text features corresponding to each text information.

It is understood that the text information in the data description information may specifically include: the text information describing the text content of the content map sample, such as the title of the content map sample and the publisher name of the content map sample, is not exhaustive here. Therefore, the second text features corresponding to each text message may specifically include text sub-features corresponding to the title of the content diagram sample, text sub-features corresponding to text messages describing the text content of the content diagram sample, such as text sub-features corresponding to the publisher name of the content diagram sample, and the text sub-features are not exhaustive here.

Specifically, the text feature extraction may specifically be performed by using a Bidirectional Encoder (BERT) model of the converter to perform processing, and the text information in the data description information specifically includes a title of the content diagram sample and a publisher name of the content diagram sample as an example, based on which, as shown in fig. 7, the text feature extraction is performed on the content classification tag 702 and the text information 704 respectively through the Bidirectional encoder model of the converter, so that a first text feature 706 corresponding to the content classification tag 702 and a second text feature 708 corresponding to the text information 704 may be obtained, and since the text information in the data description information specifically includes the title of the content diagram sample and the publisher name of the content diagram sample, the second text feature 708 specifically includes a text sub-feature 7082 corresponding to the title of the content diagram sample and a text sub-feature 7084 corresponding to the publisher name of the content diagram sample.

At step 406, the image features include a first image feature and a second image feature, and the text features include a first text feature and a second text feature.

Specifically, since the first image feature and the second image feature can be obtained after the image feature extraction in step 402, that is, the image features in the sample features of each content map sample specifically include the first image feature and the second image feature. Similarly, the first text feature and the second text feature can be obtained after the text feature extraction in step 402, that is, the text features in the sample features of each content map sample specifically include the first text feature and the second text feature.

In this embodiment, image features are extracted at multiple problems by performing image feature extraction on each content map sample and the processed content map sample, so as to improve the richness of image feature information included in the obtained image features, and then, text feature extraction is performed on content classification labels and text information corresponding to each content map sample, so that text features are extracted at multiple problems, and the richness of text feature information included in the obtained text features is improved, so that subsequent model training can learn feature information of more dimensions, and the accuracy of a target pre-training model obtained by training is further improved.

In one embodiment, as shown in fig. 8, in step 404, performing text feature extraction on the content classification label and the text information corresponding to each content graph sample, to obtain a first text feature and a second text feature of each content graph sample, includes:

step 802, performing text feature extraction on the content classification labels corresponding to the content graph samples to obtain first text features.

The first text feature is a text feature corresponding to the content classification label.

And 804, performing text division on the text information corresponding to each content pattern to obtain a text sequence corresponding to each content pattern.

Wherein, the text sequence corresponding to each content map sample comprises a plurality of text marks (Token). Specifically, a plurality of text labels can be obtained after text division is performed on text information corresponding to the content graph samples, and a text sequence corresponding to the content graph samples is formed based on the text labels.

For example, the text information corresponding to the content map sample is "ten meters for a girl dragging! If girls cry, the text information is divided to obtain a plurality of text marks: [ tow ], [ line ], [ woman ], [ child ], [ number ], [ ten ], [ meter ], [! The contents of the text marks are represented by text marks, so that the corresponding text sequences can be obtained: [ drag ] [ Row ] [ girl ] [ number ] [ Cross ] [ meter ] [! [ girl ] [ child ] [ cry ]. Second, the text message corresponding to the content map sample is "do you know how are two cats quarry? ", then dividing the text information to obtain a plurality of text labels: [ you ], [ know ], [ say ], [ two ], [ only ], [ cat ], [ yes ], [ like ], [ what ], [ quard ], [ frame ], [ of ], [ do ], [? H, the corresponding text sequence can thus be obtained: [ you ] [ know ] [ two ] [ Cat ] [ all ] [ e.g., ] [ how ] [ of ] [ frame ] [ Do ] [? ]. It should be understood that the foregoing examples are only for the purpose of understanding the text sequences described in the present solution and should not be construed as limiting the present solution.

Step 806, performing mask processing on each text sequence, replacing part of text marks in the text sequence after the mask processing with mask marks, and generating each second text feature based on each text sequence after the mask processing.

The masking is to mask (mask) a part of text marks in the text sequence, that is, replace the part of text marks with mask marks, and in this embodiment, random blank padding of 0 is adopted, that is, the mask marks are specifically [ mask ], and no other information is included. For example, the text sequence is [ drag line ] [ girl ] [ number ] [ ten ] [ meter ] [! [ girl ] [ Cry ] the masked text sequence may be [ trail ] [ girl ] [ number ] [ ten ] [ m ] [! [ girl ] [ mask ], or [ mask ] [ meter ] [! [ girl ] [ cry ].

Specifically, each text sequence is subjected to mask processing, so that part of text marks in the text sequence are replaced by mask marks to obtain a masked text sequence, and then each second text feature is generated based on each masked text sequence. For ease of understanding, text information is used as "do you know how are two cats quarry? As an example, as shown in fig. 9, text information 902 is divided into text sequences corresponding to content samples, and then each text sequence is masked to obtain a masked text sequence 904, where the masked text sequence 904 may be: [ you ] [ know ] [ two ] [ mask ] [ yes ] [ how ] [ mask ] [ of ] [ Do ] [? ]. The masked text sequence 904 is then subject to text feature extraction by the bi-directional encoder model of the converter to output a second text feature 906.

In this embodiment, the text sequence including the mask mark enables the text feature to focus on the context information corresponding to the mask mark, that is, the text feature can include more correlations between text information, and further enriches the obtained text feature.

In the process of identifying the somatosensory paintings of the data information, because the causes of each type of somatosensory paintings are different, each somatosensory paintings label has unique emphasis points, for example, negative energy labels are mainly triggered by some negative energy keywords, emotion exaggerated labels are mainly triggered by some emotional words or even punctuation marks (such as exclamation marks) and the like, in order to ensure that the obtained somatosensory paintings identification model can identify the somatosensory styles of the data information more accurately, special treatment needs to be carried out on characters or punctuation marks triggered by the keys in the training process of the pre-training model so as to obtain key components of the characters or punctuation marks.

Based on this, in one embodiment, as shown in fig. 10, the text sequence corresponding to the content map sample includes a plurality of text labels;

step 806, performing mask processing on each text sequence, including:

step 1002, calculating the contribution degree of each text label in each text sequence, wherein the contribution degree is the contribution degree of the text label to the content classification label prediction.

The contribution degree is the contribution degree of the text label to the content classification label prediction, that is, the contribution degree is specifically the contribution degree for measuring the probability that each text label has accurate content classification label in the content classification label.

Specifically, the server calculates the contribution degree of each text label in each text sequence, and specifically calculates based on the following formula (1):

S(w _i )＝P(y _t |s)-P(y _t |s′ _i-1 w _i )； (1)

wherein, S (w) _i ) Representing the degree of contribution of text labels to the content classification tag prediction, y _t Representing a text sequence, w _i Representing the ith text label in the text sequence, P (y) _t S) is the contribution of the text sequence to the content classification tag prediction, s' _i-1 Is represented by w ₁ 、w ₂ To w _i-1 A composed text sequence.

Step 1004, determining the key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, and determining the key text marks in each text sequence as the replaced text marks in the text sequence.

The key text mark may include one text mark or a plurality of text marks, and the key text mark is a text mark with a higher contribution degree in the text sequence. Based on this, the server may rank the contribution degree of each text label in each text sequence from high to low, and determine the text label with a higher contribution degree in the text sequence as a key text label in each text sequence, or determine the key text label in each text sequence through a key text label model, which is not limited here. Therefore, the server determines the key text marks in each text sequence as the replaced text marks in the text sequence, namely the text marks which are subjected to mask replacement are the key text marks.

For example, the text sequence includes a text label 1, a text label 2, a text label 3, and a text label 4, and the contribution degrees corresponding to the text label 1, the text label 2, the text label 3, and the text label 4 are ranked from high to low, specifically: the contribution degree of the text label 1, the contribution degree of the text label 4, the contribution degree of the text label 2, and the contribution degree of the text label 3, it may be determined that the contribution degree of the text label 1 is the highest, that is, the text label 1 may be used as a key text label based on requirements, and then the text label 1 is replaced with a mask label [ mask ] when performing the masking processing.

The text sequence is [ Moore ] [ Tou ] [ car ] [ drag ] [ line ] [ girl ] [ child ] [ number ] [ ten ] [ meter ] [! For example, if the BERT random masking strategy is used, the masked text sequence can be obtained as follows: [ Moto ] [ vehicle ] [ drag ] [ girl ] [ child ] [ number ] [ mask ] [ meter ] [ ]! [ girl ] [ child ] [ cry ]. Based on the method for determining the key text label provided by this embodiment, the text sequence after masking can be obtained as follows: [ Moto ] [ vehicle ] [ drag ] [ girl ] [ number ] [ Cross ] [ Meter ] [ ]! [ girl ] [ child ] [ tragic ] [ mask ].

For convenience of understanding, in the following example, the key text labels are determined by using a key text label model, as shown in fig. 11, a text sequence sample 1102 with a small data size is obtained first, then a key text label 1104 corresponding to the text sequence sample 1102 is obtained by using the foregoing formula (1), then a text sequence sample with a large data size is obtained from a text sequence database 1106, and the text sequence sample with a large data size and the key text labels 1104 corresponding to the text sequence samples are used as input of a key text label model 1108, that is, the key text labels 1110 of each text sequence sample can be output by using the key text label model 1108.

In this embodiment, by calculating the contribution degree of each text label in each text sequence, the key text label that can most affect the content classification label prediction result is determined from the text sequence, so that the key text label is replaced during mask processing, and thus the text feature can pay attention to the context information corresponding to the key text label, and thus the relevance between the obtained text information can more affect the content classification label prediction result, and the accuracy of the text information is further improved. Secondly, through a random mask strategy replacing BERT, the pre-training model can learn more information which is more useful for the somatosensory picture type in the training process, and therefore the accuracy of obtaining the somatosensory picture recognition model through subsequent training is improved.

In one embodiment, as shown in fig. 12, step 308, training the initial pre-training model based on the sample features of each content map sample to obtain a target pre-training model, including:

and step 1202, constructing multi-modal characteristics corresponding to each content map sample based on the sample characteristics of each content map sample.

The multi-modal feature is obtained by performing multi-modal feature interaction on the image feature and the text feature, and the multi-modal feature interaction means that the image feature and the text feature of the content map sample are interacted. Specifically, the server performs cross-attention (cross-attention) feature extraction on a first image feature, a second image feature, a first text feature and a second text feature in sample features of each content map sample to construct a multi-modal feature corresponding to each content map sample.

For understanding, as shown in fig. 13, by a similar method described in the foregoing embodiment, a first image feature 1302 corresponding to a content pattern is obtained through a visual transform (ViT) model, and a second image feature 1304 corresponding to a processed content pattern is obtained, a first text feature 1306 corresponding to a content classification label and a second text feature 1308 corresponding to text information are obtained through a Bidirectional Encoder (BERT) model of a converter, and then cross attention feature extraction is performed on the first image feature 1302, the second image feature 1304, the first text feature 1306, and the second text feature 1308 of each content pattern sample to obtain a multi-modal feature 1310.

And 1204, training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model.

Specifically, the server trains the initial pre-training model based on the multi-modal features of each content map sample obtained in step 1204, so as to obtain a target pre-training model.

In the embodiment, the cross-attention feature extraction is performed on the image features and the text features of different dimensions in the sample features of each content map sample, and the obtained multi-modal features are fused with the dimension features on the basis of describing the multi-dimensional features, so that the correlation between the image features and the text features can be described, and the accuracy of the trained target pre-training model is further improved.

In one embodiment, as shown in fig. 14, step 1204, training the initial pre-training model based on the multi-modal features of each content graph sample, to obtain a target pre-training model, includes:

and 1402, training the initial pre-training model based on at least one of the fusion characteristics and the sample characteristics of each content map sample and the multi-modal characteristics to obtain a target pre-training model.

The fusion features comprise a first fusion feature and a second fusion feature, and the first fusion feature is a feature constructed based on a first image feature and a second text feature of the content image sample; the second fused feature is a feature constructed based on a second image feature and a second text feature of the content map sample. It is to be understood that the first fused feature is a feature constructed based on the first image feature of the content map sample and at least one text sub-feature of the second text feature, and the second fused feature is a feature constructed based on the second image feature of the content map sample and at least one text sub-feature of the second text feature.

For ease of understanding, taking the example that the second text feature includes a text sub-feature corresponding to the title of the content map sample and a text sub-feature corresponding to the publisher name of the content map sample, as shown in fig. 15, a first fused feature 1506 is constructed based on a text sub-feature 1504 where the first image feature 1502 corresponds to the title of the map sample and a second fused feature 1512 is constructed based on a text sub-feature 1510 where the second image feature 1508 corresponds to the publisher name of the content map sample. It should be understood that, in practical applications, other manners, such as constructing a first fusion feature for a feature constructed based on a text sub-feature of the first image feature corresponding to the publisher name of the content map sample, and constructing a second fusion feature based on a text sub-feature of the second image feature corresponding to the title of the content map sample, may also be used, and a specific fusion object and manner of the fusion feature are not limited herein.

Specifically, the server may train the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain the target pre-training model according to the requirements of practical applications. In other words, the server may train the initial pre-training model based on the fusion features and the multi-modal features of each content graph sample to obtain a target pre-training model. Or training the initial pre-training model based on the sample characteristics and the multi-modal characteristics to obtain a target pre-training model. Or training the initial pre-training model based on the fusion characteristics and the sample characteristics of each content graph sample to obtain a target pre-training model.

In this embodiment, in the process of training the initial pre-training model, at least one of the fusion feature and the sample feature is further introduced based on consideration of the fusion feature, so that the pre-training model learns more feature information in the training process, and the accuracy of obtaining the somatosensory painting wind recognition model in the subsequent training is further improved.

In the following, how to train the initial pre-training model based on the fusion features and sample features of each content map sample to obtain a detailed implementation of the target pre-training model will be described in detail, and it should be understood that the implementation only considering the fusion features or sample features is similar to the subsequent steps, and therefore will not be described in detail.

Based on this, in an embodiment, as shown in fig. 16, in step 1402, the training process of training the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain the target pre-training model may specifically include the following processing processes:

and 1602, obtaining the predicted content classification labels corresponding to the content map samples based on the multi-modal features, and calculating cross entropy loss information corresponding to the content pattern samples through the predicted content classification labels and the content classification labels.

The cross entropy loss information is used for describing errors between the predicted content classification labels and the content classification labels, and particularly, the cross entropy errors between the predicted content classification labels and the content classification labels are calculated through a cross entropy loss function.

Specifically, in the training process, the initial pre-training model can obtain the predicted content classification label corresponding to each content map sample based on each multi-modal feature, where the predicted content classification label is used to describe a predicted category of content information included in the content map sample, and the predicted content classification label corresponding to each content map sample may be one or more than one, which is not limited herein. Based on the method, the server uses a cross entropy loss function to calculate a cross entropy error between the predicted content classification label and the content classification label corresponding to each content graph sample, and therefore the cross entropy error is used as cross entropy loss information corresponding to the content graph sample. For example, the cross entropy error can be calculated using cross-entropy loss as a cross entropy loss function.

Step 1604, calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of the fusion characteristics and the sample characteristics of each content pattern sample.

Wherein the similarity loss information is used to describe a degree of similarity between the plurality of features. Specifically, the server calculates the similarity between the first fusion feature and the second fusion feature in the fusion features through a similarity algorithm, and calculates the similarity between each dimension sample feature in the sample features, so as to obtain the similarity loss information corresponding to each content pattern. The similarity algorithm may be an euclidean distance similarity algorithm, a cosine similarity algorithm, or the like, and is not limited herein.

And 1606, updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information.

Specifically, the server updates the model parameters of the initial pre-training model through the cross entropy loss information and the similarity loss information calculated in the previous steps. Therefore, after repeated iteration updating, when the loss function of the initial pre-training model reaches convergence, the target pre-training model is generated based on the model parameters of the initial pre-training model updated for the last time.

In this embodiment, the error between the content classification label and the content classification label is predicted through cross entropy loss information description, and the similarity degree between multiple features is described through similarity loss information, so as to improve the accuracy and the richness of the loss information of the pre-training model. Therefore, when the model parameters are updated, the errors between the predicted labels and the real labels and the similarity degree between the characteristics are considered, so that the model training process is more reliable, namely, the obtained target pre-training model is more accurate.

In one embodiment, as shown in fig. 17, step 1604, calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of the fusion feature and the sample feature of each content pattern sample, includes:

at step 1702, a first similarity between a first image feature and a second image feature of each content map sample is calculated.

Wherein the first similarity is used for describing the similarity between the image features. Specifically, the server calculates the similarity between the first image feature and the second image feature by a similarity algorithm, so as to obtain the first similarity of each content map sample. The similarity calculation method may be an euclidean distance similarity calculation method, a cosine similarity calculation method, or the like, and is not limited herein.

At step 1704, a second similarity between the text sub-features in the second text feature of each content map sample is calculated.

And the second similarity is used for describing the similarity between the text sub-features. Specifically, the server calculates the similarity between the text sub-features in the second text feature through a similarity algorithm, so as to obtain the second similarity of each content map sample. For example, if the second text feature includes a text sub-feature corresponding to the title of the content diagram sample, and a text sub-feature corresponding to the publisher name of the content diagram sample is taken as an example for explanation, the second similarity is used to describe a similarity between the text sub-feature corresponding to the title of the content diagram sample and a text sub-feature corresponding to the publisher name of the content diagram sample.

At step 1706, a third similarity between the first fused feature and the second fused feature of each content map sample is calculated.

Wherein the third similarity is used for describing the similarity between the fusion features. Specifically, the server calculates the similarity between the first fusion feature and the second fusion feature by a similarity algorithm, so as to obtain a third similarity of each content map sample. The similarity calculation method may be an euclidean distance similarity calculation method, a cosine similarity calculation method, or the like, and is not limited herein.

It should be understood that there is no timing constraint between steps 1702, 1704, and 1706.

Step 1708, obtaining similarity loss information corresponding to each content pattern based on each first similarity, each second similarity, and each second similarity.

Wherein, the similarity loss information is specifically used to describe: similarity between image features, similarity between text sub-features, and similarity between fused features. Specifically, the server obtains similarity loss information corresponding to each content pattern based on the first similarity, the second similarities, and the second similarities corresponding to each content pattern obtained in the previous step.

In this embodiment, the similarity loss information is obtained by specifically calculating the similarity between the image features, the similarity between the text sub-features, and the similarity between the fusion features, so as to improve the accuracy and the richness of the similarity loss information, that is, to further improve the accuracy and the richness of the loss information of the pre-training model.

In an embodiment, as shown in fig. 18, in step 1402, the training process of training the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain the target pre-training model may specifically include the following processing processes:

and step 1802, performing image-text matching degree evaluation based on the first image characteristics and the second text characteristics of each content image sample to obtain image-text matching degrees corresponding to each content image sample.

The image-text matching degree is used for describing the matching degree between the first image characteristic and the second text characteristic, namely whether the second text characteristic of the content image sample can accurately describe the first image characteristic of the content image sample or not can be described according to the image-text matching degree.

Specifically, in the training process, the server performs self-attention feature extraction on the first image features of each content map sample through the initial pre-training model to obtain image self-attention features of each content map sample, and performs self-attention feature extraction on the second text features of each content map sample through the initial pre-training model to obtain text self-attention features of each content map sample. The image self-attention feature refers to an image feature extracted by self-attention during training. Text self-attention features refer to text features extracted by self-attention during training. Based on the above, the server evaluates the matching degree of each image self-attention feature and each text self-attention feature through the initial pre-training model, so as to obtain the image-text matching degree corresponding to each content pattern.

It should be understood that, in practical applications, when the image-text matching degree corresponding to the content image sample is high, the interaction between the image feature and the text feature may be enhanced in the training process, and when the image-text matching degree corresponding to the content image sample is weak, the interaction between the image feature and the text feature may be reduced in the training process.

Since the model parameters of the initial pre-training model need to be updated in the training process of training the initial pre-training model to obtain the target pre-training model, after the image-text matching degree corresponding to each content pattern is obtained in step 1802, in step 906, the process of updating the model parameters of the initial pre-training model based on each cross entropy loss information and each similarity loss information may specifically include the following processing processes:

and 1804, updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

Specifically, the server updates the model parameters of the initial pre-training model through the cross entropy loss information, the similarity loss information and the image-text matching degree calculated in the previous steps. Therefore, after repeated iteration updating, when the loss function of the initial pre-training model reaches convergence, the target pre-training model is generated based on the model parameters of the initial pre-training model updated for the last time.

In this embodiment, the matching degree between the first image feature and the second text feature is described through the image-text matching degree, that is, whether the second text feature of the content map sample can accurately describe the first image feature of the content map sample can be described according to the image-text matching degree, so that when the model parameter is updated, on the basis of considering the error between the predicted content classification label and the similarity degrees between a plurality of features, the matching degree between the image feature and the text feature can be further considered, so that the process of model training is more reliable, that is, the obtained target pre-training model is more accurate.

To describe the feature processing method in the training process in more detail, as shown in fig. 19, first, image feature extraction is performed on each content pattern sample 1901 and each processed content pattern sample 1902 to obtain a first image feature 1903 and a second image feature 1904 of each content pattern sample. Similarly, text feature extraction is performed on the content classification label 1905 and the text information 1906 corresponding to each content image sample, so as to obtain a first text feature 1907 and a second text feature including a first text sub-feature 1908 and a second text sub-feature 1909 of each content image sample.

Based on this, a multi-modal feature 1910 corresponding to each content map sample is constructed based on the first image feature 1903, the second image feature 1904, the first text feature 1907, and the second text feature comprising the first text sub-feature 1908 and the second text sub-feature 1909. Further, a first fused feature 1912 is constructed based on the first image feature 1903 of the content graph sample and the second text sub-feature 1909 of the second text feature, and a second fused feature 1912 is constructed based on the second image feature 1904 of the content graph sample and the first text sub-feature 1908 of the second text feature. And then obtaining a target pre-training model by the model training method described in the previous embodiment.

Further, after the target pre-training model is obtained through training, the somatosensory painting wind recognition model is obtained through training based on the target pre-training model, and a method for training the somatosensory painting wind recognition model will be described in detail below. In an embodiment, as shown in fig. 20, a training method of a somatosensory picture wind recognition model is provided, which is described by taking an example that the method is applied to a server in fig. 1, it is to be understood that the method can also be applied to a terminal, and can also be applied to a system including the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

step 2002, obtaining all somatosensory painting training samples and the somatosensory painting labels corresponding to all the somatosensory painting training samples.

The somatosensory painting type label is used for describing the somatosensory painting type of the somatosensory painting training sample, the somatosensory painting type is used for describing the style and tone corresponding type of data information, and the data information can be texts, pictures, videos or music. Specifically, the style and the tone are the embodiment of the overall style of the data information, and the data information with the same style and the tone has a certain commonality, which can cause resonance of a class of users, such as: eye care, healing, middle aged and elderly people, campus, and fashion. Therefore, style and tonality are a kind of overall feeling given to the user, and may be auditory feeling, such as relaxation and cheerfulness, etc., or visual feeling, such as pleasure and sadness, etc.

Specifically, the server obtains all somatosensory painting training samples and somatosensory painting labels corresponding to all the somatosensory painting training samples, wherein the somatosensory painting labels are obtained based on manual labeling.

It can be understood that, when the somatosensory painting tag corresponding to the somatosensory painting training sample is labeled manually, because the subjectivity of judging the corresponding category of the style and the tone of the data information is strong, for example, the style and the tone of the data information may be caused by content properties (such as seriousness and low tone), may also be caused by audience groups (such as young groups and middle-aged groups), or for data information in a text form, may also be caused by writing methods, so the somatosensory painting tag needs to consider the classification target of various tags such as specific content categories, intentions, emotions, and the like to be taken and collected for judgment to obtain a final result.

Based on this, in order to improve and promote the degree of accuracy of body feeling pictorial wind label in practical application, after obtaining the initial body feeling pictorial wind label that each body feeling pictorial wind training sample corresponds based on first artifical summary mark, because the judgement of body feeling pictorial wind label has very strong subjectivity, probably leads to initial body feeling pictorial wind label imperfect and inaccurate. Therefore, the initial somatosensory painting wind tag needs to be judged again by different annotators. As shown in fig. 21, the annotator 2102 judges the initial somatosensory stroke tag to obtain a first judgment result 2104 of the initial somatosensory stroke tag, and the annotator 2106 judges the initial somatosensory stroke tag to obtain a second judgment result 2108 of the initial somatosensory stroke tag. If the first determination result 2104 and the second determination result 2108 are the same, the initial somatosensory pictogram tag corresponding to the somatosensory pictogram training sample is set as the somatosensory pictogram tag corresponding to the somatosensory pictogram training sample. On the contrary, if the first determination result 2104 and the second determination result 2108 are not consistent, the initial somatosensory painting tag corresponding to the somatosensory painting training sample is adjusted, and then a similar determination step is performed on the adjusted initial somatosensory painting tag until the determination results are consistent.

Further, in this embodiment, descriptions of commonly used somatosensory painting tags and corresponding descriptions of the somatosensory painting tags are provided, as shown in fig. 22, the somatosensory painting tags include emotional exaggeration, rural wind, serious menstruation, high tone, low tone, easy entertainment, deep profession, social energy, shallow understanding, cure and the like. The somatosensory painting tag comprises a serious regular menstruation, can describe data information such as a news report of word seriousness, and is often international and social livelihood. Secondly, when the somatosensory painting tag comprises social positive energy, the somatosensory painting tag can describe data information such as sex courage, value sense, and the like, so that people can see positive energy news which can revive people's mind. The specific contents of fig. 22 are not fully and exhaustively introduced here, and in practical applications, the somatosensory note and the specific description corresponding to each somatosensory note are not limited to the foregoing examples.

And step 2004, training the initial somatosensory picture wind recognition model based on each somatosensory picture wind training sample to obtain a trained somatosensory picture wind recognition model, wherein the somatosensory picture wind recognition model recognizes the somatosensory picture wind category of the data information.

The initial somatosensory picture wind recognition model is a target pre-training model obtained through the embodiment, and the specific training mode is not repeated here.

Specifically, the server obtains the predicted somatosensory painting style labels of the somatosensory painting training samples through the initial somatosensory painting style recognition models based on the somatosensory painting training samples, and updates the model parameters of the initial somatosensory painting style recognition models based on the predicted somatosensory painting style labels and the somatosensory painting style labels of the somatosensory painting training samples. The somatosensory picture wind identification model is used for identifying the type of the somatosensory picture wind of the data information.

In the training method of the somatosensory painting recognition model, the initial somatosensory painting recognition model is the target pre-training model, and the target pre-training model can learn the characteristic information included in more content image samples in the training process, so that the accuracy of the trained target pre-training model is improved, and the accuracy of the somatosensory painting recognition model is improved.

Further, after the somatosensory picture wind recognition model is obtained by training, as can be seen from fig. 2, the somatosensory picture wind recognition model is obtained by calling training through the somatosensory picture wind recognition service, and the somatosensory picture wind type of the data information is recognized. In an embodiment, as shown in fig. 23, a method for identifying a somatosensory style is provided, and is described by taking an example that the method is applied to a server in fig. 1, it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

and step 2302, acquiring the data information to be identified.

In a specific application, the server can acquire somatosensory picture wind identification information from the terminal and acquire to-be-identified data information needing somatosensory picture wind type identification from the somatosensory picture wind identification information. Or acquiring the data information to be identified from a database. And is not particularly limited herein.

Step 2304, based on the data information to be recognized, a predicted somatosensory painting tag corresponding to the data information to be recognized is obtained through the somatosensory painting recognition model, and the predicted somatosensory painting tag is used for describing the somatosensory painting type of the data information to be recognized.

The somatosensory picture wind recognition model is obtained through the embodiment, and the specific training mode is not repeated here.

Specifically, the server takes the data information to be recognized as the input of the trained somatosensory picture wind recognition model, and the somatosensory picture wind recognition model can output a predicted somatosensory picture wind tag corresponding to the data information to be recognized, wherein the predicted somatosensory picture wind tag is used for describing the somatosensory picture wind category of the data information to be recognized. The specific definition and examples of the somatosensory style category are described in detail in the foregoing embodiments, and are not described here again.

In the method for identifying the somatosensory paintings, the predicted somatosensory paintings labels corresponding to the data information to be identified need to be acquired through the somatosensory paintings identification model, the somatosensory paintings identification model is obtained through training based on the target pre-training model, the target pre-training model can learn the characteristic information included in more content picture samples in the training process, and the accuracy of the trained target pre-training model is improved, so that the accuracy of the somatosensory paintings identification model is improved, and the accuracy of the somatosensory paintings category of the identification data information is further improved.

The following describes in detail detailed embodiments of a pre-training model, a training method of a somatosensory painting recognition model, and a corresponding somatosensory painting recognition method, and as shown in fig. 24, the method includes:

step 2402, obtaining each sample data pair and a content classification label corresponding to the content map sample in each sample data pair.

The sample data pair comprises a content graph sample and data description information corresponding to the content graph sample. Secondly, the content classification labels are used to describe the categories of the content information included in the content map samples, and the content classification label corresponding to each content map sample may be one or more. It should be understood that the specific examples of the sample data pair and the category-containing tag are similar to the foregoing embodiments and are not described herein again.

Step 2404, performing image feature extraction on each content map sample and the processed content map sample respectively to obtain a first image feature and a second image feature of each content map sample.

The data description information comprises a processed content image sample after the content image sample is processed, and the processing of the content image sample is specifically data enhancement processing of the content image sample. The manner how the server obtains the image features is similar to the foregoing embodiment, and is not described here again.

Step 2406, performing text feature extraction on the content classification labels corresponding to the content graph samples to obtain first text features.

The data description information comprises text information corresponding to the content graph sample. And the server extracts the text features of the content classification labels corresponding to the content pattern samples to obtain first text features corresponding to the content classification labels corresponding to the content pattern samples.

Step 2408, performing text division on the text information corresponding to each content pattern to obtain a text sequence corresponding to each content pattern.

Wherein, the text sequence corresponding to each content map sample comprises a plurality of text marks (Token). Specifically, a plurality of text labels can be obtained after text division is performed on text information corresponding to the content graph samples, and a text sequence corresponding to the content graph samples is formed based on the text labels. The manner how the server obtains the text sequence corresponding to each content map sample is similar to the foregoing embodiment, and is not described herein again.

Step 2410, performing mask processing on each text sequence, replacing part of text marks in the text sequence after the mask processing with mask marks, and generating each second text characteristic based on each text sequence after the mask processing.

The masking process is to mask (mask) a part of the text marks in the text sequence, that is, replace the part of the text marks with mask marks. The manner how the server generates each second text feature is similar to the foregoing embodiment, and is not described herein again.

And step 2412, constructing the multi-modal characteristics corresponding to each content map sample based on the sample characteristics of each content map sample.

The sample features of each content map sample comprise a first image feature, a second image feature, a first text feature and a second text feature. The manner how the server gets the multi-modal features is similar to the previous embodiment and will not be described here.

And 2414, training the initial pre-training model based on at least one of the fusion characteristics and the sample characteristics of each content map sample and the multi-modal characteristics to obtain a target pre-training model.

The server can train the initial pre-training model based on the fusion features and the multi-modal features of each content map sample to obtain a target pre-training model. Or training the initial pre-training model based on the sample characteristics and the multi-modal characteristics to obtain a target pre-training model. Or training the initial pre-training model based on the fusion characteristics and the sample characteristics of each content graph sample to obtain a target pre-training model. The way how the server trains the initial pre-training model to obtain the target pre-training model is similar to that of the foregoing embodiment, and details are not repeated here.

Step 2416, obtaining all somatosensory painting training samples and somatosensory painting labels corresponding to all the somatosensory painting training samples.

The somatosensory painting tag is used for describing the somatosensory painting type of the somatosensory painting training sample, the somatosensory painting type is used for describing the style and tone corresponding type of data information, and the data information can be texts, pictures, videos or music. The manner in which the server obtains the motion sensing picture wind training samples and the motion sensing picture wind tags corresponding to the motion sensing picture wind training samples is similar to that in step 2002, and is not described here again.

Step 2418, training the initial somatosensory painting recognition model based on each somatosensory painting training sample to obtain a trained somatosensory painting recognition model.

The motion sensing picture wind recognition model recognizes the motion sensing picture wind type of the data information, and the initial motion sensing picture wind recognition model is a target pre-training model obtained through the training. The way of obtaining the somatosensory picture wind recognition model by the server training is similar to that in step 2004, and is not described here again.

And step 2420, acquiring the data information to be identified.

The server obtains the information of the data to be identified in a similar manner to step 2302, which is not described herein again.

Step 2422, based on the to-be-identified data information, obtaining a predicted somatosensory painting tag corresponding to the to-be-identified data information through the somatosensory painting identification model, wherein the predicted somatosensory painting tag is used for describing the somatosensory painting type of the to-be-identified data information.

The somatosensory picture wind recognition model is obtained through the training method. The server obtains the predicted somatosensory painting wind label corresponding to the data information to be identified in a manner similar to that in step 2304, which is not described herein again.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a training device of the pre-training model for realizing the training method of the pre-training model. The implementation scheme for solving the problem provided by the apparatus is similar to the implementation scheme described in the method, so the specific limitations in the following embodiments of the training apparatus for one or more pre-training models may refer to the limitations on the training method for the pre-training models, and are not described herein again.

In one embodiment, as shown in fig. 25, there is provided a training apparatus for pre-training a model, including: an acquisition module 2502, a processing module 2504, and a first training module 2506, wherein:

an obtaining module 2502, configured to obtain each sample data pair, where the sample data pair includes a content map sample and data description information corresponding to the content map sample; obtaining content classification labels corresponding to the content image samples in each sample data pair;

the processing module 2504 is configured to perform feature extraction on each sample data pair and the content classification label corresponding to each content map sample, so as to obtain sample features of each content map sample, where the sample features include image features and text features;

the first training module 2506 is configured to train the initial pre-training model based on sample features of each content map sample to obtain a target pre-training model, where the target pre-training model is used to train and obtain a somatosensory painting recognition model, and the somatosensory painting recognition model recognizes a somatosensory painting category of data information.

In one embodiment, the data description information includes: the content image processing method comprises the steps of processing a content image sample, processing the content image sample, and text information corresponding to the content image sample;

the processing module 2504 is specifically configured to perform image feature extraction on each content map sample and the processed content map sample, so as to obtain a first image feature and a second image feature of each content map sample; respectively extracting text features of the content classification labels and the text information corresponding to the content image samples to obtain first text features and second text features of the content image samples; and the image features include a first image feature and a second image feature, and the text features include a first text feature and a second text feature.

In an embodiment, the processing module 2504 is specifically configured to perform text feature extraction on the content classification labels corresponding to the content image samples to obtain first text features; dividing texts of text information corresponding to the content pattern books to obtain text sequences corresponding to the content pattern books; and performing mask processing on each text sequence, replacing part of text marks in the text sequence after the mask processing with mask marks, and generating each second text characteristic based on each text sequence after the mask processing.

In one embodiment, the text sequence corresponding to the content map sample includes a plurality of text labels;

the processing module 2504 is specifically configured to calculate a contribution degree of each text label in each text sequence, where the contribution degree is a contribution degree of the text label to content classification label prediction; and determining the key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, and determining the key text marks in each text sequence as the replaced text marks in the text sequence.

In one embodiment, the first training module 2506 is specifically configured to construct a multi-modal feature corresponding to each content pattern sample based on the sample feature of each content pattern sample; and training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model.

In an embodiment, the first training module 2506 is specifically configured to train the initial pre-training model based on at least one of a fusion feature and a sample feature of each content map sample, and a multi-modal feature, to obtain a target pre-training model; the fusion features comprise a first fusion feature and a second fusion feature, and the first fusion feature is a feature constructed based on a first image feature and a second text feature of the content image sample; the second fused feature is a feature constructed based on a second image feature and a second text feature of the content map sample.

In one embodiment, first training module 2506 is specifically configured to, during a training process: based on each multi-modal characteristic, obtaining a predicted content classification label corresponding to each content graph sample, and calculating cross entropy loss information corresponding to each content pattern through each predicted content classification label and each content classification label; calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of the fusion characteristics and the sample characteristics of each content pattern sample; and updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information.

In an embodiment, the processing module 2504 is specifically configured to calculate a first similarity between the first image feature and the second image feature of each content map sample; calculating a second similarity between each text sub-feature in the second text feature of each content map sample; calculating a third similarity between the first fusion characteristic and the second fusion characteristic of each content map sample; and obtaining the similarity loss information corresponding to each content pattern according to each first similarity, each second similarity and each second similarity.

In an embodiment, the first training module 2506 is specifically configured to perform, in a training process, evaluation on image-text matching degrees based on the first image features and the second text features of each content image sample, so as to obtain image-text matching degrees corresponding to each content image sample; and updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

In one embodiment, as shown in fig. 26, there is provided a training apparatus for a somatosensory picture wind recognition model, including: an acquisition module 2602 and a second training module 2604, wherein:

an obtaining module 2602, configured to obtain data information to be identified;

the second training module 2604 is configured to train the initial somatosensory painting wind recognition model based on each somatosensory painting wind training sample to obtain a trained somatosensory painting wind recognition model;

In one embodiment, the second training module 2604 comprises:

the system comprises an initial somatosensory picture wind identification model obtaining module, a judgment module and a control module, wherein the initial somatosensory picture wind identification model obtaining module is used for obtaining an initial somatosensory picture wind identification model which is used for comparing the initial somatosensory picture wind identification model with the initial somatosensory picture wind identification model;

and the motion sensing picture wind model training module is used for training the initial motion sensing picture wind recognition model obtained by the initial motion sensing picture wind recognition model obtaining module based on each motion sensing picture wind training sample to obtain the trained motion sensing picture wind recognition model.

The initial somatosensory painting wind recognition model obtaining module can be a target pre-training model obtained by a training device of the pre-training model and used as the initial somatosensory painting wind recognition model, or can be a training device of the pre-training model directly, namely the training device of the pre-training model is used as the initial somatosensory painting wind recognition model obtaining module.

In one embodiment, as shown in fig. 27, there is provided a device for recognizing a somatosensory painting wind, including: an obtaining module 2702 and an identifying module 2704, wherein:

an obtaining module 2702, configured to obtain data information to be identified;

the identification module 2704 is configured to obtain, based on the to-be-identified data information, a predicted somatosensory painting tag corresponding to the to-be-identified data information through the somatosensory painting identification model, where the predicted somatosensory painting tag is used to describe the somatosensory painting category of the to-be-identified data information;

In an embodiment, the device for recognizing the somatosensory paintings further comprises a training device of the somatosensory paintings recognition model, so as to train and obtain the somatosensory paintings recognition model.

All or part of all modules in the training device for the pre-training model, the training device for the somatosensory picture wind recognition model and the somatosensory picture wind recognition device can be realized through software, hardware and a combination of the software and the hardware. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 28. The computer device includes a processor, a memory, an Input/Output interface (I/O for short), and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing required data such as sample data pairs, somatosensory painting training samples, to-be-identified data information and the like. The input/output interface of the computer device is used for exchanging information between the processor and an external device. The communication interface of the computer device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a training method of a pre-training model.

Those skilled in the art will appreciate that the architecture shown in fig. 28 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases involved in the embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the various embodiments provided herein may be, without limitation, general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, or the like.

All possible combinations of the technical features in the above embodiments may not be described for the sake of brevity, but should be considered as being within the scope of the present disclosure as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method of training a pre-trained model, the method comprising:

obtaining each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample;

obtaining content classification labels corresponding to the content map samples in each sample data pair;

performing feature extraction on each sample data pair and a content classification label corresponding to each content map sample to obtain sample features of each content map sample, wherein the sample features comprise image features and text features;

training an initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory painting recognition model, and the somatosensory painting recognition model recognizes the somatosensory painting category of data information.

2. The method of claim 1, wherein the data description information comprises: processing the content image sample, and text information corresponding to the content pattern;

the performing feature extraction on each sample data pair and the content classification label corresponding to each content graph sample to obtain the sample feature of each content graph sample includes:

respectively carrying out image feature extraction on each content map sample and the processed content map sample to obtain a first image feature and a second image feature of each content map sample;

the image features include the first image features and second image features, and the text features include the first text features and the second text features.

3. The method according to claim 2, wherein performing text feature extraction on the content classification label and the text information corresponding to each of the content image samples to obtain a first text feature and a second text feature of each of the content image samples respectively comprises:

performing text division on text information corresponding to each content pattern to obtain a text sequence corresponding to each content pattern sample;

and performing mask processing on each text sequence, replacing part of text marks in the text sequence after the mask processing with mask marks, and generating each second text feature based on each text sequence after the mask processing.

4. The method of claim 3, wherein the text sequence corresponding to the content map sample comprises a plurality of text labels;

the masking each text sequence includes:

calculating the contribution degree of each text mark in each text sequence, wherein the contribution degree is the contribution degree of the text mark to content classification label prediction;

and determining key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, and determining the key text marks in each text sequence as replaced text marks in the text sequence.

5. The method according to claim 2 or 3, wherein the training an initial pre-training model based on the sample features of each content map sample to obtain a target pre-training model comprises:

constructing multi-modal features corresponding to each content pattern sample based on the sample features of each content pattern sample;

and training the initial pre-training model based on the multi-modal characteristics of each content map sample to obtain the target pre-training model.

6. The method of claim 5, wherein training the initial pre-trained model based on the multi-modal features of each of the content map samples to obtain the target pre-trained model comprises:

training the initial pre-training model based on at least one of the fusion features and the sample features of each content map sample and the multi-modal features to obtain the target pre-training model;

wherein the fused feature comprises a first fused feature and a second fused feature, the first fused feature being a feature constructed based on the first image feature and the second text feature of the content map sample; the second fused feature is a feature constructed based on the second image feature and the second text feature of the content map sample.

7. The method according to claim 6, wherein training the initial pre-training model based on at least one of fusion features and sample features of each of the content map samples and the multi-modal features to obtain the target pre-training model comprises:

during the training process:

obtaining a predicted content classification label corresponding to each content map sample based on each multi-modal feature, and calculating cross entropy loss information corresponding to each content pattern through each predicted content classification label and each content classification label;

calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of fusion characteristics and sample characteristics of each content pattern sample;

and updating the model parameters of the initial pre-training model based on each piece of cross entropy loss information and each piece of similarity loss information.

8. The method according to claim 7, wherein calculating and determining similarity loss information corresponding to each content pattern sample based on at least one of a fusion feature and a sample feature of each content pattern sample comprises:

calculating a first similarity between the first image feature and the second image feature of each content map sample;

calculating a third similarity between the first fused feature and the second fused feature of each of the content map samples;

and obtaining similarity loss information corresponding to each content pattern based on each first similarity, each second similarity and each second similarity.

9. The method of claim 7, wherein the training of the initial pre-trained model to obtain the target pre-trained model is performed based on at least one of fusion features and sample features of each of the content map samples and the multi-modal features, and further comprising:

in the training process, evaluating the image-text matching degree based on the first image characteristic and the second text characteristic of each content image sample to obtain the image-text matching degree corresponding to each content image sample;

updating the model parameters of the initial pre-training model based on each piece of cross entropy loss information and each piece of similarity loss information, including:

and updating the model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

10. A training method of a somatosensory picture wind recognition model is characterized by comprising the following steps:

obtaining content classification labels corresponding to the content graph samples in each sample data pair;

training an initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, and taking the target pre-training model as the initial somatosensory picture wind recognition model.

11. A method for recognizing a somatosensory painting wind is characterized by comprising the following steps:

acquiring data information to be identified;

performing feature extraction on each sample data pair and a content classification label corresponding to each content graph sample to obtain sample features of each content graph sample, wherein the sample features comprise image features and text features;

training an initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, and taking the target pre-training model as the initial somatosensory picture wind recognition model.

12. An apparatus for pre-training a model, the apparatus comprising:

the acquisition module is used for acquiring each sample data pair, and the sample data pairs comprise content graph samples and data description information corresponding to the content graph samples; obtaining content classification labels corresponding to the content graph samples in each sample data pair;

the processing module is used for performing feature extraction on each sample data pair and the content classification label corresponding to each content graph sample to obtain sample features of each content graph sample, wherein the sample features comprise image features and text features;

the first training module is used for training an initial pre-training model based on the sample characteristics of each content map sample to obtain a target pre-training model, the target pre-training model is used for training to obtain a somatosensory painting recognition model, and the somatosensory painting recognition model recognizes the somatosensory painting category of data information.

13. The utility model provides a trainer of somatosensory picture wind recognition model, its characterized in that, the device includes:

the acquisition module is used for acquiring data information to be identified;

14. The utility model provides a recognition device of body feeling picture wind, its characterized in that, the device includes:

the acquisition module is used for acquiring data information to be identified;

the identification module is used for acquiring a predicted somatosensory painting tag corresponding to the data information to be identified through a somatosensory painting identification model based on the data information to be identified, and the predicted somatosensory painting tag is used for describing the somatosensory painting category of the data information to be identified;

15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 11 when executing the computer program.