CN115658964B

CN115658964B - Training method and device for pre-training model and somatosensory wind identification model

Info

Publication number: CN115658964B
Application number: CN202210572644.2A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2023-07-18
Anticipated expiration: 2042-05-25
Also published as: CN115658964A

Abstract

The application relates to a training method and device for a pre-training model and a somatosensory wind identification model. The method comprises the following steps: acquiring each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample; acquiring content classification labels corresponding to content graph samples in each sample data pair; extracting characteristics of each sample data pair and a content classification label corresponding to each content graph sample to obtain sample characteristics of each content graph sample, wherein the sample characteristics comprise image characteristics and text characteristics; based on sample characteristics of each content graph sample, training an initial pre-training model to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies the somatosensory wind category of data information. By adopting the method, the accuracy of identifying the somatosensory wind can be ensured.

Description

Training method and device for pre-training model and somatosensory wind identification model

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method and device for a pre-training model and a somatosensory wind-painting recognition model.

Background

With the age of rapid development of the internet, as the threshold of content production decreases, the distribution amount of various contents increases at an exponential rate. The motion feeling and the wind drawing of different users on each content are different, namely the user's visual feeling on each content, and the specific content can be a Title (Title) seen by the user, a cover chart of the content, an account number of an author issuing the content, and the like. Therefore, classification in the dimension of somatosensory wind is needed for describing the style and the tonality of content, and the somatosensory wind is particularly the whole style of content, and the content of the same style and the tonality can have a certain commonality, such as positive energy, easy entertainment and the like, so that the content of the same style and the tonality can cause resonance of a class of users.

At present, an unsupervised or weakly supervised method is generally adopted for classifying the information flow in the somatosensory wind drawing dimension, and the accuracy of the obtained classification result is lower because the unsupervised or weakly supervised method needs to collect a large number of data samples and perform cluster analysis on sample data and the subjectivity of the somatosensory wind drawing is stronger. Therefore, how to ensure accuracy of identifying somatosensory wind is a problem to be solved.

Disclosure of Invention

Accordingly, in order to solve the above-mentioned problems, it is necessary to provide a training method and apparatus for a pre-training model and a somatosensory wind recognition model, which can ensure accuracy in recognizing somatosensory wind.

In a first aspect, the present application provides a method of training a pre-training model. The method comprises the following steps:

acquiring each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample;

acquiring content classification labels corresponding to content graph samples in each sample data pair;

extracting characteristics of each sample data pair and a content classification label corresponding to each content graph sample to obtain sample characteristics of each content graph sample, wherein the sample characteristics comprise image characteristics and text characteristics;

based on sample characteristics of each content graph sample, training an initial pre-training model to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies the somatosensory wind category of data information.

In one embodiment, the data description information includes: a processed content graph sample after processing the content graph sample, and text information corresponding to the content graph sample;

Extracting features of each sample data pair and content classification labels corresponding to each content graph sample to obtain sample features of each content graph sample, wherein the method comprises the following steps:

respectively extracting image features of each content image sample and the processed content image samples to obtain a first image feature and a second image feature of each content image sample;

respectively extracting text features of content classification labels and text information corresponding to the content graph samples to obtain first text features and second text features of the content graph samples;

the image features include a first image feature and a second image feature, and the text features include a first text feature and a second text feature.

In one embodiment, text feature extraction is performed on content classification labels and text information corresponding to each content graph sample, to obtain a first text feature and a second text feature of each content graph sample, including:

extracting text features from content classification labels corresponding to the content graph samples to obtain first text features;

text division is carried out on text information corresponding to each content graph sample, and a text sequence corresponding to each content graph sample is obtained;

masking each text sequence, replacing part of text labels in the masked text sequences with mask labels, and generating each second text feature based on each masked text sequence.

In one embodiment, the text sequence corresponding to the content graph sample includes a plurality of text labels;

masking each text sequence, including:

calculating the contribution degree of each text mark in each text sequence, wherein the contribution degree is the contribution degree of the text mark to the content classification label prediction;

and determining the key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, and determining the key text marks in each text sequence as the replaced text marks in the text sequence.

In one embodiment, training the initial pre-training model based on sample features of each content graph sample to obtain a target pre-training model includes:

based on sample characteristics of each content graph sample, constructing multi-mode characteristics corresponding to each content graph sample;

based on the multi-modal characteristics of each content graph sample, training the initial pre-training model to obtain a target pre-training model.

In one embodiment, training the initial pre-training model based on multi-modal characteristics of each content graph sample to obtain a target pre-training model includes:

training the initial pre-training model based on at least one of fusion characteristics and sample characteristics of each content graph sample and multi-modal characteristics to obtain a target pre-training model;

The fusion features comprise a first fusion feature and a second fusion feature, wherein the first fusion feature is a feature constructed based on a first image feature and a second text feature of a content graph sample; the second fusion feature is a feature constructed based on the second image feature and the second text feature of the content map sample.

In one embodiment, training the initial pre-training model based on at least one of the fusion features and the sample features of each content graph sample and the multi-modal features to obtain a target pre-training model includes:

during the training process:

based on the multi-mode features, obtaining predicted content classification labels corresponding to the content graph samples, and calculating cross entropy loss information corresponding to the content graph samples through the predicted content classification labels and the content classification labels;

calculating and determining similarity loss information corresponding to each content graph sample based on at least one of fusion characteristics and sample characteristics of each content graph sample;

based on the cross entropy loss information and the similarity loss information, model parameters of the initial pre-training model are updated.

In one embodiment, calculating and determining similarity loss information corresponding to each content graph sample based on at least one of a fusion feature and a sample feature of each content graph sample includes:

Calculating a first similarity between the first image feature and the second image feature of each content graph sample;

calculating a second similarity between each text sub-feature in the second text features of each content map sample;

calculating a third similarity between the first fusion feature and the second fusion feature of each content graph sample;

and obtaining similarity loss information corresponding to each content graph sample based on each first similarity, each second similarity and each second similarity.

In one embodiment, training the initial pre-training model based on at least one of the fusion feature and the sample feature of each content graph sample and the multi-modal feature to obtain a target pre-training model, and further comprising:

in the training process, evaluating the image-text matching degree based on the first image features and the second text features of each content image sample to obtain the image-text matching degree corresponding to each content image sample;

updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information, comprising:

and updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

In a second aspect, the present application provides a training method for a somatosensory wind identification model. The method comprises the following steps:

acquiring somatosensory wind drawing training samples and somatosensory wind drawing labels corresponding to the somatosensory wind drawing training samples;

training the initial somatosensory wind identification model based on each somatosensory wind training sample to obtain a trained somatosensory wind identification model, and identifying the somatosensory wind category of data information of the somatosensory wind identification model;

the method for obtaining the initial somatosensory wind identification model comprises the following steps:

training the initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, and taking the target pre-training model as an initial somatosensory wind identification model.

In a third aspect, the present application provides a method for identifying somatosensory wind. The method comprises the following steps:

Acquiring data information to be identified;

based on the data information to be identified, a predicted somatosensory wind-drawing label corresponding to the data information to be identified is obtained through a somatosensory wind-drawing identification model, and the predicted somatosensory wind-drawing label is used for describing the somatosensory wind category of the data information to be identified;

the method for obtaining the somatosensory wind identification model comprises the following steps of:

training the initial somatosensory wind-painting recognition model based on each somatosensory wind-painting training sample to obtain a trained somatosensory wind-painting recognition model;

In a fourth aspect, the present application further provides a training device for pre-training a model. The device comprises:

the acquisition module is used for acquiring each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample; obtaining content classification labels corresponding to the content graph samples in each sample data pair;

the processing module is used for extracting the characteristics of each sample data pair and the content classification label corresponding to each content graph sample to obtain sample characteristics of each content graph sample, wherein the sample characteristics comprise image characteristics and text characteristics;

the first training module is used for training the initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies the somatosensory wind category of data information.

In a fifth aspect, the present application further provides a training device for a somatosensory wind identification model. The device comprises:

the acquisition module is used for acquiring the data information to be identified;

the second training module is used for training the initial somatosensory wind-painting recognition model based on the somatosensory wind-painting training samples to obtain a trained somatosensory wind-painting recognition model;

In a sixth aspect, the application further provides a device for identifying somatosensory wind. The device comprises:

the recognition module is used for acquiring a predicted somatosensory wind-drawing label corresponding to the data information to be recognized through the somatosensory wind-drawing recognition model based on the data information to be recognized, wherein the predicted somatosensory wind-drawing label is used for describing the somatosensory wind category of the data information to be recognized;

In a seventh aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In an eighth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

training the initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, and taking the target pre-training model as an initial somatosensory wind identification model

In a ninth aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring data information to be identified;

In a tenth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In an eleventh aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a twelfth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring data information to be identified;

In a thirteenth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

In a fourteenth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

In a fifteenth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

acquiring data information to be identified;

According to the training method and device, the computer equipment, the storage medium and the computer program product of the pre-training model and the somatosensory wind recognition model, through obtaining each sample data pair, wherein the sample data pair comprises a content image sample and data description information corresponding to the content image sample, then obtaining content classification labels corresponding to the content image sample in each sample data pair, and carrying out feature extraction on the content classification labels corresponding to each sample data pair and each content image sample to obtain sample features of each content image sample, wherein the sample features comprise image features and text features, the initial pre-training model is trained based on the sample features of each content image sample to obtain a target pre-training model, and the target pre-training model is used for training to obtain the somatosensory wind recognition model and the somatosensory wind recognition model recognition data information somatosensory wind category. Because the sample data pair comprises the content graph sample and the data description information corresponding to the content graph sample, the sample characteristics of each content graph sample can describe the included characteristic information of the content graph sample from multiple dimensions, and the obtained target pre-training model can learn the included characteristic information of more content graph samples, so that the accuracy of the target pre-training model obtained through training can be improved. Based on the motion sensing wind pattern recognition model, the motion sensing wind pattern recognition model is further trained by using the target pre-training model, namely, the accuracy of the motion sensing wind pattern recognition model is further improved, and the accuracy of the motion sensing wind pattern of recognition data information is further improved.

Drawings

FIG. 1 is an application environment diagram of a training method of a pre-training model in one embodiment;

FIG. 2 is a schematic diagram of a body sensing and wind identification system according to an embodiment;

FIG. 3 is a flow diagram of a training method of a pre-training model in one embodiment;

FIG. 4 is a flow diagram of sample features for obtaining samples of content graphs in one embodiment;

FIG. 5 is a schematic diagram of data enhancement processing of content graph samples in one embodiment;

FIG. 6 is a schematic diagram of acquiring a first image feature and a second image feature in one embodiment;

FIG. 7 is a schematic diagram of acquiring a first text feature and a second text feature in one embodiment;

FIG. 8 is a flow diagram of obtaining first text features and second text features in one embodiment;

FIG. 9 is a diagram of acquiring second text features, in one embodiment;

FIG. 10 is a flow diagram of masking text sequences in one embodiment;

FIG. 11 is a flow diagram of determining a key text label in one embodiment;

FIG. 12 is a partial flow diagram of a training method of a pre-training model in one embodiment;

FIG. 13 is a schematic diagram of acquiring multi-modal features in one embodiment;

FIG. 14 is a partial flow diagram of a training method of a pre-training model in another embodiment;

FIG. 15 is a schematic diagram of constructing a first fusion feature and a second fusion feature in one embodiment;

FIG. 16 is a partial flow diagram of a training method of a pre-training model in yet another embodiment;

FIG. 17 is a flow diagram of determining similarity loss information in one embodiment;

FIG. 18 is a partial flow diagram of a training method for a pre-training model in yet another embodiment;

FIG. 19 is a flow chart of a method of feature processing during the foregoing training process in one embodiment;

FIG. 20 is a flow chart of training of a somatosensory wind recognition model in one embodiment;

FIG. 21 is a flowchart of a method for obtaining somatosensory wind labels corresponding to somatosensory wind training samples according to an embodiment;

FIG. 22 is a schematic diagram of a somatosensory wind tag and corresponding description in one embodiment;

FIG. 23 is a flowchart of a method for recognizing somatosensory wind in an embodiment;

FIG. 24 is a flowchart illustrating a method for recognizing somatosensory wind according to an embodiment;

FIG. 25 is a schematic diagram of a training device for pre-training a model in one embodiment;

FIG. 26 is a schematic structural diagram of a training device for a somatosensory wind identification model in one embodiment;

FIG. 27 is a schematic diagram of a body sensing and wind drawing recognition device according to an embodiment;

fig. 28 is an internal structural view of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

And machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to technologies such as image processing, text processing and machine learning of artificial intelligence, and is specifically described by the following embodiments:

the training method of the pre-training model provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers.

Specifically, taking the application to the server 104 as an example, the server 104 may obtain a sample data pair and a content classification label corresponding to a content graph sample in each sample data pair from the data storage system, then the server 104 performs feature extraction on each sample data pair and a content classification label corresponding to each content graph sample to obtain sample features of each content graph sample, where the sample features include image features and text features, and based on the sample features of each content graph sample, the server 104 trains an initial pre-training model to obtain a target pre-training model, where the target pre-training model is used to train and obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies a somatosensory wind category of data information.

Secondly, taking the terminal 102 with high computing power as an example for explanation, the terminal 102 can obtain a sample data pair and content classification labels corresponding to content graph samples in each sample data pair through communication with the server 104, then the terminal 102 performs feature extraction on each sample data pair and content classification labels corresponding to each content graph sample to obtain sample features of each content graph sample, the sample features comprise image features and text features, based on the sample features of each content graph sample, the terminal 102 trains an initial pre-training model to obtain a target pre-training model, and the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies the somatosensory wind category of data information.

The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, aircrafts, etc. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers. The embodiment of the invention can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like.

Specifically, the training method of the pre-training model provided by the embodiment of the application can be applied to a somatosensory wind recognition system shown in fig. 2. The following will describe a training method based on a pre-training model, a training method of a somatosensory wind recognition model, and a recognition method of somatosensory wind, and main functions of each service module:

1. content production end 201

Professional production content (Professional Generated Content, PGC), user originated content (User Generated Content, UGC) and Multi-Channel Network product form MCN (MCN) provide graphics or video content through a mobile or back-end application program interface (Application Program Interface, API) system, which are the primary content sources for the content production end 201. Next, the content producer 201 uploads the teletext content by communicating with the up-down content interface server 203, where the source of the teletext content is typically a lightweight distribution terminal and an edit content portal, and the distribution of the video content is typically a photography terminal.

2. Content consumer 202

The content consumption end 202 communicates with the uplink and downlink content interface server 203, pushes through the recommendation to obtain the index information of the accessed content, and then communicates with the content storage server 204, where obtaining the corresponding content includes recommending to obtain the content, and the content storage server 204 stores content entities such as: video source files, picture source files, etc. And meta information of the content such as title, author, cover map, category, and Tag (Tag) information is stored in the content database 205. Secondly, the content consumption end 202 can also report the behavior data, the click time, the play click and other information of the user in the uploading and downloading processes to the back end for statistical analysis. And the content consumption end 202 browses the content data, and various data from external channels enter the system through the content consumption end 202 via the uplink and downlink content interface server 203.

3. Uplink and downlink content interface server 203

The up-down content interface server 203 is used to communicate directly with the content producer 201, and the content submitted from the front end, typically, the title of the content, the publisher, the abstract, the cover map, the distribution time, and the like, is stored in the content database 205. Next, the uplink and downlink content interface server 203 can also use meta information of the teletext content, such as: information such as file size, cover map links, title, distribution time, author, etc. is written into the content database 205. Further, the uplink and downlink content interface server 203 is configured to synchronize the content submitted by the content producer 201 to the dispatch center server 206 for subsequent content processing and circulation.

4. Content storage server 204

The content storage server 204 is used to store content entity information other than meta information of the content, such as video source files and picture source files of the teletext content, and the terminal directly accesses the source files from the content storage server 204 when consuming the video content. Secondly, when the labels corresponding to the samples are extracted, providing the video source file comprising the frame extraction content in the middle of the source file, and taking the frame extraction through the samples as a candidate set of the samples.

5. Content database 205

All meta-information of the content released by the content producer 201 is stored in the content database 205, and the key points are that the meta-information of the content itself, such as file size, cover map link, code rate, file format, title, release time, author, video file size, video format, whether original mark or first include classification of the content in the manual review process, where the classification of the content in the manual review process includes classification and tag information of each of the first, second and third levels, for example: an article explaining A cell-phone, the first class classification is the science and technology, and the second class classification is the smart mobile phone, and the tertiary classification is domestic cell-phone, and tag information is A. Next, when the manual auditing system 207 performs the manual auditing process, the information in the content database 205 is read, and the result and the state of the manual auditing obtained by the manual auditing system 207 are also returned to the content database 205.

Further, the content processing by the dispatch center server mainly includes machine processing and manual auditing processing, where various quality judgments such as low quality filtering, content labeling such as classification, and label information of the machine processing core, content duplication elimination is performed at the duplication elimination server 208, and specifically, the duplication elimination result of the content is written into the content database 205, so that the completely duplicated content is not issued to the manual auditing system 207, and repeated secondary processing is avoided. Thus, the meta information of the content is read from the content database 205 when the subsequent modeling identifies that information such as a content title, cover map, label, etc. is required.

6. Dispatch center server 206

The dispatching center server 206 is responsible for the whole dispatching process of content circulation, acquires the warehoused content through the uplink and downlink content interface server 203, and then acquires the meta-information of the content from the content database 205. Second, manual auditing system 207 and machine processing systems may also be scheduled, controlling the order and priority of scheduling. The content consumer 202 may also be provided with a content outlet distribution service (typically a recommendation engine, or a search engine, or an operation) via a presentation page directly, that is, content index information obtained by the content consumer 202, i.e., an entry address of content consumption access. Further, by communicating with the content somatosensory wind identification service 209, the somatosensory wind categories of the data information are identified and marked during the information stream content circulation process.

7. Manual auditing system 207

The manual auditing system 207 is a carrier with manual service capability and is mainly used for auditing data information which cannot be determined and judged by machines such as sensitive data information and the like, and the manual auditing system 207 can also carry out secondary confirmation by marking of classified labels of special types of videos, so that the marking effect and quality are ensured.

8. Somatosensory wind identification service 209 and somatosensory wind identification model 210

According to the training method of the somatosensory wind identification model, based on the obtained target pre-training model, somatosensory wind training samples are obtained from a somatosensory wind training sample database 211, and somatosensory wind labels corresponding to the somatosensory wind training samples, and the somatosensory wind identification model 210 is obtained through training. The somatosensory wind recognition service 209 is performed based on the somatosensory wind recognition model 210.

9. Target pre-training model 212 and multimodal pre-training sample database 213

According to the training method of the pre-training model, each sample data pair and the content classification label corresponding to the content graph sample in each sample data pair are obtained from the multi-mode pre-training sample database 213, and the target pre-training model 212 is obtained through training.

10. Crawling and data preprocessing system 214

Crawling and data preprocessing system 214 crawls the corresponding content patterns from the internet with the streaming content to supplement the relevant pre-training data for the corresponding domain.

11. Video snapshot and teletext content resolution service 215

The video snapshot and teletext parsing service 215 is used to obtain the necessary video file frames from the video source file as a subsequent construction of a video cover map to provide the source of the original data. Alternatively, when there are multiple pictures in the content, the video frame and content parsing service 215 parses the content to extract multiple pictures that may be cover sheets that are input together as the cover sheets and the cover sheets that were uploaded by the original author.

Based on this, in one embodiment, as shown in fig. 3, a training method of a pre-training model is provided, and the method is described by taking an application of the method to the server in fig. 1 as an example, it is understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

In step 302, each sample data pair is acquired, where the sample data pair includes a content graph sample and data description information corresponding to the content graph sample.

The sample data pair includes a content map sample and data description information corresponding to the content map sample, the content map sample may be a cover map sample or a thumbnail sample, and the data description information specifically includes a processed content map sample after processing the content map sample, and text information corresponding to the content map sample, where the text information corresponding to the content map sample specifically includes: title (Title) of the content map sample, publisher Name (Puin_Name) of the content map sample, and the like.

Specifically, the server first obtains a video file sample set, where the video file sample set may be a plurality of video file samples downloaded from a database, or may be a plurality of video file samples uploaded through a terminal, which is not limited herein. The server specifically invokes the video extraction and the teletext content resolution service to acquire video file frames from each video file sample, and provides the acquired video file frames as a source of original data for constructing the content map sample.

Further, the server constructs content graph samples based on the acquired multiple video file frames, and performs data enhancement processing on each content graph sample to obtain each processed content graph sample. And secondly, calling a picture and text content analysis service to extract text information of each content picture sample to obtain text information corresponding to each content picture sample, and forming data description information corresponding to each content picture sample through the processed content picture sample and the text information corresponding to each content picture sample, thereby obtaining each sample data pair comprising the content picture sample and the data description information corresponding to the content picture sample.

Step 304, obtaining content classification labels corresponding to the content graph samples in each sample data pair.

The content classification labels are used for describing categories of content information included in the content graph samples, and one or more content classification labels corresponding to each content graph sample can be used. For example, where the content map sample is an animal on a lawn, the content classification labels may be cats, dogs, grasslands, etc.

And 306, extracting the characteristics of each sample data pair and the content classification label corresponding to each content graph sample to obtain sample characteristics of each content graph sample, wherein the sample characteristics comprise image characteristics and text characteristics.

The image features comprise image features corresponding to the content image samples and image features corresponding to images included in the data description information of the content pattern pairs. Similarly, the text features include text features corresponding to the content classification tags, and text features corresponding to the text included in the data description information of the content pattern pair.

Step 308, training the initial pre-training model based on the sample characteristics of each content graph sample to obtain a target pre-training model, wherein the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies the somatosensory wind category of the data information.

The somatosensory wind-drawing type is used for describing corresponding types of styles and tones of data information, and the data information can be text, pictures, videos or music. Specifically, style and tonality are the whole style of the data information, and the data information of the same style and tonality has a certain commonality, so that resonance of a class of users can be caused, for example: nourishing eyes, healing, middle-aged and elderly people, campuses, and tidal conditions. Thus, style and tone are an overall feel to the user, which may be auditory, e.g., soothing and cheering, or visual, e.g., pleasurable and sad.

Based on the content classification labels, the initial pre-training model is trained based on the predicted content classification labels and the content classification labels, and a target pre-training model is obtained. The target pre-training model thus obtained is the target pre-training model 212 in fig. 2, which is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model is specifically used for identifying the somatosensory wind category of the data information.

In the training method of the pre-training model, since the sample data pair comprises the content graph sample and the data description information corresponding to the content graph sample, the sample characteristics of each content graph sample can describe the characteristic information included in the content graph sample from multiple dimensions, and therefore the obtained target pre-training model can learn the characteristic information included in more content graph samples, and the accuracy of the target pre-training model obtained through training can be improved. Based on the motion sensing wind pattern recognition model, the motion sensing wind pattern recognition model is further trained by using the target pre-training model, namely, the accuracy of the motion sensing wind pattern recognition model is further improved, and the accuracy of the motion sensing wind pattern of recognition data information is further improved.

In one embodiment, as shown in FIG. 4, the data description information includes: a processed content graph sample after processing the content graph sample, and text information corresponding to the content graph sample;

step 306, extracting features of each sample data pair and a content classification label corresponding to each content graph sample to obtain sample features of each content graph sample, which specifically includes:

and step 402, respectively extracting image features of each content graph sample and the processed content graph samples to obtain a first image feature and a second image feature of each content graph sample.

The data description information comprises a processed content graph sample after the content graph sample is processed. Specifically, the processing of the content map sample is specifically data enhancement processing of the content map sample, such as rotation, clipping, gaussian noise, masking, color conversion, and filters. For ease of understanding, as shown in fig. 5, the color change process is performed on the content map sample 502 to obtain a processed content map sample 504. Next, the content map sample 506 is clipped, and a processed content map sample 508 is obtained. It should be understood that the foregoing examples are for understanding data enhancement processing only and are not limiting of the present application.

Based on the image characteristics, the server extracts the image characteristics of each content graph sample, and first image characteristics corresponding to each content graph sample are obtained. And similarly, the server extracts image features of the processed content graph samples included in the data description information to obtain second image features corresponding to the processed content graph samples.

Specifically, the image feature extraction may be specifically performed by using a visual transformation (Vision Transformer, viT) model, as shown in fig. 6, where the image feature extraction is performed on the content image sample 602 and the processed content image sample 604 by using the visual transformation model, so as to obtain a first image feature 606 corresponding to the content image sample 602 and a second image feature 608 corresponding to the processed content image sample 604.

And step 404, respectively extracting text features of the content classification labels and the text information corresponding to the content graph samples to obtain a first text feature and a second text feature of the content graph samples.

The data description information comprises text information corresponding to the content graph sample. Based on the text feature extraction, the server extracts text features from the content classification labels corresponding to the content graph samples, and obtains first text features corresponding to the content classification labels corresponding to the content graph samples. And similarly, the server extracts text features of the text information included in the data description information to obtain second text features corresponding to the text information.

It will be appreciated that the text information in the data description information may specifically include: the title of the content map sample, the name of the publisher of the content map sample, and the like describe text information of the text content of the content map sample, and the text information is not exhaustive herein. Therefore, the second text feature corresponding to each text message may specifically include a text sub-feature corresponding to the title of the content graph sample, and the text sub-feature corresponding to each text message describing the text content of the content graph sample, such as a text sub-feature corresponding to the name of the publisher of the content graph sample, is not exhaustive herein.

Specifically, text feature extraction may be specifically performed by using a bi-directional encoder (Bidirectional EncoderRepresentation from Transformers, BERT) model of the converter, and the text information in the data description information specifically includes a title of the content graph sample and a publisher name of the content graph sample is taken as an example to describe, based on this, as shown in fig. 7, text feature extraction is performed on the content classification tag 702 and the text information 704 by using the bi-directional encoder model of the converter, so that a first text feature 706 corresponding to the content classification tag 702 and a second text feature 708 corresponding to the text information 704 may be obtained, and since the text information in the data description information specifically includes the title of the content graph sample and the publisher name of the content graph sample, the second text feature 708 specifically includes a text sub-feature 7082 corresponding to the title of the content graph sample and a text sub-feature 7084 corresponding to the publisher name of the content graph sample.

In step 406, the image features include a first image feature and a second image feature, and the text features include a first text feature and a second text feature.

Specifically, since the image feature extraction is performed in step 402, the first image feature and the second image feature may be obtained, that is, the image features in the sample features of each content image sample specifically include the first image feature and the second image feature. Similarly, since the text feature extraction is performed in step 402, the first text feature and the second text feature may be obtained, that is, the text features in the sample features of each content map sample specifically include the first text feature and the second text feature.

In this embodiment, image features are extracted at a plurality of questions by extracting image features from each content image sample and the processed content image sample, so as to improve the richness of image feature information included in the obtained image features, and then text feature extraction is performed on content classification labels and text information corresponding to each content image sample, text features are extracted at a plurality of questions, so that the richness of text feature information included in the obtained text features is improved, and further, the accuracy of a target pre-training model obtained by training is improved.

In one embodiment, as shown in fig. 8, step 404, performing text feature extraction on the content classification tag and the text information corresponding to each content graph sample to obtain a first text feature and a second text feature of each content graph sample, includes:

and step 802, extracting text features from the content classification labels corresponding to the content graph samples to obtain first text features.

The first text feature is a text feature corresponding to the content classification tag.

And step 804, performing text division on the text information corresponding to each content graph sample to obtain a text sequence corresponding to each content graph sample.

The text sequence corresponding to each content graph sample comprises a plurality of text labels (Token). Specifically, after text division is performed on text information corresponding to the content graph sample, a plurality of text labels can be obtained, and a text sequence corresponding to the content graph sample is formed based on the text labels.

For example, the text information corresponding to the content graph sample is "drag girl tens of meters-! Girls cry ", a plurality of text marks can be obtained after dividing the text information: [ drag ], [ line ], [ girl ], [ child ], [ number ], [ ten ], [ rice ], [ ]! The text mark is expressed by [ girl ], [ tragic ], [ cry ], "[ ]", and the content thereof, so that the corresponding text sequence can be obtained: [ drag ] [ line ] [ girl ] [ child ] [ ten ] [ m ] [ I! [ female ] [ child ] [ tragic ] [ cry ]. Second, the text information corresponding to the content graph sample is "do you know how two cats are noisy? ", a plurality of text labels can be obtained after dividing the text information: [ you ], [ know ], [ dao ], [ two ], [ only ], [ cat ], [ yes ], [ like ], [ what ], [ noisy ], [ rack ], [ mock ], [? The corresponding text sequence can be obtained from the following: [ know ] [ channel ] [ two ] [ cat ] [ is ] [ like ] [ is ] [ frame ] [ is ]? ]. It should be understood that the foregoing examples are provided merely for the understanding of the text sequences described in the present scheme and should not be construed as limiting the present scheme.

At step 806, masking is performed on each text sequence, and a portion of text labels in the masked text sequence are replaced with masking labels, and each second text feature is generated based on each masked text sequence.

The masking process is to mask (mask) a part of text marks in the text sequence, that is, replace the part of text marks with mask marks, and in this embodiment, random blank filling with 0 is used, that is, mask marks are specifically [ mask ], and no other information is included. For example, the text sequence is [ tug row ] [ girl ] [ ten ] [ m ] |! The text sequence after masking treatment can be [ girl ] [ ten ] [ m ] [ I ]! [ girl ] [ mask ], or [ mask ] [ girl ] [ mask ] [ rice ] ] the number ]! [ girl ] [ crying ] is carried out.

Specifically, masking processing is performed on each text sequence, so that part of text marks in the text sequence are replaced by masking marks to obtain a text sequence after masking processing, and each second text feature is generated based on each text sequence after masking processing. For ease of understanding, take text information as "do you know how two cats are noisy? "as an example, as shown in fig. 9, text information 902 is firstly subjected to text division to obtain text sequences corresponding to each content graph sample, and then each text sequence is subjected to mask processing to obtain a masked text sequence 904, where the masked text sequence 904 may be: [ know ] [ lanes ] [ two ] [ mask only ] [ is ] [ mask ] [ is ]? ]. The masked text sequence 904 is thus text feature extracted by the bi-directional encoder model of the converter to output a second text feature 906.

In this embodiment, by including the text sequence of the mask mark, the text feature can pay attention to the context information corresponding to the mask mark, that is, the text feature can include the relevance between more text information, and the obtained text feature is further enriched.

In the process of identifying the somatosensory wind, as the causes of each type of somatosensory wind are different, each somatosensory wind label has unique emphasis, for example, the negative energy label is mainly triggered by some negative energy keywords, the emotion exaggeration label is mainly triggered by some emotion words and even punctuation marks (such as exclamation marks and the like), and the like, so that the obtained somatosensory wind identification model can identify the somatosensory wind type of the data information more accurately, and special processing needs to be considered for the characters or punctuation marks triggered by the keywords in the training process of the pre-training model to obtain key components of the somatosensory wind identification model.

Based on this, in one embodiment, as shown in fig. 10, the text sequence corresponding to the content map sample includes a plurality of text labels;

step 806, masking each text sequence, including:

step 1002, calculating contribution degree of each text label in each text sequence, wherein the contribution degree is the contribution degree of the text label to content classification label prediction.

The contribution degree is the contribution degree of the text labels to content classification label prediction, namely the contribution degree is specifically the contribution degree for measuring the probability that each text label is accurately content classification label in the content classification label of the text sequence.

Specifically, the server calculates the contribution of each text marker in each text sequence, and specifically calculates based on the following formula (1):

S(w _i )＝P(y _t |s)-P(y _t |s′ _i-1 w _i )； (1)

wherein S (w _i ) Representing the contribution degree of text labels to content classification label prediction, y _t Representing text sequences, w _i Representing the ith text mark in the text sequence, P (y _t S) is the contribution degree of text sequence to content classification label prediction, s' _i-1 Represented by w ₁ 、w ₂ To w _i-1 A composed text sequence.

Step 1004, determining key text labels in each text sequence according to the contribution degree of each text label in each text sequence, and determining the key text labels in each text sequence as replaced text labels in the text sequence.

The key text labels may include one text label or a plurality of text labels, and the key text labels are text labels with higher contribution in the text sequence. Based on this, the server may rank the contribution of each text label in each text sequence from high to low, and determine the text label with the higher contribution in the text sequence as the key text label in each text sequence, or determine the key text label in each text sequence by a key text label model, which is not limited herein. Thus, the server determines the key text labels in each text sequence as the replaced text labels in the text sequence, i.e. the text labels subjected to mask replacement are the key text labels.

For example, the text sequence includes a text label 1, a text label 2, a text label 3 and a text label 4, and the contribution degrees corresponding to the text label 1, the text label 2, the text label 3 and the text label 4 are ranked from high to low, specifically: the contribution of text mark 1, the contribution of text mark 4, the contribution of text mark 2 and the contribution of text mark 3 may be determined to be the highest, i.e. text mark 1 may be regarded as a key text mark based on demand, and then text mark 1 may be replaced with a mask mark when masking.

The text sequence is [ Motuo ] [ vehicle ] [ tuo ] [ line ] [ girl ] [ child ] [ ten ] [ m ] ] which is ]! To illustrate an example of [ girl ] [ child ] [ tragic ] [ cry ], if the random masking strategy of BERT is used, the masked text sequence can be obtained as follows: [ Mount ] [ vehicle ] [ tuo ] [ line ] [ girl ] [ child ] [ mask ] [ rice ]! [ female ] [ child ] [ tragic ] [ cry ]. Based on the method for determining the key text labels provided in the embodiment, the text sequence after masking may be obtained as follows: [ Mount ] [ Tuo ] [ Car ] [ Tuo ] [ Ling ] [ girl ] [ child ] [ Ten ] [ Mi ]! [ girl ] [ child ] [ tragic ] [ mask ].

For ease of understanding, the following will be described with reference to determining a key text label by using a key text label model, as shown in fig. 11, first obtaining a text sequence sample 1102 with a smaller data amount, then obtaining a key text label 1104 corresponding to the text sequence sample 1102 by using the foregoing formula (1), then obtaining a text sequence sample with a larger data amount from the text sequence database 1106, and using the text sequence sample with a larger data amount and the key text label 1104 corresponding to the text sequence sample as inputs of the key text label model 1108, that is, outputting the key text label 1110 of each text sequence sample by using the key text label model 1108.

In this embodiment, by calculating the contribution degree of each text label in each text sequence, the key text label that can affect the content classification label prediction result is determined from the text sequences, so that the key text label is replaced during mask processing, so that the text features can pay attention to the context information corresponding to the key text label, and the obtained relevance between the text information can affect the content classification label prediction result more, thereby further improving the accuracy of the text information. Secondly, through replacing the random mask strategy of BERT, the pre-training model can learn more information which is more useful for the somatosensory wind-drawing category in the training process, so that the accuracy of obtaining the somatosensory wind-drawing recognition model in subsequent training is improved.

In one embodiment, as shown in fig. 12, step 308, training the initial pre-training model based on the sample features of each content graph sample to obtain a target pre-training model includes:

step 1202, constructing multi-mode features corresponding to each content graph sample based on the sample features of each content graph sample.

The multi-modal feature is a feature obtained by carrying out multi-modal feature interaction on the image feature and the text feature, and the multi-modal feature interaction refers to interaction between the image feature and the text feature of the content image sample. Specifically, the server performs cross-attention (cross-attention) feature extraction on a first image feature, a second image feature, a first text feature and a second text feature in sample features of each content graph sample to construct multi-modal features corresponding to each content graph sample.

To facilitate understanding, as shown in fig. 13, by a similar method as described in the previous embodiment, a first image feature 1302 corresponding to a content image sample and a second image feature 1304 corresponding to a processed content image sample are obtained by a visual transformation (ViT) model, a first text feature 1306 corresponding to a content classification tag and a second text feature 1308 corresponding to text information are obtained by a bi-directional encoder (BERT) model of a transformer, and then cross-attention feature extraction is performed on the first image feature 1302, the second image feature 1304, the first text feature 1306 and the second text feature 1308 of each content image sample to obtain a multi-modal feature 1310.

And step 1204, training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model.

Specifically, the server trains the initial pre-training model based on the multi-modal characteristics of each content graph sample obtained in step 1204, to obtain the target pre-training model.

In this embodiment, cross-attention feature extraction is performed on image features and text features in different dimensions in sample features of each content graph sample, so that the obtained multi-modal feature is fused with each dimension feature on the basis of describing the multi-dimension feature, and therefore correlation between each image feature and text feature can be described, and further accuracy of a target pre-training model obtained through training is improved.

In one embodiment, as shown in fig. 14, step 1204 trains the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model, including:

step 1402, training the initial pre-training model based on at least one of the fusion features and the sample features of each content graph sample and the multi-modal features to obtain a target pre-training model.

The fusion features comprise a first fusion feature and a second fusion feature, wherein the first fusion feature is a feature constructed based on a first image feature and a second text feature of a content graph sample; the second fusion feature is a feature constructed based on the second image feature and the second text feature of the content map sample. It should be appreciated that the first fused feature is a feature constructed based on at least one text sub-feature of the first image feature and the second text feature of the content map sample and the second fused feature is a feature constructed based on at least one text sub-feature of the second image feature and the second text feature of the content map sample.

For ease of understanding, the description will be given taking the text sub-feature that the second text feature includes the text sub-feature corresponding to the title of the content graph sample, taking the text sub-feature corresponding to the publisher name of the content graph sample as an example, as shown in fig. 15, the first fusion feature 1506 is constructed based on the text sub-feature 1504 of the first image feature 1502 corresponding to the title of the graph sample, and the second fusion feature 1512 is constructed based on the text sub-feature 1510 of the second image feature 1508 corresponding to the publisher name of the content graph sample. It should be understood that, in practical applications, other manners, such as building a first fusion feature for a feature built based on a text sub-feature corresponding to a publisher name of a content graph sample and building a second fusion feature based on a text sub-feature corresponding to a title of a content graph sample, may also be implemented, and specific fusion objects and manners of the fusion features are not limited herein.

Specifically, the server can train the initial pre-training model based on at least one of fusion characteristics and sample characteristics of each content graph sample and multi-mode characteristics according to the requirements of practical application, so as to obtain the target pre-training model. That is, the server may train the initial pre-training model based on the fusion features and the multi-modal features of each content graph sample to obtain the target pre-training model. Or training the initial pre-training model based on the sample characteristics and the multi-modal characteristics to obtain the target pre-training model. Or training the initial pre-training model based on the fusion characteristics and the sample characteristics of each content graph sample to obtain the target pre-training model.

In this embodiment, in the process of training the initial pre-training model, based on consideration of the fusion feature, at least one of the fusion feature and the sample feature is further introduced, so that the pre-training model learns more feature information in the training process, and the accuracy of obtaining the somatosensory wind recognition model in the subsequent training is further improved.

In the following, a detailed description will be given of how the initial pre-training model is trained based on the fusion features and sample features of each content map sample, resulting in a detailed implementation of the target pre-training model, and it should be understood that the implementation taking only the fusion features or sample features into consideration is similar to the subsequent steps, and thus will not be described in detail.

Based on this, in one embodiment, as shown in fig. 16, step 1402, training the initial pre-training model based on at least one of the fusion feature and the sample feature of each content graph sample and the multi-modal feature, may specifically include the following processing steps in the training process of obtaining the target pre-training model:

step 1602, based on the multi-modal features, obtains a predicted content classification label corresponding to each content graph sample, and calculates cross entropy loss information corresponding to each content graph sample by using each predicted content classification label and each content classification label.

The cross entropy loss information is used for describing errors between the predicted content classification labels and the content classification labels, and specifically, the cross entropy errors between the predicted content classification labels and the content classification labels are calculated through a cross entropy loss function.

Specifically, in the training process, the initial pre-training model can obtain, based on each multi-modal feature, a predicted content classification label corresponding to each content graph sample, where the predicted content classification label is used to describe a predicted category of content information included in the content graph sample, and the predicted content classification label corresponding to each content graph sample may be one or more slaves, which is not limited herein. Based on this, the server calculates a cross entropy error between the predicted content classification label and the content classification label corresponding to each content map sample using the cross entropy loss function, thereby taking the cross entropy error as the cross entropy loss information corresponding to the content map sample. For example, cross-entropy error may be calculated using cross-entopy loss as a cross-entropy loss function.

Step 1604, calculating and determining similarity loss information corresponding to each content graph sample based on at least one of the fusion feature and the sample feature of each content graph sample.

Wherein the similarity loss information is used to describe the degree of similarity between the plurality of features. Specifically, the server calculates the similarity between the first fusion feature and the second fusion feature in the fusion features through a similarity algorithm, and calculates the similarity between the sample features in each dimension in the sample features, so as to obtain similarity loss information corresponding to each content graph sample. The similarity algorithm may be a euclidean distance similarity algorithm, a cosine similarity algorithm, or the like, which is not limited herein.

In step 1606, model parameters of the initial pre-trained model are updated based on the cross entropy loss information and the similarity loss information.

Specifically, the server updates model parameters of the initial pre-training model through the cross entropy loss information and the similarity loss information obtained through calculation in the previous steps. Therefore, after repeated iterative updating, when the loss function of the initial pre-training model reaches convergence, a target pre-training model is generated based on the model parameters of the initial pre-training model updated last time.

In this embodiment, the error between the content classification label and the content classification label is predicted by the cross entropy loss information description, and the degree of similarity between the plurality of features is described by the similarity loss information, so as to improve the accuracy and the richness of the loss information of the pre-training model. Therefore, errors between the predicted label and the real label and the similarity between the features are considered when the model parameters are updated, so that the model training process is more reliable, namely the obtained target pre-training model is more accurate.

In one embodiment, as shown in fig. 17, step 1604, calculating and determining similarity loss information corresponding to each content graph sample based on at least one of the fusion feature and the sample feature of each content graph sample includes:

step 1702, a first similarity between a first image feature and a second image feature of each content graph sample is calculated.

Wherein the first similarity is used to describe the similarity between image features. Specifically, the server calculates the similarity between the first image feature and the second image feature through a similarity algorithm, so as to obtain the first similarity of each content image sample. The similarity algorithm may be a euclidean distance similarity algorithm, a cosine similarity algorithm, or the like, which is not limited herein.

Step 1704, calculating a second similarity between each text sub-feature in the second text feature of each content graph sample.

Wherein the second similarity is used to describe the similarity between the text sub-features. Specifically, the server calculates the similarity between the text sub-features in the second text feature through a similarity algorithm, so as to obtain the second similarity of each content graph sample. For example, taking the text sub-feature corresponding to the title of the content graph sample as an example for the text sub-feature corresponding to the publisher name of the content graph sample, the second similarity is used to describe the text sub-feature corresponding to the title of the content graph sample, and the similarity between the text sub-feature corresponding to the publisher name of the content graph sample.

Step 1706, a third similarity between the first fused feature and the second fused feature of each content graph sample is calculated.

Wherein the third similarity is used to describe the similarity between the fused features. Specifically, the server calculates the similarity between the first fusion feature and the second fusion feature through a similarity algorithm, so as to obtain a third similarity of each content graph sample. The similarity algorithm may be a euclidean distance similarity algorithm, a cosine similarity algorithm, or the like, which is not limited herein.

It should be appreciated that there is no timing constraint between steps 1702, 1704, and 1706.

Step 1708, obtaining similarity loss information corresponding to each content graph sample based on each first similarity, each second similarity, and each second similarity.

The similarity loss information is specifically used for describing: similarity between image features, similarity between text sub-features, and similarity between fusion features. Specifically, the server obtains similarity loss information corresponding to each content graph sample based on the first similarity, each second similarity and each second similarity corresponding to each content graph sample obtained in the previous step.

In this embodiment, the similarity loss information is obtained by calculating the similarity between the image features, the similarity between the text sub-features and the similarity between the fusion features, so as to improve the accuracy and the richness of the similarity loss information, that is, further improve the accuracy and the richness of the loss information of the pre-training model.

In one embodiment, as shown in fig. 18, step 1402, training the initial pre-training model based on at least one of the fusion feature and the sample feature of each content graph sample and the multi-modal feature, and obtaining the training process of the target pre-training model may specifically include the following processing steps:

And step 1802, evaluating the image-text matching degree based on the first image feature and the second text feature of each content image sample, so as to obtain the image-text matching degree corresponding to each content image sample.

The image-text matching degree is used for describing the matching degree between the first image feature and the second text feature, namely whether the second text feature of the content image sample can accurately describe the first image feature of the content image sample or not can be described according to the image-text matching degree.

Specifically, in the training process, the server performs self-attention feature extraction on the first image features of each content graph sample through the initial pre-training model to obtain image self-attention features of each content graph sample, and performs self-attention feature extraction on the second text features of each content graph sample through the initial pre-training model to obtain text self-attention features of each content graph sample. Wherein, the image self-attention feature refers to the image feature extracted by self-attention during training. Text self-attention features refer to text features extracted by self-attention during training. Based on the matching degree evaluation, the server evaluates the matching degree of the self-attention characteristics of each image and the self-attention characteristics of each text through an initial pre-training model, so that the image-text matching degree corresponding to each content image sample is obtained.

It should be understood that in practical application, when the image-text matching degree corresponding to the content image sample is higher, the interaction between the image features and the text features can be enhanced in the training process, and when the image-text matching degree corresponding to the content image sample is weaker, the interaction between the image features and the text features can be reduced in the training process.

Because the model parameters of the initial pre-training model need to be updated in the training process of training the initial pre-training model to obtain the target pre-training model, after obtaining the image-text matching degree corresponding to each content image sample in step 1802, in step 906, the process of updating the model parameters of the initial pre-training model based on each cross entropy loss information and each similarity loss information may specifically include the following processing procedures:

at step 1804, model parameters of the initial pre-training model are updated based on the cross entropy loss information, the similarity loss information, and the pattern-text matching degree.

Specifically, the server updates model parameters of the initial pre-training model through the cross entropy loss information, the similarity loss information and the image-text matching degree obtained through calculation in the previous steps. Therefore, after repeated iterative updating, when the loss function of the initial pre-training model reaches convergence, a target pre-training model is generated based on the model parameters of the initial pre-training model updated last time.

In this embodiment, the matching degree between the first image feature and the second text feature is described through the image-text matching degree, that is, whether the second text feature of the content image sample can accurately describe the first image feature of the content image sample can be described according to the image-text matching degree, so that when the model parameters are updated, the matching degree between the image feature and the text feature can be further considered on the basis of considering the errors between the prediction content classification labels and the similarity degree between the multiple features, and the model training process is more reliable, that is, the obtained target pre-training model is more accurate.

To describe the method of feature processing in the foregoing training process in more detail, as shown in fig. 19, first, image feature extraction is performed on each content image sample 1901 and the processed content image sample 1902, respectively, to obtain a first image feature 1903 and a second image feature 1904 of each content image sample. Similarly, text feature extraction is performed on the content classification labels 1905 and the text information 1906 corresponding to each content graph sample, so as to obtain a first text feature 1907 and a second text feature including a first text sub-feature 1908 and a second text sub-feature 1909 of each content graph sample.

Based on this, a multimodal feature 1910 corresponding to each content map sample is constructed based on the first image feature 1903, the second image feature 1904, the first text feature 1907, and the second text feature including the first text sub-feature 1908 and the second text sub-feature 1909. Further, a first fusion feature 1912 is constructed based on the first image feature 1903 and a second text sub-feature 1909 of the content map sample, and a second fusion feature 1912 is constructed based on the second image feature 1904 and a first text sub-feature 1908 of the second text feature of the content map sample. And then obtaining a target pre-training model through the model training method described in the previous embodiment.

Further, after the training to obtain the target pre-training model, the training to obtain the somatosensory wind identification model based on the target pre-training model will be described in detail below. In one embodiment, as shown in fig. 20, a training method of a somatosensory wind recognition model is provided, and the method is described by taking an example that the method is applied to a server in fig. 1, it can be understood that the method can also be applied to a terminal, and can also be applied to a system comprising the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

Step 2002, obtaining a somatosensory wind training sample and a somatosensory wind label corresponding to the somatosensory wind training sample.

The somatosensory wind label is used for describing a somatosensory wind type of a somatosensory wind training sample, the somatosensory wind type is used for describing a corresponding type of style and tone of data information, and the data information can be text, pictures, videos or music. Specifically, style and tonality are the whole style of the data information, and the data information of the same style and tonality has a certain commonality, and can cause resonance of a class of users, such as: nourishing eyes, healing, middle-aged and elderly people, campuses, and tidal conditions. Thus, style and tone are an overall feel to the user, which may be auditory, e.g., soothing and cheering, or visual, e.g., pleasurable and sad.

Specifically, the server acquires each somatosensory wind training sample and a somatosensory wind label corresponding to each somatosensory wind training sample, wherein the somatosensory wind label is obtained based on manual labeling.

It can be understood that, when the somatosensory wind label corresponding to the somatosensory wind training sample is manually marked, the subjectivity of judging the corresponding type of the style and the tone of the data information is strong, for example, the style and the tone of the data information may be caused by content properties (such as serious and low, etc.), or may be caused by audience groups (such as young groups and middle-aged groups), or may also be caused by writing techniques for the data information in a text form, so that the somatosensory wind label needs to consider the classified targets of various labels such as specific content types, intentions, emotions, and the like to obtain a final result.

Based on this, in order to improve and promote the degree of accuracy of somatosensory wind-drawing label in practical application, after obtaining the corresponding initial somatosensory wind-drawing label of each somatosensory wind-drawing training sample based on the first manual summary mark, because the judgement of somatosensory wind-drawing label has very strong subjectivity, can lead to initial somatosensory wind-drawing label imperfect and inaccurate. Therefore, the initial somatosensory picture label needs to be judged again by different labeling personnel. As shown in fig. 21, the labeling person 2102 judges the initial somatosensory wind label to obtain a first judging result 2104 of the initial somatosensory wind label, and the labeling person 2106 judges the initial somatosensory wind label to obtain a second judging result 2108 of the initial somatosensory wind label. If the first determination result 2104 and the second determination result 2108 are consistent, the initial somatosensory wind label corresponding to the somatosensory wind training sample is taken as the somatosensory wind label corresponding to the somatosensory wind training sample. Otherwise, if the first judging result 2104 and the second judging result 2108 are inconsistent, the initial somatosensory wind label corresponding to the somatosensory wind training sample is adjusted, and then a similar judging step is performed on the adjusted initial somatosensory wind label until the judging results are consistent.

Further, in this embodiment, a common somatosensory wind tag and descriptions corresponding to the somatosensory wind tags are provided, as shown in fig. 22, the somatosensory wind tags include emotion exaggeration, country wind, serious menstruation, high-altitude, low-altitude, easy entertainment, deep specialty, positive energy in society, shallow understanding, healing and the like. The somatosensory picture label comprises serious news reports and the like capable of describing that data information is word serious, and is often international, social and civil, and the like. And secondly, when the somatosensory picture label comprises positive energy of society, the positive energy news and the like which can be used for describing that data information is the form of the congratulation, the value is positive, and people can be inspired. The details of fig. 22 are not fully described herein, and in practical applications, the somatosensory wind tag and the specific description corresponding to each somatosensory wind tag are not limited to the foregoing examples.

Step 2004, training the initial somatosensory wind identification model based on each somatosensory wind training sample to obtain a trained somatosensory wind identification model, and identifying the somatosensory wind category of the data information of the somatosensory wind identification model.

The initial somatosensory wind recognition model is the target pre-training model obtained through the foregoing embodiment, and specific training modes are not described herein.

Specifically, the server obtains a predicted somatosensory wind-drawing label of each somatosensory wind-drawing training sample through an initial somatosensory wind-drawing recognition model based on each somatosensory wind-drawing training sample, and updates model parameters of the initial somatosensory wind-drawing recognition model specifically based on the predicted somatosensory wind-drawing label and the somatosensory wind-drawing label of each somatosensory wind-drawing training sample, so that when a loss function of the initial somatosensory wind-drawing recognition model reaches convergence after repeated iterative updating, the somatosensory wind-drawing recognition model is generated based on the model parameters of the initial somatosensory wind-drawing recognition model updated last time. The somatosensory wind identification model is used for identifying the somatosensory wind category of the data information.

According to the training method of the somatosensory wind identification model, the initial somatosensory wind identification model is the target pre-training model, and the target pre-training model can learn the characteristic information contained in more content graph samples in the training process, so that the accuracy of the target pre-training model obtained through training is improved, and the accuracy of the somatosensory wind identification model is improved.

Further, after the somatosensory wind recognition model is obtained by training, based on fig. 2, the somatosensory wind recognition model is obtained by invoking training through the somatosensory wind recognition service, so as to recognize the somatosensory wind category of the data information. In one embodiment, as shown in fig. 23, a method for identifying somatosensory wind is provided, and the method is taken as an example and explained by applying the method to the server in fig. 1, it is understood that the method can also be applied to a terminal, and can also be applied to a system comprising the terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method includes the steps of:

Step 2302, obtaining the data information to be identified.

In a specific application, the server may obtain somatosensory wind identification information from the terminal, and obtain data information to be identified, which needs to be identified in somatosensory wind category, from the somatosensory wind identification information. Or, obtaining the data information to be identified from the database. The present invention is not particularly limited herein.

And 2304, obtaining a predicted somatosensory wind-drawing label corresponding to the data information to be identified through a somatosensory wind-drawing identification model based on the data information to be identified, wherein the predicted somatosensory wind-drawing label is used for describing the somatosensory wind category of the data information to be identified.

The somatosensory wind recognition model is obtained through the foregoing embodiment, and specific training methods are not described herein.

Specifically, the server takes the data information to be identified as input of a body-sensing wind identification model obtained through training, and the body-sensing wind identification model can output a predicted body-sensing wind label corresponding to the data information to be identified, wherein the predicted body-sensing wind label is used for describing the body-sensing wind type of the data information to be identified. Specific definitions and examples of somatosensory wind categories are described in detail in the foregoing embodiments, and are not repeated here.

According to the method for recognizing the somatosensory wind, the predicted somatosensory wind tag corresponding to the data information to be recognized is needed to be obtained through the somatosensory wind recognition model, the somatosensory wind recognition model is specifically obtained through training based on the target pre-training model, the target pre-training model can learn the characteristic information contained in more content graph samples in the training process, and the accuracy of the target pre-training model obtained through training is improved, so that the accuracy of the somatosensory wind recognition model is improved, and the accuracy of the somatosensory wind category of the recognition data information is further improved.

Detailed embodiments corresponding to the training method of the pre-training model and the somatosensory wind recognition method will be described in detail below, as shown in fig. 24, including:

step 2402, obtaining each sample data pair and a content classification label corresponding to the content graph sample in each sample data pair.

The sample data pair comprises a content graph sample and data description information corresponding to the content graph sample. Next, the content classification labels are used to describe the category of the content information included in the content graph samples, and the content classification labels corresponding to each content graph sample may be one or more. It should be appreciated that specific examples of sample data pairs and classification labels are similar to the previous embodiments and are not repeated here.

Step 2404, extracting image features of each content graph sample and the processed content graph sample, to obtain a first image feature and a second image feature of each content graph sample.

The data description information comprises a processed content graph sample after the content graph sample is processed, and the processing of the content graph sample is particularly the data enhancement processing of the content graph sample. How the server obtains the image features is similar to the foregoing embodiment, and will not be described here again.

Step 2406, extracting text features from the content classification labels corresponding to the content graph samples to obtain first text features.

The data description information comprises text information corresponding to the content graph sample. And the server extracts text features of the content classification labels corresponding to the content graph samples to obtain first text features corresponding to the content classification labels corresponding to the content graph samples.

Step 2408, performing text division on the text information corresponding to each content graph sample to obtain a text sequence corresponding to each content graph sample.

The text sequence corresponding to each content graph sample comprises a plurality of text labels (Token). Specifically, after text division is performed on text information corresponding to the content graph sample, a plurality of text labels can be obtained, and a text sequence corresponding to the content graph sample is formed based on the text labels. The manner how the server obtains the text sequence corresponding to each content graph sample is similar to the foregoing embodiment, and will not be repeated here.

In step 2410, masking each text sequence, replacing part of the text labels in the masked text sequence with masking labels, and generating each second text feature based on each masked text sequence.

Wherein the masking process is to mask (mask) part of the text labels in the text sequence, i.e. replace part of the text labels with mask labels. How the server generates each second text feature is similar to the foregoing embodiment, and will not be repeated here.

Step 2412, constructing multi-modal features corresponding to each content graph sample based on the sample features of each content graph sample.

The sample features of each content graph sample comprise a first image feature, a second image feature, a first text feature and a second text feature. How the server obtains the multi-modal feature is similar to the foregoing embodiment, and will not be described here again.

Step 2414, training the initial pre-training model based on at least one of the fusion features and the sample features of each content graph sample and the multi-modal features to obtain a target pre-training model.

The server can train the initial pre-training model based on the fusion characteristics and the multi-modal characteristics of each content graph sample to obtain the target pre-training model. Or training the initial pre-training model based on the sample characteristics and the multi-modal characteristics to obtain the target pre-training model. Or training the initial pre-training model based on the fusion characteristics and the sample characteristics of each content graph sample to obtain the target pre-training model. How the server trains the initial pre-training model to obtain the target pre-training model is similar to the previous embodiment, and will not be repeated here.

Step 2416, obtaining each somatosensory wind training sample and a somatosensory wind label corresponding to each somatosensory wind training sample.

The somatosensory wind label is used for describing a somatosensory wind type of a somatosensory wind training sample, the somatosensory wind type is used for describing a corresponding type of style and tone of data information, and the data information can be text, pictures, videos or music. The manner of obtaining the somatosensory wind training samples and the somatosensory wind labels corresponding to the somatosensory wind training samples by the server is similar to step 2002, and will not be repeated here.

Step 2418, training the initial somatosensory wind identification model based on the somatosensory wind training samples to obtain a trained somatosensory wind identification model.

The somatosensory wind recognition model recognizes the somatosensory wind category of the data information, and the initial somatosensory wind recognition model is a target pre-training model obtained through the training. The manner in which the server trains to obtain the somatosensory wind identification model is similar to step 2004 and will not be described again here.

Step 2420, obtain the data information to be identified.

The server obtains the data information to be identified in a similar manner to step 2302, which will not be described in detail herein.

Step 2422, based on the data information to be identified, obtaining a predicted somatosensory wind label corresponding to the data information to be identified through a somatosensory wind identification model, where the predicted somatosensory wind label is used for describing the somatosensory wind category of the data information to be identified.

The somatosensory wind-painting recognition model is obtained through the training method. The server obtains the predicted somatosensory wind tag corresponding to the data information to be identified in a similar manner to step 2304, which is not described herein.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a training device of the pre-training model for realizing the training method of the pre-training model. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the training device for one or more pre-training models provided below may be referred to the limitation of the training method for the pre-training model hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 25, there is provided a training apparatus of a pre-training model, comprising: acquisition module 2502, processing module 2504, and first training module 2506, wherein:

an acquisition module 2502 for acquiring each sample data pair including a content graph sample and data description information corresponding to the content graph sample; obtaining content classification labels corresponding to the content graph samples in each sample data pair;

the processing module 2504 is configured to perform feature extraction on each sample data pair and a content classification label corresponding to each content graph sample, so as to obtain sample features of each content graph sample, where the sample features include image features and text features;

The first training module 2506 is configured to train the initial pre-training model based on sample features of each content graph sample to obtain a target pre-training model, where the target pre-training model is used for training to obtain a somatosensory wind identification model, and the somatosensory wind identification model identifies a somatosensory wind category of data information.

the processing module 2504 is specifically configured to extract image features of each content graph sample and the processed content graph sample, so as to obtain a first image feature and a second image feature of each content graph sample; extracting text features of content classification labels and text information corresponding to the content graph samples respectively to obtain first text features and second text features of the content graph samples; and the image features include a first image feature and a second image feature, and the text features include a first text feature and a second text feature.

In one embodiment, the processing module 2504 is specifically configured to perform text feature extraction on the content classification labels corresponding to each content graph sample, so as to obtain each first text feature; text division is carried out on text information corresponding to each content graph sample, and a text sequence corresponding to each content graph sample is obtained; and masking each text sequence, wherein part of text labels in the text sequence after masking are replaced by masking labels, and each second text feature is generated based on each text sequence after masking.

the processing module 2504 is specifically configured to calculate a contribution degree of each text label in each text sequence, where the contribution degree is a contribution degree of the text label to content classification label prediction; and determining the key text labels in each text sequence according to the contribution degree of each text label in each text sequence, and determining the key text labels in each text sequence as the replaced text labels in the text sequence.

In one embodiment, the first training module 2506 is specifically configured to construct a multi-modal feature corresponding to each content graph sample based on the sample features of each content graph sample; and training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain a target pre-training model.

In one embodiment, the first training module 2506 is specifically configured to train the initial pre-training model based on at least one of the fusion feature and the sample feature of each content graph sample and the multi-modal feature to obtain a target pre-training model; the fusion features comprise a first fusion feature and a second fusion feature, wherein the first fusion feature is a feature constructed based on a first image feature and a second text feature of a content graph sample; the second fusion feature is a feature constructed based on the second image feature and the second text feature of the content map sample.

In one embodiment, first training module 2506 is specifically configured to, during a training process: based on the multi-mode features, obtaining predicted content classification labels corresponding to the content graph samples, and calculating cross entropy loss information corresponding to the content graph samples through the predicted content classification labels and the content classification labels; calculating and determining similarity loss information corresponding to each content graph sample based on at least one of fusion characteristics and sample characteristics of each content graph sample; and updating model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information.

In one embodiment, the processing module 2504 is specifically configured to calculate a first similarity between the first image feature and the second image feature of each content image sample; calculating a second similarity between each text sub-feature in the second text features of each content graph sample; calculating a third similarity between the first fusion feature and the second fusion feature of each content graph sample; and obtaining similarity loss information corresponding to each content graph sample based on each first similarity, each second similarity and each second similarity.

In one embodiment, the first training module 2506 is specifically configured to perform an image-text matching degree evaluation based on the first image feature and the second text feature of each content graph sample during the training process, so as to obtain an image-text matching degree corresponding to each content graph sample; and updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

In one embodiment, as shown in fig. 26, there is provided a training apparatus of a somatosensory wind recognition model, including: an acquisition module 2602 and a second training module 2604, wherein:

an acquisition module 2602, configured to acquire data information to be identified;

the second training module 2604 is configured to train the initial somatosensory wind identification model based on each somatosensory wind training sample, and obtain a trained somatosensory wind identification model;

In one embodiment, the second training module 2604 includes:

the initial somatosensory wind-painting identification model obtaining module is used for obtaining an initial somatosensory wind-painting identification model and is used for obtaining the initial somatosensory wind-painting identification model;

and the somatosensory wind-drawing model training module is used for training the initial somatosensory wind-drawing recognition model obtained by the initial somatosensory wind-drawing recognition model obtaining module based on each somatosensory wind-drawing training sample to obtain a trained somatosensory wind-drawing recognition model.

The initial somatosensory wind-drawing recognition model obtaining module may be a target pre-training model obtained by the training device of the pre-training model as an initial somatosensory wind-drawing recognition model, or may be a training device of the pre-training model directly, that is, the training device of the pre-training model is used as the initial somatosensory wind-drawing recognition model obtaining module.

In one embodiment, as shown in fig. 27, there is provided a motion sensing and wind drawing recognition apparatus, including: an acquisition module 2702 and an identification module 2704, wherein:

An acquisition module 2702, configured to acquire data information to be identified;

the recognition module 2704 is configured to obtain, based on the data information to be recognized, a predicted somatosensory wind tag corresponding to the data information to be recognized through a somatosensory wind recognition model, where the predicted somatosensory wind tag is used for describing a somatosensory wind category of the data information to be recognized;

In an embodiment, the somatosensory wind identification device may further include a training device for the somatosensory wind identification model to train and obtain the somatosensory wind identification model.

All or part of each module in the training device of the pre-training model, the training device of the somatosensory wind-drawing recognition model and the somatosensory wind-drawing recognition device can be realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 28. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data required by sample data pairs, somatosensory wind training samples, data information to be identified and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a training method for a pre-training model.

It will be appreciated by those skilled in the art that the structure shown in fig. 28 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of training a pre-training model, the method comprising:

obtaining each sample data pair, wherein the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample, the content graph sample is constructed based on a plurality of video file frames, and the data description information comprises: a processed content graph sample after processing the content graph sample and text information corresponding to the content graph sample;

Acquiring content classification labels corresponding to the content graph samples in each sample data pair;

respectively extracting image features of each content image sample and each processed content image sample by adopting a visual transformation model to obtain a first image feature and a second image feature of each content image sample;

extracting text features from content classification labels corresponding to the content graph samples by adopting a bidirectional encoder of the converter to obtain first text features;

text division is carried out on text information corresponding to each content graph sample, a text sequence corresponding to each content graph sample is obtained, and the text sequence corresponding to each content graph sample comprises a plurality of text marks;

calculating the contribution degree of each text mark in each text sequence, wherein the contribution degree is the contribution degree of the text mark to content classification label prediction;

determining key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, determining the key text marks in each text sequence as replaced text marks in the text sequence, replacing the replaced text marks in the text sequence with mask marks, and generating each second text feature based on the text sequence after each mask processing by adopting a bi-directional encoder of the converter;

Training an initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, wherein the sample characteristics comprise the first image characteristics, the second image characteristics, the first text characteristics and the second text characteristics; the target pre-training model is used for training to obtain a somatosensory wind identification model, the somatosensory wind identification model predicts a somatosensory wind label of data information, the somatosensory wind label is used for describing the somatosensory wind category of the data information, the data information is at least one of text, picture, video or music, the somatosensory wind category is used for describing the corresponding category of style and tone of the data information, the style and the tone are integral feelings formed for a user, and the integral feelings are any one of auditory feelings or visual feelings.

2. The method of claim 1, wherein training the initial pre-training model based on the sample features of each of the content map samples to obtain the target pre-training model comprises:

based on the sample characteristics of each content graph sample, constructing multi-mode characteristics corresponding to each content graph sample;

And training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain the target pre-training model.

3. The method of claim 2, wherein the training the initial pre-training model based on the multi-modal characteristics of each of the content map samples to obtain the target pre-training model comprises:

training the initial pre-training model based on at least one of fusion features and sample features of each content graph sample and the multi-modal features to obtain the target pre-training model;

the fusion features comprise a first fusion feature and a second fusion feature, wherein the first fusion feature is a feature constructed based on the first image feature and the second text feature of the content graph sample; the second fusion feature is a feature constructed based on the second image feature and the second text feature of the content map sample.

4. The method of claim 3, wherein training the initial pre-training model based on at least one of fusion features and sample features of each of the content map samples, and the multi-modal features, results in the target pre-training model, comprising:

During the training process:

based on the multi-modal characteristics, obtaining predicted content classification labels corresponding to the content graph samples, and calculating cross entropy loss information corresponding to the content graph samples through the predicted content classification labels and the content classification labels;

and updating model parameters of the initial pre-training model based on each piece of cross entropy loss information and each piece of similarity loss information.

5. The method of claim 4, wherein computing the similarity loss information for each of the content graph samples based on at least one of the fusion features and the sample features of each of the content graph samples comprises:

calculating a first similarity between the first image feature and the second image feature of each of the content map samples;

calculating a second similarity between each text sub-feature in the second text features of each of the content map samples;

And obtaining similarity loss information corresponding to each content graph sample based on each first similarity, each second similarity and each third similarity.

6. The method of claim 4, wherein training the initial pre-training model based on at least one of fusion features and sample features of each of the content map samples and the multi-modal features results in the target pre-training model, further comprising:

in the training process, evaluating the image-text matching degree based on the first image features and the second text features of each content graph sample to obtain the image-text matching degree corresponding to each content graph sample;

the updating the model parameters of the initial pre-training model based on the cross entropy loss information and the similarity loss information comprises the following steps:

7. A training method of a somatosensory wind recognition model is characterized by comprising the following steps:

Training an initial somatosensory wind recognition model based on each somatosensory wind training sample to obtain a trained somatosensory wind recognition model, wherein the somatosensory wind recognition model predicts a somatosensory wind label of data information, the somatosensory wind label is used for describing the somatosensory wind category of the data information, the data information is at least one of text, picture, video or music, the somatosensory wind category is used for describing the corresponding category of style and tone of the data information, the style and the tone are an overall feeling formed for a user, and the overall feeling is any one of auditory feeling or visual feeling;

the obtaining mode of the initial somatosensory wind identification model comprises the following steps:

Training an initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, wherein the sample characteristics comprise the first image characteristics, the second image characteristics, the first text characteristics and the second text characteristics; and taking the target pre-training model as the initial somatosensory wind identification model.

8. A method for identifying somatosensory wind, the method comprising:

acquiring data information to be identified;

based on the data information to be identified, a predicted somatosensory wind-drawing label corresponding to the data information to be identified is obtained through a somatosensory wind-drawing identification model, the predicted somatosensory wind-drawing label is used for describing a somatosensory wind-drawing type of the data information to be identified, the data information is at least one of texts, pictures, videos or music, the somatosensory wind-drawing type is used for describing a style and a tone corresponding type of the data information, the style and the tone are an overall feeling formed for a user, and the overall feeling is any one of auditory feeling or visual feeling;

Training the initial somatosensory wind identification model based on each somatosensory wind training sample to obtain a trained somatosensory wind identification model;

9. A training device for pre-training a model, the device comprising:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring each sample data pair, the sample data pair comprises a content graph sample and data description information corresponding to the content graph sample, the content graph sample is constructed based on a plurality of video file frames, and the data description information comprises: a processed content graph sample after processing the content graph sample and text information corresponding to the content graph sample; obtaining content classification labels corresponding to the content graph samples in each sample data pair;

The processing module is used for extracting image features of the content image samples and the processed content image samples by adopting a visual conversion model to obtain first image features and second image features of the content image samples; extracting text features from content classification labels corresponding to the content graph samples by adopting a bidirectional encoder of the converter to obtain first text features; text division is carried out on text information corresponding to each content graph sample, a text sequence corresponding to each content graph sample is obtained, and the text sequence corresponding to each content graph sample comprises a plurality of text marks; calculating the contribution degree of each text mark in each text sequence, wherein the contribution degree is the contribution degree of the text mark to content classification label prediction; determining key text marks in each text sequence according to the contribution degree of each text mark in each text sequence, determining the key text marks in each text sequence as replaced text marks in the text sequence, replacing the replaced text marks in the text sequence with mask marks, and generating each second text feature based on the text sequence after each mask processing by adopting a bi-directional encoder of the converter;

The first training module is used for training the initial pre-training model based on sample characteristics of each content graph sample to obtain a target pre-training model, wherein the sample characteristics comprise the first image characteristics, the second image characteristics, the first text characteristics and the second text characteristics; the target pre-training model is used for training to obtain a somatosensory wind identification model, the somatosensory wind identification model predicts a somatosensory wind label of data information, the somatosensory wind label is used for describing the somatosensory wind category of the data information, the data information is at least one of text, picture, video or music, the somatosensory wind category is used for describing the corresponding category of style and tone of the data information, the style and the tone are integral feelings formed for a user, and the integral feelings are any one of auditory feelings or visual feelings.

10. The apparatus of claim 9, wherein the first training module is specifically configured to construct a multi-modal feature corresponding to each content graph sample based on a sample feature of each content graph sample; and training the initial pre-training model based on the multi-modal characteristics of each content graph sample to obtain the target pre-training model.

11. The apparatus of claim 10, wherein the first training module is specifically configured to train the initial pre-training model based on at least one of fusion features and sample features of each of the content map samples and the multi-modal features to obtain the target pre-training model; the fusion features comprise a first fusion feature and a second fusion feature, wherein the first fusion feature is a feature constructed based on the first image feature and the second text feature of the content graph sample; the second fusion feature is a feature constructed based on the second image feature and the second text feature of the content map sample.

12. The apparatus of claim 11, wherein the first training module is specifically configured to, during the training process: based on the multi-modal characteristics, obtaining predicted content classification labels corresponding to the content graph samples, and calculating cross entropy loss information corresponding to the content graph samples through the predicted content classification labels and the content classification labels; calculating and determining similarity loss information corresponding to each content graph sample based on at least one of fusion characteristics and sample characteristics of each content graph sample; and updating model parameters of the initial pre-training model based on each piece of cross entropy loss information and each piece of similarity loss information.

13. The apparatus according to claim 12, wherein the processing module is configured to calculate a first similarity between the first image feature and the second image feature of each of the content map samples; and calculating a second similarity between text sub-features in the second text features of each of the content map samples; calculating a third similarity between the first fusion feature and the second fusion feature of each content graph sample; and obtaining similarity loss information corresponding to each content graph sample based on each first similarity, each second similarity and each third similarity.

14. The apparatus of claim 12, wherein the first training module is specifically configured to perform, during the training process, an evaluation of a degree of matching between graphics and text based on the first image feature and the second text feature of each of the content graph samples, so as to obtain a degree of matching between graphics and text corresponding to each of the content graph samples; and updating model parameters of the initial pre-training model based on the cross entropy loss information, the similarity loss information and the image-text matching degree.

15. A training device for a somatosensory wind identification model, the device comprising:

the acquisition module is used for acquiring all the somatosensory wind-drawing training samples and somatosensory wind-drawing labels corresponding to the somatosensory wind-drawing training samples;

the second training module is used for training the initial somatosensory wind recognition model based on the somatosensory wind training samples to obtain a trained somatosensory wind recognition model, the somatosensory wind recognition model predicts a somatosensory wind label of data information, the somatosensory wind label is used for describing the somatosensory wind category of the data information, the data information is at least one of text, picture, video or music, the somatosensory wind category is used for describing the corresponding category of the style and the tone of the data information, the style and the tone are an overall feeling formed for a user, and the overall feeling is any one of auditory feeling or visual feeling;

16. A motion sensing and wind drawing recognition device, the device comprising:

the identification module is used for acquiring a predicted somatosensory wind-drawing label corresponding to the data information to be identified through a somatosensory wind-drawing identification model based on the data information to be identified, wherein the predicted somatosensory wind-drawing label is used for describing the somatosensory wind-drawing type of the data information to be identified, the data information is at least one of texts, pictures, videos or music, the somatosensory wind-drawing type is used for describing the style and the tone corresponding type of the data information, and the style and the tone are an overall feeling formed for a user, and the overall feeling is any one of auditory feeling or visual feeling;

17. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.

18. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1 to 8.