CN113254741A

CN113254741A - Data processing method and system based on intra-modality fusion and inter-modality relation

Info

Publication number: CN113254741A
Application number: CN202110665991.5A
Authority: CN
Inventors: 李寿山; 安明慧; 王晶晶; 周国栋
Original assignee: Suzhou University
Current assignee: Digital Suzhou Construction Co ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-08-13
Anticipated expiration: 2041-06-16
Also published as: CN113254741B

Abstract

The application relates to a data processing method and a system based on the relation between fusion modalities and between modalities, which comprises the following steps: obtaining sample data of a social network directed target classification, dividing the sample data into a training set, a verification set and a test set, and obtaining sample data of the training set, sample data of the verification set and sample data of the test set; constructing a preset classification model, wherein the preset classification model comprises a feature extraction network, a target classification main task network and a multi-modal subject information auxiliary task network which are connected with the feature extraction network; and inputting the sample data of the training set into the preset classification model, training by using a preset loss function, and fusing the output of the main task and the auxiliary task by using a gating mechanism to obtain a social data classification model, wherein the social data classification model is used for classifying the input data to be classified. The method and the device can effectively improve the performance of the social network data in directing to the target classification.

Description

Data processing method and system based on intra-modality fusion and inter-modality relation

Technical Field

The present application relates to the field of data processing technologies, and more particularly, to a data processing method and system fusing intra-modality and inter-modality relationships.

Background

The expression and behavior of a human are variously expressed, reflecting the mental condition of the human. The identification of various expressions and behaviors and the classification of subjects having problems therein have become an essential process for social security and are also important research subjects in the field of psychomedicine. Such subjects are characterized by brain dysfunction that results in various degrees of impairment of mental activities such as cognition, emotion, will, and behavior under the influence of various biological, psychological and social environmental factors. There are a very large variety of such problematic expressions and behaviors, and many eventually develop into various mental disorders, such as autism, depression, delusional disorder, etc. Among them, depression is one of the most common mental disorders, seriously threatening the health of people. According to the world health organization, about 3 million people worldwide suffer from depression. Depression can lead to suicide in severe cases, which seriously affects the daily life of the patient. However, 76% to 85% of depression patients are not treated effectively in low-income and medium-income countries due to lack of medical resources and trained health care staff, and most depression patients miss appropriate treatment time ignoring their own depression symptoms. Early depression detection is of great significance in preventing mental health diseases such as depression.

The traditional mental disorder identification is mainly based on psychological knowledge, for example, for depression, a way of filling out a depression measuring table and a professional manual interview is adopted to judge whether a user has depression tendency, however, the following defects exist in the way: (1) the resource consumption is high, the professional medical personnel are limited, and the manual detection cost is high; (2) the diagnosis period is long, the diagnosis process needs medical personnel to follow up for a long time, and the process is slow; (3) the process is passive, and the patient can actively go to treatment only when symptoms are obvious, and the optimal treatment time is missed.

With the rapid development of the internet, social platforms such as Twitter, microblog and buffalo have become indispensable social tools for people, hundreds of millions of users share their thoughts and moods in various social platforms every day, the social network data including multi-modal information (such as texts, pictures, voice and the like) provides a new method and way for expression and behavior recognition of people, and more researchers use the multi-modal social network data to research various mental health diseases including depression. However, in the face of massive social network data, how to effectively model multi-modal sequence information becomes a key problem for improving data processing performance. Currently, modeling text or picture modal sequence information is mostly realized by a variation method such as RNN (radio network) and the like, and the problem of sequence dependence exists. Timing information cannot be modeled well.

Disclosure of Invention

The object of the present application is to solve the above technical problem. The application provides a social network data processing method and system fusing intra-modal and inter-modal relationships, data needing to be identified and classified are processed by using a new topic model to model multi-modal sequence information, the problem of sequence dependence caused by methods such as RNN (neural network) is relieved, and the performance of classification processing of a social network pointing to a target is further improved. The application provides the following technical scheme:

in a first aspect, a data processing method based on intra-modality and inter-modality relationships is provided, which includes:

obtaining sample data of a social network directed target classification, dividing the sample data into a training set, a verification set and a test set, and obtaining sample data of the training set, sample data of the verification set and sample data of the test set;

constructing a preset classification model, wherein the preset classification model comprises a feature extraction network, a target classification main task network and a multi-mode topic information auxiliary task network which are connected with the feature extraction network, the feature extraction network comprises a text feature extraction network and a picture feature extraction network, and the multi-mode topic information auxiliary task network comprises a text modal network, a picture modal network and an inter-modal network and is used for acquiring topic information in the text modal network, topic information in the picture modal network and network relation topic information between modalities;

and inputting the sample data of the training set into the preset classification model, training by using a preset loss function, and fusing the output of the main task and the auxiliary task by using a gating mechanism to obtain a social data classification model, wherein the social data classification model is used for classifying the input data to be classified.

Optionally, the text feature extraction network is a BERT model, and the picture feature extraction network is a VGG model.

Optionally, the training with the preset loss function includes simultaneously training the main task and the auxiliary task through a main task loss function, an auxiliary task loss function, and a joint loss function.

Optionally, wherein the text modality network, the picture modality network, and the inter-modality network are constructed based on a variational auto-encoder framework.

Optionally, the method for obtaining the theme information in the text modal network, the theme information in the picture modal network, and the network relationship theme information between modalities includes:

obtaining topic information in a mode by using a text mode network and a picture mode network;

obtaining the relation information between the text mode and the picture mode by using the following formula, and inputting the relation information into the inter-mode network to obtain the relation topic information between the multiple modes:

wherein the content of the first and second substances,

is a non-linear function of the standard,

is as followstThe individual text and its corresponding pictorial representation,

and is

Is a transformation vector of order 3,d，mthe size of the dimensions of the vector is represented,

，

vector multiplication for trainable parameters

The result of (a) is a vector

Each of which

。

Optionally, wherein the network model of the primary task is constructed based on an LSTM model.

Optionally, the merging of the outputs of the main task and the auxiliary task using the gating mechanism comprises:

wherein the content of the first and second substances,

for the final representation of the user of the social network,

is an output representation of the main task and,

for the representation of the output of the three kinds of subject information,

，

and

are trainable parameters.

Optionally, the main task loss function is:

wherein the content of the first and second substances,Nfor the number of samples to be taken,

is as followsiIndividual user

The true category label of (a) is,

is composed of

The coefficients of the regularization are,

all training parameters in the model;

auxiliary task loss function:

wherein the content of the first and second substances,Uin the form of an intermediate content matrix, the content matrix,

for a standard normal distribution, the first half of the formula measures the modeled distribution with a Kullback-Leibler divergence

And true distribution

The second part is the reconstruction loss of the model, the original input is reconstructed by generating the network,

representing a training parameter;

joint loss function:

wherein the content of the first and second substances,

as weights, to balance the penalty functions of the primary and secondary tasks,

is a function of the main loss as a function of,

for the purpose of the text modal loss function,

as a function of the modal loss of the picture,

is an inter-modal relationship loss function.

In a second aspect, there is provided a data processing system that fuses intra-modality and inter-modality relationships, comprising:

the system comprises a sample construction unit, a data acquisition unit and a data analysis unit, wherein the sample construction unit is used for acquiring an initial sample and dividing the sample into a training set, a verification set and a test set;

the model building unit is used for building a data classification model based on the relation topic information in the fusion modality and between the modalities;

and the model training unit is used for training a data classification model based on the intra-modal and inter-modal relation topic information.

The beneficial effects of this application include at least:

(1) compared with the prior social network data processing and classification method which mainly uses texts for training and only mines the relevant information of the text data, the method uses multi-modal social network data based on texts and pictures, and uses more useful information;

(2) the text features are extracted by using the latest BERT method, the picture features are extracted by using the VGG method, data information can be captured better, and the performance of the method is improved effectively;

(3) the invention provides a new theme model, which can learn sparse text theme characteristics and continuous picture characteristic themes;

(4) the topic model is used for learning the topic information in each mode and the relationship topic information among multiple modes, and the social network data target direction classification performance is obviously improved.

Additional advantages, objects, and features of the application will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the application.

Drawings

The present application may be better understood by describing exemplary embodiments thereof in conjunction with the following drawings, wherein:

fig. 1 is a general schematic diagram of a data processing method for fusing intra-modality and inter-modality relationships according to an embodiment of the present application.

Fig. 2 is a flowchart of a data processing method for fusing intra-modality and inter-modality relationships according to an embodiment of the present application.

Fig. 3 is a schematic diagram of text feature extraction based on a BERT model according to an embodiment of the present application.

Fig. 4 is a schematic diagram of extracting features of a picture based on a VGG model according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a topic information model provided in an embodiment of the present application.

FIG. 6 is a data processing system diagram that merges intra-modality and inter-modality relationships provided in accordance with an embodiment of the present application.

Detailed Description

The following detailed description of the embodiments of the present application, taken in conjunction with the accompanying drawings and examples, will enable those skilled in the art to practice the embodiments of the present application with reference to the description.

It is noted that in the detailed description of these embodiments, in order to provide a concise description, all features of an actual implementation may not be described in detail. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions are made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Fig. 2 is a flowchart of a data processing method for fusing intra-modality and inter-modality relationships according to an embodiment of the present application. The method at least comprises the following steps:

step S201, sample data of a social network pointing to a target classification is obtained, the sample data is divided into a training set, a verification set and a test set, and training set sample data, verification set sample data and test set sample data are obtained.

Step S202, a preset classification model is constructed, wherein the preset classification model comprises a feature extraction network, a target classification main task network and a multi-mode subject information auxiliary task network which are connected with the feature extraction network, the feature extraction network comprises a text feature extraction network and a picture feature extraction network, and the multi-mode subject information auxiliary task network comprises a text mode network, a picture mode network and an inter-mode network and is used for acquiring subject information in the text mode network, subject information in the picture mode network and subject information of network relation among modes.

In the present embodiment, text features are extracted using a BERT-base (uncached) model, and the structure of the BERT-base (uncached) model is shown in fig. 3, because each user has published many text sentences, "[ CLS ] in the BERT model is used to represent whole sentence vectors, and the sentence vectors are converted into the same dimensional features as the pictures. When all text sequences are BERT encoded, a text Content matrix (UGC) can be obtained, and the formula is as follows:

wherein the content of the first and second substances,

a matrix of text content representing the user,

is shown asnThe characteristics extracted by BERT were used in sentences.

In this embodiment, a VGG16 model is used to encode a picture, and an output vector of the first fully-connected layer of the VGG16 is used as an encoding vector of the picture, and is converted into a feature dimension with the same size as a text, where the VGG16 model is shown in fig. 4.

In the present embodiment, referring to FIG. 5, the following steps are followed to obtainObtaining the relation topic information between the internal topics of each modality and the multiple modalities: unlike traditional neural topic models that focus on modeling sparse text feature topics, the topic model in this embodiment aims to generate a matrix of content (e.g., content matrix) in between each modality

Or

Etc.), since both sparse and continuous features can be encoded into the same UGC matrix, they can be input into the topic model proposed by the present invention, which adopts a variational self-encoder framework, including inference networks and generation networks. The loss function of the subject model is as follows:

wherein the content of the first and second substances,Uthe intermediate content matrix is shown with superscripts omitted for convenience.

Is a standard normal distribution, and the first half of the formula measures the modeled distribution with a Kullback-Leibler divergence

And true distribution

The second part is the reconstruction loss of the model, i.e. the original input is reconstructed by the generating network.

Representing the training parameters.

Firstly, a theme model is used for modeling a theme in each mode, relationship information among multiple modes is obtained by using the following formula, and then the relationship information is input into the theme model to obtain relationship theme characteristics among the modes, wherein the formula is as follows:

wherein the content of the first and second substances,

is a standard non-linear function of the signal,

is shown astThe individual text and its corresponding pictorial representation,

and is

Is a transformation vector of order 3,d，mrepresenting the dimension size of the vector.

，

Are trainable parameters. We multiply the vectors

As a vector

Each of which

Is equivalent to

The bilinear models simultaneously capture multiline mutual information of the vectors. We approximate each vector slice using two low rank matrices using a method of vector factorization

The method comprises the following steps:

wherein the content of the first and second substances,

，

and is

。

In the example, an LSTM is utilized to construct a target-oriented classification network model, the training of the model is set as a main task, and a theme in each modeling mode and a relation theme between multiple modes are set as auxiliary tasks.

Step S203, inputting the sample data of the training set into the preset classification model, training by using a preset loss function, and fusing the output of the main task and the auxiliary task by using a gating mechanism to obtain a social data classification model, wherein the social data classification model is used for classifying the input data to be classified

Firstly, the characteristics of a text mode and a picture mode are fused and input into an LSTM network for training to obtain the representation of a main task. The loss function of a specific main task is defined as:

wherein the content of the first and second substances,Nindicating the number of samples, i.e., the number of Twitter users.

Is the firstiIndividual user

True category label of (2).

Is that

The coefficients of the regularization are,

all training parameters in the model are represented.

Then, in order to distinguish different influences of the topic information of different auxiliary tasks on the target detection classification result, a gating mechanism is used to distinguish the expressions of the main task and the auxiliary tasks, and a formula is defined as follows:

wherein the content of the first and second substances,

it is the last representation of the user that,

is an output representation of the main task and,

the method is a representation of three types of theme information output, wherein the theme information in each mode comprises a text theme and a picture theme, and then the relation theme information among multiple modes.

，

And

are trainable parameters.

The method usesThe combined loss function simultaneously optimizes the main task and the three auxiliary tasks, the loss function of the whole method comprises the three auxiliary tasks, such as text theme modeling, picture theme modeling and multi-modal relation theme modeling, and the corresponding loss function is expressed as

And

. The loss function of the loss union of the whole method is then expressed as:

wherein the content of the first and second substances,

is a weight that balances the penalty function for the primary and secondary tasks.

In the embodiment, the data classification is carried out by simultaneously fusing the characteristics of the main task and the auxiliary task through the following steps, and the characteristic result fused by using the gating mechanism is input into softmax to carry out the second classification of the pointing target:

wherein the content of the first and second substances,

，

is a parameter that is trainable,nis the total number of categories to be classified,ris the final representation of the user for classification.

Is the probability distribution of the two classes predicted by the model. The method inputs the social network user numberThe text and picture information of the table are converted into probability distribution and output as a classification label pointing to the target, taking depression as an example, 0 is non-depression and 1 is depression.

FIG. 6 is a data processing system for fusing multi-modal relationship topic information provided in one embodiment of the present application. The system at least comprises the following units: a sample construction unit 610, a model construction unit 620, and a model training unit 630.

The sample construction unit 610 acquires an initial sample, and divides the sample into a training set, a verification set and a test set;

a model construction unit 620 for constructing a data classification model based on the intra-modality fusion and inter-modality relationship topic information;

a model training unit 630 trains data classification models based on intra-modality and inter-modality relationship topic information.

For relevant details reference is made to the above-described method embodiments.

The basic principles of the present application have been described in connection with specific embodiments, but it should be noted that, for those skilled in the art, it can be understood that all or any of the steps or components of the method and apparatus of the present application can be implemented in hardware, firmware, software or their combination in any computing device (including processors, storage media, etc.) or network of computing devices, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present application.

The object of the present application can thus also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. The object of the application can thus also be achieved merely by providing a program product comprising program code for implementing the method or the apparatus. That is, such a program product also constitutes the present application, and a storage medium storing such a program product also constitutes the present application. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.

It is further noted that in the apparatus and method of the present application, it is apparent that the components or steps may be disassembled and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Unless otherwise defined, technical or scientific terms used in the claims and the specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. The use of "first," "second," and similar terms in the description and claims of this patent application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The terms "a" or "an," and the like, do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprise" or "comprises", and the like, means that the element or item listed before "comprises" or "comprising" covers the element or item listed after "comprising" or "comprises" and its equivalent, and does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, nor are they restricted to direct or indirect connections.

The above-described embodiments should not be construed as limiting the scope of the present application. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A data processing method based on intra-modality and inter-modality relationships fusion includes:

2. The method of claim 1, wherein the text feature extraction network is a BERT model and the picture feature extraction network is a VGG model.

3. The method of claim 1, wherein the training with the preset loss function comprises simultaneously training the main task and the auxiliary task with a main task loss function, an auxiliary task loss function, and a joint loss function.

4. The method of claim 1, wherein the text modality network, picture modality network, and inter-modality network are constructed based on a variational self-encoder framework.

5. The method according to claim 4, wherein the method for obtaining the topic information of the text modal network, the topic information of the picture modal network and the topic information of the inter-modal network relationship comprises the following steps: