CN114169320A

CN114169320A - Multi-source data fusion method and system based on word vector matrix decomposition technology

Info

Publication number: CN114169320A
Application number: CN202111330802.5A
Authority: CN
Inventors: 杜登斌; 杜小军; 杜乐
Original assignee: Wuhan Donghu Big Data Trading Center Co ltd
Current assignee: Wuhan Donghu Big Data Trading Center Co ltd
Priority date: 2021-11-11
Filing date: 2021-11-11
Publication date: 2022-03-11

Abstract

The invention provides a multi-source data fusion method and system based on a word vector matrix decomposition technology, which are characterized in that multi-source data samples are obtained, each multi-source data sample comprises multi-mode data of texts, voices, images and videos, and a corresponding implied semantic knowledge base is obtained; the method comprises the steps that multisource data samples and corresponding semantic information extracted from a hidden semantic knowledge base are projected to a shared semantic subspace together to generate a word vector matrix; decomposing the word vector matrix to obtain the low-dimensional characteristics of the multi-source data sample; training a classifier by taking the low-dimensional features of the multi-source data samples as input and corresponding semantic information as a label; and after the multi-modal data of the target task to be mined are processed by the same generated word vector matrix and decomposition matrix, the multi-modal data are input into a trained classifier to obtain semantic information of the target task, and the implicit semantic mining of the target task is completed. The invention realizes the effective extraction of the implicit meanings of the multi-source data and the relationship among the data.

Description

Multi-source data fusion method and system based on word vector matrix decomposition technology

Technical Field

The invention relates to the technical field of multi-source data fusion, in particular to a multi-source data fusion method and system based on a word vector matrix decomposition technology.

Background

The new tools for data processing free data scientists from the tedious task of data preparation, but how to tailor to each data analysis project volume, fusing multi-source data to form an effective analysis data set remains a more challenging bottleneck that data scientists must face.

With the development of computer technology and communication technology, new theories and methods are continuously appeared, the multi-source data fusion technology becomes mature, the theoretical research is shifted to actual wider application, and finally the development is bound to be in the direction of intellectualization and real-time. The traditional word vector learning method usually depends on a large amount of unlabeled text corpora, but ignores semantic information of words, such as semantic relation among the words, and the like, so that efficient learning of word vectors is of great importance to natural language processing. However, only the corpus is relied on to learn the word vectors, and the meaning of the words and the complex relationship among the words cannot be well reflected.

Therefore, a generally applicable method is not available at present, and the problem that the implication of multi-source data and the relation between data cannot be effectively extracted can be solved.

Disclosure of Invention

In view of the above, the invention provides a multi-source data fusion method based on a word vector matrix decomposition technology, which is used for solving the problem that the implicit meaning of multi-source data and the relationship between data cannot be effectively extracted.

The technical scheme of the invention is realized as follows:

the invention discloses a multi-source data fusion method based on a word vector matrix decomposition technology, which comprises the following steps:

s1, acquiring multi-source data samples, wherein each multi-source data sample comprises multi-modal data of texts, voice, images and videos, and acquiring a corresponding implied semantic knowledge base; simultaneously acquiring a target task;

s2, projecting the multi-source data sample and the corresponding semantic information extracted from the implicit semantic knowledge base to a shared semantic subspace together through word2vec to generate a word vector matrix;

s3, decomposing the word vector matrix to obtain the low-dimensional characteristics of the multi-source data sample;

s4, training a classifier by taking the low-dimensional features of the multi-source data samples as input and corresponding semantic information as labels;

and S5, inputting the multi-modal data of the target task into the trained classifier after the multi-modal data of the target task are processed in the same steps of S2 and S3, obtaining semantic information of the target task, and completing the implicit semantic mining of the target task.

By the method, the implication meaning of the multi-source data and the relation among the data are effectively extracted, and the implication semantic mining of unknown data can be realized.

On the basis of the above technical solution, preferably, step S1 specifically includes:

the implied semantic knowledge base comprises description information and synonymous information corresponding to the implied association of the sample data;

if the multi-modal data is in text form, then it is symbolic; if the multimodal data is in audio or visual form, it is represented as a signal, which is converted to a corresponding text format, the visual form including pictures and video.

On the basis of the above technical solution, preferably, step S2 specifically includes:

s2-1, training a neural network by using word2vec based on a multi-source data sample, continuously adjusting the weight W through a gradient descent algorithm in the training process, and obtaining the final weight W after the training is finished;

s2-2, multiplying each multi-source data sample of the input layer, namely a one-hot vector by a weight W to obtain a vector which is a word vector, and uniformly expressing the multi-modal data;

s2-3, projecting the plurality of word vectors and corresponding semantic information extracted from the implicit semantic knowledge base to a shared semantic subspace together to generate an n multiplied by m word vector matrix X with the rank r.

According to the method, a word vector matrix of the multi-source data sample is constructed, and multi-mode data are uniformly expressed.

On the basis of the above technical solution, preferably, step S3 specifically includes:

s3-1, calculating a real symmetric matrix X^TM eigenvalues λ of X_iAnd m unit vector features v_iR non-zero of said feature values;

s3-2, converting the m characteristic values lambda_iSorting from large to small, finding out singular value

And obtaining S ═ diag (sigma)₁,σ₂,…,σ_r)；

S3-3, correspondingly dividing m unit vector characteristics v_iArranging to obtain a right singular matrix V ═ V (V)₁,ν₂,…,ν_r,…,ν_m) And simultaneously obtaining the eigenvector matrix V of the non-zero eigenvalue₁＝(ν₁,ν₂,…,ν_r)；

S3-4, from XV₁S^-1Determining an n x r matrix U₁＝(u₁,u₂,…,u_r)；

S3-5, constructing n-r column vectors u_jEach satisfying the condition of being orthogonal to the other n-1 column vectors and being a unit vector, is U₂＝(u_r+1,u_r+2,…,u_r,…,u_n) Obtaining the left singular matrix U ═ (U ═₁,U₂)＝(u₁,u₂,…,u_r,…,u_n)；

S3-6, obtaining a singular value decomposition matrix of the word vector matrix:

where U is the left singular matrix, V^TIs the transposed matrix of the right singular matrix V.

According to the method, the word vector matrix is decomposed, the data dimension reduction is completed, and the low-dimensional characteristics of the multi-source data sample are obtained.

On the basis of the above technical solution, preferably, step S5 specifically includes:

and (4) after the target task is processed in the steps S2 and S3, obtaining the low-dimensional features of the target task and inputting the low-dimensional features into a classifier, wherein the classifier comprises a decision tree algorithm model, and the low-dimensional features of the target task are classified through the decision tree algorithm model to obtain the semantic information of the target task.

On the basis of the above technical solution, preferably, the semantic information specifically includes:

textual form information associated with the target task.

According to the method, the low-dimensional features of the target task are classified through the classifier, the low-dimensional features of the multi-source data sample close to the low-dimensional features are obtained, and the semantic information possibly implicitly corresponding to the low-dimensional features is further obtained.

In a second aspect of the present invention, a multi-source data fusion system based on a word vector matrix decomposition technique is disclosed, the system comprising:

the multi-source data acquisition module: acquiring multi-source data samples, including multi-mode data sets of texts, voices, images and videos, and acquiring a corresponding implicit semantic knowledge base; simultaneously acquiring a target task;

the multi-source data processing module: through word2vec, projecting the multi-source data sample and corresponding semantic information extracted from the implicit semantic knowledge base to a shared semantic subspace together to generate a word vector matrix; performing singular value decomposition on the word vector matrix to obtain low-dimensional characteristics of the multi-source data sample;

a classifier module: the method comprises a decision tree algorithm model, and low-dimensional features of a target task are classified according to the word vector matrix through the decision tree algorithm model to obtain semantic information of the target task, wherein the semantic information comprises text form information associated with the target task, including different expression modes of shape, sound, color, smell, texture, synonym and same meaning.

In a third aspect of the present invention, an electronic device is disclosed, the device comprising: at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the memory stores a word vector matrix decomposition technology-based multi-source data fusion method program executable by the processor, and the word vector matrix decomposition technology-based multi-source data fusion method program is configured to implement the word vector matrix decomposition technology-based multi-source data fusion method according to the first aspect of the present invention.

In a fourth aspect of the present invention, a computer-readable storage medium is disclosed, in which a word vector matrix decomposition technology-based multi-source data fusion method program is stored, and when executed, the word vector matrix decomposition technology-based multi-source data fusion method program implements the word vector matrix decomposition technology-based multi-source data fusion method according to the first aspect of the present invention.

Compared with the prior art, the multi-source data fusion method based on the word vector matrix decomposition technology has the following beneficial effects:

(1) valuable semantic information is extracted from a hidden semantic knowledge base to be used as constraint supervision on information of a single dependent corpus, a word vector matrix decomposition model fusing the semantic information is provided, the quality of word vectors is greatly improved, and the advantages of realization of a multi-source data fusion-based target task and automatic cognition and analysis are obvious;

(2) the word vector matrix is decomposed by decomposing the word vector matrix and extracting valuable semantic information from the implicit semantic knowledge base, the valuable semantic information is integrated into the word vector learning process, the word vector matrix is decomposed, finally, the implicit semantics of a target task are excavated through a decision tree algorithm, the fusion and the cognition of multi-source data are realized, the huge semantic space of 'cognition iceberg' or 'dark matter' implicit in the multi-source data is learned, and the real emotion and thinking of human can be more approximate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the work flow of the multi-source data fusion method based on the word vector matrix decomposition technology.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Examples

The working flow of the multi-source data fusion method based on the word vector matrix decomposition technology is shown in figure 1, and the processing steps are described as follows:

firstly, multi-source data samples are obtained, wherein the multi-source data samples comprise multi-modal data of texts, voice, images and videos, and a corresponding implied semantic knowledge base is obtained. And turning to the second step.

It should be understood that the implied semantic knowledge base comprises description information and synonymous information corresponding to the implied association of the sample data; if the multi-modal data is in text form, then it is symbolic; if the multimodal data is in audio or visual form, it is represented as a signal, which is converted to a corresponding text format, the visual form including pictures and video.

For example, multi-source data sample multimodal dataset content includes: material, year, director, actors, genre, etc.

And secondly, projecting the multi-source data sample and corresponding semantic information extracted from the implicit semantic knowledge base to a shared semantic subspace together through word2vec to generate a word vector matrix. And (6) turning to the third step.

For example, semantic information corresponding to a multimodal dataset is extracted from the implied semantic knowledge base as the name of a movie.

It should be understood that the second step specifically includes:

It should be appreciated that due to the one-hot encoding format, the matrix of wordemmbedding of the multi-source data sample words for all input layers is the weight W.

And thirdly, decomposing the word vector matrix to obtain the low-dimensional characteristics of the multi-source data sample. And turning to the fourth step.

It should be understood that the third step mainly comprises:

And obtaining S ═ diag (sigma)₁,σ₂,…,σ_r)；

S3-3, correspondingly dividing m unit vector characteristics v_iArranging to obtain a right singular matrix V ═ V (V)₁,ν₂,…,ν_r,…,ν_m) While obtaining non-zero eigenvaluesEigenvector matrix V₁＝(ν₁,ν₂,…,ν_r)；

S3-4, from XV₁S^-1Determining an n x r matrix U₁＝(u₁,u₂,…,u_r)；

The method carries out singular value decomposition on the word vector matrix to complete data dimension reduction, reduces the dimension of the word vector matrix on one hand, and can compress and summarize a large amount of data on the other hand to obtain the low-dimensional characteristics of the multi-source data sample.

And fourthly, taking the low-dimensional features of the multi-source data samples as input, and taking the corresponding semantic information as a label to train a classifier. And turning to the fifth step.

And fifthly, inputting the multi-modal data of the target task into the trained classifier after the same processing as the second step and the third step, obtaining the semantic information of the target task, and finishing the implicit semantic mining of the target task.

It should be understood that after the target task is processed in the second step and the third step, the low-dimensional features of the target task are obtained and input to a classifier, where the classifier includes a decision tree algorithm model, and the low-dimensional features of the target task are classified by the decision tree algorithm model to obtain semantic information of the target task.

For example, a user inputs own preferred subject, actor and style information, which is multi-modal data of a target task, the multi-modal data is subjected to word vector matrix construction and matrix decomposition processing to obtain low-dimensional features of the target task and is input into a classifier, corresponding semantic information, namely a corresponding movie name, is obtained through a decision tree algorithm model, and the movie name is recommended for the user.

The invention also discloses a multi-source data fusion system based on the word vector matrix decomposition technology, which comprises the following steps:

The invention also discloses an electronic device, comprising: at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the memory stores a word vector matrix decomposition technology-based multi-source data fusion method program executable by the processor, and the word vector matrix decomposition technology-based multi-source data fusion method program is configured to implement a word vector matrix decomposition technology-based multi-source data fusion method according to an embodiment of the present invention.

The invention also discloses a computer readable storage medium, the storage medium is stored with a multi-source data fusion method program based on the word vector matrix decomposition technology, and when the multi-source data fusion method program based on the word vector matrix decomposition technology is executed, the multi-source data fusion method based on the word vector matrix decomposition technology is realized.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-source data fusion method based on a word vector matrix decomposition technology is characterized by comprising the following steps:

2. The multi-source data fusion method based on the word vector matrix decomposition technology of claim 1, wherein the step S1 specifically includes:

3. The multi-source data fusion method based on the word vector matrix decomposition technology of claim 1, wherein the step S2 specifically includes:

4. The multi-source data fusion method based on the word vector matrix decomposition technology of claim 3, wherein in the step S3, the method specifically includes:

And obtaining S ═ diag (sigma)₁,σ₂,…,σ_r)；

S3-4, from XV₁S^-1Determining an n x r matrix U₁＝(u₁,u₂,…,u_r)；

5. The multi-source data fusion method based on the word vector matrix decomposition technology of claim 4, wherein the step S5 specifically includes:

and (4) after the target task is processed in the steps S2 and S3, obtaining the low-dimensional features of the target task and inputting the low-dimensional features into a classifier, wherein the classifier comprises a decision tree algorithm model, and the low-dimensional features of the multi-modal data of the target task are classified through the decision tree algorithm model to obtain the semantic information of the target task.

6. The multi-source data fusion method based on the word vector matrix decomposition technology of claim 5, wherein the semantic information specifically includes:

textual form information associated with the target task.

7. A multi-source data fusion system based on a word vector matrix decomposition technique, the system comprising:

a classifier module: the method comprises a decision tree algorithm model, and low-dimensional features of a target task are classified according to the word vector matrix through the decision tree algorithm model to obtain semantic information of the target task, wherein the semantic information comprises text form information associated with the target task, including shape, sound, color, smell, texture and meaning expression.

8. An electronic device comprising at least one processor, at least one memory, a communication interface, and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the memory stores a word vector matrix decomposition technology-based multi-source data fusion method program executable by the processor, and the word vector matrix decomposition technology-based multi-source data fusion method program is configured to implement a word vector matrix decomposition technology-based multi-source data fusion method according to any one of claims 1 to 7.

9. A computer-readable storage medium, wherein the storage medium stores thereon a word vector matrix decomposition technology-based multi-source data fusion method program, and when executed, the word vector matrix decomposition technology-based multi-source data fusion method program implements a word vector matrix decomposition technology-based multi-source data fusion method according to any one of claims 1 to 7.