CN112052679B

CN112052679B - Fused media information processing method based on MI-CFM-IMC algorithm

Info

Publication number: CN112052679B
Application number: CN202011050720.0A
Authority: CN
Inventors: 胡燕祝; 王松
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-08-02
Anticipated expiration: 2040-09-29
Also published as: CN112052679A

Abstract

The invention relates to a fused media information processing method based on MI-CFM-IMC algorithm, which is a fused media information processing method for dangerous chemical accidents in safety production, belongs to the field of intelligent safety, and is characterized by comprising the following steps: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) and acquiring a fusion matrix. The method can effectively solve the problem that the semantic associated information of the rare words cannot be fully expressed due to lack of context related data, and greatly improves the relation definition between the synonyms and the antonyms of the central words in the same text. The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method with high information fusion accuracy is provided for the field of hazardous chemical substance accident information processing.

Description

Fused media information processing method based on MI-CFM-IMC algorithm

Technical Field

The invention relates to the field of intelligent safety, in particular to a fused media information processing method for dangerous chemical accidents in safety production.

Background

At present, in the field of safe production of hazardous chemicals, fused media information is processed, and key information of fused media text data is mainly acquired, a mainstream method is to adopt a distributed learning method, train by acquiring a large number of keywords of text information, show the interrelation of characteristic words and establish a matrix, but the information processing based on the method has many problems, and when the characteristics of the keywords of the text information are insufficient, the generated characteristic words have weak relationship and are insufficient to completely show the interrelation of the key information; when the anti-sense words of the central sentence appear in the same text segment, the words with opposite meanings are endowed with more similar semantic association in the processing process, and the words with similar meanings of the central vocabulary appear in the sentences with different semantic attributes, so that the generated relation matrix is easily endowed with farther relation.

For processing of fused media information, particularly text information, in order to acquire key information efficiently and timely, a central vocabulary needs to be accurately distinguished and associated, on the basis, a fused media information processing method based on an MI-CFM-IMC algorithm is provided, a context feature matrix is constructed for a central word of text features, the relation between the key information is found, a synonym feature matrix and an antisense feature matrix are constructed, the found key information is classified, then objects respectively corresponding to the key information are found through an attribute semantic feature matrix, and finally the three matrices are fused by using the IMC, so that the efficiency and the accuracy of the fused media information processing are improved, workers can find problems in various aspects, the problems are summarized stereoscopically, and various accident prevention and solution strategies are made.

Disclosure of Invention

In view of the problems in the prior art, the technical problem to be solved by the present invention is to provide a fused media information processing method based on MI-CFM-IMC algorithm, and the specific flow is shown in fig. 1.

The technical scheme for realizing the aim of the invention comprises the following specific steps:

the method comprises the following steps: calculating mutual information of the terms and each category;

establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the residual lexical items and the mutual information of each category;

in the formula, U is a term, C is a category, U, C are binary random variables, I represents an identity matrix, P represents probability, and when a document contains the term, the value of U is e _t 1, otherwise e _t 0; when the document belongs to the category C, the value e of C _c 1, otherwise e _c ＝0；

Calculating each lexical item and mutual information of each category, selecting k lexical items with the largest value, deleting repeated words among the categories, and screening out characteristic words;

step two: constructing a context feature matrix M ^SPPMI ；

And (3) preprocessing a context feature text by a similarity dictionary S formed by a test set, labeling all context words of each word in the similarity dictionary S, and calculating # (w), # (c) and # (w, c):

M ^SPPMI ＝(SPPMI(w _i ,c _j )) _|D|×|D| ；

in the formula, D is the number of total words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of times of occurrence of each current central word, # (c) is the number of times of occurrence of context words, and# (w, c) is the number of times of occurrence of each word pair (w, c);

step three: constructing synonym and antisense characteristic matrix M ^SAM ：

After a crawler technology is utilized to obtain related texts, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:

in the formula, matrix M ^SAM Is a matrix of one size | S | × | S |,

represents M ^SAM The ith row and the jth column elements of the matrix, if the row head words and the list head words are synonyms, then

If the relation is an antisense word, then

Otherwise

Step four: constructing attribute semantic feature matrix M ^SFM ：

For each word S in the similarity dictionary S _i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:

in the formula, for each text in ST, if the word name in CN appears in the text, setting

Otherwise set up

Up to s _i When the word is the last word in the S, constructing an attribute semantic feature matrix;

step five: acquiring a fusion matrix M:

M≈(M ^SAM ) ^T (M ^CFM ) ^T H(M ^SFM ) ^T

in the formula, M ^CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m ^SAM The characteristic matrix is a matrix in an IMC algorithm; m ^SFM The characteristic matrix is a matrix in an IMC algorithm, and the H matrix represents a Hermit matrix.

Compared with the prior art, the invention has the advantages that:

(1) the present invention overcomes the disadvantage of weak relationships between information of distributed learning and may effectively improve the relevance between information of melting media.

(2) The method can effectively solve the problem that the semantic association information of the rare words cannot be fully expressed due to lack of context related data, firstly determines the opposite relation between the synonym and the antisense of the headword, and greatly improves the relation definition between the synonym and the antisense of the headword between the same text.

(3) The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method has certain practical value for processing the accident information of the hazardous chemical substances in production.

Drawings

For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of steps for establishing a fused media information processing method based on MI-CFM-IMC algorithm;

FIG. 2 is a flow chart of an algorithm for establishing a fused media information processing method based on MI-CFM-IMC algorithm;

fig. 3 is a comparative experiment of four sets of information processing methods.

Detailed description of the preferred embodiments

The present invention will be described in further detail below with reference to examples.

The implementation case selects the constructed CFM data from Wikipedia, the constructed SMA data from Thesauus, and the constructed SFM data from Wikipedia, Wiktionary dictionary and online dictionary. After relevant data are crawled from relevant websites, data are preprocessed to achieve uniform format, and then matrix relevant processing is carried out. Wherein, the size of the similarity dictionary S constructed according to the data set in the CFM is 5987.

The overall flow of the fused media information processing method based on the MI-CFM-IMC algorithm is shown as the figure, and the specific steps are as follows:

(1) calculating mutual information of terms and each category:

establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the mutual information of the rest terms and each category:

where U is the term, C is the category, U, C are all binary random variables that, when a document contains terms,the value of U is e _t 1, otherwise e _t 0; when the document belongs to the category C, the value e of C _c 1, otherwise e _c ＝0。

And calculating the mutual information of each term and each category of each category, and selecting k terms with the maximum value. And deleting repeated words among all categories, and screening out the characteristic words.

(2) Constructing a context feature matrix M ^SPPMI ：

And (2) preprocessing a context feature text by a similarity dictionary S formed by a test set, wherein the size of the S is 5987, all context words are labeled for each word in the similarity dictionary S, 3770834 texts are reserved finally, and the calculation is performed, wherein the calculation is performed according to the following formula:

M ^SPPMI ＝(SPPMI(w _i ,c _j )) _|D|×|D|

in the formula, D is the total number of words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of occurrences of each current core word, # (c) is the number of occurrences of a context word, and # (w, c) is the number of occurrences of each word pair (w, c).

(3) Constructing synonym and antonym feature matrix M ^SAM ：

After obtaining related texts by utilizing a crawler technology, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:

in the formula, matrix M ^SAM Is a matrix of one size | S | × | S |. If the head words of the row list and the head words of the list are synonyms, then

If the relation is an antisense word, then

Otherwise

(4) Constructing attribute semantic feature matrix M ^SFM ：

wherein, for each text in ST, if the word name in CN appears in the text, setting

Otherwise set up

Up to s _i And when the word is the last word in the S, constructing an attribute semantic feature matrix.

(5) Acquiring a fusion matrix M:

M≈(M ^SAM ) ^T (M ^CFM ) ^T H(M ^SFM ) ^T

in the formula, M ^CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m ^SAM The characteristic matrix is a matrix in an IMC algorithm; m ^SFM The feature matrix is a matrix in the IMC algorithm.

In order to verify the accuracy of the invention in processing information of the fused media, four sets of information processing comparison experiments were performed on the invention, and the experimental results are shown in fig. 3. As can be seen from the figure, the fused media information processing method based on the MI-CFM-IMC algorithm has the advantages that the accuracy rate is kept above 90%, the higher accuracy rate can be achieved on the basis of ensuring the stability, and the effect is good. The fused media information processing method based on the MI-CFM-IMC algorithm is effective, provides a better method for processing the fused media information, and has certain practical value.

Claims

1. The invention discloses a fused media information processing method based on MI-CFM-IMC algorithm, which is characterized in that: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) acquiring a fusion matrix; the method specifically comprises the following five steps: