CN112052679B - Fused media information processing method based on MI-CFM-IMC algorithm - Google Patents
Fused media information processing method based on MI-CFM-IMC algorithm Download PDFInfo
- Publication number
- CN112052679B CN112052679B CN202011050720.0A CN202011050720A CN112052679B CN 112052679 B CN112052679 B CN 112052679B CN 202011050720 A CN202011050720 A CN 202011050720A CN 112052679 B CN112052679 B CN 112052679B
- Authority
- CN
- China
- Prior art keywords
- matrix
- word
- constructing
- words
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a fused media information processing method based on MI-CFM-IMC algorithm, which is a fused media information processing method for dangerous chemical accidents in safety production, belongs to the field of intelligent safety, and is characterized by comprising the following steps: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) and acquiring a fusion matrix. The method can effectively solve the problem that the semantic associated information of the rare words cannot be fully expressed due to lack of context related data, and greatly improves the relation definition between the synonyms and the antonyms of the central words in the same text. The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method with high information fusion accuracy is provided for the field of hazardous chemical substance accident information processing.
Description
Technical Field
The invention relates to the field of intelligent safety, in particular to a fused media information processing method for dangerous chemical accidents in safety production.
Background
At present, in the field of safe production of hazardous chemicals, fused media information is processed, and key information of fused media text data is mainly acquired, a mainstream method is to adopt a distributed learning method, train by acquiring a large number of keywords of text information, show the interrelation of characteristic words and establish a matrix, but the information processing based on the method has many problems, and when the characteristics of the keywords of the text information are insufficient, the generated characteristic words have weak relationship and are insufficient to completely show the interrelation of the key information; when the anti-sense words of the central sentence appear in the same text segment, the words with opposite meanings are endowed with more similar semantic association in the processing process, and the words with similar meanings of the central vocabulary appear in the sentences with different semantic attributes, so that the generated relation matrix is easily endowed with farther relation.
For processing of fused media information, particularly text information, in order to acquire key information efficiently and timely, a central vocabulary needs to be accurately distinguished and associated, on the basis, a fused media information processing method based on an MI-CFM-IMC algorithm is provided, a context feature matrix is constructed for a central word of text features, the relation between the key information is found, a synonym feature matrix and an antisense feature matrix are constructed, the found key information is classified, then objects respectively corresponding to the key information are found through an attribute semantic feature matrix, and finally the three matrices are fused by using the IMC, so that the efficiency and the accuracy of the fused media information processing are improved, workers can find problems in various aspects, the problems are summarized stereoscopically, and various accident prevention and solution strategies are made.
Disclosure of Invention
In view of the problems in the prior art, the technical problem to be solved by the present invention is to provide a fused media information processing method based on MI-CFM-IMC algorithm, and the specific flow is shown in fig. 1.
The technical scheme for realizing the aim of the invention comprises the following specific steps:
the method comprises the following steps: calculating mutual information of the terms and each category;
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the residual lexical items and the mutual information of each category;
in the formula, U is a term, C is a category, U, C are binary random variables, I represents an identity matrix, P represents probability, and when a document contains the term, the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0;
Calculating each lexical item and mutual information of each category, selecting k lexical items with the largest value, deleting repeated words among the categories, and screening out characteristic words;
step two: constructing a context feature matrix M SPPMI ;
And (3) preprocessing a context feature text by a similarity dictionary S formed by a test set, labeling all context words of each word in the similarity dictionary S, and calculating # (w), # (c) and # (w, c):
M SPPMI =(SPPMI(w i ,c j )) |D|×|D| ;
in the formula, D is the number of total words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of times of occurrence of each current central word, # (c) is the number of times of occurrence of context words, and# (w, c) is the number of times of occurrence of each word pair (w, c);
step three: constructing synonym and antisense characteristic matrix M SAM :
After a crawler technology is utilized to obtain related texts, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
in the formula, matrix M SAM Is a matrix of one size | S | × | S |,represents M SAM The ith row and the jth column elements of the matrix, if the row head words and the list head words are synonyms, thenIf the relation is an antisense word, thenOtherwise
Step four: constructing attribute semantic feature matrix M SFM :
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
in the formula, for each text in ST, if the word name in CN appears in the text, settingOtherwise set upUp to s i When the word is the last word in the S, constructing an attribute semantic feature matrix;
step five: acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The characteristic matrix is a matrix in an IMC algorithm, and the H matrix represents a Hermit matrix.
Compared with the prior art, the invention has the advantages that:
(1) the present invention overcomes the disadvantage of weak relationships between information of distributed learning and may effectively improve the relevance between information of melting media.
(2) The method can effectively solve the problem that the semantic association information of the rare words cannot be fully expressed due to lack of context related data, firstly determines the opposite relation between the synonym and the antisense of the headword, and greatly improves the relation definition between the synonym and the antisense of the headword between the same text.
(3) The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method has certain practical value for processing the accident information of the hazardous chemical substances in production.
Drawings
For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of steps for establishing a fused media information processing method based on MI-CFM-IMC algorithm;
FIG. 2 is a flow chart of an algorithm for establishing a fused media information processing method based on MI-CFM-IMC algorithm;
fig. 3 is a comparative experiment of four sets of information processing methods.
Detailed description of the preferred embodiments
The present invention will be described in further detail below with reference to examples.
The implementation case selects the constructed CFM data from Wikipedia, the constructed SMA data from Thesauus, and the constructed SFM data from Wikipedia, Wiktionary dictionary and online dictionary. After relevant data are crawled from relevant websites, data are preprocessed to achieve uniform format, and then matrix relevant processing is carried out. Wherein, the size of the similarity dictionary S constructed according to the data set in the CFM is 5987.
The overall flow of the fused media information processing method based on the MI-CFM-IMC algorithm is shown as the figure, and the specific steps are as follows:
(1) calculating mutual information of terms and each category:
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the mutual information of the rest terms and each category:
where U is the term, C is the category, U, C are all binary random variables that, when a document contains terms,the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0。
And calculating the mutual information of each term and each category of each category, and selecting k terms with the maximum value. And deleting repeated words among all categories, and screening out the characteristic words.
(2) Constructing a context feature matrix M SPPMI :
And (2) preprocessing a context feature text by a similarity dictionary S formed by a test set, wherein the size of the S is 5987, all context words are labeled for each word in the similarity dictionary S, 3770834 texts are reserved finally, and the calculation is performed, wherein the calculation is performed according to the following formula:
M SPPMI =(SPPMI(w i ,c j )) |D|×|D|
in the formula, D is the total number of words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of occurrences of each current core word, # (c) is the number of occurrences of a context word, and # (w, c) is the number of occurrences of each word pair (w, c).
(3) Constructing synonym and antonym feature matrix M SAM :
After obtaining related texts by utilizing a crawler technology, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
in the formula, matrix M SAM Is a matrix of one size | S | × | S |. If the head words of the row list and the head words of the list are synonyms, thenIf the relation is an antisense word, thenOtherwise
(4) Constructing attribute semantic feature matrix M SFM :
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
wherein, for each text in ST, if the word name in CN appears in the text, settingOtherwise set upUp to s i And when the word is the last word in the S, constructing an attribute semantic feature matrix.
(5) Acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The feature matrix is a matrix in the IMC algorithm.
In order to verify the accuracy of the invention in processing information of the fused media, four sets of information processing comparison experiments were performed on the invention, and the experimental results are shown in fig. 3. As can be seen from the figure, the fused media information processing method based on the MI-CFM-IMC algorithm has the advantages that the accuracy rate is kept above 90%, the higher accuracy rate can be achieved on the basis of ensuring the stability, and the effect is good. The fused media information processing method based on the MI-CFM-IMC algorithm is effective, provides a better method for processing the fused media information, and has certain practical value.
Claims (1)
1. The invention discloses a fused media information processing method based on MI-CFM-IMC algorithm, which is characterized in that: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) acquiring a fusion matrix; the method specifically comprises the following five steps:
the method comprises the following steps: calculating mutual information of the terms and each category;
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the residual lexical items and the mutual information of each category;
in the formula, U is a term, C is a category, U, C are binary random variables, I represents an identity matrix, P represents probability, and when a document contains the term, the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0;
Calculating each lexical item and mutual information of each category, selecting k lexical items with the largest value, deleting repeated words among the categories, and screening out characteristic words;
step two: constructing a context feature matrix M SPPMI ;
And (3) preprocessing a context feature text by a similarity dictionary S formed by a test set, labeling all context words of each word in the similarity dictionary S, and calculating # (w), # (c) and # (w, c):
M SPPMI =(SPPMI(w i ,c j )) |D|×|D| ;
in the formula, D is the number of total words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of times of occurrence of each current central word, # (c) is the number of times of occurrence of context words, and# (w, c) is the number of times of occurrence of each word pair (w, c);
step three: constructing synonym and antisense characteristic matrix M SAM :
After obtaining related texts by utilizing a crawler technology, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
in the formula, matrix M SAM Is a matrix of one size | S | × | S |,represents M SAM The ith row and the jth column elements of the matrix, if the row head words and the list head words are synonyms, thenIf the relation is an antisense word, thenOtherwise
Step four: constructing an attribute semantic feature matrix M SFM :
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
in the formula, for each text in ST, if the word name in CN appears in the text, settingOtherwise set upUp to s i When the word is the last word in the S, constructing an attribute semantic feature matrix;
step five: acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The characteristic matrix is a matrix in an IMC algorithm, and the H matrix represents a Hermit matrix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011050720.0A CN112052679B (en) | 2020-09-29 | 2020-09-29 | Fused media information processing method based on MI-CFM-IMC algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011050720.0A CN112052679B (en) | 2020-09-29 | 2020-09-29 | Fused media information processing method based on MI-CFM-IMC algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112052679A CN112052679A (en) | 2020-12-08 |
CN112052679B true CN112052679B (en) | 2022-08-02 |
Family
ID=73605231
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011050720.0A Active CN112052679B (en) | 2020-09-29 | 2020-09-29 | Fused media information processing method based on MI-CFM-IMC algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112052679B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109960786A (en) * | 2019-03-27 | 2019-07-02 | 北京信息科技大学 | Chinese Measurement of word similarity based on convergence strategy |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3777756A1 (en) * | 2012-02-14 | 2021-02-17 | 3Shape A/S | Modeling a digital design of a denture |
-
2020
- 2020-09-29 CN CN202011050720.0A patent/CN112052679B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109960786A (en) * | 2019-03-27 | 2019-07-02 | 北京信息科技大学 | Chinese Measurement of word similarity based on convergence strategy |
Non-Patent Citations (2)
Title |
---|
基于词向量及术语关系抽取方法的文本分类方法;侯庆霖;《移动通信》;20180715(第07期);20190509 * |
改进的关键词提取算法研究;王涛等;《重庆师范大学学报(自然科学版)》;20190509(第03期);第103-109页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112052679A (en) | 2020-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Almuzaini et al. | Impact of stemming and word embedding on deep learning-based Arabic text categorization | |
CN109446404B (en) | Method and device for analyzing emotion polarity of network public sentiment | |
CN110059185B (en) | Medical document professional vocabulary automatic labeling method | |
CN112101041B (en) | Entity relationship extraction method, device, equipment and medium based on semantic similarity | |
CN109886020A (en) | Software vulnerability automatic classification method based on deep neural network | |
CN109508459B (en) | Method for extracting theme and key information from news | |
CN105975454A (en) | Chinese word segmentation method and device of webpage text | |
CN111753058B (en) | Text viewpoint mining method and system | |
CN111680509A (en) | Method and device for automatically extracting text keywords based on co-occurrence language network | |
CN112364628B (en) | New word recognition method and device, electronic equipment and storage medium | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
CN112069307B (en) | Legal provision quotation information extraction system | |
CN111859961A (en) | Text keyword extraction method based on improved TopicRank algorithm | |
CN113705237A (en) | Relation extraction method and device fusing relation phrase knowledge and electronic equipment | |
CN115017303A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
CN113971404A (en) | Cultural relic security named entity identification method based on decoupling attention | |
Uddin et al. | Depression analysis of bangla social media data using gated recurrent neural network | |
CN112667806A (en) | Text classification screening method using LDA | |
Scharkow | Content analysis, automatic | |
Aziz et al. | Evaluating cross domain sentiment analysis using supervised machine learning techniques | |
CN110704638A (en) | Clustering algorithm-based electric power text dictionary construction method | |
CN111538893B (en) | Method for extracting network security new words from unstructured data | |
CN112052679B (en) | Fused media information processing method based on MI-CFM-IMC algorithm | |
CN113157913A (en) | Ethical behavior discrimination method based on social news data set | |
CN115952794A (en) | Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |