CN112052679B - Fused media information processing method based on MI-CFM-IMC algorithm - Google Patents

Fused media information processing method based on MI-CFM-IMC algorithm Download PDF

Info

Publication number
CN112052679B
CN112052679B CN202011050720.0A CN202011050720A CN112052679B CN 112052679 B CN112052679 B CN 112052679B CN 202011050720 A CN202011050720 A CN 202011050720A CN 112052679 B CN112052679 B CN 112052679B
Authority
CN
China
Prior art keywords
matrix
word
constructing
words
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011050720.0A
Other languages
Chinese (zh)
Other versions
CN112052679A (en
Inventor
胡燕祝
王松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202011050720.0A priority Critical patent/CN112052679B/en
Publication of CN112052679A publication Critical patent/CN112052679A/en
Application granted granted Critical
Publication of CN112052679B publication Critical patent/CN112052679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a fused media information processing method based on MI-CFM-IMC algorithm, which is a fused media information processing method for dangerous chemical accidents in safety production, belongs to the field of intelligent safety, and is characterized by comprising the following steps: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) and acquiring a fusion matrix. The method can effectively solve the problem that the semantic associated information of the rare words cannot be fully expressed due to lack of context related data, and greatly improves the relation definition between the synonyms and the antonyms of the central words in the same text. The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method with high information fusion accuracy is provided for the field of hazardous chemical substance accident information processing.

Description

Fused media information processing method based on MI-CFM-IMC algorithm
Technical Field
The invention relates to the field of intelligent safety, in particular to a fused media information processing method for dangerous chemical accidents in safety production.
Background
At present, in the field of safe production of hazardous chemicals, fused media information is processed, and key information of fused media text data is mainly acquired, a mainstream method is to adopt a distributed learning method, train by acquiring a large number of keywords of text information, show the interrelation of characteristic words and establish a matrix, but the information processing based on the method has many problems, and when the characteristics of the keywords of the text information are insufficient, the generated characteristic words have weak relationship and are insufficient to completely show the interrelation of the key information; when the anti-sense words of the central sentence appear in the same text segment, the words with opposite meanings are endowed with more similar semantic association in the processing process, and the words with similar meanings of the central vocabulary appear in the sentences with different semantic attributes, so that the generated relation matrix is easily endowed with farther relation.
For processing of fused media information, particularly text information, in order to acquire key information efficiently and timely, a central vocabulary needs to be accurately distinguished and associated, on the basis, a fused media information processing method based on an MI-CFM-IMC algorithm is provided, a context feature matrix is constructed for a central word of text features, the relation between the key information is found, a synonym feature matrix and an antisense feature matrix are constructed, the found key information is classified, then objects respectively corresponding to the key information are found through an attribute semantic feature matrix, and finally the three matrices are fused by using the IMC, so that the efficiency and the accuracy of the fused media information processing are improved, workers can find problems in various aspects, the problems are summarized stereoscopically, and various accident prevention and solution strategies are made.
Disclosure of Invention
In view of the problems in the prior art, the technical problem to be solved by the present invention is to provide a fused media information processing method based on MI-CFM-IMC algorithm, and the specific flow is shown in fig. 1.
The technical scheme for realizing the aim of the invention comprises the following specific steps:
the method comprises the following steps: calculating mutual information of the terms and each category;
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the residual lexical items and the mutual information of each category;
Figure BDA0002709463360000011
in the formula, U is a term, C is a category, U, C are binary random variables, I represents an identity matrix, P represents probability, and when a document contains the term, the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0;
Calculating each lexical item and mutual information of each category, selecting k lexical items with the largest value, deleting repeated words among the categories, and screening out characteristic words;
step two: constructing a context feature matrix M SPPMI
And (3) preprocessing a context feature text by a similarity dictionary S formed by a test set, labeling all context words of each word in the similarity dictionary S, and calculating # (w), # (c) and # (w, c):
M SPPMI =(SPPMI(w i ,c j )) |D|×|D|
in the formula, D is the number of total words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of times of occurrence of each current central word, # (c) is the number of times of occurrence of context words, and# (w, c) is the number of times of occurrence of each word pair (w, c);
step three: constructing synonym and antisense characteristic matrix M SAM
After a crawler technology is utilized to obtain related texts, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
Figure BDA0002709463360000021
in the formula, matrix M SAM Is a matrix of one size | S | × | S |,
Figure BDA0002709463360000022
represents M SAM The ith row and the jth column elements of the matrix, if the row head words and the list head words are synonyms, then
Figure BDA0002709463360000023
If the relation is an antisense word, then
Figure BDA0002709463360000024
Otherwise
Figure BDA0002709463360000025
Step four: constructing attribute semantic feature matrix M SFM
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
Figure BDA0002709463360000026
in the formula, for each text in ST, if the word name in CN appears in the text, setting
Figure BDA0002709463360000027
Otherwise set up
Figure BDA0002709463360000028
Up to s i When the word is the last word in the S, constructing an attribute semantic feature matrix;
step five: acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The characteristic matrix is a matrix in an IMC algorithm, and the H matrix represents a Hermit matrix.
Compared with the prior art, the invention has the advantages that:
(1) the present invention overcomes the disadvantage of weak relationships between information of distributed learning and may effectively improve the relevance between information of melting media.
(2) The method can effectively solve the problem that the semantic association information of the rare words cannot be fully expressed due to lack of context related data, firstly determines the opposite relation between the synonym and the antisense of the headword, and greatly improves the relation definition between the synonym and the antisense of the headword between the same text.
(3) The method combines the mutual information characteristic extraction MI and the IMC induction matrix completion algorithm, and obtains higher information fusion accuracy. The method has certain practical value for processing the accident information of the hazardous chemical substances in production.
Drawings
For a better understanding of the present invention, reference is made to the following further description taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of steps for establishing a fused media information processing method based on MI-CFM-IMC algorithm;
FIG. 2 is a flow chart of an algorithm for establishing a fused media information processing method based on MI-CFM-IMC algorithm;
fig. 3 is a comparative experiment of four sets of information processing methods.
Detailed description of the preferred embodiments
The present invention will be described in further detail below with reference to examples.
The implementation case selects the constructed CFM data from Wikipedia, the constructed SMA data from Thesauus, and the constructed SFM data from Wikipedia, Wiktionary dictionary and online dictionary. After relevant data are crawled from relevant websites, data are preprocessed to achieve uniform format, and then matrix relevant processing is carried out. Wherein, the size of the similarity dictionary S constructed according to the data set in the CFM is 5987.
The overall flow of the fused media information processing method based on the MI-CFM-IMC algorithm is shown as the figure, and the specific steps are as follows:
(1) calculating mutual information of terms and each category:
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the mutual information of the rest terms and each category:
Figure BDA0002709463360000031
where U is the term, C is the category, U, C are all binary random variables that, when a document contains terms,the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0。
And calculating the mutual information of each term and each category of each category, and selecting k terms with the maximum value. And deleting repeated words among all categories, and screening out the characteristic words.
(2) Constructing a context feature matrix M SPPMI
And (2) preprocessing a context feature text by a similarity dictionary S formed by a test set, wherein the size of the S is 5987, all context words are labeled for each word in the similarity dictionary S, 3770834 texts are reserved finally, and the calculation is performed, wherein the calculation is performed according to the following formula:
M SPPMI =(SPPMI(w i ,c j )) |D|×|D|
in the formula, D is the total number of words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of occurrences of each current core word, # (c) is the number of occurrences of a context word, and # (w, c) is the number of occurrences of each word pair (w, c).
(3) Constructing synonym and antonym feature matrix M SAM
After obtaining related texts by utilizing a crawler technology, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
Figure BDA0002709463360000041
in the formula, matrix M SAM Is a matrix of one size | S | × | S |. If the head words of the row list and the head words of the list are synonyms, then
Figure BDA0002709463360000042
If the relation is an antisense word, then
Figure BDA0002709463360000043
Otherwise
Figure BDA0002709463360000044
(4) Constructing attribute semantic feature matrix M SFM
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
Figure BDA0002709463360000045
wherein, for each text in ST, if the word name in CN appears in the text, setting
Figure BDA0002709463360000046
Otherwise set up
Figure BDA0002709463360000047
Up to s i And when the word is the last word in the S, constructing an attribute semantic feature matrix.
(5) Acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The feature matrix is a matrix in the IMC algorithm.
In order to verify the accuracy of the invention in processing information of the fused media, four sets of information processing comparison experiments were performed on the invention, and the experimental results are shown in fig. 3. As can be seen from the figure, the fused media information processing method based on the MI-CFM-IMC algorithm has the advantages that the accuracy rate is kept above 90%, the higher accuracy rate can be achieved on the basis of ensuring the stability, and the effect is good. The fused media information processing method based on the MI-CFM-IMC algorithm is effective, provides a better method for processing the fused media information, and has certain practical value.

Claims (1)

1. The invention discloses a fused media information processing method based on MI-CFM-IMC algorithm, which is characterized in that: (1) calculating mutual information of the terms and each category; (2) constructing a context feature matrix; (3) constructing a synonym and antisense characteristic matrix; (4) constructing an attribute semantic feature matrix; (5) acquiring a fusion matrix; the method specifically comprises the following five steps:
the method comprises the following steps: calculating mutual information of the terms and each category;
establishing a stop word lexicon and a training text set, segmenting the training text in the data set, filtering stop words according to the stop word lexicon after segmentation, labeling the part of speech of the segmented text, and calculating the residual lexical items and the mutual information of each category;
Figure FDA0002709463350000011
in the formula, U is a term, C is a category, U, C are binary random variables, I represents an identity matrix, P represents probability, and when a document contains the term, the value of U is e t 1, otherwise e t 0; when the document belongs to the category C, the value e of C c 1, otherwise e c =0;
Calculating each lexical item and mutual information of each category, selecting k lexical items with the largest value, deleting repeated words among the categories, and screening out characteristic words;
step two: constructing a context feature matrix M SPPMI
And (3) preprocessing a context feature text by a similarity dictionary S formed by a test set, labeling all context words of each word in the similarity dictionary S, and calculating # (w), # (c) and # (w, c):
M SPPMI =(SPPMI(w i ,c j )) |D|×|D|
in the formula, D is the number of total words, SPPMI is a negatively sampled non-negative point mutual information matrix, # (w) is the number of times of occurrence of each current central word, # (c) is the number of times of occurrence of context words, and# (w, c) is the number of times of occurrence of each word pair (w, c);
step three: constructing synonym and antisense characteristic matrix M SAM
After obtaining related texts by utilizing a crawler technology, analyzing the json data of the text behaviors of synonyms and antonyms, extracting to obtain a synonym and antonym list of each word in a similarity dictionary, and constructing an SAM characteristic matrix:
Figure FDA0002709463350000012
in the formula, matrix M SAM Is a matrix of one size | S | × | S |,
Figure FDA0002709463350000013
represents M SAM The ith row and the jth column elements of the matrix, if the row head words and the list head words are synonyms, then
Figure FDA0002709463350000014
If the relation is an antisense word, then
Figure FDA0002709463350000015
Otherwise
Figure FDA0002709463350000016
Step four: constructing an attribute semantic feature matrix M SFM
For each word S in the similarity dictionary S i Extracting explanation and description texts of words from the related library, and preprocessing the semantic texts to obtain ST files; and sequencing the word frequency of each word in the ST according to the word frequency, and filtering to obtain a dictionary CN:
Figure FDA0002709463350000021
in the formula, for each text in ST, if the word name in CN appears in the text, setting
Figure FDA0002709463350000022
Otherwise set up
Figure FDA0002709463350000023
Up to s i When the word is the last word in the S, constructing an attribute semantic feature matrix;
step five: acquiring a fusion matrix M:
M≈(M SAM ) T (M CFM ) T H(M SFM ) T
in the formula, M CFM The matrix is a target matrix to be decomposed in the IMC algorithm; m SAM The characteristic matrix is a matrix in an IMC algorithm; m SFM The characteristic matrix is a matrix in an IMC algorithm, and the H matrix represents a Hermit matrix.
CN202011050720.0A 2020-09-29 2020-09-29 Fused media information processing method based on MI-CFM-IMC algorithm Active CN112052679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011050720.0A CN112052679B (en) 2020-09-29 2020-09-29 Fused media information processing method based on MI-CFM-IMC algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011050720.0A CN112052679B (en) 2020-09-29 2020-09-29 Fused media information processing method based on MI-CFM-IMC algorithm

Publications (2)

Publication Number Publication Date
CN112052679A CN112052679A (en) 2020-12-08
CN112052679B true CN112052679B (en) 2022-08-02

Family

ID=73605231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011050720.0A Active CN112052679B (en) 2020-09-29 2020-09-29 Fused media information processing method based on MI-CFM-IMC algorithm

Country Status (1)

Country Link
CN (1) CN112052679B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese Measurement of word similarity based on convergence strategy

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3777756A1 (en) * 2012-02-14 2021-02-17 3Shape A/S Modeling a digital design of a denture

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109960786A (en) * 2019-03-27 2019-07-02 北京信息科技大学 Chinese Measurement of word similarity based on convergence strategy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于词向量及术语关系抽取方法的文本分类方法;侯庆霖;《移动通信》;20180715(第07期);20190509 *
改进的关键词提取算法研究;王涛等;《重庆师范大学学报(自然科学版)》;20190509(第03期);第103-109页 *

Also Published As

Publication number Publication date
CN112052679A (en) 2020-12-08

Similar Documents

Publication Publication Date Title
Almuzaini et al. Impact of stemming and word embedding on deep learning-based Arabic text categorization
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN110059185B (en) Medical document professional vocabulary automatic labeling method
CN112101041B (en) Entity relationship extraction method, device, equipment and medium based on semantic similarity
CN109886020A (en) Software vulnerability automatic classification method based on deep neural network
CN109508459B (en) Method for extracting theme and key information from news
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN111753058B (en) Text viewpoint mining method and system
CN111680509A (en) Method and device for automatically extracting text keywords based on co-occurrence language network
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
CN111143571B (en) Entity labeling model training method, entity labeling method and device
CN112069307B (en) Legal provision quotation information extraction system
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
CN113705237A (en) Relation extraction method and device fusing relation phrase knowledge and electronic equipment
CN115017303A (en) Method, computing device and medium for enterprise risk assessment based on news text
CN113971404A (en) Cultural relic security named entity identification method based on decoupling attention
Uddin et al. Depression analysis of bangla social media data using gated recurrent neural network
CN112667806A (en) Text classification screening method using LDA
Scharkow Content analysis, automatic
Aziz et al. Evaluating cross domain sentiment analysis using supervised machine learning techniques
CN110704638A (en) Clustering algorithm-based electric power text dictionary construction method
CN111538893B (en) Method for extracting network security new words from unstructured data
CN112052679B (en) Fused media information processing method based on MI-CFM-IMC algorithm
CN113157913A (en) Ethical behavior discrimination method based on social news data set
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant