CN107451120B - Content conflict detection method and system for open text information - Google Patents
Content conflict detection method and system for open text information Download PDFInfo
- Publication number
- CN107451120B CN107451120B CN201710646040.7A CN201710646040A CN107451120B CN 107451120 B CN107451120 B CN 107451120B CN 201710646040 A CN201710646040 A CN 201710646040A CN 107451120 B CN107451120 B CN 107451120B
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- occurrence
- data set
- component
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 29
- 239000011159 matrix material Substances 0.000 claims abstract description 73
- 238000000034 method Methods 0.000 claims abstract description 19
- 238000000605 extraction Methods 0.000 claims description 12
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 4
- 238000010219 correlation analysis Methods 0.000 abstract description 3
- 230000001360 synchronised effect Effects 0.000 abstract description 3
- 238000009826 distribution Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002860 competitive effect Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a content conflict detection method and system of open text information. The method comprises the following steps: establishing a public text intelligence data set; extracting keywords and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to establish a keyword co-occurrence network; extracting components in the keyword co-occurrence network to obtain a component data set; the method of the invention directly detects and judges the content in the public text information by using correlation analysis without structuralized description and storage of the public text data, reduces the calculated amount, overcomes the technical defect of poor content conflict detection accuracy caused by that the structuralized knowledge base can not be synchronized with the public text information of big data with very strong real-time property due to the updating of the structuralized knowledge base, and realizes the detection of the content conflict of the public text information with the characteristic of big data.
Description
Technical Field
The invention relates to the field of public text information application, in particular to a method and a system for detecting content conflict of public text information.
Background
Public intelligence, also known as open source intelligence, refers to the intelligence collected and mined from public media (such as newspapers/publications, the internet, self-media platforms, etc.), and the intelligence content is mainly unstructured data, including numbers, texts, pictures, videos, etc.
Public text intelligence refers to information data in text format collected and mined from public media (e.g., newspapers/publications, the internet, self-media platforms, etc.).
Content conflicts refer to situations where descriptions of the same subject matter feature are inconsistent or contradictory in the same problem context.
The published text information has the relative advantages of low acquisition cost, wide data source channel, good data real-time property and the like, and has wide application value and benefit in the fields of military information guarantee, enterprise competitive strategy research and judgment and the like. Meanwhile, with the progress of self-media technology, the popularization of the internet and the like, the public text information presents the characteristic of big data, namely, the data volume is increased at a remarkable speed, the data generation has multi-source characteristics, the data propagation process is multi-channel and parallel, complicated and complicated, and the like, and the mass public text information inevitably has conflict contents, so that the analysis and the utilization of the public text information become difficult; the conscious misdirection of information by potential competitors further exacerbates the problem. Thus, the first step in the efficient and accurate application of the published text intelligence is the detection and discovery of conflicting content.
The conflict content is a key factor for restricting the quality of the public text intelligence data, and if the potential conflict content cannot be timely and effectively detected, discovered and eliminated, the analysis result of the public text intelligence big data is unreliable, and the application value of the public text intelligence big data is reduced. Currently, content conflict detection for text data is mainly oriented to small-scale and medium-scale data, and is mainly applied to detecting and discovering metadata or structured data conflicts.
For example, a twenty institute of research in china electronic technology group corporation may propose a method for detecting conflict of instruction contents in a network management and control system, the method comprising the following steps:
1. counting mutually exclusive instructions of the network management and control system in content;
2. establishing a plurality of mutually exclusive instruction sets, wherein all instructions in each mutually exclusive instruction set are mutually exclusive;
3. setting an instruction interval time threshold t;
4. recording instructions received by the same equipment in a time period with the interval time of t, wherein if 2 or more instructions exist in the same exclusive instruction set, the instruction content conflict occurs; otherwise, there is no instruction content conflict.
For another example, the zhao xiao fei, huang zhi ball proposes a CWM (common repository meta-model, CWM for short) metadata conflict detection method based on description logic, which includes the following steps:
1. descriptive logical DL establishing a notional-above-supported identity constraintid;
2. Application description logic DLidFormalizing CWM metadata to establish DLidA knowledge base;
3. defining a set of requirements describing a logical query language;
4. according to the requirements for describing the logical query language, establishing the query language with the following format:
5. query DL using nRQLidAnd the knowledge base finds content conflicts.
The existing method is mainly oriented to small-scale and medium-scale data in the aspect of content conflict detection of text data, and is characterized in that: (1) the key steps are that firstly, the text data is structurally described and stored; (2) based on the structured knowledge base, an inference mechanism for conflict detection, such as a mutually exclusive instruction set, a conflict query language, and the like, is established, and then the conflict detection of the content is performed. For the public text information showing the characteristics of big data, the existing method has the following defects: (1) under the background that the public text information presents the characteristics of big data, the workload of the structural description and storage of the public text information data is extremely huge, and the public text information data is very difficult to realize; (2) the conflict detection inference mechanism established on the basis of the structured knowledge base is solidified and lacks flexibility, and under the condition that the real-time performance of big data of the public text information is very strong, the established content conflict detection inference mechanism is very easy to have a problem situation which is not suitable for new; (3) the detected conflict content is in a microscopic level, that is, content conflicts exist in a plurality of texts (usually 2, the number of the texts is small), the content conflicts existing in the whole level of the large data set are difficult to present, and it can be seen that the existing content conflict detection method cannot realize the content conflict detection of the public text information with the characteristics of the large data.
Disclosure of Invention
The invention aims to provide a method and a system for detecting content conflict of public text information to realize the detection of the content conflict of the public text information with the characteristic of big data.
In order to achieve the purpose, the invention provides the following scheme:
a content conflict detection method of open text information comprises the following steps:
acquiring public text information, and establishing a public text information data set, wherein the public text information data set comprises a plurality of texts;
extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
extracting components in the keyword co-occurrence network to obtain a component data set;
judging each component in the component data set, and judging whether content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the extracting keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix specifically include:
segmenting words of each text in the public text intelligence data set to obtain an entry set of the text;
calculating the expectation of the cross information entropy of each entry in the entry set of the text;
according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order;
extracting the first k entries in the ordered entry set as keywords of the text;
establishing a keyword set according to keywords of each text in the text information data set;
counting the times of common occurrence of any two keywords in the same text in the keyword set;
and establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text.
Optionally, the keyword co-occurrence matrix is binarized to obtain a binarized keyword co-occurrence matrix, which specifically includes:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, extracting components in the keyword co-occurrence network to obtain a component data set, specifically including:
extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components;
and combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, each component in the component data set is determined, and whether a content conflict exists in the corresponding component is determined; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict, which specifically comprises the following steps:
judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component;
and when the judgment result shows that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set.
A system for detecting content conflicts in published textual intelligence, comprising:
the public text information data set establishing module is used for acquiring public text information and establishing a public text information data set;
the keyword co-occurrence matrix construction module is used for extracting keywords of each text in the public text information data set and constructing a keyword co-occurrence matrix;
the binarization processing module is used for carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
the keyword co-occurrence network establishing module is used for establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
the component extraction module is used for extracting components in the keyword co-occurrence network to obtain a component data set;
the conflict judging module is used for judging each component in the component data set and judging whether content conflicts exist in the corresponding components; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the keyword co-occurrence matrix constructing module specifically includes:
the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text;
the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text;
the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry;
the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text;
the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set;
the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text;
and the keyword co-occurrence matrix establishing submodule is used for establishing the keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text.
Optionally, the binarization processing module specifically includes:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, the component extraction module specifically includes:
the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components;
and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, the conflict judgment module specifically includes:
the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components;
and the conflict content determining submodule is used for searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result shows that the content conflict exists in the corresponding component.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a content conflict detection method and a system of open text information, firstly, the open text information is obtained, and an open text information data set is established; then, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix; then, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; extracting components in the keyword co-occurrence network to obtain a component data set; and finally, judging each component in the component data set, judging whether content conflict exists or not, and determining the content with conflict. The method of the invention directly detects and judges the content in the public text information by using the correlation analysis, does not need a structured knowledge base and does not need to structurally describe and store the public text data, thereby reducing the calculated amount, overcoming the technical defect of poor content conflict detection accuracy caused by that the updating of the knowledge base can not be synchronous with the public text information of big data with very strong real-time property, and realizing the detection of the content conflict of the public text information with the characteristics of the big data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a content conflict detection method for public text intelligence provided by the present invention.
Fig. 2 is a block diagram of a system for detecting content conflict of public text intelligence according to the present invention.
Detailed Description
The invention aims to provide a method and a system for detecting content conflict of public text information, so as to realize the detection of the content conflict of the public text information with the characteristic of big data.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the present invention provides a content conflict detection method for open text intelligence, which comprises the following steps:
102, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
103, performing binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
104, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; specifically, the process of establishing the keyword co-occurrence network is to connect keywords corresponding to elements with a value of 1 in the binarization keyword co-occurrence matrix to obtain the keyword co-occurrence network;
105, extracting components in the keyword co-occurrence network to obtain a component data set;
Optionally, the extracting a keyword of each text in the public text intelligence data set in step 102, and constructing a keyword co-occurrence matrix specifically includes:
segmenting words of each text in the public text intelligence data set to obtain an entry set of the text; specifically, for the mth text tmPerforming word segmentation to obtain a vocabulary entry set of the text Representing the m-th text tmL. 1mAn entry,/m=1,2,…,Lm,LmRepresenting the m-th text tmThe total number of entries in the chinese word, for example, by segmenting the text "Nanjing City Changjiang river bridge", can be obtained { Nanjing, City Changjiang, Jiang bridge, Nanjing City, Changjiang river, Daqian, Changjiang river bridge }.
Calculating the expectation of the cross information entropy of each entry in the entry set of the text;
wherein,indicating the presence of an entryTime text tmBelong to class ciThe probability of (d); p (c)i) Representing text tmA probability distribution of the categories;reflects the text tmProbability distribution of category and occurrence of entryThe greater the distance between the probability distributions of the text classes, the greater the value, the entryFor text tmThe greater the impact of the class distribution.
Calculating a text tmAnd obtaining an entry feature set according to the expectation of the cross information entropy of each entry:
according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order;
extracting the first k entries in the ordered entry set as keywords of the text; in particular, for the mth text tmIf L ismLess than or equal to 200, thenOtherwise, k is 10;
establishing a keyword set according to keywords of each text in the text information data set; in particular, the mth text tmThe set of keywords isw1Is the mth text tmThe first entry after middle sorting, kmIn order to obtain the value of k when extracting the key words from the mth text, the key word set of each text forms a key word set, and D is equal to D1∪D2∪…∪DM={d1,d2,…,dsWhere s is the number of keywords in the keyword set D of the public text intelligence dataset.
Counting the times of common occurrence of any two keywords in the same text in the keyword set;
establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text;
in particular, for a group of words (D) in set Du,dv) Wherein u is 1,2, …, s; v ═ 1,2, …, s; u is not equal to v; counting the times of their appearance in the same text, and marking as au,vThen, a keyword co-occurrence matrix based on the keyword set D is obtained: a ═ au,v)s×s。
Optionally, in step 103, performing binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, which specifically includes:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Specifically, based on the number s of keywords in the keyword set D, a threshold value (> 0 and is an integer) is set, if a isu,vIs not less than a'u,v1, otherwise'u,v0. Obtaining a binary keyword co-occurrence matrix A '(a)'u,v)s×s。
Optionally, in step 105, extracting components in the keyword co-occurrence network to obtain a component data set, which specifically includes:
extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components; specifically, keywords with connection in the keyword co-occurrence network are placed in the same component, and keywords without connection in the keyword co-occurrence network are placed in different components, wherein the ith component is Ci={di,1,di,2,…};
Combining all the components in the extracted keyword co-occurrence network into a component data set, wherein the component data set is specifically { C }1,C2,…,Ci,…}。
Optionally, step 106, determining each component in the component data set, and determining whether a content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict, which specifically comprises the following steps:
judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component;
and when the judgment result shows that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set.
Specifically, for each component Ci={di,1,di,2…, in turn, are manually interpreted, e.g. by the occurrence of a keyword di,xAnd di,yIf the content semantic conflict exists between (x ≠ y), according to the keyword di,xAnd di,y(x ≠ y) search for public text intelligence dataset T ═ { T ≠ T1,t2,…,tm,…,tMDetermining the content with conflict according to the corresponding text in the text; otherwise, consider component CiThere is no content conflict for the text corresponding to the keywords in the set.
As shown in fig. 2, the present invention further provides a content conflict detection system for public text intelligence, which includes:
a public text information data set establishing module 201, configured to obtain public text information and establish a public text information data set;
a keyword co-occurrence matrix construction module 202, configured to extract keywords of each text in the public text intelligence data set, and construct a keyword co-occurrence matrix;
a binarization processing module 203, configured to perform binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
a keyword co-occurrence network establishing module 204, configured to establish a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
a component extraction module 205, configured to extract components in the keyword co-occurrence network to obtain a component data set;
a conflict judgment module 206, configured to judge each component in the component data set, and judge whether a content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the keyword co-occurrence matrix constructing module 202 specifically includes:
the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text;
the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text;
the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry;
the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text;
the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set;
the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text;
and the keyword co-occurrence matrix establishing submodule is used for establishing the keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text.
Optionally, the binarization processing module 203 specifically includes:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, the component extraction module 205 specifically includes:
the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components;
and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, the conflict judgment module 206 specifically includes:
the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components;
and the conflict content determining submodule is used for searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result shows that the content conflict exists in the corresponding component.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a content conflict detection method and a system of open text information, firstly, the open text information is obtained, and an open text information data set is established; then, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix; then, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; extracting components in the keyword co-occurrence network to obtain a component data set; and finally, judging each component in the component data set, judging whether content conflict exists or not, and determining the content with conflict. The method of the invention directly detects and judges the content in the public text information by using the correlation analysis, does not need a structured knowledge base and does not need to structurally describe and store the public text data, thereby reducing the calculated amount, overcoming the technical defect of poor content conflict detection accuracy caused by that the updating of the knowledge base can not be synchronous with the public text information of big data with very strong real-time property, and realizing the detection of the content conflict of the public text information with the characteristics of the big data.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the implementation manner of the present invention are explained by applying specific examples, the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof, the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.
Claims (4)
1. A content conflict detection method of open text information is characterized by comprising the following steps:
acquiring public text information, and establishing a public text information data set, wherein the public text information data set comprises a plurality of texts;
extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
extracting components in the keyword co-occurrence network to obtain a component data set;
judging each component in the component data set, and judging whether content conflict exists in the corresponding component; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the method specifically comprises the following steps: judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component; when the judgment result is that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set;
the method for extracting the keywords of each text in the public text information data set and constructing the keyword co-occurrence matrix specifically comprises the following steps: segmenting words of each text in the public text intelligence data set to obtain an entry set of the text; calculating the expectation of the cross information entropy of each entry in the entry set of the text; according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order; extracting the first k entries in the ordered entry set as keywords of the text; establishing a keyword set according to keywords of each text in the text information data set; counting the times of common occurrence of any two keywords in the same text in the keyword set; establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text;
extracting components in the keyword co-occurrence network to obtain a component data set, which specifically comprises the following steps: extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components; and combining all the components in the extracted keyword co-occurrence network into a component data set.
2. The method according to claim 1, wherein the binarizing processing is performed on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, and specifically comprises:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
3. A content conflict detection system for open text intelligence, comprising:
the public text information data set establishing module is used for acquiring public text information and establishing a public text information data set;
the keyword co-occurrence matrix construction module is used for extracting keywords of each text in the public text information data set and constructing a keyword co-occurrence matrix;
the binarization processing module is used for carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
the keyword co-occurrence network establishing module is used for establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
the component extraction module is used for extracting components in the keyword co-occurrence network to obtain a component data set;
the conflict judging module is used for judging each component in the component data set and judging whether content conflicts exist in the corresponding components; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the conflict judgment module specifically comprises: the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components; a conflict content determining submodule for searching the corresponding text in the public text information data set according to the keyword with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result is that the content conflict exists in the corresponding component;
the keyword co-occurrence matrix construction module specifically comprises: the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text; the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text; the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry; the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text; the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set; the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text; the keyword co-occurrence matrix establishing submodule is used for establishing a keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text;
the component extraction module specifically comprises: the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components; and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
4. The system according to claim 3, wherein the binarization processing module specifically comprises:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646040.7A CN107451120B (en) | 2017-08-01 | 2017-08-01 | Content conflict detection method and system for open text information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646040.7A CN107451120B (en) | 2017-08-01 | 2017-08-01 | Content conflict detection method and system for open text information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107451120A CN107451120A (en) | 2017-12-08 |
CN107451120B true CN107451120B (en) | 2020-10-30 |
Family
ID=60490592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710646040.7A Active CN107451120B (en) | 2017-08-01 | 2017-08-01 | Content conflict detection method and system for open text information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107451120B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6987003B2 (en) * | 2018-03-20 | 2021-12-22 | 株式会社Screenホールディングス | Text mining methods, text mining programs, and text mining equipment |
CN110377690B (en) * | 2019-06-27 | 2021-03-16 | 北京信息科技大学 | Information acquisition method and system based on remote relationship extraction |
CN110442765B (en) * | 2019-07-04 | 2022-03-11 | 卓尔智联(武汉)研究院有限公司 | Information processing method, device, terminal and storage medium |
CN114003785A (en) * | 2021-10-29 | 2022-02-01 | 奇安信科技集团股份有限公司 | Method and device for obtaining threat information based on endogenous security |
CN114090781A (en) * | 2022-01-20 | 2022-02-25 | 北京零点远景网络科技有限公司 | Text data-based repulsion event detection method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
US9235563B2 (en) * | 2009-07-02 | 2016-01-12 | Battelle Memorial Institute | Systems and processes for identifying features and determining feature associations in groups of documents |
CN106599304A (en) * | 2016-12-29 | 2017-04-26 | 中南大学 | Small and medium-sized website-oriented modularized user retrieval intention modeling method |
-
2017
- 2017-08-01 CN CN201710646040.7A patent/CN107451120B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9235563B2 (en) * | 2009-07-02 | 2016-01-12 | Battelle Memorial Institute | Systems and processes for identifying features and determining feature associations in groups of documents |
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
CN106599304A (en) * | 2016-12-29 | 2017-04-26 | 中南大学 | Small and medium-sized website-oriented modularized user retrieval intention modeling method |
Also Published As
Publication number | Publication date |
---|---|
CN107451120A (en) | 2017-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107451120B (en) | Content conflict detection method and system for open text information | |
CN110516067B (en) | Public opinion monitoring method, system and storage medium based on topic detection | |
CN106407484B (en) | Video tag extraction method based on barrage semantic association | |
US8285713B2 (en) | Image search using face detection | |
WO2018196561A1 (en) | Label information generating method and device for application and storage medium | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN103049435B (en) | Text fine granularity sentiment analysis method and device | |
CN110413787B (en) | Text clustering method, device, terminal and storage medium | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN101593200A (en) | Chinese Web page classification method based on the keyword frequency analysis | |
CN112131449A (en) | Implementation method of cultural resource cascade query interface based on elastic search | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN105279277A (en) | Knowledge data processing method and device | |
CN101079031A (en) | Web page subject extraction system and method | |
CN110738033B (en) | Report template generation method, device and storage medium | |
CN109344298A (en) | Method and device for converting unstructured data into structured data | |
CN102279890A (en) | Sentiment word extracting and collecting method based on micro blog | |
CN107180087B (en) | A kind of searching method and device | |
CN103886020A (en) | Quick search method of real estate information | |
CN114443855A (en) | Knowledge graph cross-language alignment method based on graph representation learning | |
CN107341142B (en) | Enterprise relation calculation method and system based on keyword extraction and analysis | |
CN115618014A (en) | Standard document analysis management system and method applying big data technology | |
Aslam et al. | Web-AM: An efficient boilerplate removal algorithm for Web articles | |
CN110083654A (en) | A kind of multi-source data fusion method and system towards science and techniques of defence field | |
CN104978431B (en) | Web data fusion method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |