CN107451120B - Content conflict detection method and system for open text information - Google Patents

Content conflict detection method and system for open text information Download PDF

Info

Publication number
CN107451120B
CN107451120B CN201710646040.7A CN201710646040A CN107451120B CN 107451120 B CN107451120 B CN 107451120B CN 201710646040 A CN201710646040 A CN 201710646040A CN 107451120 B CN107451120 B CN 107451120B
Authority
CN
China
Prior art keywords
keyword
text
occurrence
data set
component
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710646040.7A
Other languages
Chinese (zh)
Other versions
CN107451120A (en
Inventor
李晓军
姚俊萍
沈涛
张锴琦
王利涛
马俊春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rocket Force University of Engineering of PLA
Original Assignee
Rocket Force University of Engineering of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rocket Force University of Engineering of PLA filed Critical Rocket Force University of Engineering of PLA
Priority to CN201710646040.7A priority Critical patent/CN107451120B/en
Publication of CN107451120A publication Critical patent/CN107451120A/en
Application granted granted Critical
Publication of CN107451120B publication Critical patent/CN107451120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a content conflict detection method and system of open text information. The method comprises the following steps: establishing a public text intelligence data set; extracting keywords and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to establish a keyword co-occurrence network; extracting components in the keyword co-occurrence network to obtain a component data set; the method of the invention directly detects and judges the content in the public text information by using correlation analysis without structuralized description and storage of the public text data, reduces the calculated amount, overcomes the technical defect of poor content conflict detection accuracy caused by that the structuralized knowledge base can not be synchronized with the public text information of big data with very strong real-time property due to the updating of the structuralized knowledge base, and realizes the detection of the content conflict of the public text information with the characteristic of big data.

Description

Content conflict detection method and system for open text information
Technical Field
The invention relates to the field of public text information application, in particular to a method and a system for detecting content conflict of public text information.
Background
Public intelligence, also known as open source intelligence, refers to the intelligence collected and mined from public media (such as newspapers/publications, the internet, self-media platforms, etc.), and the intelligence content is mainly unstructured data, including numbers, texts, pictures, videos, etc.
Public text intelligence refers to information data in text format collected and mined from public media (e.g., newspapers/publications, the internet, self-media platforms, etc.).
Content conflicts refer to situations where descriptions of the same subject matter feature are inconsistent or contradictory in the same problem context.
The published text information has the relative advantages of low acquisition cost, wide data source channel, good data real-time property and the like, and has wide application value and benefit in the fields of military information guarantee, enterprise competitive strategy research and judgment and the like. Meanwhile, with the progress of self-media technology, the popularization of the internet and the like, the public text information presents the characteristic of big data, namely, the data volume is increased at a remarkable speed, the data generation has multi-source characteristics, the data propagation process is multi-channel and parallel, complicated and complicated, and the like, and the mass public text information inevitably has conflict contents, so that the analysis and the utilization of the public text information become difficult; the conscious misdirection of information by potential competitors further exacerbates the problem. Thus, the first step in the efficient and accurate application of the published text intelligence is the detection and discovery of conflicting content.
The conflict content is a key factor for restricting the quality of the public text intelligence data, and if the potential conflict content cannot be timely and effectively detected, discovered and eliminated, the analysis result of the public text intelligence big data is unreliable, and the application value of the public text intelligence big data is reduced. Currently, content conflict detection for text data is mainly oriented to small-scale and medium-scale data, and is mainly applied to detecting and discovering metadata or structured data conflicts.
For example, a twenty institute of research in china electronic technology group corporation may propose a method for detecting conflict of instruction contents in a network management and control system, the method comprising the following steps:
1. counting mutually exclusive instructions of the network management and control system in content;
2. establishing a plurality of mutually exclusive instruction sets, wherein all instructions in each mutually exclusive instruction set are mutually exclusive;
3. setting an instruction interval time threshold t;
4. recording instructions received by the same equipment in a time period with the interval time of t, wherein if 2 or more instructions exist in the same exclusive instruction set, the instruction content conflict occurs; otherwise, there is no instruction content conflict.
For another example, the zhao xiao fei, huang zhi ball proposes a CWM (common repository meta-model, CWM for short) metadata conflict detection method based on description logic, which includes the following steps:
1. descriptive logical DL establishing a notional-above-supported identity constraintid
2. Application description logic DLidFormalizing CWM metadata to establish DLidA knowledge base;
3. defining a set of requirements describing a logical query language;
4. according to the requirements for describing the logical query language, establishing the query language with the following format:
Figure RE-BDA0001366880760000021
5. query DL using nRQLidAnd the knowledge base finds content conflicts.
The existing method is mainly oriented to small-scale and medium-scale data in the aspect of content conflict detection of text data, and is characterized in that: (1) the key steps are that firstly, the text data is structurally described and stored; (2) based on the structured knowledge base, an inference mechanism for conflict detection, such as a mutually exclusive instruction set, a conflict query language, and the like, is established, and then the conflict detection of the content is performed. For the public text information showing the characteristics of big data, the existing method has the following defects: (1) under the background that the public text information presents the characteristics of big data, the workload of the structural description and storage of the public text information data is extremely huge, and the public text information data is very difficult to realize; (2) the conflict detection inference mechanism established on the basis of the structured knowledge base is solidified and lacks flexibility, and under the condition that the real-time performance of big data of the public text information is very strong, the established content conflict detection inference mechanism is very easy to have a problem situation which is not suitable for new; (3) the detected conflict content is in a microscopic level, that is, content conflicts exist in a plurality of texts (usually 2, the number of the texts is small), the content conflicts existing in the whole level of the large data set are difficult to present, and it can be seen that the existing content conflict detection method cannot realize the content conflict detection of the public text information with the characteristics of the large data.
Disclosure of Invention
The invention aims to provide a method and a system for detecting content conflict of public text information to realize the detection of the content conflict of the public text information with the characteristic of big data.
In order to achieve the purpose, the invention provides the following scheme:
a content conflict detection method of open text information comprises the following steps:
acquiring public text information, and establishing a public text information data set, wherein the public text information data set comprises a plurality of texts;
extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
extracting components in the keyword co-occurrence network to obtain a component data set;
judging each component in the component data set, and judging whether content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the extracting keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix specifically include:
segmenting words of each text in the public text intelligence data set to obtain an entry set of the text;
calculating the expectation of the cross information entropy of each entry in the entry set of the text;
according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order;
extracting the first k entries in the ordered entry set as keywords of the text;
establishing a keyword set according to keywords of each text in the text information data set;
counting the times of common occurrence of any two keywords in the same text in the keyword set;
and establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text.
Optionally, the keyword co-occurrence matrix is binarized to obtain a binarized keyword co-occurrence matrix, which specifically includes:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, extracting components in the keyword co-occurrence network to obtain a component data set, specifically including:
extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components;
and combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, each component in the component data set is determined, and whether a content conflict exists in the corresponding component is determined; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict, which specifically comprises the following steps:
judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component;
and when the judgment result shows that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set.
A system for detecting content conflicts in published textual intelligence, comprising:
the public text information data set establishing module is used for acquiring public text information and establishing a public text information data set;
the keyword co-occurrence matrix construction module is used for extracting keywords of each text in the public text information data set and constructing a keyword co-occurrence matrix;
the binarization processing module is used for carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
the keyword co-occurrence network establishing module is used for establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
the component extraction module is used for extracting components in the keyword co-occurrence network to obtain a component data set;
the conflict judging module is used for judging each component in the component data set and judging whether content conflicts exist in the corresponding components; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the keyword co-occurrence matrix constructing module specifically includes:
the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text;
the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text;
the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry;
the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text;
the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set;
the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text;
and the keyword co-occurrence matrix establishing submodule is used for establishing the keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text.
Optionally, the binarization processing module specifically includes:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, the component extraction module specifically includes:
the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components;
and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, the conflict judgment module specifically includes:
the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components;
and the conflict content determining submodule is used for searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result shows that the content conflict exists in the corresponding component.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a content conflict detection method and a system of open text information, firstly, the open text information is obtained, and an open text information data set is established; then, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix; then, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; extracting components in the keyword co-occurrence network to obtain a component data set; and finally, judging each component in the component data set, judging whether content conflict exists or not, and determining the content with conflict. The method of the invention directly detects and judges the content in the public text information by using the correlation analysis, does not need a structured knowledge base and does not need to structurally describe and store the public text data, thereby reducing the calculated amount, overcoming the technical defect of poor content conflict detection accuracy caused by that the updating of the knowledge base can not be synchronous with the public text information of big data with very strong real-time property, and realizing the detection of the content conflict of the public text information with the characteristics of the big data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of a content conflict detection method for public text intelligence provided by the present invention.
Fig. 2 is a block diagram of a system for detecting content conflict of public text intelligence according to the present invention.
Detailed Description
The invention aims to provide a method and a system for detecting content conflict of public text information, so as to realize the detection of the content conflict of the public text information with the characteristic of big data.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the present invention provides a content conflict detection method for open text intelligence, which comprises the following steps:
step 101, obtaining public text intelligence and establishing a public text intelligence data set, wherein the public text intelligence data set comprises a plurality of texts; specifically, the public text intelligence data set T is T ═ T1,t2,…,tm,…,tMWhere t ismFor the mth piece of text in the public text intelligence data set T, M represents the total number of pieces of text in the public text intelligence data set T.
102, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
103, performing binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
104, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; specifically, the process of establishing the keyword co-occurrence network is to connect keywords corresponding to elements with a value of 1 in the binarization keyword co-occurrence matrix to obtain the keyword co-occurrence network;
105, extracting components in the keyword co-occurrence network to obtain a component data set;
step 106, judging each component in the component data set, and judging whether content conflicts exist in the corresponding components; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the extracting a keyword of each text in the public text intelligence data set in step 102, and constructing a keyword co-occurrence matrix specifically includes:
segmenting words of each text in the public text intelligence data set to obtain an entry set of the text; specifically, for the mth text tmPerforming word segmentation to obtain a vocabulary entry set of the text
Figure RE-BDA0001366880760000071
Figure RE-BDA0001366880760000074
Representing the m-th text tmL. 1mAn entry,/m=1,2,…,Lm,LmRepresenting the m-th text tmThe total number of entries in the chinese word, for example, by segmenting the text "Nanjing City Changjiang river bridge", can be obtained { Nanjing, City Changjiang, Jiang bridge, Nanjing City, Changjiang river, Daqian, Changjiang river bridge }.
Calculating the expectation of the cross information entropy of each entry in the entry set of the text;
specifically, the entry is calculated
Figure RE-BDA0001366880760000072
The expectation of cross information entropy is:
Figure RE-BDA0001366880760000073
wherein,
Figure RE-BDA0001366880760000081
indicating the presence of an entry
Figure RE-BDA0001366880760000082
Time text tmBelong to class ciThe probability of (d); p (c)i) Representing text tmA probability distribution of the categories;
Figure RE-BDA0001366880760000083
reflects the text tmProbability distribution of category and occurrence of entry
Figure RE-BDA0001366880760000084
The greater the distance between the probability distributions of the text classes, the greater the value, the entry
Figure RE-BDA0001366880760000085
For text tmThe greater the impact of the class distribution.
Calculating a text tmAnd obtaining an entry feature set according to the expectation of the cross information entropy of each entry:
Figure RE-BDA0001366880760000086
according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order;
extracting the first k entries in the ordered entry set as keywords of the text; in particular, for the mth text tmIf L ismLess than or equal to 200, then
Figure RE-BDA0001366880760000088
Otherwise, k is 10;
establishing a keyword set according to keywords of each text in the text information data set; in particular, the mth text tmThe set of keywords is
Figure RE-BDA0001366880760000087
w1Is the mth text tmThe first entry after middle sorting, kmIn order to obtain the value of k when extracting the key words from the mth text, the key word set of each text forms a key word set, and D is equal to D1∪D2∪…∪DM={d1,d2,…,dsWhere s is the number of keywords in the keyword set D of the public text intelligence dataset.
Counting the times of common occurrence of any two keywords in the same text in the keyword set;
establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text;
in particular, for a group of words (D) in set Du,dv) Wherein u is 1,2, …, s; v ═ 1,2, …, s; u is not equal to v; counting the times of their appearance in the same text, and marking as au,vThen, a keyword co-occurrence matrix based on the keyword set D is obtained: a ═ au,v)s×s
Optionally, in step 103, performing binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, which specifically includes:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Specifically, based on the number s of keywords in the keyword set D, a threshold value (> 0 and is an integer) is set, if a isu,vIs not less than a'u,v1, otherwise'u,v0. Obtaining a binary keyword co-occurrence matrix A '(a)'u,v)s×s
Optionally, in step 105, extracting components in the keyword co-occurrence network to obtain a component data set, which specifically includes:
extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components; specifically, keywords with connection in the keyword co-occurrence network are placed in the same component, and keywords without connection in the keyword co-occurrence network are placed in different components, wherein the ith component is Ci={di,1,di,2,…};
Combining all the components in the extracted keyword co-occurrence network into a component data set, wherein the component data set is specifically { C }1,C2,…,Ci,…}。
Optionally, step 106, determining each component in the component data set, and determining whether a content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict, which specifically comprises the following steps:
judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component;
and when the judgment result shows that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set.
Specifically, for each component Ci={di,1,di,2…, in turn, are manually interpreted, e.g. by the occurrence of a keyword di,xAnd di,yIf the content semantic conflict exists between (x ≠ y), according to the keyword di,xAnd di,y(x ≠ y) search for public text intelligence dataset T ═ { T ≠ T1,t2,…,tm,…,tMDetermining the content with conflict according to the corresponding text in the text; otherwise, consider component CiThere is no content conflict for the text corresponding to the keywords in the set.
As shown in fig. 2, the present invention further provides a content conflict detection system for public text intelligence, which includes:
a public text information data set establishing module 201, configured to obtain public text information and establish a public text information data set;
a keyword co-occurrence matrix construction module 202, configured to extract keywords of each text in the public text intelligence data set, and construct a keyword co-occurrence matrix;
a binarization processing module 203, configured to perform binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
a keyword co-occurrence network establishing module 204, configured to establish a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
a component extraction module 205, configured to extract components in the keyword co-occurrence network to obtain a component data set;
a conflict judgment module 206, configured to judge each component in the component data set, and judge whether a content conflict exists in the corresponding component; and when the judgment result shows that the corresponding components have content conflict, determining the text with conflict in the public text intelligence data set according to the components with content conflict.
Optionally, the keyword co-occurrence matrix constructing module 202 specifically includes:
the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text;
the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text;
the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry;
the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text;
the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set;
the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text;
and the keyword co-occurrence matrix establishing submodule is used for establishing the keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text.
Optionally, the binarization processing module 203 specifically includes:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
Optionally, the component extraction module 205 specifically includes:
the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components;
and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
Optionally, the conflict judgment module 206 specifically includes:
the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components;
and the conflict content determining submodule is used for searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result shows that the content conflict exists in the corresponding component.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses a content conflict detection method and a system of open text information, firstly, the open text information is obtained, and an open text information data set is established; then, extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix; carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix; then, establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix; extracting components in the keyword co-occurrence network to obtain a component data set; and finally, judging each component in the component data set, judging whether content conflict exists or not, and determining the content with conflict. The method of the invention directly detects and judges the content in the public text information by using the correlation analysis, does not need a structured knowledge base and does not need to structurally describe and store the public text data, thereby reducing the calculated amount, overcoming the technical defect of poor content conflict detection accuracy caused by that the updating of the knowledge base can not be synchronous with the public text information of big data with very strong real-time property, and realizing the detection of the content conflict of the public text information with the characteristics of the big data.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the implementation manner of the present invention are explained by applying specific examples, the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof, the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

Claims (4)

1. A content conflict detection method of open text information is characterized by comprising the following steps:
acquiring public text information, and establishing a public text information data set, wherein the public text information data set comprises a plurality of texts;
extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;
carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
extracting components in the keyword co-occurrence network to obtain a component data set;
judging each component in the component data set, and judging whether content conflict exists in the corresponding component; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the method specifically comprises the following steps: judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component; when the judgment result is that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set;
the method for extracting the keywords of each text in the public text information data set and constructing the keyword co-occurrence matrix specifically comprises the following steps: segmenting words of each text in the public text intelligence data set to obtain an entry set of the text; calculating the expectation of the cross information entropy of each entry in the entry set of the text; according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order; extracting the first k entries in the ordered entry set as keywords of the text; establishing a keyword set according to keywords of each text in the text information data set; counting the times of common occurrence of any two keywords in the same text in the keyword set; establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text;
extracting components in the keyword co-occurrence network to obtain a component data set, which specifically comprises the following steps: extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components; and combining all the components in the extracted keyword co-occurrence network into a component data set.
2. The method according to claim 1, wherein the binarizing processing is performed on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, and specifically comprises:
replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;
and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
3. A content conflict detection system for open text intelligence, comprising:
the public text information data set establishing module is used for acquiring public text information and establishing a public text information data set;
the keyword co-occurrence matrix construction module is used for extracting keywords of each text in the public text information data set and constructing a keyword co-occurrence matrix;
the binarization processing module is used for carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;
the keyword co-occurrence network establishing module is used for establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;
the component extraction module is used for extracting components in the keyword co-occurrence network to obtain a component data set;
the conflict judging module is used for judging each component in the component data set and judging whether content conflicts exist in the corresponding components; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the conflict judgment module specifically comprises: the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components; a conflict content determining submodule for searching the corresponding text in the public text information data set according to the keyword with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result is that the content conflict exists in the corresponding component;
the keyword co-occurrence matrix construction module specifically comprises: the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text; the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text; the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry; the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text; the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set; the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text; the keyword co-occurrence matrix establishing submodule is used for establishing a keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text;
the component extraction module specifically comprises: the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components; and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.
4. The system according to claim 3, wherein the binarization processing module specifically comprises:
a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;
and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.
CN201710646040.7A 2017-08-01 2017-08-01 Content conflict detection method and system for open text information Active CN107451120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710646040.7A CN107451120B (en) 2017-08-01 2017-08-01 Content conflict detection method and system for open text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710646040.7A CN107451120B (en) 2017-08-01 2017-08-01 Content conflict detection method and system for open text information

Publications (2)

Publication Number Publication Date
CN107451120A CN107451120A (en) 2017-12-08
CN107451120B true CN107451120B (en) 2020-10-30

Family

ID=60490592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710646040.7A Active CN107451120B (en) 2017-08-01 2017-08-01 Content conflict detection method and system for open text information

Country Status (1)

Country Link
CN (1) CN107451120B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6987003B2 (en) * 2018-03-20 2021-12-22 株式会社Screenホールディングス Text mining methods, text mining programs, and text mining equipment
CN110377690B (en) * 2019-06-27 2021-03-16 北京信息科技大学 Information acquisition method and system based on remote relationship extraction
CN110442765B (en) * 2019-07-04 2022-03-11 卓尔智联(武汉)研究院有限公司 Information processing method, device, terminal and storage medium
CN114003785A (en) * 2021-10-29 2022-02-01 奇安信科技集团股份有限公司 Method and device for obtaining threat information based on endogenous security
CN114090781A (en) * 2022-01-20 2022-02-25 北京零点远景网络科技有限公司 Text data-based repulsion event detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577404A (en) * 2012-07-19 2014-02-12 中国人民大学 Microblog-oriented discovery method for new emergencies
US9235563B2 (en) * 2009-07-02 2016-01-12 Battelle Memorial Institute Systems and processes for identifying features and determining feature associations in groups of documents
CN106599304A (en) * 2016-12-29 2017-04-26 中南大学 Small and medium-sized website-oriented modularized user retrieval intention modeling method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9235563B2 (en) * 2009-07-02 2016-01-12 Battelle Memorial Institute Systems and processes for identifying features and determining feature associations in groups of documents
CN103577404A (en) * 2012-07-19 2014-02-12 中国人民大学 Microblog-oriented discovery method for new emergencies
CN106599304A (en) * 2016-12-29 2017-04-26 中南大学 Small and medium-sized website-oriented modularized user retrieval intention modeling method

Also Published As

Publication number Publication date
CN107451120A (en) 2017-12-08

Similar Documents

Publication Publication Date Title
CN107451120B (en) Content conflict detection method and system for open text information
CN110516067B (en) Public opinion monitoring method, system and storage medium based on topic detection
CN106407484B (en) Video tag extraction method based on barrage semantic association
US8285713B2 (en) Image search using face detection
WO2018196561A1 (en) Label information generating method and device for application and storage medium
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103049435B (en) Text fine granularity sentiment analysis method and device
CN110413787B (en) Text clustering method, device, terminal and storage medium
CN104881458B (en) A kind of mask method and device of Web page subject
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN112131449A (en) Implementation method of cultural resource cascade query interface based on elastic search
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN105279277A (en) Knowledge data processing method and device
CN101079031A (en) Web page subject extraction system and method
CN110738033B (en) Report template generation method, device and storage medium
CN109344298A (en) Method and device for converting unstructured data into structured data
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN107180087B (en) A kind of searching method and device
CN103886020A (en) Quick search method of real estate information
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN107341142B (en) Enterprise relation calculation method and system based on keyword extraction and analysis
CN115618014A (en) Standard document analysis management system and method applying big data technology
Aslam et al. Web-AM: An efficient boilerplate removal algorithm for Web articles
CN110083654A (en) A kind of multi-source data fusion method and system towards science and techniques of defence field
CN104978431B (en) Web data fusion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant