CN107451120B

CN107451120B - A content conflict detection method and system for public text intelligence

Info

Publication number: CN107451120B
Application number: CN201710646040.7A
Authority: CN
Inventors: 李晓军; 姚俊萍; 沈涛; 张锴琦; 王利涛; 马俊春
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2017-08-01
Filing date: 2017-08-01
Publication date: 2020-10-30
Anticipated expiration: 2037-08-01
Also published as: CN107451120A

Abstract

The invention discloses a content conflict detection method and system for publishing text information. The method includes: establishing a public text information data set; extracting keywords, and constructing a keyword co-occurrence matrix; performing binarization processing on the keyword co-occurrence matrix to establish a keyword co-occurrence network; and extracting the keyword co-occurrence network. components, and obtain a component data set; judge each component, determine whether there is content conflict, and determine the conflicting content, the method of the present invention uses correlation analysis to directly detect and judge the content in the public text information, without the need for The structured description and storage of public text data reduces the amount of calculation, and overcomes the technical defect of poor content conflict detection accuracy due to the inability to synchronize the update of structured knowledge base with the public text intelligence of very real-time big data. The content conflict detection of open text intelligence with big data characteristics is realized.

Description

A content conflict detection method and system for public text intelligence

技术领域technical field

本发明涉及公开文本情报应用领域，特别涉及一种公开文本情报的内容冲突检测方法及系统。The invention relates to the application field of public text information, in particular to a content conflict detection method and system for public text information.

背景技术Background technique

公开情报，又称为开源情报，是指从公众媒体(如报纸/刊物、互联网、自媒体平台等)上收集和挖掘的情报，情报内容以非结构化数据为主，包括数字、文本、图片、视频等。Open intelligence, also known as open source intelligence, refers to intelligence collected and mined from public media (such as newspapers/publications, the Internet, and self-media platforms, etc.). The content of intelligence is mainly unstructured data, including numbers, texts, and pictures. , video, etc.

公开文本情报，是指从公众媒体(如报纸/刊物、互联网、自媒体平台等) 上收集和挖掘的文本格式的情报数据。Open text intelligence refers to intelligence data in text format collected and mined from public media (such as newspapers/publications, the Internet, and self-media platforms, etc.).

内容冲突，是指在相同的问题情境中针对同一主题特征的描述存在不一致或者互相矛盾的情形。Content conflict refers to inconsistent or contradictory descriptions of the same subject features in the same problem situation.

公开文本情报具有获取成本低、数据来源渠道广泛、数据实时性好等相对优势，在军事情报保障、企业竞争战略研判等领域具有广泛的应用价值和效益。同时，随着自媒体技术的进步、互联网的普及等，公开文本情报呈现出大数据特点，即数据量以惊人地速度增长、数据产生具有多源特征、数据传播过程多渠道并行且交杂繁复等，海量的公开文本情报中不可避免地存在着冲突内容，使得公开文本情报的分析和利用变得困难；而潜在竞争对手有意识的信息误导更是加重了该问题的严重性。因此，公开文本情报高效、准确地应用的第一步就是冲突内容的检测和发现。Open text intelligence has comparative advantages such as low acquisition cost, wide data source channels, and good data real-time performance. At the same time, with the advancement of self-media technology and the popularization of the Internet, public text intelligence presents the characteristics of big data, that is, the amount of data is growing at an alarming rate, the data generation has multi-source characteristics, and the data dissemination process is multi-channel parallel and complex. etc., there are inevitably conflicting contents in the massive public text intelligence, which makes the analysis and utilization of public text intelligence difficult; and the conscious information misleading of potential competitors aggravates the seriousness of the problem. Therefore, the first step in the efficient and accurate application of open text intelligence is the detection and discovery of conflicting content.

冲突内容是制约公开文本情报数据质量的关键性因素，潜在的冲突内容如果得不到及时、有效的检测发现和消除，将导致公开文本情报大数据分析结果的不可靠，降低其应用价值。当前，针对文本数据的内容冲突检测主要面向小规模及中等规模数据，且主要应用于检测和发现元数据或者结构化数据冲突。Conflict content is a key factor restricting the quality of public text intelligence data. If potential conflict content is not detected and eliminated in a timely and effective manner, the results of big data analysis of public text intelligence will be unreliable and its application value will be reduced. Currently, content conflict detection for text data is mainly for small-scale and medium-scale data, and is mainly used to detect and discover metadata or structured data conflicts.

例如，中国电子科技集团公司第二十研究所张可人提出网络管控系统中指令内容冲突检测方法，该方法包括以下步骤：For example, Zhang Keren from the 20th Research Institute of China Electronics Technology Group Corporation proposed a method for detecting conflict of command contents in a network management and control system, which includes the following steps:

1.统计网络管控系统在内容上互斥的指令；1. Count the mutually exclusive instructions of the network management and control system in terms of content;

2.建立多个互斥指令集，每一互斥指令集中的各指令均为互斥；2. Establish multiple mutually exclusive instruction sets, and each instruction in each mutually exclusive instruction set is mutually exclusive;

3.设定指令间隔时间阈值t；3. Set the command interval time threshold t;

4.对同一设备在间隔时间为t的时间段内收到的指令进行记录，如果存在 2条及以上指令在同一互斥指令集中，则指令内容冲突发生；否则，无指令内容冲突。4. Record the instructions received by the same device within the time interval of t. If there are 2 or more instructions in the same mutually exclusive instruction set, the instruction content conflict occurs; otherwise, there is no instruction content conflict.

再如，赵晓非、黄志球提出基于描述逻辑的CWM(公共仓库元模型，简称CWM)元数据冲突检测方法，该方法包括以下步骤：For another example, Zhao Xiaofei and Huang Zhiqiu proposed a description logic-based CWM (Common Warehouse Metamodel, CWM for short) metadata conflict detection method, which includes the following steps:

1.建立一种支持概念之上的同一性约束的描述逻辑DL_id；1. Establish a description logic DL _id that supports identity constraints on concepts;

2.应用描述逻辑DL_id将CWM元数据形式化，建立DL_id知识库；2. Apply the description logic DL _id to formalize the CWM metadata and establish a DL _id knowledge base;

3.定义描述逻辑查询语言需求集合；3. Define the description logic query language requirement set;

4.依据描述逻辑查询语言需求，建立如下格式的查询语言：4. According to the description logic query language requirements, establish a query language in the following format:

5.应用nRQL查询DL_id知识库，发现内容冲突。5. Apply nRQL to query the DL _id knowledge base to find content conflicts.

现有方法在文本数据的内容冲突检测方面主要面向小规模及中等规模数据，其特征主要体现在：(1)其关键步骤中首先进行文本数据的结构化描述和存储；(2)以结构化的知识库为基础，建立冲突检测的推理机制，如互斥指令集、冲突查询语言等，进而进行内容的冲突检测。对于呈现出大数据特点的公开文本情报，现有方法存在着如下的缺陷：(1)公开文本情报呈现出大数据特点的背景下，公开文本情报数据的结构化描述和存储的工作量异常巨大，将变得非常困难；(2)以结构化的知识库为基础建立的冲突检测推理机制，是固化的，缺少灵活性，在公开文本情报大数据实时性非常强的情况下，建立的内容冲突检测推理机制将非常容易出现不适应新的问题情境；(3)检测得到的冲突内容是微观层面的，即若干条(通常为2，且数目较小)文本中存在内容冲突，很难呈现大数据集整体层面中存在的内容冲突，可见现有的内容冲突检测方法无法实现具有大数据特点的公开文本情报的内容冲突的检测。The existing methods are mainly oriented to small-scale and medium-scale data in terms of content conflict detection of text data, and their characteristics are mainly reflected in: (1) the key steps of the text data are first structured description and storage; (2) structured On the basis of the knowledge base, a reasoning mechanism for conflict detection is established, such as mutually exclusive instruction set, conflict query language, etc., and then content conflict detection is carried out. For the open text intelligence with the characteristics of big data, the existing methods have the following defects: (1) Under the background of the open text intelligence showing the characteristics of big data, the workload of structured description and storage of the open text intelligence data is extremely huge. , it will become very difficult; (2) The conflict detection and reasoning mechanism established on the basis of a structured knowledge base is rigid and lacks flexibility. Under the circumstance that the real-time nature of public text intelligence big data is very strong, the content of the established content The conflict detection and reasoning mechanism will be very easy to not adapt to the new problem situation; (3) The detected conflict content is at the micro level, that is, there are content conflicts in several (usually 2, and a small number) texts, which are difficult to present. The content conflict in the overall level of the big data set shows that the existing content conflict detection methods cannot realize the content conflict detection of public text intelligence with the characteristics of big data.

发明内容SUMMARY OF THE INVENTION

本发明的目的是，为了实现具有大数据特点的公开文本情报的内容冲突的检测，提供一种公开文本情报的内容冲突检测方法及系统。The purpose of the present invention is to provide a content conflict detection method and system for open text intelligence in order to realize the content conflict detection of open text intelligence with the characteristics of big data.

为实现上述目的，本发明提供了如下方案：For achieving the above object, the present invention provides the following scheme:

一种公开文本情报的内容冲突检测方法，包括如下步骤：A content conflict detection method for public text intelligence, comprising the following steps:

获取公开文本情报，建立公开文本情报数据集，所述公开文本情报数据集中包括多条文本；Obtaining public text intelligence, and establishing a public text intelligence data set, wherein the public text intelligence data set includes a plurality of texts;

提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵；extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix;

对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵；Binarizing the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;

根据所述二值化关键词共现矩阵建立关键词共现网络；establishing a keyword co-occurrence network according to the binarized keyword co-occurrence matrix;

提取所述关键词共现网络中的成分，获得成分数据集；extracting components in the keyword co-occurrence network to obtain a component data set;

对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。Judging each component in the component data set to determine whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine that the public text information data set exists according to the component with content conflict. conflicting text.

可选的，所述提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵，具体包括：Optionally, extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix, specifically includes:

对所述公开文本情报数据集中的每一条文本的进行分词，获得该条文本的词条集合；Perform word segmentation on each piece of text in the public text intelligence data set to obtain the entry set of the piece of text;

计算该条文本的词条集合中的每个词条的交叉信息熵的期望；Calculate the expected cross-information entropy of each entry in the entry set of the text;

根据每个词条的交叉信息熵的期望的大小，对该条文本的词条集合中的词条进行降序的排序；According to the expected size of the cross-information entropy of each entry, the entries in the entry set of the text are sorted in descending order;

提取排序后的词条集合中的前k个词条作为该文本的关键词；Extract the first k entries in the sorted entry set as the keywords of the text;

根据文本情报数据集中的每条文本的关键词，建立关键词集合；Establish a keyword set according to the keywords of each text in the text intelligence data set;

统计关键词集合中任意两个关键词在同一条文本中共同出现的次数；Count the number of times that any two keywords in the keyword set appear together in the same text;

根据每两个关键词在同一条文本中共同出现的次数，建立关键词共现矩阵。According to the co-occurrence times of each two keywords in the same text, a keyword co-occurrence matrix is established.

可选的，对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵，具体包括：Optionally, perform binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, which specifically includes:

将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1；Replace the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;

将所述关键词共现矩阵中的小于所述设定阈值的元素替换为0。Elements in the keyword co-occurrence matrix that are smaller than the set threshold are replaced with 0.

可选的，提取所述关键词共现网络中的成分，获得成分数据集，具体包括：Optionally, extract the components in the keyword co-occurrence network to obtain a component data set, which specifically includes:

按照同一成分中关键词之间存在共现性，不同成分中的关键词间不存在共现性的原则，提取所述关键词共现网络中的成分；According to the principle of co-occurrence between keywords in the same component and no co-occurrence between keywords in different components, extract the components in the keyword co-occurrence network;

将提取的关键词共现网络中的所有成分组合成成分数据集。All components in the extracted keyword co-occurrence network are combined into a component dataset.

可选的，对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本，具体包括：Optionally, each component in the component data set is judged to determine whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, the published text is determined according to the component with content conflict. Conflicting texts in intelligence datasets, including:

对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容语义上的冲突；Judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;

在判断结果为对应成分中存在内容冲突时，则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本，确定所述公开文本情报数据集存在冲突的文本。When the judgment result is that there is content conflict in the corresponding component, the corresponding text in the public text information data set is searched according to the keyword with content semantic conflict in the component, and the conflicting text in the public text information data set is determined.

一种公开文本情报的内容冲突检测系统，包括：A content conflict detection system for public text intelligence, comprising:

公开文本情报数据集建立模块，用于获取公开文本情报，建立公开文本情报数据集；The public text intelligence data set building module is used to obtain public text intelligence and establish public text intelligence data sets;

关键词共现矩阵构建模块，用于提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵；a keyword co-occurrence matrix building module, used for extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix;

二值化处理模块，用于对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵；The binarization processing module is used for binarizing the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;

关键词共现网络建立模块，用于根据所述二值化关键词共现矩阵建立关键词共现网络；a keyword co-occurrence network establishment module, configured to establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix;

成分提取模块，用于提取所述关键词共现网络中的成分，获得成分数据集；a component extraction module for extracting components in the keyword co-occurrence network to obtain a component data set;

冲突判断模块，用于对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。The conflict judgment module is used to judge each component in the component data set, and judge whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the content conflict according to the component. Conflicting text exists in the public text intelligence dataset.

可选的，所述关键词共现矩阵构建模块具体包括：Optionally, the keyword co-occurrence matrix building module specifically includes:

词条划分子模块，用于对所述公开文本情报数据集中的每一条文本的进行分词，获得该条文本的词条集合；an entry division submodule, which is used to segment each text in the public text intelligence data set to obtain the entry set of the text;

期望计算子模块，用于计算该条文本的词条集合中的每个词条的交叉信息熵的期望；The expectation calculation submodule is used to calculate the expectation of the cross information entropy of each entry in the entry set of the text;

排序子模块，用于根据每个词条的交叉信息熵的期望的大小，对该条文本的词条集合中的词条进行降序的排序；a sorting submodule, used for sorting the entries in the entry set of the text in descending order according to the expected size of the cross-information entropy of each entry;

关键词提取子模块，用于提取排序后的词条集合中的前k个词条作为该文本的关键词；The keyword extraction sub-module is used to extract the first k entries in the sorted entry set as the keywords of the text;

关键词集合建立子模块，用于根据文本情报数据集中的每条文本的关键词，建立关键词集合；The keyword set establishment sub-module is used to establish a keyword set according to the keywords of each text in the text intelligence data set;

共现次数统计子模块，用于统计关键词集合中任意两个关键词在同一条文本中共同出现的次数；Co-occurrence statistics sub-module, used to count the co-occurrence times of any two keywords in the keyword set in the same text;

关键词共现矩阵建立子模块，用于根据任意两个关键词在同一条文本中共同出现的次数，建立关键词共现矩阵。The keyword co-occurrence matrix establishment sub-module is used to establish a keyword co-occurrence matrix according to the number of co-occurrences of any two keywords in the same text.

可选的，所述二值化处理模块具体包括：Optionally, the binarization processing module specifically includes:

置1子模块，用于将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1；A submodule is set to 1, for replacing the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;

置0子模块，用于将所述关键词共现矩阵中的小于设定阈值的元素替换为 0。The sub-module of setting 0 is used to replace the elements in the keyword co-occurrence matrix that are smaller than the set threshold with 0.

可选的，所述成分提取模块具体包括：Optionally, the component extraction module specifically includes:

成分提取子模块，用于按照同一成分中关键词之间存在共现性，不同成分中的关键词间不存在共现性的原则，提取所述关键词共现网络中的成分；The component extraction sub-module is used for extracting the components in the keyword co-occurrence network according to the principle of co-occurrence between keywords in the same component and no co-occurrence between keywords in different components;

成分数据集建立子模块，用于将提取的关键词共现网络中的所有成分组合成成分数据集。The component dataset builds a submodule for combining all components in the extracted keyword co-occurrence network into a component dataset.

可选的，所述冲突判断模块具体包括：Optionally, the conflict judgment module specifically includes:

冲突判断子模块，用于对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容语义上的冲突；a conflict judgment submodule, used for judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;

冲突内容确定子模块，用于在判断结果为对应成分中存在内容冲突时，则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本，确定所述公开文本情报数据集存在冲突的文本。The conflict content determination sub-module is used to retrieve the corresponding text in the public text information data set according to the keyword of the content semantic conflict in the corresponding component when the judgment result is that there is content conflict in the corresponding component, and determine the public text information The dataset has conflicting text.

根据本发明提供的具体实施例，本发明公开了以下技术效果：According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:

本发明公开了一种公开文本情报的内容冲突检测方法及系统，首先，获取公开文本情报，建立公开文本情报数据集；然后，提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵；并对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵；然后，根据所述二值化关键词共现矩阵建立关键词共现网络；提取所述关键词共现网络中的成分，获得成分数据集；最后，对所述成分数据集中的每一成分进行判断，判断是否存在内容冲突，并确定存在冲突的内容。本发明的方法，运用关联分析直接对公开文本情报中的内容进行检测和判断，无需结构化的知识库，也无需对公开文本数据进行结构化描述和存储，减小了计算量，克服了因知识库更新无法与实时性非常强的大数据的公开文本情报同步，造成内容冲突检测准确性差的技术缺陷，实现了具有大数据特点的公开文本情报的内容冲突的检测。The invention discloses a content conflict detection method and system for public text information. First, public text information is obtained, and a public text information data set is established; then, keywords of each text in the public text information data set are extracted to construct a public text information data set. keyword co-occurrence matrix; binarize the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix; then, establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix ; Extract the components in the keyword co-occurrence network to obtain a component data set; finally, judge each component in the component data set to determine whether there is content conflict, and determine the conflicting content. The method of the present invention directly detects and judges the content in the public text information by using the correlation analysis, without the need for a structured knowledge base, and without the need for structured description and storage of the public text data, which reduces the amount of calculation and overcomes the problem of The update of the knowledge base cannot be synchronized with the open text intelligence of the real-time big data, which causes the technical defect of poor content conflict detection accuracy, and realizes the content conflict detection of the open text intelligence with the characteristics of big data.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.

图1为本发明提供的一种公开文本情报的内容冲突检测方法的流程图。FIG. 1 is a flowchart of a content conflict detection method for public text intelligence provided by the present invention.

图2为本发明提供的一种公开文本情报的内容冲突检测系统的结构框图。FIG. 2 is a structural block diagram of a content conflict detection system for disclosing text information provided by the present invention.

具体实施方式Detailed ways

本发明的目的是提供一种公开文本情报的内容冲突检测方法及系统，以实现具有大数据特点的公开文本情报的内容冲突的检测。The purpose of the present invention is to provide a content conflict detection method and system for open text intelligence, so as to realize the content conflict detection of open text intelligence with big data characteristics.

为使本发明的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

如图1所示，本发明提供了一种公开文本情报的内容冲突检测方法，包括如下步骤：As shown in FIG. 1 , the present invention provides a content conflict detection method for public text intelligence, comprising the following steps:

步骤101，获取公开文本情报，建立公开文本情报数据集，所述公开文本情报数据集中包括多条文本；具体的，所述公开文本情报数据集T为， T＝{t₁,t₂,…,t_m,…,t_M}，其中，t_m为所述公开文本情报数据集T中的第m条文本，M表示所述公开文本情报数据集T中文本的总条数。Step 101: Obtain public text information, and establish a public text information data set, where the public text information data set includes a plurality of texts; specifically, the public text information data set T is, T={t ₁ ,t ₂ ,... ,t _m ,...,t _M }, where t _m is the mth text in the public text intelligence data set T, and M represents the total number of texts in the public text intelligence data set T.

步骤102，提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵；Step 102, extracting the keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;

步骤103，对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵；Step 103, performing binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;

步骤104，根据所述二值化关键词共现矩阵建立关键词共现网络；具体的, 所述建立关键词共现网络的过程是将二值化关键词共现矩阵中的值为1的元素对应的关键词进行连线，得到关键词共现网络；Step 104, establishing a keyword co-occurrence network according to the binarized keyword co-occurrence matrix; specifically, the process of establishing the keyword co-occurrence network is to set the value in the binarized keyword co-occurrence matrix to 1. The keywords corresponding to the elements are connected to obtain the keyword co-occurrence network;

步骤105，提取所述关键词共现网络中的成分，获得成分数据集；Step 105, extracting the components in the keyword co-occurrence network to obtain a component data set;

步骤106，对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。Step 106: Judging each component in the component data set, and judging whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the public text information according to the component with content conflict. The dataset has conflicting text.

可选的，步骤102所述提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵，具体包括：Optionally, in step 102, the keywords of each text in the public text intelligence data set are extracted to construct a keyword co-occurrence matrix, which specifically includes:

对所述公开文本情报数据集中的每一条文本的进行分词，获得该条文本的词条集合；具体的，对第m条文本t_m，进行分词，得到该条文本的词条集合为

表示第m条文本t_m的第l_m个词条， l_m＝1,2,…,L_m，L_m表示第m条文本t_m中词条的总数，例如，对文本“南京市长江大桥”进行分词，可以得到，{南京，市长，江大桥，南京市，长江，大桥，长江大桥}。Perform word segmentation on each text in the public text information data set to obtain the entry set of the text; specifically, perform word segmentation on the m-th text t _m , and obtain the entry set of the text as:

Represents the _lmth entry in the _mth text _tm , lm = 1,2,..., _Lm , _Lm represents the total number of entries in the _mth text tm, for example, for the text "Nanjing Yangtze River "Bridge" can be obtained by word segmentation, {Nanjing, Mayor, River Bridge, Nanjing City, Yangtze River, Bridge, Yangtze River Bridge}.

具体的，计算词条

交叉信息熵的期望为：Specifically, counting terms

The expectation of cross information entropy is:

其中，

表示在出现词条

时文本t_m属于类别c_i的概率；p(c_i)表示文本t_m类别的概率分布；

反映了文本t_m类别的概率分布与在出现了词条

的情况下文本类别的概率分布之间的距离，其值越大，词条

对文本 t_m类别分布的影响也就越大。in,

Indicates that the term appears

is the probability that the text t _m belongs to the category c _i ; p( _ci ) represents the probability distribution of the text t _m category;

reflects the probability distribution of the text t _m categories and the occurrence of the term

The distance between the probability distributions of text categories in the case of

The greater the impact on the category distribution of text _tm .

计算文本t_m中每个词条的交叉信息熵的期望，得到词条特征集：Calculate the expectation of the cross-information entropy of each entry in the text t _m , and get the entry feature set:

提取排序后的词条集合中的前k个词条作为该文本的关键词；具体的，对于第m条文本t_m，如果L_m≤200，则

否则，k＝10；Extract the first k entries in the sorted entry set as the keywords of the text; specifically, for the m-th text t _m , if L _m ≤ 200, then

otherwise, k=10;

根据文本情报数据集中的每条文本的关键词，建立关键词集合；具体的，第m条文本t_m关键词集为

w₁,为第m条文本t_m中排序后的第一个词条，k_m为对第m条文本提取关键词时k的取值，每条文本的关键词集组成关键词集合为D＝D₁∪D₂∪…∪D_M＝{d₁,d₂,…,d_s}，其中，s为公开文本情报数据集的关键词集合D中关键词数目。According to the keywords of each text in the text intelligence data set, a keyword set is established; specifically, the m-th text t _m keyword set is

w ₁ , is the first entry after sorting in the m-th text t _m , k _m is the value of k when extracting keywords from the m-th text, and the keyword set of each text is composed of the keyword set D =D ₁ ∪D ₂ ∪… _∪DM ={d ₁ ,d ₂ ,…,d _s }, where s is the number of keywords in the keyword set D of the public text intelligence dataset.

根据每两个关键词在同一条文本中共同出现的次数，建立关键词共现矩阵；Establish a keyword co-occurrence matrix according to the number of co-occurrences of each two keywords in the same text;

具体的，对于集合D中一组词(d_u,d_v)，其中u＝1,2,…,s；v＝1,2,…,s；u≠v；统计它们在同一条文本中出现的次数，记为a_u,v，则得到基于关键词集合D的关键词共现矩阵：A＝(a_u,v)_s×s。Specifically, for a group of words (d _u , d _v ) in the set D, where u=1,2,...,s; v=1,2,...,s; u≠v; count them in the same text The number of occurrences is denoted as a _u,v , then a keyword co-occurrence matrix based on the keyword set D is obtained: A=(a _u,v ) _s×s .

可选的，步骤103，对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵，具体包括：Optionally, in step 103, binarize the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, which specifically includes:

具体的，以关键词集合D中关键词数目s为依据，设置阈值ε(ε＞0，且为整数)，如果a_u,v≥ε，则a'_u,v＝1，否则a'_u,v＝0。得到二值化的关键词共现矩阵A'＝(a'_u,v)_s×s。Specifically, based on the number of keywords s in the keyword set D, a threshold ε (ε>0, and an integer) is set, if a _u,v ≥ε, then a' _u,v =1, otherwise a' _{u , v} = 0. A binarized keyword co-occurrence matrix A'=(a' _u,v ) _s×s is obtained.

可选的，步骤105，提取所述关键词共现网络中的成分，获得成分数据集，具体包括：Optionally, in step 105, components in the keyword co-occurrence network are extracted to obtain a component data set, which specifically includes:

按照同一成分中关键词之间存在共现性，不同成分中的关键词间不存在共现性的原则，提取所述关键词共现网络中的成分；具体的，将关键词共现网络中有连线的关键词放在同一成分中，将关键词共现网络中没有连线的关键词放在不同的成分中，其中第i个成分为C_i＝{d_i,1,d_i,2,…}；According to the principle that there is co-occurrence between keywords in the same component and there is no co-occurrence between keywords in different components, the components in the keyword co-occurrence network are extracted; The connected keywords are placed in the same component, and the unconnected keywords in the keyword co-occurrence network are placed in different components, where the i-th component is C _i ={d _i,1 ,d _{i, 2} ,…};

将提取的关键词共现网络中的所有成分组合成成分数据集，具体的所述成分数据集为{C₁,C₂,…,C_i,…}。All components in the extracted keyword co-occurrence network are combined into a component data set, and the specific component data set is {C ₁ ,C ₂ ,...,C _i ,...}.

可选的，步骤106，对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本，具体包括：Optionally, in step 106, judge each component in the component data set, and judge whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the component according to the component with content conflict. Conflicting texts described in public text intelligence datasets, including:

具体的，对每一成分C_i＝{d_i,1,d_i,2,…}依次进行人工判读，如出现关键词d_i,x与d_i,y(x≠y)间存在内容语义冲突，则根据关键词d_i,x与d_i,y(x≠y)检索公开文本情报数据集T＝{t₁,t₂,…,t_m,…,t_M}中的对应文本，确定存在冲突的内容；否则，认为成分C_i集合中关键词对应的文本不存在内容冲突。Specifically, manual interpretation is performed on each component C _i ={d _i,1 ,d _i,2 ,...} in turn, if there is content semantics between the keywords d _i,x and d _i,y (x≠y) If there is a conflict, search the corresponding text in the public text information dataset T={t ₁ ,t ₂ ,…,t _m ,…,t _M } according to the keywords d _i,x and d _i,y (x≠y), It is determined that there is conflicting content; otherwise, it is considered that there is no content conflict in the text corresponding to the keywords in the component C _i set.

如图2所示，本发明还提供了一种公开文本情报的内容冲突检测系统，包括：As shown in Figure 2, the present invention also provides a content conflict detection system for disclosing text intelligence, including:

公开文本情报数据集建立模块201，用于获取公开文本情报，建立公开文本情报数据集；The public text intelligence data set establishment module 201 is used for obtaining public text intelligence and establishing a public text intelligence data set;

关键词共现矩阵构建模块202，用于提取所述公开文本情报数据集中的每一条文本的关键词，构建关键词共现矩阵；The keyword co-occurrence matrix building module 202 is used to extract the keywords of each text in the public text intelligence data set, and construct a keyword co-occurrence matrix;

二值化处理模块203，用于对所述关键词共现矩阵进行二值化处理，得到二值化关键词共现矩阵；The binarization processing module 203 is configured to perform binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;

关键词共现网络建立模块204，用于根据所述二值化关键词共现矩阵建立关键词共现网络；A keyword co-occurrence network establishing module 204, configured to establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix;

成分提取模块205，用于提取所述关键词共现网络中的成分，获得成分数据集；A component extraction module 205, configured to extract components in the keyword co-occurrence network to obtain a component data set;

冲突判断模块206，用于对所述成分数据集中的每一成分进行判断，判断对应成分中是否存在内容冲突；并在判断结果为对应成分中存在内容冲突时，根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。The conflict judgment module 206 is used to judge each component in the component data set, and judge whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the content conflict according to the component with the content conflict. Conflicting texts described in public text intelligence datasets.

可选的，所述关键词共现矩阵构建模块202具体包括：Optionally, the keyword co-occurrence matrix building module 202 specifically includes:

可选的，所述二值化处理模块203具体包括：Optionally, the binarization processing module 203 specifically includes:

可选的，所述成分提取模块205具体包括：Optionally, the component extraction module 205 specifically includes:

可选的，所述冲突判断模块206具体包括：Optionally, the conflict judgment module 206 specifically includes:

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本文中应用了具体个例对发明的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想，所描述的实施例仅仅是本发明的一部分实施例，而不是全部的实施例，基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The principles and implementations of the invention are described herein by using specific examples. The descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention, and the described embodiments are only a part of the embodiments of the present invention. , rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Claims

1. A content conflict detection method of open text information is characterized by comprising the following steps:

acquiring public text information, and establishing a public text information data set, wherein the public text information data set comprises a plurality of texts;

extracting keywords of each text in the public text information data set, and constructing a keyword co-occurrence matrix;

carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;

establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;

extracting components in the keyword co-occurrence network to obtain a component data set;

judging each component in the component data set, and judging whether content conflict exists in the corresponding component; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the method specifically comprises the following steps: judging each component in the component data set, and judging whether content semantic conflict exists in the corresponding component; when the judgment result is that the corresponding component has content conflict, searching the corresponding text in the public text information data set according to the keywords with content semantic conflict in the component, and determining the text with conflict in the public text information data set;

the method for extracting the keywords of each text in the public text information data set and constructing the keyword co-occurrence matrix specifically comprises the following steps: segmenting words of each text in the public text intelligence data set to obtain an entry set of the text; calculating the expectation of the cross information entropy of each entry in the entry set of the text; according to the expected size of the cross information entropy of each entry, sorting the entries in the entry set of the entry text in a descending order; extracting the first k entries in the ordered entry set as keywords of the text; establishing a keyword set according to keywords of each text in the text information data set; counting the times of common occurrence of any two keywords in the same text in the keyword set; establishing a keyword co-occurrence matrix according to the co-occurrence times of every two keywords in the same text;

extracting components in the keyword co-occurrence network to obtain a component data set, which specifically comprises the following steps: extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among keywords in the same component and no co-occurrence exists among keywords in different components; and combining all the components in the extracted keyword co-occurrence network into a component data set.

2. The method according to claim 1, wherein the binarizing processing is performed on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, and specifically comprises:

replacing elements which are larger than or equal to a set threshold value in the keyword co-occurrence matrix with 1;

and replacing elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.

3. A content conflict detection system for open text intelligence, comprising:

the public text information data set establishing module is used for acquiring public text information and establishing a public text information data set;

the keyword co-occurrence matrix construction module is used for extracting keywords of each text in the public text information data set and constructing a keyword co-occurrence matrix;

the binarization processing module is used for carrying out binarization processing on the keyword co-occurrence matrix to obtain a binarization keyword co-occurrence matrix;

the keyword co-occurrence network establishing module is used for establishing a keyword co-occurrence network according to the binarization keyword co-occurrence matrix;

the component extraction module is used for extracting components in the keyword co-occurrence network to obtain a component data set;

the conflict judging module is used for judging each component in the component data set and judging whether content conflicts exist in the corresponding components; and when the judgment result is that the corresponding component has content conflict, determining the text of the public text information data set with conflict according to the component with content conflict; the conflict judgment module specifically comprises: the conflict judgment submodule is used for judging each component in the component data set and judging whether content semantic conflicts exist in the corresponding components; a conflict content determining submodule for searching the corresponding text in the public text information data set according to the keyword with content semantic conflict in the component and determining the text with conflict in the public text information data set when the judgment result is that the content conflict exists in the corresponding component;

the keyword co-occurrence matrix construction module specifically comprises: the entry division submodule is used for carrying out word segmentation on each text in the public text intelligence data set to obtain an entry set of the text; the expectation calculation submodule is used for calculating the expectation of the cross information entropy of each entry in the entry set of the text; the sequencing submodule is used for sequencing the entries in the entry set of the entry text in a descending order according to the expected size of the cross information entropy of each entry; the keyword extraction submodule is used for extracting the first k entries in the ordered entry set as keywords of the text; the keyword set establishing sub-module is used for establishing a keyword set according to the keywords of each text in the text information data set; the co-occurrence frequency counting submodule is used for counting the co-occurrence frequency of any two keywords in the keyword set in the same text; the keyword co-occurrence matrix establishing submodule is used for establishing a keyword co-occurrence matrix according to the co-occurrence times of any two keywords in the same text;

the component extraction module specifically comprises: the component extraction submodule is used for extracting components in the keyword co-occurrence network according to the principle that co-occurrence exists among the keywords in the same component and the co-occurrence does not exist among the keywords in different components; and the component data set establishing submodule is used for combining all the components in the extracted keyword co-occurrence network into a component data set.

4. The system according to claim 3, wherein the binarization processing module specifically comprises:

a 1 setting sub-module, configured to replace an element, which is greater than or equal to a set threshold, in the keyword co-occurrence matrix with 1;

and the 0 setting sub-module is used for replacing the elements smaller than the set threshold value in the keyword co-occurrence matrix with 0.