CN107451120B - A content conflict detection method and system for public text intelligence - Google Patents
A content conflict detection method and system for public text intelligence Download PDFInfo
- Publication number
- CN107451120B CN107451120B CN201710646040.7A CN201710646040A CN107451120B CN 107451120 B CN107451120 B CN 107451120B CN 201710646040 A CN201710646040 A CN 201710646040A CN 107451120 B CN107451120 B CN 107451120B
- Authority
- CN
- China
- Prior art keywords
- keyword
- text
- occurrence
- component
- data set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 33
- 239000011159 matrix material Substances 0.000 claims abstract description 73
- 238000000034 method Methods 0.000 claims abstract description 16
- 238000004364 calculation method Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 12
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims 2
- 238000012163 sequencing technique Methods 0.000 claims 2
- 230000007547 defect Effects 0.000 abstract description 4
- 238000010219 correlation analysis Methods 0.000 abstract description 3
- 238000009826 distribution Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种公开文本情报的内容冲突检测方法及系统。方法包括:建立公开文本情报数据集;提取关键词,构建关键词共现矩阵;对关键词共现矩阵进行二值化处理,建立关键词共现网络;提取所述关键词共现网络中的成分,获得成分数据集;对每一成分进行判断,判断是否存在内容冲突,并确定存在冲突的内容,本发明的方法,运用关联分析直接对公开文本情报中的内容进行检测和判断,无需对公开文本数据进行结构化描述和存储,减小了计算量,克服了因结构化的知识库更新无法与实时性非常强的大数据的公开文本情报同步,造成内容冲突检测准确性差的技术缺陷,实现了具有大数据特点的公开文本情报的内容冲突的检测。
The invention discloses a content conflict detection method and system for publishing text information. The method includes: establishing a public text information data set; extracting keywords, and constructing a keyword co-occurrence matrix; performing binarization processing on the keyword co-occurrence matrix to establish a keyword co-occurrence network; and extracting the keyword co-occurrence network. components, and obtain a component data set; judge each component, determine whether there is content conflict, and determine the conflicting content, the method of the present invention uses correlation analysis to directly detect and judge the content in the public text information, without the need for The structured description and storage of public text data reduces the amount of calculation, and overcomes the technical defect of poor content conflict detection accuracy due to the inability to synchronize the update of structured knowledge base with the public text intelligence of very real-time big data. The content conflict detection of open text intelligence with big data characteristics is realized.
Description
技术领域technical field
本发明涉及公开文本情报应用领域,特别涉及一种公开文本情报的内容冲突检测方法及系统。The invention relates to the application field of public text information, in particular to a content conflict detection method and system for public text information.
背景技术Background technique
公开情报,又称为开源情报,是指从公众媒体(如报纸/刊物、互联网、自媒体平台等)上收集和挖掘的情报,情报内容以非结构化数据为主,包括数字、文本、图片、视频等。Open intelligence, also known as open source intelligence, refers to intelligence collected and mined from public media (such as newspapers/publications, the Internet, and self-media platforms, etc.). The content of intelligence is mainly unstructured data, including numbers, texts, and pictures. , video, etc.
公开文本情报,是指从公众媒体(如报纸/刊物、互联网、自媒体平台等) 上收集和挖掘的文本格式的情报数据。Open text intelligence refers to intelligence data in text format collected and mined from public media (such as newspapers/publications, the Internet, and self-media platforms, etc.).
内容冲突,是指在相同的问题情境中针对同一主题特征的描述存在不一致或者互相矛盾的情形。Content conflict refers to inconsistent or contradictory descriptions of the same subject features in the same problem situation.
公开文本情报具有获取成本低、数据来源渠道广泛、数据实时性好等相对优势,在军事情报保障、企业竞争战略研判等领域具有广泛的应用价值和效益。同时,随着自媒体技术的进步、互联网的普及等,公开文本情报呈现出大数据特点,即数据量以惊人地速度增长、数据产生具有多源特征、数据传播过程多渠道并行且交杂繁复等,海量的公开文本情报中不可避免地存在着冲突内容,使得公开文本情报的分析和利用变得困难;而潜在竞争对手有意识的信息误导更是加重了该问题的严重性。因此,公开文本情报高效、准确地应用的第一步就是冲突内容的检测和发现。Open text intelligence has comparative advantages such as low acquisition cost, wide data source channels, and good data real-time performance. At the same time, with the advancement of self-media technology and the popularization of the Internet, public text intelligence presents the characteristics of big data, that is, the amount of data is growing at an alarming rate, the data generation has multi-source characteristics, and the data dissemination process is multi-channel parallel and complex. etc., there are inevitably conflicting contents in the massive public text intelligence, which makes the analysis and utilization of public text intelligence difficult; and the conscious information misleading of potential competitors aggravates the seriousness of the problem. Therefore, the first step in the efficient and accurate application of open text intelligence is the detection and discovery of conflicting content.
冲突内容是制约公开文本情报数据质量的关键性因素,潜在的冲突内容如果得不到及时、有效的检测发现和消除,将导致公开文本情报大数据分析结果的不可靠,降低其应用价值。当前,针对文本数据的内容冲突检测主要面向小规模及中等规模数据,且主要应用于检测和发现元数据或者结构化数据冲突。Conflict content is a key factor restricting the quality of public text intelligence data. If potential conflict content is not detected and eliminated in a timely and effective manner, the results of big data analysis of public text intelligence will be unreliable and its application value will be reduced. Currently, content conflict detection for text data is mainly for small-scale and medium-scale data, and is mainly used to detect and discover metadata or structured data conflicts.
例如,中国电子科技集团公司第二十研究所张可人提出网络管控系统中指令内容冲突检测方法,该方法包括以下步骤:For example, Zhang Keren from the 20th Research Institute of China Electronics Technology Group Corporation proposed a method for detecting conflict of command contents in a network management and control system, which includes the following steps:
1.统计网络管控系统在内容上互斥的指令;1. Count the mutually exclusive instructions of the network management and control system in terms of content;
2.建立多个互斥指令集,每一互斥指令集中的各指令均为互斥;2. Establish multiple mutually exclusive instruction sets, and each instruction in each mutually exclusive instruction set is mutually exclusive;
3.设定指令间隔时间阈值t;3. Set the command interval time threshold t;
4.对同一设备在间隔时间为t的时间段内收到的指令进行记录,如果存在 2条及以上指令在同一互斥指令集中,则指令内容冲突发生;否则,无指令内容冲突。4. Record the instructions received by the same device within the time interval of t. If there are 2 or more instructions in the same mutually exclusive instruction set, the instruction content conflict occurs; otherwise, there is no instruction content conflict.
再如,赵晓非、黄志球提出基于描述逻辑的CWM(公共仓库元模型,简称CWM)元数据冲突检测方法,该方法包括以下步骤:For another example, Zhao Xiaofei and Huang Zhiqiu proposed a description logic-based CWM (Common Warehouse Metamodel, CWM for short) metadata conflict detection method, which includes the following steps:
1.建立一种支持概念之上的同一性约束的描述逻辑DLid;1. Establish a description logic DL id that supports identity constraints on concepts;
2.应用描述逻辑DLid将CWM元数据形式化,建立DLid知识库;2. Apply the description logic DL id to formalize the CWM metadata and establish a DL id knowledge base;
3.定义描述逻辑查询语言需求集合;3. Define the description logic query language requirement set;
4.依据描述逻辑查询语言需求,建立如下格式的查询语言:4. According to the description logic query language requirements, establish a query language in the following format:
5.应用nRQL查询DLid知识库,发现内容冲突。5. Apply nRQL to query the DL id knowledge base to find content conflicts.
现有方法在文本数据的内容冲突检测方面主要面向小规模及中等规模数据,其特征主要体现在:(1)其关键步骤中首先进行文本数据的结构化描述和存储;(2)以结构化的知识库为基础,建立冲突检测的推理机制,如互斥指令集、冲突查询语言等,进而进行内容的冲突检测。对于呈现出大数据特点的公开文本情报,现有方法存在着如下的缺陷:(1)公开文本情报呈现出大数据特点的背景下,公开文本情报数据的结构化描述和存储的工作量异常巨大,将变得非常困难;(2)以结构化的知识库为基础建立的冲突检测推理机制,是固化的,缺少灵活性,在公开文本情报大数据实时性非常强的情况下,建立的内容冲突检测推理机制将非常容易出现不适应新的问题情境;(3)检测得到的冲突内容是微观层面的,即若干条(通常为2,且数目较小)文本中存在内容冲突,很难呈现大数据集整体层面中存在的内容冲突,可见现有的内容冲突检测方法无法实现具有大数据特点的公开文本情报的内容冲突的检测。The existing methods are mainly oriented to small-scale and medium-scale data in terms of content conflict detection of text data, and their characteristics are mainly reflected in: (1) the key steps of the text data are first structured description and storage; (2) structured On the basis of the knowledge base, a reasoning mechanism for conflict detection is established, such as mutually exclusive instruction set, conflict query language, etc., and then content conflict detection is carried out. For the open text intelligence with the characteristics of big data, the existing methods have the following defects: (1) Under the background of the open text intelligence showing the characteristics of big data, the workload of structured description and storage of the open text intelligence data is extremely huge. , it will become very difficult; (2) The conflict detection and reasoning mechanism established on the basis of a structured knowledge base is rigid and lacks flexibility. Under the circumstance that the real-time nature of public text intelligence big data is very strong, the content of the established content The conflict detection and reasoning mechanism will be very easy to not adapt to the new problem situation; (3) The detected conflict content is at the micro level, that is, there are content conflicts in several (usually 2, and a small number) texts, which are difficult to present. The content conflict in the overall level of the big data set shows that the existing content conflict detection methods cannot realize the content conflict detection of public text intelligence with the characteristics of big data.
发明内容SUMMARY OF THE INVENTION
本发明的目的是,为了实现具有大数据特点的公开文本情报的内容冲突的检测,提供一种公开文本情报的内容冲突检测方法及系统。The purpose of the present invention is to provide a content conflict detection method and system for open text intelligence in order to realize the content conflict detection of open text intelligence with the characteristics of big data.
为实现上述目的,本发明提供了如下方案:For achieving the above object, the present invention provides the following scheme:
一种公开文本情报的内容冲突检测方法,包括如下步骤:A content conflict detection method for public text intelligence, comprising the following steps:
获取公开文本情报,建立公开文本情报数据集,所述公开文本情报数据集中包括多条文本;Obtaining public text intelligence, and establishing a public text intelligence data set, wherein the public text intelligence data set includes a plurality of texts;
提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix;
对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;Binarizing the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;
根据所述二值化关键词共现矩阵建立关键词共现网络;establishing a keyword co-occurrence network according to the binarized keyword co-occurrence matrix;
提取所述关键词共现网络中的成分,获得成分数据集;extracting components in the keyword co-occurrence network to obtain a component data set;
对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。Judging each component in the component data set to determine whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine that the public text information data set exists according to the component with content conflict. conflicting text.
可选的,所述提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵,具体包括:Optionally, extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix, specifically includes:
对所述公开文本情报数据集中的每一条文本的进行分词,获得该条文本的词条集合;Perform word segmentation on each piece of text in the public text intelligence data set to obtain the entry set of the piece of text;
计算该条文本的词条集合中的每个词条的交叉信息熵的期望;Calculate the expected cross-information entropy of each entry in the entry set of the text;
根据每个词条的交叉信息熵的期望的大小,对该条文本的词条集合中的词条进行降序的排序;According to the expected size of the cross-information entropy of each entry, the entries in the entry set of the text are sorted in descending order;
提取排序后的词条集合中的前k个词条作为该文本的关键词;Extract the first k entries in the sorted entry set as the keywords of the text;
根据文本情报数据集中的每条文本的关键词,建立关键词集合;Establish a keyword set according to the keywords of each text in the text intelligence data set;
统计关键词集合中任意两个关键词在同一条文本中共同出现的次数;Count the number of times that any two keywords in the keyword set appear together in the same text;
根据每两个关键词在同一条文本中共同出现的次数,建立关键词共现矩阵。According to the co-occurrence times of each two keywords in the same text, a keyword co-occurrence matrix is established.
可选的,对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵,具体包括:Optionally, perform binarization processing on the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix, which specifically includes:
将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1;Replace the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;
将所述关键词共现矩阵中的小于所述设定阈值的元素替换为0。Elements in the keyword co-occurrence matrix that are smaller than the set threshold are replaced with 0.
可选的,提取所述关键词共现网络中的成分,获得成分数据集,具体包括:Optionally, extract the components in the keyword co-occurrence network to obtain a component data set, which specifically includes:
按照同一成分中关键词之间存在共现性,不同成分中的关键词间不存在共现性的原则,提取所述关键词共现网络中的成分;According to the principle of co-occurrence between keywords in the same component and no co-occurrence between keywords in different components, extract the components in the keyword co-occurrence network;
将提取的关键词共现网络中的所有成分组合成成分数据集。All components in the extracted keyword co-occurrence network are combined into a component dataset.
可选的,对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本,具体包括:Optionally, each component in the component data set is judged to determine whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, the published text is determined according to the component with content conflict. Conflicting texts in intelligence datasets, including:
对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容语义上的冲突;Judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;
在判断结果为对应成分中存在内容冲突时,则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本,确定所述公开文本情报数据集存在冲突的文本。When the judgment result is that there is content conflict in the corresponding component, the corresponding text in the public text information data set is searched according to the keyword with content semantic conflict in the component, and the conflicting text in the public text information data set is determined.
一种公开文本情报的内容冲突检测系统,包括:A content conflict detection system for public text intelligence, comprising:
公开文本情报数据集建立模块,用于获取公开文本情报,建立公开文本情报数据集;The public text intelligence data set building module is used to obtain public text intelligence and establish public text intelligence data sets;
关键词共现矩阵构建模块,用于提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;a keyword co-occurrence matrix building module, used for extracting the keywords of each text in the public text intelligence data set, and constructing a keyword co-occurrence matrix;
二值化处理模块,用于对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;The binarization processing module is used for binarizing the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix;
关键词共现网络建立模块,用于根据所述二值化关键词共现矩阵建立关键词共现网络;a keyword co-occurrence network establishment module, configured to establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix;
成分提取模块,用于提取所述关键词共现网络中的成分,获得成分数据集;a component extraction module for extracting components in the keyword co-occurrence network to obtain a component data set;
冲突判断模块,用于对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。The conflict judgment module is used to judge each component in the component data set, and judge whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the content conflict according to the component. Conflicting text exists in the public text intelligence dataset.
可选的,所述关键词共现矩阵构建模块具体包括:Optionally, the keyword co-occurrence matrix building module specifically includes:
词条划分子模块,用于对所述公开文本情报数据集中的每一条文本的进行分词,获得该条文本的词条集合;an entry division submodule, which is used to segment each text in the public text intelligence data set to obtain the entry set of the text;
期望计算子模块,用于计算该条文本的词条集合中的每个词条的交叉信息熵的期望;The expectation calculation submodule is used to calculate the expectation of the cross information entropy of each entry in the entry set of the text;
排序子模块,用于根据每个词条的交叉信息熵的期望的大小,对该条文本的词条集合中的词条进行降序的排序;a sorting submodule, used for sorting the entries in the entry set of the text in descending order according to the expected size of the cross-information entropy of each entry;
关键词提取子模块,用于提取排序后的词条集合中的前k个词条作为该文本的关键词;The keyword extraction sub-module is used to extract the first k entries in the sorted entry set as the keywords of the text;
关键词集合建立子模块,用于根据文本情报数据集中的每条文本的关键词,建立关键词集合;The keyword set establishment sub-module is used to establish a keyword set according to the keywords of each text in the text intelligence data set;
共现次数统计子模块,用于统计关键词集合中任意两个关键词在同一条文本中共同出现的次数;Co-occurrence statistics sub-module, used to count the co-occurrence times of any two keywords in the keyword set in the same text;
关键词共现矩阵建立子模块,用于根据任意两个关键词在同一条文本中共同出现的次数,建立关键词共现矩阵。The keyword co-occurrence matrix establishment sub-module is used to establish a keyword co-occurrence matrix according to the number of co-occurrences of any two keywords in the same text.
可选的,所述二值化处理模块具体包括:Optionally, the binarization processing module specifically includes:
置1子模块,用于将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1;A submodule is set to 1, for replacing the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;
置0子模块,用于将所述关键词共现矩阵中的小于设定阈值的元素替换为 0。The sub-module of setting 0 is used to replace the elements in the keyword co-occurrence matrix that are smaller than the set threshold with 0.
可选的,所述成分提取模块具体包括:Optionally, the component extraction module specifically includes:
成分提取子模块,用于按照同一成分中关键词之间存在共现性,不同成分中的关键词间不存在共现性的原则,提取所述关键词共现网络中的成分;The component extraction sub-module is used for extracting the components in the keyword co-occurrence network according to the principle of co-occurrence between keywords in the same component and no co-occurrence between keywords in different components;
成分数据集建立子模块,用于将提取的关键词共现网络中的所有成分组合成成分数据集。The component dataset builds a submodule for combining all components in the extracted keyword co-occurrence network into a component dataset.
可选的,所述冲突判断模块具体包括:Optionally, the conflict judgment module specifically includes:
冲突判断子模块,用于对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容语义上的冲突;a conflict judgment submodule, used for judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;
冲突内容确定子模块,用于在判断结果为对应成分中存在内容冲突时,则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本,确定所述公开文本情报数据集存在冲突的文本。The conflict content determination sub-module is used to retrieve the corresponding text in the public text information data set according to the keyword of the content semantic conflict in the corresponding component when the judgment result is that there is content conflict in the corresponding component, and determine the public text information The dataset has conflicting text.
根据本发明提供的具体实施例,本发明公开了以下技术效果:According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:
本发明公开了一种公开文本情报的内容冲突检测方法及系统,首先,获取公开文本情报,建立公开文本情报数据集;然后,提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;并对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;然后,根据所述二值化关键词共现矩阵建立关键词共现网络;提取所述关键词共现网络中的成分,获得成分数据集;最后,对所述成分数据集中的每一成分进行判断,判断是否存在内容冲突,并确定存在冲突的内容。本发明的方法,运用关联分析直接对公开文本情报中的内容进行检测和判断,无需结构化的知识库,也无需对公开文本数据进行结构化描述和存储,减小了计算量,克服了因知识库更新无法与实时性非常强的大数据的公开文本情报同步,造成内容冲突检测准确性差的技术缺陷,实现了具有大数据特点的公开文本情报的内容冲突的检测。The invention discloses a content conflict detection method and system for public text information. First, public text information is obtained, and a public text information data set is established; then, keywords of each text in the public text information data set are extracted to construct a public text information data set. keyword co-occurrence matrix; binarize the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix; then, establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix ; Extract the components in the keyword co-occurrence network to obtain a component data set; finally, judge each component in the component data set to determine whether there is content conflict, and determine the conflicting content. The method of the present invention directly detects and judges the content in the public text information by using the correlation analysis, without the need for a structured knowledge base, and without the need for structured description and storage of the public text data, which reduces the amount of calculation and overcomes the problem of The update of the knowledge base cannot be synchronized with the open text intelligence of the real-time big data, which causes the technical defect of poor content conflict detection accuracy, and realizes the content conflict detection of the open text intelligence with the characteristics of big data.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some of the present invention. In the embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without creative labor.
图1为本发明提供的一种公开文本情报的内容冲突检测方法的流程图。FIG. 1 is a flowchart of a content conflict detection method for public text intelligence provided by the present invention.
图2为本发明提供的一种公开文本情报的内容冲突检测系统的结构框图。FIG. 2 is a structural block diagram of a content conflict detection system for disclosing text information provided by the present invention.
具体实施方式Detailed ways
本发明的目的是提供一种公开文本情报的内容冲突检测方法及系统,以实现具有大数据特点的公开文本情报的内容冲突的检测。The purpose of the present invention is to provide a content conflict detection method and system for open text intelligence, so as to realize the content conflict detection of open text intelligence with big data characteristics.
为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.
如图1所示,本发明提供了一种公开文本情报的内容冲突检测方法,包括如下步骤:As shown in FIG. 1 , the present invention provides a content conflict detection method for public text intelligence, comprising the following steps:
步骤101,获取公开文本情报,建立公开文本情报数据集,所述公开文本情报数据集中包括多条文本;具体的,所述公开文本情报数据集T为, T={t1,t2,…,tm,…,tM},其中,tm为所述公开文本情报数据集T中的第m条文本,M表示所述公开文本情报数据集T中文本的总条数。Step 101: Obtain public text information, and establish a public text information data set, where the public text information data set includes a plurality of texts; specifically, the public text information data set T is, T={t 1 ,t 2 ,... ,t m ,...,t M }, where t m is the mth text in the public text intelligence data set T, and M represents the total number of texts in the public text intelligence data set T.
步骤102,提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;
步骤103,对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;
步骤104,根据所述二值化关键词共现矩阵建立关键词共现网络;具体的, 所述建立关键词共现网络的过程是将二值化关键词共现矩阵中的值为1的元素对应的关键词进行连线,得到关键词共现网络;
步骤105,提取所述关键词共现网络中的成分,获得成分数据集;
步骤106,对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。Step 106: Judging each component in the component data set, and judging whether there is a content conflict in the corresponding component; and when the judgment result is that there is a content conflict in the corresponding component, determine the public text information according to the component with content conflict. The dataset has conflicting text.
可选的,步骤102所述提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵,具体包括:Optionally, in
对所述公开文本情报数据集中的每一条文本的进行分词,获得该条文本的词条集合;具体的,对第m条文本tm,进行分词,得到该条文本的词条集合为 表示第m条文本tm的第lm个词条, lm=1,2,…,Lm,Lm表示第m条文本tm中词条的总数,例如,对文本“南京市长江大桥”进行分词,可以得到,{南京,市长,江大桥,南京市,长江,大桥,长江大桥}。Perform word segmentation on each text in the public text information data set to obtain the entry set of the text; specifically, perform word segmentation on the m-th text t m , and obtain the entry set of the text as: Represents the lmth entry in the mth text tm , lm = 1,2,..., Lm , Lm represents the total number of entries in the mth text tm, for example, for the text "Nanjing Yangtze River "Bridge" can be obtained by word segmentation, {Nanjing, Mayor, River Bridge, Nanjing City, Yangtze River, Bridge, Yangtze River Bridge}.
计算该条文本的词条集合中的每个词条的交叉信息熵的期望;Calculate the expected cross-information entropy of each entry in the entry set of the text;
具体的,计算词条交叉信息熵的期望为:Specifically, counting terms The expectation of cross information entropy is:
其中,表示在出现词条时文本tm属于类别ci的概率;p(ci)表示文本tm类别的概率分布;反映了文本tm类别的概率分布与在出现了词条的情况下文本类别的概率分布之间的距离,其值越大,词条对文本 tm类别分布的影响也就越大。in, Indicates that the term appears is the probability that the text t m belongs to the category c i ; p( ci ) represents the probability distribution of the text t m category; reflects the probability distribution of the text t m categories and the occurrence of the term The distance between the probability distributions of text categories in the case of The greater the impact on the category distribution of text tm .
计算文本tm中每个词条的交叉信息熵的期望,得到词条特征集:Calculate the expectation of the cross-information entropy of each entry in the text t m , and get the entry feature set:
根据每个词条的交叉信息熵的期望的大小,对该条文本的词条集合中的词条进行降序的排序;According to the expected size of the cross-information entropy of each entry, the entries in the entry set of the text are sorted in descending order;
提取排序后的词条集合中的前k个词条作为该文本的关键词;具体的,对于第m条文本tm,如果Lm≤200,则否则,k=10;Extract the first k entries in the sorted entry set as the keywords of the text; specifically, for the m-th text t m , if L m ≤ 200, then otherwise, k=10;
根据文本情报数据集中的每条文本的关键词,建立关键词集合;具体的,第m条文本tm关键词集为w1,为第m条文本tm中排序后的第一个词条,km为对第m条文本提取关键词时k的取值,每条文本的关键词集组成关键词集合为D=D1∪D2∪…∪DM={d1,d2,…,ds},其中,s为公开文本情报数据集的关键词集合D中关键词数目。According to the keywords of each text in the text intelligence data set, a keyword set is established; specifically, the m-th text t m keyword set is w 1 , is the first entry after sorting in the m-th text t m , k m is the value of k when extracting keywords from the m-th text, and the keyword set of each text is composed of the keyword set D =D 1 ∪D 2 ∪… ∪DM ={d 1 ,d 2 ,…,d s }, where s is the number of keywords in the keyword set D of the public text intelligence dataset.
统计关键词集合中任意两个关键词在同一条文本中共同出现的次数;Count the number of times that any two keywords in the keyword set appear together in the same text;
根据每两个关键词在同一条文本中共同出现的次数,建立关键词共现矩阵;Establish a keyword co-occurrence matrix according to the number of co-occurrences of each two keywords in the same text;
具体的,对于集合D中一组词(du,dv),其中u=1,2,…,s;v=1,2,…,s;u≠v;统计它们在同一条文本中出现的次数,记为au,v,则得到基于关键词集合D的关键词共现矩阵:A=(au,v)s×s。Specifically, for a group of words (d u , d v ) in the set D, where u=1,2,...,s; v=1,2,...,s; u≠v; count them in the same text The number of occurrences is denoted as a u,v , then a keyword co-occurrence matrix based on the keyword set D is obtained: A=(a u,v ) s×s .
可选的,步骤103,对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵,具体包括:Optionally, in
将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1;Replace the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;
将所述关键词共现矩阵中的小于所述设定阈值的元素替换为0。Elements in the keyword co-occurrence matrix that are smaller than the set threshold are replaced with 0.
具体的,以关键词集合D中关键词数目s为依据,设置阈值ε(ε>0,且为整数),如果au,v≥ε,则a'u,v=1,否则a'u,v=0。得到二值化的关键词共现矩阵A'=(a'u,v)s×s。Specifically, based on the number of keywords s in the keyword set D, a threshold ε (ε>0, and an integer) is set, if a u,v ≥ε, then a' u,v =1, otherwise a' u , v = 0. A binarized keyword co-occurrence matrix A'=(a' u,v ) s×s is obtained.
可选的,步骤105,提取所述关键词共现网络中的成分,获得成分数据集,具体包括:Optionally, in
按照同一成分中关键词之间存在共现性,不同成分中的关键词间不存在共现性的原则,提取所述关键词共现网络中的成分;具体的,将关键词共现网络中有连线的关键词放在同一成分中,将关键词共现网络中没有连线的关键词放在不同的成分中,其中第i个成分为Ci={di,1,di,2,…};According to the principle that there is co-occurrence between keywords in the same component and there is no co-occurrence between keywords in different components, the components in the keyword co-occurrence network are extracted; The connected keywords are placed in the same component, and the unconnected keywords in the keyword co-occurrence network are placed in different components, where the i-th component is C i ={d i,1 ,d i, 2 ,…};
将提取的关键词共现网络中的所有成分组合成成分数据集,具体的所述成分数据集为{C1,C2,…,Ci,…}。All components in the extracted keyword co-occurrence network are combined into a component data set, and the specific component data set is {C 1 ,C 2 ,...,C i ,...}.
可选的,步骤106,对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本,具体包括:Optionally, in
对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容语义上的冲突;Judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;
在判断结果为对应成分中存在内容冲突时,则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本,确定所述公开文本情报数据集存在冲突的文本。When the judgment result is that there is content conflict in the corresponding component, the corresponding text in the public text information data set is searched according to the keyword with content semantic conflict in the component, and the conflicting text in the public text information data set is determined.
具体的,对每一成分Ci={di,1,di,2,…}依次进行人工判读,如出现关键词di,x与di,y(x≠y)间存在内容语义冲突,则根据关键词di,x与di,y(x≠y)检索公开文本情报数据集T={t1,t2,…,tm,…,tM}中的对应文本,确定存在冲突的内容;否则,认为成分Ci集合中关键词对应的文本不存在内容冲突。Specifically, manual interpretation is performed on each component C i ={d i,1 ,d i,2 ,...} in turn, if there is content semantics between the keywords d i,x and d i,y (x≠y) If there is a conflict, search the corresponding text in the public text information dataset T={t 1 ,t 2 ,…,t m ,…,t M } according to the keywords d i,x and d i,y (x≠y), It is determined that there is conflicting content; otherwise, it is considered that there is no content conflict in the text corresponding to the keywords in the component C i set.
如图2所示,本发明还提供了一种公开文本情报的内容冲突检测系统,包括:As shown in Figure 2, the present invention also provides a content conflict detection system for disclosing text intelligence, including:
公开文本情报数据集建立模块201,用于获取公开文本情报,建立公开文本情报数据集;The public text intelligence data set
关键词共现矩阵构建模块202,用于提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;The keyword co-occurrence
二值化处理模块203,用于对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;The
关键词共现网络建立模块204,用于根据所述二值化关键词共现矩阵建立关键词共现网络;A keyword co-occurrence
成分提取模块205,用于提取所述关键词共现网络中的成分,获得成分数据集;A
冲突判断模块206,用于对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容冲突;并在判断结果为对应成分中存在内容冲突时,根据存在内容冲突的成分确定所述公开文本情报数据集存在冲突的文本。The
可选的,所述关键词共现矩阵构建模块202具体包括:Optionally, the keyword co-occurrence
词条划分子模块,用于对所述公开文本情报数据集中的每一条文本的进行分词,获得该条文本的词条集合;an entry division submodule, which is used to segment each text in the public text intelligence data set to obtain the entry set of the text;
期望计算子模块,用于计算该条文本的词条集合中的每个词条的交叉信息熵的期望;The expectation calculation submodule is used to calculate the expectation of the cross information entropy of each entry in the entry set of the text;
排序子模块,用于根据每个词条的交叉信息熵的期望的大小,对该条文本的词条集合中的词条进行降序的排序;a sorting submodule, used for sorting the entries in the entry set of the text in descending order according to the expected size of the cross-information entropy of each entry;
关键词提取子模块,用于提取排序后的词条集合中的前k个词条作为该文本的关键词;The keyword extraction sub-module is used to extract the first k entries in the sorted entry set as the keywords of the text;
关键词集合建立子模块,用于根据文本情报数据集中的每条文本的关键词,建立关键词集合;The keyword set establishment sub-module is used to establish a keyword set according to the keywords of each text in the text intelligence data set;
共现次数统计子模块,用于统计关键词集合中任意两个关键词在同一条文本中共同出现的次数;Co-occurrence statistics sub-module, used to count the co-occurrence times of any two keywords in the keyword set in the same text;
关键词共现矩阵建立子模块,用于根据任意两个关键词在同一条文本中共同出现的次数,建立关键词共现矩阵。The keyword co-occurrence matrix establishment sub-module is used to establish a keyword co-occurrence matrix according to the number of co-occurrences of any two keywords in the same text.
可选的,所述二值化处理模块203具体包括:Optionally, the
置1子模块,用于将所述关键词共现矩阵中的大于或等于设定阈值的元素替换为1;A submodule is set to 1, for replacing the elements in the keyword co-occurrence matrix that are greater than or equal to the set threshold with 1;
置0子模块,用于将所述关键词共现矩阵中的小于设定阈值的元素替换为 0。The sub-module of setting 0 is used to replace the elements in the keyword co-occurrence matrix that are smaller than the set threshold with 0.
可选的,所述成分提取模块205具体包括:Optionally, the
成分提取子模块,用于按照同一成分中关键词之间存在共现性,不同成分中的关键词间不存在共现性的原则,提取所述关键词共现网络中的成分;The component extraction sub-module is used for extracting the components in the keyword co-occurrence network according to the principle of co-occurrence between keywords in the same component and no co-occurrence between keywords in different components;
成分数据集建立子模块,用于将提取的关键词共现网络中的所有成分组合成成分数据集。The component dataset builds a submodule for combining all components in the extracted keyword co-occurrence network into a component dataset.
可选的,所述冲突判断模块206具体包括:Optionally, the
冲突判断子模块,用于对所述成分数据集中的每一成分进行判断,判断对应成分中是否存在内容语义上的冲突;a conflict judgment submodule, used for judging each component in the component data set, and judging whether there is a content semantic conflict in the corresponding component;
冲突内容确定子模块,用于在判断结果为对应成分中存在内容冲突时,则根据该成分中存在内容语义冲突的关键词检索所述公开文本情报数据集中对应的文本,确定所述公开文本情报数据集存在冲突的文本。The conflict content determination sub-module is used to retrieve the corresponding text in the public text information data set according to the keyword of the content semantic conflict in the corresponding component when the judgment result is that there is content conflict in the corresponding component, and determine the public text information The dataset has conflicting text.
根据本发明提供的具体实施例,本发明公开了以下技术效果:According to the specific embodiments provided by the present invention, the present invention discloses the following technical effects:
本发明公开了一种公开文本情报的内容冲突检测方法及系统,首先,获取公开文本情报,建立公开文本情报数据集;然后,提取所述公开文本情报数据集中的每一条文本的关键词,构建关键词共现矩阵;并对所述关键词共现矩阵进行二值化处理,得到二值化关键词共现矩阵;然后,根据所述二值化关键词共现矩阵建立关键词共现网络;提取所述关键词共现网络中的成分,获得成分数据集;最后,对所述成分数据集中的每一成分进行判断,判断是否存在内容冲突,并确定存在冲突的内容。本发明的方法,运用关联分析直接对公开文本情报中的内容进行检测和判断,无需结构化的知识库,也无需对公开文本数据进行结构化描述和存储,减小了计算量,克服了因知识库更新无法与实时性非常强的大数据的公开文本情报同步,造成内容冲突检测准确性差的技术缺陷,实现了具有大数据特点的公开文本情报的内容冲突的检测。The invention discloses a content conflict detection method and system for public text information. First, public text information is obtained, and a public text information data set is established; then, keywords of each text in the public text information data set are extracted to construct a public text information data set. keyword co-occurrence matrix; binarize the keyword co-occurrence matrix to obtain a binarized keyword co-occurrence matrix; then, establish a keyword co-occurrence network according to the binarized keyword co-occurrence matrix ; Extract the components in the keyword co-occurrence network to obtain a component data set; finally, judge each component in the component data set to determine whether there is content conflict, and determine the conflicting content. The method of the present invention directly detects and judges the content in the public text information by using the correlation analysis, without the need for a structured knowledge base, and without the need for structured description and storage of the public text data, which reduces the amount of calculation and overcomes the problem of The update of the knowledge base cannot be synchronized with the open text intelligence of the real-time big data, which causes the technical defect of poor content conflict detection accuracy, and realizes the content conflict detection of the open text intelligence with the characteristics of big data.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.
本文中应用了具体个例对发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The principles and implementations of the invention are described herein by using specific examples. The descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention, and the described embodiments are only a part of the embodiments of the present invention. , rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646040.7A CN107451120B (en) | 2017-08-01 | 2017-08-01 | A content conflict detection method and system for public text intelligence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710646040.7A CN107451120B (en) | 2017-08-01 | 2017-08-01 | A content conflict detection method and system for public text intelligence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107451120A CN107451120A (en) | 2017-12-08 |
CN107451120B true CN107451120B (en) | 2020-10-30 |
Family
ID=60490592
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710646040.7A Active CN107451120B (en) | 2017-08-01 | 2017-08-01 | A content conflict detection method and system for public text intelligence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107451120B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6987003B2 (en) * | 2018-03-20 | 2021-12-22 | 株式会社Screenホールディングス | Text mining methods, text mining programs, and text mining equipment |
CN110377690B (en) * | 2019-06-27 | 2021-03-16 | 北京信息科技大学 | Information acquisition method and system based on remote relationship extraction |
CN110442765B (en) * | 2019-07-04 | 2022-03-11 | 卓尔智联(武汉)研究院有限公司 | Information processing method, device, terminal and storage medium |
CN114003785B (en) * | 2021-10-29 | 2025-03-18 | 奇安信科技集团股份有限公司 | A method and device for acquiring threat intelligence based on intrinsic security |
CN114090781A (en) * | 2022-01-20 | 2022-02-25 | 北京零点远景网络科技有限公司 | Text data-based repulsion event detection method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
US9235563B2 (en) * | 2009-07-02 | 2016-01-12 | Battelle Memorial Institute | Systems and processes for identifying features and determining feature associations in groups of documents |
CN106599304A (en) * | 2016-12-29 | 2017-04-26 | 中南大学 | Small and medium-sized website-oriented modularized user retrieval intention modeling method |
-
2017
- 2017-08-01 CN CN201710646040.7A patent/CN107451120B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9235563B2 (en) * | 2009-07-02 | 2016-01-12 | Battelle Memorial Institute | Systems and processes for identifying features and determining feature associations in groups of documents |
CN103577404A (en) * | 2012-07-19 | 2014-02-12 | 中国人民大学 | Microblog-oriented discovery method for new emergencies |
CN106599304A (en) * | 2016-12-29 | 2017-04-26 | 中南大学 | Small and medium-sized website-oriented modularized user retrieval intention modeling method |
Also Published As
Publication number | Publication date |
---|---|
CN107451120A (en) | 2017-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110516067B (en) | Public opinion monitoring method, system and storage medium based on topic detection | |
CN107451120B (en) | A content conflict detection method and system for public text intelligence | |
US8285713B2 (en) | Image search using face detection | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN104252445B (en) | Approximate repetitive file detection method and device | |
CN108304502B (en) | Method and system for fast hot spot detection based on massive news data | |
WO2019196226A1 (en) | System information querying method and apparatus, computer device, and storage medium | |
CN107193796B (en) | Public opinion event detection method and device | |
CN108268600A (en) | Unstructured Data Management and device based on AI | |
CN113761208A (en) | Scientific and technological innovation information classification method and storage device based on knowledge graph | |
CN105005616B (en) | Method and system are illustrated based on the text that textual image feature interaction expands | |
CN108846117A (en) | The duplicate removal screening technique and device of business news flash | |
Färber et al. | On emerging entity detection | |
CN107918644A (en) | News subject under discussion analysis method and implementation system in reputation Governance framework | |
De Boom et al. | Semantics-driven event clustering in Twitter feeds | |
CN110162632A (en) | A method for discovering special news events | |
CN103678279A (en) | Figure uniqueness recognition method based on heterogeneous network temporal semantic path similarity | |
CN108121806A (en) | One kind is based on the matched image search method of local feature and system | |
CN118194995A (en) | Method and device for acquiring key information of scientific and technological literature in field of global environment | |
TWI793432B (en) | Document management method and system for engineering project | |
CN113282754A (en) | Public opinion detection method, device, equipment and storage medium for news events | |
CN115935953A (en) | False news detection method, device, electronic device and storage medium | |
Wang et al. | Graph-based reference table construction to facilitate entity matching | |
CN107562774A (en) | Generation method, system and the answering method and system of rare foreign languages word incorporation model | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |