CN114722160B

CN114722160B - Text data comparison method and device

Info

Publication number: CN114722160B
Application number: CN202210631816.9A
Authority: CN
Inventors: 张万鹏; 张虎; 谷学强; 胡丽; 项凤涛; 王超; 杨景照; 张煜
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-06-07
Filing date: 2022-06-07
Publication date: 2022-09-02
Anticipated expiration: 2042-06-07
Also published as: CN114722160A

Abstract

The present application relates to a text data comparison method and device in the technical field of information processing. The method includes acquiring text data item sets in two data dictionary tables, performing word segmentation processing on the two text data item sets, obtaining a Chinese word set for each element in the two text data item sets, and calculating the two text data item sets. The similarity measure between the elements of the item set, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix. By comparing the abstraction and modeling of the problem with the dictionary table, the two text data items The set comparison analysis problem is transformed into a bipartite graph to find the optimal matching solution, and the KM algorithm is used to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.

Description

Text data comparison method and device

技术领域technical field

本申请涉及信息处理技术领域，特别是涉及一种文本数据比较方法及装置。The present application relates to the technical field of information processing, and in particular, to a text data comparison method and device.

背景技术Background technique

随着数据采集和存储成本的降低，数据在数量上呈现出爆发式增长，但同时也对数据关联、融合方面提出越来越多的要求，数据关联、融合方面面临越来越多的挑战。数据集成整编作为架接在原始数据与高价值数据之间一道关键桥梁，在基于数据的统计分析中起着越来越重要的作用，也成为数据处理中越来越基础而又繁重的工作。With the reduction of data collection and storage costs, the amount of data shows explosive growth, but at the same time, more and more requirements are put forward for data association and integration, and data association and integration face more and more challenges. As a key bridge between original data and high-value data, data integration and reorganization plays an increasingly important role in data-based statistical analysis, and has also become an increasingly basic and heavy task in data processing.

数据字典表作为当前数据库系统中对数据的数据项等元信息进行定义的基础数据，是整个数据库系统应用和理解的关键信息，因而数据字典表的比对、关联和拉通在数据集成过程中具有重要意义。The data dictionary table is the basic data that defines the metadata such as data items in the current database system, and is the key information for the application and understanding of the entire database system. Therefore, the comparison, association and connection of the data dictionary table are in the process of data integration. significant.

在实现数据汇总、融合统一的过程中，对异构数据库或是不同时间点的数据进行比对关联则是数据整编集成以及更新的关键一步，尤其对于描述数据库中数据项、数据结构等元信息的数据字典显得格外重要。当前业内普遍采用数据仓库的抽取转换（Extract-Transform-Load，ETL）技术实现异构数据的抽取、转换、融合，现有研究成果：利用Python为中间单元，实现了异构数据库中表的记录数据比对，解决了不同存储数据库中数据的访问和表记录级别的比对，但仍未能解决同一数据本体在不同表达方式情况下的自动比对问题，缺乏基于语义自动化处理手段。In the process of data aggregation, integration and unification, the comparison and association of heterogeneous databases or data at different time points is a key step in data reorganization, integration and updating, especially for describing meta-information such as data items and data structures in the database. The data dictionary is particularly important. At present, the Extract-Transform-Load (ETL) technology of data warehouse is widely used in the industry to realize the extraction, transformation and fusion of heterogeneous data. Data comparison solves the comparison of data access and table record levels in different storage databases, but it still cannot solve the problem of automatic comparison of the same data ontology in different expressions, and lacks automatic processing methods based on semantics.

发明内容SUMMARY OF THE INVENTION

基于此，有必要针对上述技术问题，提供一种文本数据比较方法及装置。Based on this, it is necessary to provide a text data comparison method and apparatus for the above technical problems.

一种文本数据比较方法，所述方法包括：A text data comparison method, the method includes:

获取两个数据字典表中的文本数据项集合，并对两个所述文本数据项集合进行分词处理，得到两个所述文本数据项集合中每一个元素的中文词语集合。Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.

根据两个所述文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对所述相似性度量进行预处理，得到相似度量矩阵。Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix.

根据所述相似度量矩阵和两个所述文本数据项集合，将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题。According to the similarity measure matrix and the two sets of text data items, the comparative analysis problem of the two sets of text data items is transformed into a matching problem of a weighted bipartite graph.

采用KM算法对所述带权二分图的匹配问题进行求解，得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The KM algorithm is used to solve the matching problem of the weighted bipartite graph, and a set of globally optimal matching relationships between the two sets of text data items is obtained.

在其中一个实施例中，获取两个数据字典表中的文本数据项集合，并对两个所述文本数据项集合进行分词处理，得到两个所述文本数据项集合中每一个元素的中文词语集合，包括：In one embodiment, text data item sets in two data dictionary tables are acquired, and word segmentation processing is performed on the two text data item sets to obtain Chinese words for each element in the two text data item sets Collection, including:

获取两个数据字典表中的文本数据项集合。Gets a collection of text data items in two data dictionary tables.

采用基于统计的分词方法对两个所述文本数据项集合中的元素进行分词处理，得到两个所述文本数据项集合中每一个元素的中文词语集合。Elements in the two text data item sets are subjected to word segmentation processing using a statistical-based word segmentation method to obtain a Chinese word set for each element in the two text data item sets.

在其中一个实施例中，相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.

根据两个所述文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对所述相似性度量进行预处理，得到相似度量矩阵，包括：Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix, including:

根据两个所述文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量。A similarity measure between elements of the two text data item sets is calculated according to the Chinese word sets of each element in the two text data item sets.

当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时，相似度量矩阵对应位置的元素等于相似性度量。When the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to the similarity measure.

当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时，相似度量矩阵对应位置的元素等于0。When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to 0.

在其中一个实施例中，根据两个所述文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对所述相似性度量进行预处理，得到相似度量矩阵，步骤中所述相似性度量的计算公式为：In one of the embodiments, a similarity measure between the elements of the two text data item sets is calculated according to the Chinese word set of each element in the two text data item sets, and a preset similarity ratio threshold is used to determine the similarity between the two text data item sets. The similarity measure is preprocessed to obtain a similarity measure matrix. The calculation formula of the similarity measure in the steps is:

其中，

为相似比，

为第一个文本数据项集合中第

个元素包括的中文词语集合，

为第二个文本数据项集合中第

个元素包括的中文词语集合，

是元素个数计数操作。in,

is the similarity ratio,

for the first text data item in the set

A set of Chinese words included in each element,

is the first in the second set of text data items

A set of Chinese words included in each element,

is an element count operation.

根据相似度量矩阵稀疏化的特点，将所述相似度量矩阵且切分为若干个彼此不相关的子相似度量矩阵。According to the characteristic of sparseness of the similarity metric matrix, the similarity metric matrix is divided into several sub-similarity metric matrices that are not related to each other.

采用KM算法对每个子相似度量矩阵进行求解，得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items.

在其中一个实施例中，采用KM算法对每个子相似度量矩阵进行求解，得到两个所述文本数据项集合之间的一组全局最优的匹配关系，步骤中KM算法对增广路径的搜索采用深度优先搜索算法。In one embodiment, the KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items. In the step, the KM algorithm searches for the augmented path. A depth-first search algorithm is used.

其中，

为相似比，

为第一个文本数据项集合中第

个元素包括的中文词语集合，

为第二个文本数据项集合中第

个元素包括的中文词语集合，

是元素个数计数操作。in,

is the similarity ratio,

for the first text data item in the set

A set of Chinese words included in each element,

is the first in the second set of text data items

A set of Chinese words included in each element,

is an element count operation.

在其中一个实施例中，所述数据字典表包括数据编号字段；采用KM算法对每个子相似度量矩阵进行求解，得到两个所述文本数据项集合之间的一组全局最优的匹配关系，步骤前包括：根据所述数据编号字段，对一段时间前后更新的数据字典表进行比对。In one embodiment, the data dictionary table includes a data number field; the KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items, Before the step, the step includes: comparing the data dictionary table updated before and after a period of time according to the data number field.

一种文本数据比较装置，所述装置包括：A text data comparison device, the device includes:

比对数据获取模块，用于获取两个数据字典表中的文本数据项集合，并对两个所述文本数据项集合进行分词处理，得到两个所述文本数据项集合中每一个元素的中文词语集合。The comparison data acquisition module is used to acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain the Chinese text of each element in the two text data item sets. collection of words.

相似度量矩阵确定模块，用于根据两个所述文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对所述相似性度量进行预处理，得到相似度量矩阵。The similarity metric matrix determination module is used to calculate the similarity metric between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and set the similarity ratio threshold through a preset similarity ratio threshold. The similarity measure is preprocessed to obtain a similarity measure matrix.

文本数据比对结果确定模块，用于根据所述相似度量矩阵和两个所述文本数据项集合，将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题；采用KM算法对所述带权二分图的匹配问题进行求解，得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The text data comparison result determination module is used to convert the comparison analysis problem of the two text data item sets into the matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; using the KM algorithm The matching problem of the weighted bipartite graph is solved to obtain a set of globally optimal matching relationships between the two sets of text data items.

上述文本数据比较方法及装置，所述方法包括获取两个数据字典表中的文本数据项集合，并对两个文本数据项集合进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对相似性度量进行预处理，得到相似度量矩阵，通过对字典表对比分析问题的抽象和建模，将两个文本数据项集合比对分析问题转化为一个二分图寻求最优匹配方案的问题，并利用KM算法对该问题进行求解。本方法实现了基于语义的字典表数据自动比对分析，有效的缓解了数据整编过程中依靠人工进行比对的工作压力，为数据对比自动化处理提供了一种新思路。The above-mentioned text data comparison method and device, the method includes acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain the Chinese text of each element in the two text data item sets. Word set, calculate the similarity measure between the elements of the two text data item sets, and preprocess the similarity measure through the preset similarity ratio threshold to obtain the similarity measure matrix, and analyze the abstraction and construction of the problem by comparing the dictionary table. Modulo, the comparison analysis problem of two text data item sets is transformed into a bipartite graph to find the optimal matching solution, and the KM algorithm is used to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.

附图说明Description of drawings

图1为一个实施例中文本数据比较方法的流程示意图；1 is a schematic flowchart of a text data comparison method in one embodiment;

图2为一个实施例中数据字典表对比示意图；Fig. 2 is a schematic diagram of data dictionary table comparison in one embodiment;

图3为另一个实施例中数据项中文短语分词结果示意图；3 is a schematic diagram of a Chinese phrase segmentation result of a data item in another embodiment;

图4为一个实施例中文本数据比较方法的流程示意图；4 is a schematic flowchart of a text data comparison method in one embodiment;

图5为一个实施例中文本数据比较装置的结构框图。FIG. 5 is a structural block diagram of an apparatus for comparing text data in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

本申请提供的文本数据比较方法，可用于两个数据字典表中文本数据项的比对分析，本发明的实质是对两个存在一定差异的数据集合，利用数据之间的相似性特征，构建一组从一个集合中到另一个集合的映射关系，以及对发生变化的数据项进行标注说明的过程。由于两个数据集合中数据项并非完全一致，通常通过语义相似性实现同一数据本体不同数据值间的关联（例如，“广西壮族自治区南宁市”与“广西南宁市”在语义上等价的，但在数据值上则存在差异）；然而，同一集合中数据项与其他不同数据本体也会存在一定的相似性（例如，“训练一组”与“训练二组”、“训练三组”都有一定的相似性），因此如何能够立足全局构建最优的映射关系是解决数据字典表比对分析的关键。本发明针对两个数据集合中的数据项，构建了一种能够表征数据项语义相似性度量，依据相似度量实现两个集合中数据项基于语义关联的最优比对分析，实现整个比对结果全局相似性最优，则可为数据字典自动比对分析提供一种较为新颖的思路。The text data comparison method provided by this application can be used for the comparative analysis of text data items in two data dictionary tables. A set of mappings from one set to another, and the process of annotating data items that change. Since the data items in the two data sets are not completely consistent, the association between different data values of the same data ontology is usually achieved through semantic similarity (for example, "Nanning City, Guangxi Zhuang Autonomous Region" and "Nanning City, Guangxi" are semantically equivalent, But there are differences in data values); however, data items in the same set will also have certain similarities with other different data ontologies (for example, "training one group" and "training two groups", "training three groups" are all similar. There is a certain similarity), so how to build the optimal mapping relationship based on the global is the key to solving the comparison and analysis of the data dictionary table. Aiming at the data items in the two data sets, the invention constructs a metric that can characterize the semantic similarity of the data items, and realizes the optimal comparison analysis based on the semantic association of the data items in the two sets according to the similarity metric, and realizes the whole comparison result. The optimal global similarity can provide a novel idea for automatic comparison and analysis of data dictionary.

在一个实施例中，如图1所示，提供了一种文本数据比较方法，该方法包括以下步骤：In one embodiment, as shown in FIG. 1, a text data comparison method is provided, and the method includes the following steps:

步骤100：获取两个数据字典表中的文本数据项集合，并对两个文本数据项集合进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合。Step 100: Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.

具体的，两个数据字典表是异构的。Specifically, the two data dictionary tables are heterogeneous.

数据字典表是当前数据库系统中对数据的数据项等元信息进行定义的基础数据，是整个数据库系统应用和理解的关键信息。The data dictionary table is the basic data in the current database system to define the data items and other meta-information of the data, and is the key information for the application and understanding of the entire database system.

数据字典表可以是单位序列字典表，单位序列字典表中主要包含单位编号、单位名称、级别等字段（其中单位名称字段则体现了调整前后的数据本体的语义关联性）；还可以是部门关系字典表，产品类型字典表等。The data dictionary table can be a unit sequence dictionary table. The unit sequence dictionary table mainly includes fields such as unit number, unit name, level, etc. (the unit name field reflects the semantic relevance of the data ontology before and after adjustment); it can also be a department relationship Dictionary table, product type dictionary table, etc.

为实现对数据字典表中名称等文本数据项实施基于语义相似性的关联比对，首先需利用自然语言处理的方法对数据项进行处理。词作为表达语义的最小单元，是构设数据项相似性度量的基本运算单元，因此对中文短语数据项或简写英文短语进行有效的分词是字典数据处理的一项基本内容。本发明主要以数据字典中的中文短语数据项为例，利用开放的维基百科语料库和业务作为中文分词字典，对本发明所提出的数据字典比对方法进行介绍。分词处理的方法可选用基于词典的分词方法或基于统计的分词方法。In order to implement an association comparison based on semantic similarity for text data items such as names in the data dictionary table, it is necessary to first process the data items by means of natural language processing. As the smallest unit to express semantics, word is the basic operation unit for constructing the similarity measure of data items. Therefore, effective word segmentation for Chinese phrase data items or abbreviated English phrases is a basic content of dictionary data processing. The present invention mainly takes the Chinese phrase data item in the data dictionary as an example, and uses the open Wikipedia corpus and business as the Chinese word segmentation dictionary to introduce the data dictionary comparison method proposed by the present invention. The word segmentation method can be selected from the dictionary-based word segmentation method or the statistics-based word segmentation method.

步骤102：根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对相似性度量进行预处理，得到相似度量矩阵。Step 102: Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure by a preset similarity ratio threshold, Get the similarity measure matrix.

其中，相似度量矩阵的行数等于两个文本数据项集合中的第一个文本数据项集合中元素的个数，相似度量矩阵的列数等于两个文本数据项集合中的第二个文本数据项集合中元素的个数。Among them, the number of rows of the similarity measure matrix is equal to the number of elements in the first text data item set in the two text data item sets, and the number of columns of the similarity measure matrix is equal to the second text data in the two text data item sets. The number of elements in the item collection.

相似度量矩阵的的元素值

为第一个文本数据项集合中第i个元素与第二个文本数据项集合中第j个元素之间的相似性度量通过预设相似比阈值进行处理后得到的结果。The element values of the similarity measure matrix

It is a result obtained by processing a preset similarity ratio threshold for the similarity measure between the i -th element in the first text data item set and the j -th element in the second text data item set.

步骤104：根据相似度量矩阵和两个文本数据项集合，将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题。Step 104: According to the similarity metric matrix and the two sets of text data items, the comparative analysis problem of the two sets of text data items is transformed into a matching problem of a weighted bipartite graph.

具体的，数据字典表对比问题可以如图2所示，描述为：存在两个不同数据项序列

和

，其中数据项

与数据项

存在相似性，其相似比为

，如何构建一组一一对应的映射关系，使得

中的数据项最多只能与

中的一个数据项对应，

中的数据项最多只由

中的一个数据项对应，并且满足两组序列具有最大的相似匹配效果。根据上面的描述可知，字典表数据比对分析是一个典型的带权二分图的匹配问题。Specifically, the data dictionary table comparison problem can be shown in Figure 2, which is described as: there are two different data item sequences

and

, where the data item

with data items

There is similarity, and the similarity ratio is

, how to construct a set of one-to-one mapping relationships such that

The data items in can only match at most

A data item in corresponds to,

The data items in are at most limited by

One of the data items in , and satisfies that the two sets of sequences have the largest similarity matching effect. According to the above description, the comparison and analysis of dictionary table data is a typical matching problem of weighted bipartite graph.

步骤106：采用KM算法对带权二分图的匹配问题进行求解，得到两个文本数据项集合之间的一组全局最优的匹配关系。Step 106: Use the KM algorithm to solve the matching problem of the weighted bipartite graph, and obtain a set of globally optimal matching relationships between the two text data item sets.

具体的，KM算法是求解二分图最优匹配的经典算法，其核心思想是通过调整每个顶点的标值

和

，将最大权重匹配问题转化为增广路径搜索问题，最终使得二分图匹配数能够达到最大，进而求解出一组全局最优的匹配关系。Specifically, the KM algorithm is a classic algorithm for solving the optimal matching of bipartite graphs. Its core idea is to adjust the scalar value of each vertex by adjusting

and

, the maximum weight matching problem is transformed into an augmented path search problem, and finally the number of bipartite graph matching can be maximized, and then a set of globally optimal matching relationships can be solved.

设两个字典表的数据项构成一个二分图

，其中顶点集合

由两个字典表中数据项集合

构成，满足

，边集合

由集合

与集合

间的相似关系构成，其对应的权值为

，表示顶点

与顶点

的相似性。根据上述定义，则利用KM算法的数据字典比对方法具体的步骤包括：Let the data items of two dictionary tables form a bipartite graph

, where the vertex set

A collection of data items from two dictionary tables

constitute, satisfy

, the set of edges

by the collection

with collection

The similarity relationship between is constituted, and its corresponding weight is

, representing the vertex

with vertex

similarity. According to the above definition, the specific steps of the data dictionary comparison method using the KM algorithm include:

步骤1：初始化集合

与集合

中每个元素的标值，令

对应的标值为

，同时令

对应的标值为与其关联的相似最大值，即

，其中m为集合

中元素的数量。Step 1: Initialize the collection

with collection

The scalar value of each element in , let

The corresponding value is

, while making

The corresponding scalar value is the similar maximum value associated with it, i.e.

, where m is the set

the number of elements in .

步骤2：从集合

中的元素

开始，在二分图

中根据

搜索增广路径

，如果存在增广路径

，则跳转至步骤4；否则跳转至步骤3。Step 2: From the Collection

elements in

Start, in the bipartite graph

according to

search augmentation path

, if there is an augmentation path

, then go to step 4; otherwise, go to step 3.

步骤3：在搜索增广路径

失败时，能够得到满足

条件的一组由交替线段构成的交错路径

，然后跳转至步骤2。Step 3: Search augmentation paths

be satisfied when it fails

Conditional set of staggered paths consisting of alternating line segments

, then skip to step 2.

步骤4：对增广路径

中相关线段取反原始匹配规则，更新形成新的匹配规则。Step 4: Pair Augmentation Paths

Negate the original matching rule for the relevant line segment in the middle, and update it to form a new matching rule.

。

.

步骤5：检查集合

中元素是否已遍历，如果完成遍历，则运行结束；否则，处理集合下一个元素

，并跳转至步骤2。Step 5: Check the Collection

Whether the elements in the set have been traversed, if the traversal is completed, the operation ends; otherwise, the next element of the collection is processed

, and skip to step 2.

上述文本数据比较方法中，所述方法包括获取两个数据字典表中的文本数据项集合，并对两个文本数据项集合进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对相似性度量进行预处理，得到相似度量矩阵，通过对字典表对比分析问题的抽象和建模，将两个文本数据项集合比对分析问题转化为一个二分图寻求最优匹配方案的问题，并利用KM算法对该问题进行求解。本方法实现了基于语义的字典表数据自动比对分析，有效的缓解了数据整编过程中依靠人工进行比对的工作压力，为数据对比自动化处理提供了一种新思路。In the above text data comparison method, the method includes acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain the Chinese words of each element in the two text data item sets. Set, calculate the similarity measure between the elements of two sets of text data items, and preprocess the similarity measure through a preset similarity ratio threshold to obtain a similarity measure matrix, and analyze the abstraction and modeling of the problem by comparing the dictionary table , transforming the comparative analysis problem of two sets of text data items into a bipartite graph to find the optimal matching solution, and using the KM algorithm to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.

在其中一个实施例中，步骤100包括：获取两个数据字典表中的文本数据项集合；采用基于统计的分词方法对两个文本数据项集合中的元素进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合。In one embodiment, step 100 includes: acquiring text data item sets in two data dictionary tables; using a statistical-based word segmentation method to perform word segmentation on elements in the two text data item sets to obtain two text data items A collection of Chinese words for each element in the collection.

具体的，利用汉语言处理包（Han Language Processing，Hanlp）分词工具包，在建立业务字典的基础上，通过其内嵌的二阶隐马尔科夫链模型实现中文语料库的先验学习以及文本数据项的分词。如图3所示，将两个数据项中的中文短语切分为两个词语集合，分别为：

，

。Specifically, using the Chinese language processing package (Han Language Processing, Hanlp) word segmentation toolkit, on the basis of establishing a business dictionary, the prior learning of the Chinese corpus and text data are realized through its embedded second-order hidden Markov chain model. Participles of items. As shown in Figure 3, the Chinese phrases in the two data items are divided into two word sets, which are:

,

.

在其中一个实施例中，相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应；步骤102包括：根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量；当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时，相似度量矩阵对应位置的元素等于相似性度量；当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时，相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively; step 102 includes: according to each of the two text data item sets A set of Chinese words for one element, calculate the similarity measure between the elements of the two text data item sets; when the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold, the similarity measure matrix The element at the corresponding position is equal to the similarity measure; when the similarity measure between the elements of the two text data item sets is less than the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to 0.

在其中一个实施例中，步骤102中相似性度量的计算公式为：In one embodiment, the calculation formula of the similarity measure in step 102 is:

（1）

(1)

其中，

为相似比，

为第一个文本数据项集合中第

个元素包括的中文词语集合，

，n为第一个文本数据项集合中的元素数量，

为第二个文本数据项集合中第

个元素包括的中文词语集合，

，m为第二个文本数据项集合中的元素数量，

是元素个数计数操作。in,

is the similarity ratio,

for the first text data item in the set

A set of Chinese words included in each element,

, n is the number of elements in the first set of text data items,

is the first in the second set of text data items

A set of Chinese words included in each element,

, m is the number of elements in the second set of text data items,

is an element count operation.

具体的，对两个词语集合中的词语构建相应的词向量，并利用词向量间的余弦度量方法对相似词语进行筛选，进而利用Jaccord算法可以实现中文短语的相似性度量。但鉴于数据字典中文本数据项通常为简短精炼的专业短语，以及为了降低运算消耗，将相似性度量简化为以词语为颗粒度及文本数据项间的Jaccard相似性度量, 实现数据项间语义相似性的表征，即当两集合中公共拥有的词语的数量越多，则两个中文短语表达的语义则越相似；反之，则两个中文短语所表达的内容相似性越低。基于上述结论，对中文短语相似性度量定义为式（1），称为相似比。Specifically, corresponding word vectors are constructed for the words in the two word sets, and the cosine measurement method between the word vectors is used to filter similar words, and then the Jaccord algorithm can be used to measure the similarity of Chinese phrases. However, in view of the fact that the text data items in the data dictionary are usually short and refined professional phrases, and in order to reduce the computational consumption, the similarity measure is simplified to a Jaccard similarity measure with words as the granularity and between text data items to achieve semantic similarity between data items. In other words, when the number of common words in the two sets is greater, the semantics expressed by the two Chinese phrases are more similar; otherwise, the content similarity expressed by the two Chinese phrases is lower. Based on the above conclusions, the similarity measure of Chinese phrases is defined as formula (1), which is called the similarity ratio.

根据式（1）可知，当两个词语集合的词语重合部分的越多，则相似比

越接近于1；相应地，当两个词语集合的词语差异较大，则相似比

就会越接近于0。因此，相似比

与两个中文短语语义相似性表现出较强的正相关性。According to formula (1), it can be seen that when the words of the two word sets have more overlapping parts, the similarity ratio is higher.

The closer it is to 1; accordingly, when the words of the two word sets are quite different, the similarity ratio is

will be closer to 0. Therefore, the similarity ratio

It shows a strong positive correlation with the semantic similarity of two Chinese phrases.

在一个实施例中，如图4所示，提供了一种文本数据比较方法，该方法包括如下步骤：In one embodiment, as shown in FIG. 4, a text data comparison method is provided, and the method includes the following steps:

步骤400：获取两个数据字典表中的文本数据项集合，并对两个文本数据项集合进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合。Step 400: Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.

步骤402：根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对相似性度量进行预处理，得到相似度量矩阵。Step 402: Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure by a preset similarity ratio threshold, Get the similarity measure matrix.

步骤404：根据相似度量矩阵稀疏化的特点，将相似度量矩阵且切分为若干个彼此不相关的子相似度量矩阵。Step 404 : According to the sparse feature of the similarity metric matrix, the similarity metric matrix is divided into several sub-similarity metric matrices that are not related to each other.

根据相似度量矩阵稀疏化的特点，将原矩阵切分为若干个彼此不相关的子相似度量矩阵，其具体步骤包括：1）遍历需要对比的相似度矩阵数据；2）基于矩阵中全项为零的稀疏化特点，对相似度矩阵中全项为零的稀疏矩阵作为矩阵分割线；3）保存分割后，生成子相似度量矩阵。According to the characteristics of sparse similarity measurement matrix, the original matrix is divided into several sub-similarity measurement matrices that are not related to each other. The specific steps include: 1) traverse the similarity matrix data to be compared; The sparse feature of zero, the sparse matrix with all zero items in the similarity matrix is used as the matrix dividing line; 3) After saving and dividing, the sub-similarity metric matrix is generated.

分别利用KM算法对子矩阵进行求解，最终实现整个字典表的对比分析。The KM algorithm is used to solve the sub-matrices, and finally the comparison and analysis of the entire dictionary table are realized.

步骤406：采用KM算法对每个子相似度量矩阵进行求解，得到两个文本数据项集合之间的一组全局最优的匹配关系。Step 406: Use the KM algorithm to solve each sub-similarity metric matrix, and obtain a set of globally optimal matching relationships between the two text data item sets.

在其中一个实施例中，相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应；步骤402包括：根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量；当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时，相似度量矩阵对应位置的元素等于相似性度量；当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时，相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively; step 402 includes: according to each of the two text data item sets A set of Chinese words for one element, calculate the similarity measure between the elements of the two text data item sets; when the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold, the similarity measure matrix The element at the corresponding position is equal to the similarity measure; when the similarity measure between the elements of the two text data item sets is less than the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to 0.

在其中一个实施例中，步骤406中KM算法对增广路径的搜索采用深度优先搜索算法。In one embodiment, in step 406, the KM algorithm uses a depth-first search algorithm to search the augmentation path.

具体的，利用相似比阈值

对数据项间关联关系进行预处理，则可以有效的降低数据项间的关联深度，因此再根据KM算法在搜索增广路径

时，则可以使用深度优先搜索算法进行搜索，从而可避免过多的递归调用，进而减少运算存储消耗。Specifically, using the similarity ratio threshold

Preprocessing the association relationship between data items can effectively reduce the depth of association between data items, so the augmentation path is searched according to the KM algorithm.

, the depth-first search algorithm can be used to search, which can avoid too many recursive calls, thereby reducing the operation and storage consumption.

在其中一个实施例中，步骤402中相似性度量的计算公式如式（1）所示。In one of the embodiments, the calculation formula of the similarity measure in step 402 is shown in formula (1).

在其中一个实施例中，数据字典表包括数据编号字段；步骤406前包括：根据数据编号字段，对一段时间前后更新的数据字典表进行比对。In one embodiment, the data dictionary table includes a data number field; before step 406, the method includes: comparing the data dictionary tables updated before and after a period of time according to the data number field.

具体的，在实际字典表数据设计中通常设有编号字段，即身份标识号（IdentityDocument，ID）字段，倘若能够根据数据项编号字段具有相对固定性的特点，对一段时间前后更新的数据字典表，则可以通过优先匹配编号字段的方式进一步降低算法的运算消耗，提高比对分析效率。Specifically, in the actual dictionary table data design, there is usually a number field, that is, an identity document (ID) field. If the data item number field is relatively fixed, the data dictionary table updated before and after a period of time can be updated. , the operation consumption of the algorithm can be further reduced by preferentially matching the number field, and the comparison and analysis efficiency can be improved.

应该理解的是，虽然图1和图4的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，图1和图4中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIG. 1 and FIG. 4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 and FIG. 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. These sub-steps or The order of execution of the stages is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a stage.

在一个实施例中，如图5所示，提供了一种文本数据比较装置，包括：比对数据获取模块、相似度量矩阵确定模块和文本数据比对结果确定模块，其中：In one embodiment, as shown in FIG. 5, a text data comparison device is provided, comprising: a comparison data acquisition module, a similarity metric matrix determination module and a text data comparison result determination module, wherein:

比对数据获取模块，用于获取两个数据字典表中的文本数据项集合，并对两个文本数据项集合进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合。The comparison data acquisition module is used to acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain the Chinese word set of each element in the two text data item sets.

相似度量矩阵确定模块，用于根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量，并通过预设相似比阈值对相似性度量进行预处理，得到相似度量矩阵。The similarity measure matrix determination module is used to calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and compare the similarity by a preset similarity ratio threshold The metric is preprocessed to obtain a similarity metric matrix.

文本数据比对结果确定模块，用于根据相似度量矩阵和两个文本数据项集合，将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题；采用KM算法对带权二分图的匹配问题进行求解，得到两个文本数据项集合之间的一组全局最优的匹配关系。The text data comparison result determination module is used to convert the comparison analysis problem of two text data item sets into a matching problem of a weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; The graph matching problem is solved to obtain a set of globally optimal matching relationships between two sets of text data items.

在其中一个实施例中，比对数据获取模块，还用于获取两个数据字典表中的文本数据项集合；采用基于统计的分词方法对两个文本数据项集合中的元素进行分词处理，得到两个文本数据项集合中每一个元素的中文词语集合。In one of the embodiments, the comparison data acquisition module is further configured to acquire the text data item sets in the two data dictionary tables; adopt the statistical-based word segmentation method to perform word segmentation processing on the elements in the two text data item sets, and obtain A set of Chinese words for each element in the two sets of text data items.

在其中一个实施例中，相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应；相似度量矩阵确定模块，还用于根据两个文本数据项集合中每一个元素的中文词语集合，计算两个文本数据项集合的元素之间的相似性度量；当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时，相似度量矩阵对应位置的元素等于相似性度量；当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时，相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively; The Chinese word set of each element in the data item set, calculate the similarity measure between the elements of the two text data item sets; when the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold When , the element at the corresponding position of the similarity measurement matrix is equal to the similarity measurement; when the similarity measurement between the elements of the two text data item sets is less than the preset similarity ratio threshold, the element at the corresponding position of the similarity measurement matrix is equal to 0.

在其中一个实施例中，相似度量矩阵确定模块中相似性度量的计算公式如式（1）所示。In one of the embodiments, the calculation formula of the similarity metric in the similarity metric matrix determination module is shown in formula (1).

关于文本数据比较装置的具体限定可以参见上文中对于文本数据比较方法的限定，在此不再赘述。上述文本数据比较装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the text data comparison apparatus, reference may be made to the limitation of the text data comparison method above, which will not be repeated here. Each module in the above text data comparison apparatus can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

1. A method of comparing text data, the method comprising:

acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;

calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix; the calculation formula of the similarity measure is as follows:

wherein,

in the case of a similar ratio,

for the first in the first set of text data items

The set of chinese words that an individual element includes,

for the first in the second set of text data items

The set of chinese words that an individual element includes,

element number counting operation;

according to the similarity measurement matrix and the two text data item sets, converting a comparison analysis problem of the two text data item sets into a matching problem of a weighted bipartite graph;

solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets;

wherein, the rows and columns of the similarity measurement matrix respectively correspond to the elements in the first text data item set and the elements in the second text data item set;

the method comprises the following steps: calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix, wherein the similarity measurement matrix comprises:

calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets;

when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements of the corresponding positions of the similarity measurement matrix are equal to the similarity measurement;

when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.

2. The method of claim 1, wherein obtaining a set of text data items in two data dictionary tables, and performing word segmentation on the two sets of text data items to obtain a set of chinese words for each element in the two sets of text data items comprises:

acquiring text data item sets in two data dictionary tables;

and performing word segmentation processing on the elements in the two text data item sets by adopting a word segmentation method based on statistics to obtain a Chinese word set of each element in the two text data item sets.

3. A method of comparing text data, the method comprising:

calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix;

according to the characteristic of similarity measurement matrix sparsification, the similarity measurement matrix is divided into a plurality of sub-similarity measurement matrixes which are irrelevant to each other;

solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets;

wherein, the rows and columns of the similarity measurement matrix respectively correspond to elements in the first set of text data items and elements in the second set of text data items;

when the similarity measurement between the elements of the two text data item sets is smaller than a preset similarity ratio threshold value, the element of the corresponding position of the similarity measurement matrix is equal to 0;

and solving each sub-similarity measurement matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein a depth-first search algorithm is adopted when the KM algorithm is adopted to search an augmentation path in the step.

4. The method according to claim 3, wherein a similarity measure between elements of two sets of text data items is calculated according to a Chinese word set of each element in the two sets of text data items, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix, wherein the similarity measure is calculated according to a formula:

wherein,

in the case of a similar ratio,

for the first in the first set of text data items

The set of chinese words that an individual element includes,

for the first in the second set of text data items

The set of chinese words that an individual element includes,

is an element number count operation.

5. The method of claim 3, wherein the data dictionary table includes a data number field;

solving each sub-similarity metric matrix by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets, wherein the method comprises the following steps:

and comparing the data dictionary tables updated before and after a period of time according to the data number field.

6. A text data comparison apparatus, the apparatus comprising:

the comparison data acquisition module is used for acquiring text data item sets in the two data dictionary tables and performing word segmentation processing on the two text data item sets to obtain a Chinese word set of each element in the two text data item sets;

the similarity measurement matrix determining module is used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocessing the similarity measurement through a preset similarity ratio threshold value to obtain a similarity measurement matrix; the calculation formula of the similarity measure is as follows:

wherein,

in the case of a similar ratio,

for the first in the first set of text data items

The set of chinese words that an individual element includes,

for the first in the second set of text data items

The set of chinese words that an individual element includes,

element number counting operation;

the text data comparison result determining module is used for converting a comparison analysis problem of the two text data item sets into a matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; solving the matching problem of the weighted bipartite graph by adopting a KM algorithm to obtain a group of globally optimal matching relations between the two text data item sets;

the similarity measurement matrix determining module is also used for calculating similarity measurement between elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets; when the similarity measurement between the elements of the two text data item sets is greater than or equal to a preset similarity ratio threshold value, the elements at the corresponding positions of the similarity measurement matrix are equal to the similarity measurement; when the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold value, the element of the corresponding position of the similarity measure matrix is equal to 0.