CN114722160B - Text data comparison method and device - Google Patents
Text data comparison method and device Download PDFInfo
- Publication number
- CN114722160B CN114722160B CN202210631816.9A CN202210631816A CN114722160B CN 114722160 B CN114722160 B CN 114722160B CN 202210631816 A CN202210631816 A CN 202210631816A CN 114722160 B CN114722160 B CN 114722160B
- Authority
- CN
- China
- Prior art keywords
- text data
- data item
- item sets
- similarity
- similarity measurement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000011524 similarity measure Methods 0.000 claims abstract description 86
- 239000011159 matrix material Substances 0.000 claims abstract description 80
- 230000011218 segmentation Effects 0.000 claims abstract description 29
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000004458 analytical method Methods 0.000 claims abstract description 15
- 238000005259 measurement Methods 0.000 claims description 51
- 230000003416 augmentation Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000010845 search algorithm Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 abstract description 7
- 230000008521 reorganization Effects 0.000 abstract description 5
- 230000010365 information processing Effects 0.000 abstract description 2
- 230000010354 integration Effects 0.000 description 6
- 238000010835 comparative analysis Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 239000013598 vector Substances 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请涉及信息处理技术领域的一种文本数据比较方法及装置。所述方法包括获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵,通过对字典表对比分析问题的抽象和建模,将两个文本数据项集合比对分析问题转化为一个二分图寻求最优匹配方案的问题,并利用KM算法对该问题进行求解。本方法实现了基于语义的字典表数据自动比对分析,有效的缓解了数据整编过程中依靠人工进行比对的工作压力,为数据对比自动化处理提供了一种新思路。
The present application relates to a text data comparison method and device in the technical field of information processing. The method includes acquiring text data item sets in two data dictionary tables, performing word segmentation processing on the two text data item sets, obtaining a Chinese word set for each element in the two text data item sets, and calculating the two text data item sets. The similarity measure between the elements of the item set, and the similarity measure is preprocessed by a preset similarity ratio threshold to obtain a similarity measure matrix. By comparing the abstraction and modeling of the problem with the dictionary table, the two text data items The set comparison analysis problem is transformed into a bipartite graph to find the optimal matching solution, and the KM algorithm is used to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.
Description
技术领域technical field
本申请涉及信息处理技术领域,特别是涉及一种文本数据比较方法及装置。The present application relates to the technical field of information processing, and in particular, to a text data comparison method and device.
背景技术Background technique
随着数据采集和存储成本的降低,数据在数量上呈现出爆发式增长,但同时也对数据关联、融合方面提出越来越多的要求,数据关联、融合方面面临越来越多的挑战。数据集成整编作为架接在原始数据与高价值数据之间一道关键桥梁,在基于数据的统计分析中起着越来越重要的作用,也成为数据处理中越来越基础而又繁重的工作。With the reduction of data collection and storage costs, the amount of data shows explosive growth, but at the same time, more and more requirements are put forward for data association and integration, and data association and integration face more and more challenges. As a key bridge between original data and high-value data, data integration and reorganization plays an increasingly important role in data-based statistical analysis, and has also become an increasingly basic and heavy task in data processing.
数据字典表作为当前数据库系统中对数据的数据项等元信息进行定义的基础数据,是整个数据库系统应用和理解的关键信息,因而数据字典表的比对、关联和拉通在数据集成过程中具有重要意义。The data dictionary table is the basic data that defines the metadata such as data items in the current database system, and is the key information for the application and understanding of the entire database system. Therefore, the comparison, association and connection of the data dictionary table are in the process of data integration. significant.
在实现数据汇总、融合统一的过程中,对异构数据库或是不同时间点的数据进行比对关联则是数据整编集成以及更新的关键一步,尤其对于描述数据库中数据项、数据结构等元信息的数据字典显得格外重要。当前业内普遍采用数据仓库的抽取转换(Extract-Transform-Load,ETL)技术实现异构数据的抽取、转换、融合,现有研究成果:利用Python为中间单元,实现了异构数据库中表的记录数据比对,解决了不同存储数据库中数据的访问和表记录级别的比对,但仍未能解决同一数据本体在不同表达方式情况下的自动比对问题,缺乏基于语义自动化处理手段。In the process of data aggregation, integration and unification, the comparison and association of heterogeneous databases or data at different time points is a key step in data reorganization, integration and updating, especially for describing meta-information such as data items and data structures in the database. The data dictionary is particularly important. At present, the Extract-Transform-Load (ETL) technology of data warehouse is widely used in the industry to realize the extraction, transformation and fusion of heterogeneous data. Data comparison solves the comparison of data access and table record levels in different storage databases, but it still cannot solve the problem of automatic comparison of the same data ontology in different expressions, and lacks automatic processing methods based on semantics.
发明内容SUMMARY OF THE INVENTION
基于此,有必要针对上述技术问题,提供一种文本数据比较方法及装置。Based on this, it is necessary to provide a text data comparison method and apparatus for the above technical problems.
一种文本数据比较方法,所述方法包括:A text data comparison method, the method includes:
获取两个数据字典表中的文本数据项集合,并对两个所述文本数据项集合进行分词处理,得到两个所述文本数据项集合中每一个元素的中文词语集合。Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵。Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix.
根据所述相似度量矩阵和两个所述文本数据项集合,将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题。According to the similarity measure matrix and the two sets of text data items, the comparative analysis problem of the two sets of text data items is transformed into a matching problem of a weighted bipartite graph.
采用KM算法对所述带权二分图的匹配问题进行求解,得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The KM algorithm is used to solve the matching problem of the weighted bipartite graph, and a set of globally optimal matching relationships between the two sets of text data items is obtained.
在其中一个实施例中,获取两个数据字典表中的文本数据项集合,并对两个所述文本数据项集合进行分词处理,得到两个所述文本数据项集合中每一个元素的中文词语集合,包括:In one embodiment, text data item sets in two data dictionary tables are acquired, and word segmentation processing is performed on the two text data item sets to obtain Chinese words for each element in the two text data item sets Collection, including:
获取两个数据字典表中的文本数据项集合。Gets a collection of text data items in two data dictionary tables.
采用基于统计的分词方法对两个所述文本数据项集合中的元素进行分词处理,得到两个所述文本数据项集合中每一个元素的中文词语集合。Elements in the two text data item sets are subjected to word segmentation processing using a statistical-based word segmentation method to obtain a Chinese word set for each element in the two text data item sets.
在其中一个实施例中,相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵,包括:Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix, including:
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量。A similarity measure between elements of the two text data item sets is calculated according to the Chinese word sets of each element in the two text data item sets.
当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时,相似度量矩阵对应位置的元素等于相似性度量。When the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to the similarity measure.
当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时,相似度量矩阵对应位置的元素等于0。When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to 0.
在其中一个实施例中,根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵,步骤中所述相似性度量的计算公式为:In one of the embodiments, a similarity measure between the elements of the two text data item sets is calculated according to the Chinese word set of each element in the two text data item sets, and a preset similarity ratio threshold is used to determine the similarity between the two text data item sets. The similarity measure is preprocessed to obtain a similarity measure matrix. The calculation formula of the similarity measure in the steps is:
其中,为相似比,为第一个文本数据项集合中第个元素包括的中文词语集合,为第二个文本数据项集合中第个元素包括的中文词语集合,是元素个数计数操作。in, is the similarity ratio, for the first text data item in the set A set of Chinese words included in each element, is the first in the second set of text data items A set of Chinese words included in each element, is an element count operation.
一种文本数据比较方法,所述方法包括:A text data comparison method, the method includes:
获取两个数据字典表中的文本数据项集合,并对两个所述文本数据项集合进行分词处理,得到两个所述文本数据项集合中每一个元素的中文词语集合。Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵。Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix.
根据相似度量矩阵稀疏化的特点,将所述相似度量矩阵且切分为若干个彼此不相关的子相似度量矩阵。According to the characteristic of sparseness of the similarity metric matrix, the similarity metric matrix is divided into several sub-similarity metric matrices that are not related to each other.
采用KM算法对每个子相似度量矩阵进行求解,得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items.
在其中一个实施例中,相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first set of text data items and elements in the second set of text data items, respectively.
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵,包括:Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure through a preset similarity ratio threshold , get the similarity measure matrix, including:
根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量。A similarity measure between elements of the two text data item sets is calculated according to the Chinese word sets of each element in the two text data item sets.
当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时,相似度量矩阵对应位置的元素等于相似性度量。When the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to the similarity measure.
当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时,相似度量矩阵对应位置的元素等于0。When the similarity measure between the elements of the two text data item sets is smaller than the preset similarity ratio threshold, the element at the corresponding position of the similarity measure matrix is equal to 0.
在其中一个实施例中,采用KM算法对每个子相似度量矩阵进行求解,得到两个所述文本数据项集合之间的一组全局最优的匹配关系,步骤中KM算法对增广路径的搜索采用深度优先搜索算法。In one embodiment, the KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items. In the step, the KM algorithm searches for the augmented path. A depth-first search algorithm is used.
在其中一个实施例中,根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵,步骤中所述相似性度量的计算公式为:In one of the embodiments, a similarity measure between the elements of the two text data item sets is calculated according to the Chinese word set of each element in the two text data item sets, and a preset similarity ratio threshold is used to determine the similarity between the two text data item sets. The similarity measure is preprocessed to obtain a similarity measure matrix. The calculation formula of the similarity measure in the steps is:
其中,为相似比,为第一个文本数据项集合中第个元素包括的中文词语集合,为第二个文本数据项集合中第个元素包括的中文词语集合,是元素个数计数操作。in, is the similarity ratio, for the first text data item in the set A set of Chinese words included in each element, is the first in the second set of text data items A set of Chinese words included in each element, is an element count operation.
在其中一个实施例中,所述数据字典表包括数据编号字段;采用KM算法对每个子相似度量矩阵进行求解,得到两个所述文本数据项集合之间的一组全局最优的匹配关系,步骤前包括:根据所述数据编号字段,对一段时间前后更新的数据字典表进行比对。In one embodiment, the data dictionary table includes a data number field; the KM algorithm is used to solve each sub-similarity metric matrix to obtain a set of globally optimal matching relationships between the two sets of text data items, Before the step, the step includes: comparing the data dictionary table updated before and after a period of time according to the data number field.
一种文本数据比较装置,所述装置包括:A text data comparison device, the device includes:
比对数据获取模块,用于获取两个数据字典表中的文本数据项集合,并对两个所述文本数据项集合进行分词处理,得到两个所述文本数据项集合中每一个元素的中文词语集合。The comparison data acquisition module is used to acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain the Chinese text of each element in the two text data item sets. collection of words.
相似度量矩阵确定模块,用于根据两个所述文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对所述相似性度量进行预处理,得到相似度量矩阵。The similarity metric matrix determination module is used to calculate the similarity metric between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and set the similarity ratio threshold through a preset similarity ratio threshold. The similarity measure is preprocessed to obtain a similarity measure matrix.
文本数据比对结果确定模块,用于根据所述相似度量矩阵和两个所述文本数据项集合,将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题;采用KM算法对所述带权二分图的匹配问题进行求解,得到两个所述文本数据项集合之间的一组全局最优的匹配关系。The text data comparison result determination module is used to convert the comparison analysis problem of the two text data item sets into the matching problem of the weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; using the KM algorithm The matching problem of the weighted bipartite graph is solved to obtain a set of globally optimal matching relationships between the two sets of text data items.
上述文本数据比较方法及装置,所述方法包括获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵,通过对字典表对比分析问题的抽象和建模,将两个文本数据项集合比对分析问题转化为一个二分图寻求最优匹配方案的问题,并利用KM算法对该问题进行求解。本方法实现了基于语义的字典表数据自动比对分析,有效的缓解了数据整编过程中依靠人工进行比对的工作压力,为数据对比自动化处理提供了一种新思路。The above-mentioned text data comparison method and device, the method includes acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain the Chinese text of each element in the two text data item sets. Word set, calculate the similarity measure between the elements of the two text data item sets, and preprocess the similarity measure through the preset similarity ratio threshold to obtain the similarity measure matrix, and analyze the abstraction and construction of the problem by comparing the dictionary table. Modulo, the comparison analysis problem of two text data item sets is transformed into a bipartite graph to find the optimal matching solution, and the KM algorithm is used to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.
附图说明Description of drawings
图1为一个实施例中文本数据比较方法的流程示意图;1 is a schematic flowchart of a text data comparison method in one embodiment;
图2为一个实施例中数据字典表对比示意图;Fig. 2 is a schematic diagram of data dictionary table comparison in one embodiment;
图3为另一个实施例中数据项中文短语分词结果示意图;3 is a schematic diagram of a Chinese phrase segmentation result of a data item in another embodiment;
图4为一个实施例中文本数据比较方法的流程示意图;4 is a schematic flowchart of a text data comparison method in one embodiment;
图5为一个实施例中文本数据比较装置的结构框图。FIG. 5 is a structural block diagram of an apparatus for comparing text data in one embodiment.
具体实施方式Detailed ways
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the purpose, technical solutions and advantages of the present application more clearly understood, the present application will be described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请提供的文本数据比较方法,可用于两个数据字典表中文本数据项的比对分析,本发明的实质是对两个存在一定差异的数据集合,利用数据之间的相似性特征,构建一组从一个集合中到另一个集合的映射关系,以及对发生变化的数据项进行标注说明的过程。由于两个数据集合中数据项并非完全一致,通常通过语义相似性实现同一数据本体不同数据值间的关联(例如,“广西壮族自治区南宁市”与“广西南宁市”在语义上等价的,但在数据值上则存在差异);然而,同一集合中数据项与其他不同数据本体也会存在一定的相似性(例如,“训练一组”与“训练二组”、“训练三组”都有一定的相似性),因此如何能够立足全局构建最优的映射关系是解决数据字典表比对分析的关键。本发明针对两个数据集合中的数据项,构建了一种能够表征数据项语义相似性度量,依据相似度量实现两个集合中数据项基于语义关联的最优比对分析,实现整个比对结果全局相似性最优,则可为数据字典自动比对分析提供一种较为新颖的思路。The text data comparison method provided by this application can be used for the comparative analysis of text data items in two data dictionary tables. A set of mappings from one set to another, and the process of annotating data items that change. Since the data items in the two data sets are not completely consistent, the association between different data values of the same data ontology is usually achieved through semantic similarity (for example, "Nanning City, Guangxi Zhuang Autonomous Region" and "Nanning City, Guangxi" are semantically equivalent, But there are differences in data values); however, data items in the same set will also have certain similarities with other different data ontologies (for example, "training one group" and "training two groups", "training three groups" are all similar. There is a certain similarity), so how to build the optimal mapping relationship based on the global is the key to solving the comparison and analysis of the data dictionary table. Aiming at the data items in the two data sets, the invention constructs a metric that can characterize the semantic similarity of the data items, and realizes the optimal comparison analysis based on the semantic association of the data items in the two sets according to the similarity metric, and realizes the whole comparison result. The optimal global similarity can provide a novel idea for automatic comparison and analysis of data dictionary.
在一个实施例中,如图1所示,提供了一种文本数据比较方法,该方法包括以下步骤:In one embodiment, as shown in FIG. 1, a text data comparison method is provided, and the method includes the following steps:
步骤100:获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合。Step 100: Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.
具体的,两个数据字典表是异构的。Specifically, the two data dictionary tables are heterogeneous.
数据字典表是当前数据库系统中对数据的数据项等元信息进行定义的基础数据,是整个数据库系统应用和理解的关键信息。The data dictionary table is the basic data in the current database system to define the data items and other meta-information of the data, and is the key information for the application and understanding of the entire database system.
数据字典表可以是单位序列字典表,单位序列字典表中主要包含单位编号、单位名称、级别等字段(其中单位名称字段则体现了调整前后的数据本体的语义关联性);还可以是部门关系字典表,产品类型字典表等。The data dictionary table can be a unit sequence dictionary table. The unit sequence dictionary table mainly includes fields such as unit number, unit name, level, etc. (the unit name field reflects the semantic relevance of the data ontology before and after adjustment); it can also be a department relationship Dictionary table, product type dictionary table, etc.
为实现对数据字典表中名称等文本数据项实施基于语义相似性的关联比对,首先需利用自然语言处理的方法对数据项进行处理。词作为表达语义的最小单元,是构设数据项相似性度量的基本运算单元,因此对中文短语数据项或简写英文短语进行有效的分词是字典数据处理的一项基本内容。本发明主要以数据字典中的中文短语数据项为例,利用开放的维基百科语料库和业务作为中文分词字典,对本发明所提出的数据字典比对方法进行介绍。分词处理的方法可选用基于词典的分词方法或基于统计的分词方法。In order to implement an association comparison based on semantic similarity for text data items such as names in the data dictionary table, it is necessary to first process the data items by means of natural language processing. As the smallest unit to express semantics, word is the basic operation unit for constructing the similarity measure of data items. Therefore, effective word segmentation for Chinese phrase data items or abbreviated English phrases is a basic content of dictionary data processing. The present invention mainly takes the Chinese phrase data item in the data dictionary as an example, and uses the open Wikipedia corpus and business as the Chinese word segmentation dictionary to introduce the data dictionary comparison method proposed by the present invention. The word segmentation method can be selected from the dictionary-based word segmentation method or the statistics-based word segmentation method.
步骤102:根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵。Step 102: Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure by a preset similarity ratio threshold, Get the similarity measure matrix.
其中,相似度量矩阵的行数等于两个文本数据项集合中的第一个文本数据项集合中元素的个数,相似度量矩阵的列数等于两个文本数据项集合中的第二个文本数据项集合中元素的个数。Among them, the number of rows of the similarity measure matrix is equal to the number of elements in the first text data item set in the two text data item sets, and the number of columns of the similarity measure matrix is equal to the second text data in the two text data item sets. The number of elements in the item collection.
相似度量矩阵的的元素值为第一个文本数据项集合中第i个元素与第二个文本数据项集合中第j个元素之间的相似性度量通过预设相似比阈值进行处理后得到的结果。The element values of the similarity measure matrix It is a result obtained by processing a preset similarity ratio threshold for the similarity measure between the i -th element in the first text data item set and the j -th element in the second text data item set.
步骤104:根据相似度量矩阵和两个文本数据项集合,将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题。Step 104: According to the similarity metric matrix and the two sets of text data items, the comparative analysis problem of the two sets of text data items is transformed into a matching problem of a weighted bipartite graph.
具体的,数据字典表对比问题可以如图2所示,描述为:存在两个不同数据项序列和,其中数据项与数据项存在相似性,其相似比为,如何构建一组一一对应的映射关系,使得中的数据项最多只能与中的一个数据项对应,中的数据项最多只由中的一个数据项对应,并且满足两组序列具有最大的相似匹配效果。根据上面的描述可知,字典表数据比对分析是一个典型的带权二分图的匹配问题。Specifically, the data dictionary table comparison problem can be shown in Figure 2, which is described as: there are two different data item sequences and , where the data item with data items There is similarity, and the similarity ratio is , how to construct a set of one-to-one mapping relationships such that The data items in can only match at most A data item in corresponds to, The data items in are at most limited by One of the data items in , and satisfies that the two sets of sequences have the largest similarity matching effect. According to the above description, the comparison and analysis of dictionary table data is a typical matching problem of weighted bipartite graph.
步骤106:采用KM算法对带权二分图的匹配问题进行求解,得到两个文本数据项集合之间的一组全局最优的匹配关系。Step 106: Use the KM algorithm to solve the matching problem of the weighted bipartite graph, and obtain a set of globally optimal matching relationships between the two text data item sets.
具体的,KM算法是求解二分图最优匹配的经典算法,其核心思想是通过调整每个顶点的标值和,将最大权重匹配问题转化为增广路径搜索问题,最终使得二分图匹配数能够达到最大,进而求解出一组全局最优的匹配关系。Specifically, the KM algorithm is a classic algorithm for solving the optimal matching of bipartite graphs. Its core idea is to adjust the scalar value of each vertex by adjusting and , the maximum weight matching problem is transformed into an augmented path search problem, and finally the number of bipartite graph matching can be maximized, and then a set of globally optimal matching relationships can be solved.
设两个字典表的数据项构成一个二分图,其中顶点集合由两个字典表中数据项集合构成,满足,边集合由集合与集合间的相似关系构成,其对应的权值为,表示顶点与顶点的相似性。根据上述定义,则利用KM算法的数据字典比对方法具体的步骤包括:Let the data items of two dictionary tables form a bipartite graph , where the vertex set A collection of data items from two dictionary tables constitute, satisfy , the set of edges by the collection with collection The similarity relationship between is constituted, and its corresponding weight is , representing the vertex with vertex similarity. According to the above definition, the specific steps of the data dictionary comparison method using the KM algorithm include:
步骤1:初始化集合与集合中每个元素的标值,令对应的标值为,同时令对应的标值为与其关联的相似最大值,即,其中m为集合中元素的数量。Step 1: Initialize the collection with collection The scalar value of each element in , let The corresponding value is , while making The corresponding scalar value is the similar maximum value associated with it, i.e. , where m is the set the number of elements in .
步骤2:从集合中的元素开始,在二分图中根据搜索增广路径,如果存在增广路径,则跳转至步骤4;否则跳转至步骤3。Step 2: From the Collection elements in Start, in the bipartite graph according to search augmentation path , if there is an augmentation path , then go to step 4; otherwise, go to step 3.
步骤3:在搜索增广路径失败时,能够得到满足条件的一组由交替线段构成的交错路径,然后跳转至步骤2。Step 3: Search augmentation paths be satisfied when it fails Conditional set of staggered paths consisting of alternating line segments , then skip to step 2.
步骤4:对增广路径中相关线段取反原始匹配规则,更新形成新的匹配规则。Step 4: Pair Augmentation Paths Negate the original matching rule for the relevant line segment in the middle, and update it to form a new matching rule.
。 .
步骤5:检查集合中元素是否已遍历,如果完成遍历,则运行结束;否则,处理集合下一个元素,并跳转至步骤2。Step 5: Check the Collection Whether the elements in the set have been traversed, if the traversal is completed, the operation ends; otherwise, the next element of the collection is processed , and skip to step 2.
上述文本数据比较方法中,所述方法包括获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵,通过对字典表对比分析问题的抽象和建模,将两个文本数据项集合比对分析问题转化为一个二分图寻求最优匹配方案的问题,并利用KM算法对该问题进行求解。本方法实现了基于语义的字典表数据自动比对分析,有效的缓解了数据整编过程中依靠人工进行比对的工作压力,为数据对比自动化处理提供了一种新思路。In the above text data comparison method, the method includes acquiring text data item sets in two data dictionary tables, and performing word segmentation processing on the two text data item sets to obtain the Chinese words of each element in the two text data item sets. Set, calculate the similarity measure between the elements of two sets of text data items, and preprocess the similarity measure through a preset similarity ratio threshold to obtain a similarity measure matrix, and analyze the abstraction and modeling of the problem by comparing the dictionary table , transforming the comparative analysis problem of two sets of text data items into a bipartite graph to find the optimal matching solution, and using the KM algorithm to solve the problem. The method realizes automatic comparison and analysis of dictionary table data based on semantics, effectively relieves the work pressure of manual comparison in the process of data reorganization, and provides a new idea for automatic data comparison processing.
在其中一个实施例中,步骤100包括:获取两个数据字典表中的文本数据项集合;采用基于统计的分词方法对两个文本数据项集合中的元素进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合。In one embodiment,
具体的,利用汉语言处理包(Han Language Processing,Hanlp)分词工具包,在建立业务字典的基础上,通过其内嵌的二阶隐马尔科夫链模型实现中文语料库的先验学习以及文本数据项的分词。如图3所示,将两个数据项中的中文短语切分为两个词语集合,分别为:,。Specifically, using the Chinese language processing package (Han Language Processing, Hanlp) word segmentation toolkit, on the basis of establishing a business dictionary, the prior learning of the Chinese corpus and text data are realized through its embedded second-order hidden Markov chain model. Participles of items. As shown in Figure 3, the Chinese phrases in the two data items are divided into two word sets, which are: , .
在其中一个实施例中,相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应;步骤102包括:根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量;当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时,相似度量矩阵对应位置的元素等于相似性度量;当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时,相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively;
在其中一个实施例中,步骤102中相似性度量的计算公式为:In one embodiment, the calculation formula of the similarity measure in
(1) (1)
其中,为相似比,为第一个文本数据项集合中第个元素包括的中文词语集合,,n为第一个文本数据项集合中的元素数量,为第二个文本数据项集合中第个元素包括的中文词语集合,,m为第二个文本数据项集合中的元素数量,是元素个数计数操作。in, is the similarity ratio, for the first text data item in the set A set of Chinese words included in each element, , n is the number of elements in the first set of text data items, is the first in the second set of text data items A set of Chinese words included in each element, , m is the number of elements in the second set of text data items, is an element count operation.
具体的,对两个词语集合中的词语构建相应的词向量,并利用词向量间的余弦度量方法对相似词语进行筛选,进而利用Jaccord算法可以实现中文短语的相似性度量。但鉴于数据字典中文本数据项通常为简短精炼的专业短语,以及为了降低运算消耗,将相似性度量简化为以词语为颗粒度及文本数据项间的Jaccard相似性度量, 实现数据项间语义相似性的表征,即当两集合中公共拥有的词语的数量越多,则两个中文短语表达的语义则越相似;反之,则两个中文短语所表达的内容相似性越低。基于上述结论,对中文短语相似性度量定义为式(1),称为相似比。Specifically, corresponding word vectors are constructed for the words in the two word sets, and the cosine measurement method between the word vectors is used to filter similar words, and then the Jaccord algorithm can be used to measure the similarity of Chinese phrases. However, in view of the fact that the text data items in the data dictionary are usually short and refined professional phrases, and in order to reduce the computational consumption, the similarity measure is simplified to a Jaccard similarity measure with words as the granularity and between text data items to achieve semantic similarity between data items. In other words, when the number of common words in the two sets is greater, the semantics expressed by the two Chinese phrases are more similar; otherwise, the content similarity expressed by the two Chinese phrases is lower. Based on the above conclusions, the similarity measure of Chinese phrases is defined as formula (1), which is called the similarity ratio.
根据式(1)可知,当两个词语集合的词语重合部分的越多,则相似比越接近于1;相应地,当两个词语集合的词语差异较大,则相似比就会越接近于0。因此,相似比与两个中文短语语义相似性表现出较强的正相关性。According to formula (1), it can be seen that when the words of the two word sets have more overlapping parts, the similarity ratio is higher. The closer it is to 1; accordingly, when the words of the two word sets are quite different, the similarity ratio is will be closer to 0. Therefore, the similarity ratio It shows a strong positive correlation with the semantic similarity of two Chinese phrases.
在一个实施例中,如图4所示,提供了一种文本数据比较方法,该方法包括如下步骤:In one embodiment, as shown in FIG. 4, a text data comparison method is provided, and the method includes the following steps:
步骤400:获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合。Step 400: Acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain a Chinese word set for each element in the two text data item sets.
步骤402:根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵。Step 402: Calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and preprocess the similarity measure by a preset similarity ratio threshold, Get the similarity measure matrix.
步骤404:根据相似度量矩阵稀疏化的特点,将相似度量矩阵且切分为若干个彼此不相关的子相似度量矩阵。Step 404 : According to the sparse feature of the similarity metric matrix, the similarity metric matrix is divided into several sub-similarity metric matrices that are not related to each other.
根据相似度量矩阵稀疏化的特点,将原矩阵切分为若干个彼此不相关的子相似度量矩阵,其具体步骤包括:1)遍历需要对比的相似度矩阵数据;2)基于矩阵中全项为零的稀疏化特点,对相似度矩阵中全项为零的稀疏矩阵作为矩阵分割线;3)保存分割后,生成子相似度量矩阵。According to the characteristics of sparse similarity measurement matrix, the original matrix is divided into several sub-similarity measurement matrices that are not related to each other. The specific steps include: 1) traverse the similarity matrix data to be compared; The sparse feature of zero, the sparse matrix with all zero items in the similarity matrix is used as the matrix dividing line; 3) After saving and dividing, the sub-similarity metric matrix is generated.
分别利用KM算法对子矩阵进行求解,最终实现整个字典表的对比分析。The KM algorithm is used to solve the sub-matrices, and finally the comparison and analysis of the entire dictionary table are realized.
步骤406:采用KM算法对每个子相似度量矩阵进行求解,得到两个文本数据项集合之间的一组全局最优的匹配关系。Step 406: Use the KM algorithm to solve each sub-similarity metric matrix, and obtain a set of globally optimal matching relationships between the two text data item sets.
在其中一个实施例中,相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应;步骤402包括:根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量;当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时,相似度量矩阵对应位置的元素等于相似性度量;当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时,相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively;
在其中一个实施例中,步骤406中KM算法对增广路径的搜索采用深度优先搜索算法。In one embodiment, in
具体的,利用相似比阈值对数据项间关联关系进行预处理,则可以有效的降低数据项间的关联深度,因此再根据KM算法在搜索增广路径时,则可以使用深度优先搜索算法进行搜索,从而可避免过多的递归调用,进而减少运算存储消耗。Specifically, using the similarity ratio threshold Preprocessing the association relationship between data items can effectively reduce the depth of association between data items, so the augmentation path is searched according to the KM algorithm. , the depth-first search algorithm can be used to search, which can avoid too many recursive calls, thereby reducing the operation and storage consumption.
在其中一个实施例中,步骤402中相似性度量的计算公式如式(1)所示。In one of the embodiments, the calculation formula of the similarity measure in
在其中一个实施例中,数据字典表包括数据编号字段;步骤406前包括:根据数据编号字段,对一段时间前后更新的数据字典表进行比对。In one embodiment, the data dictionary table includes a data number field; before
具体的,在实际字典表数据设计中通常设有编号字段,即身份标识号(IdentityDocument,ID)字段,倘若能够根据数据项编号字段具有相对固定性的特点,对一段时间前后更新的数据字典表,则可以通过优先匹配编号字段的方式进一步降低算法的运算消耗,提高比对分析效率。Specifically, in the actual dictionary table data design, there is usually a number field, that is, an identity document (ID) field. If the data item number field is relatively fixed, the data dictionary table updated before and after a period of time can be updated. , the operation consumption of the algorithm can be further reduced by preferentially matching the number field, and the comparison and analysis efficiency can be improved.
应该理解的是,虽然图1和图4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1和图4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flowcharts of FIG. 1 and FIG. 4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIG. 1 and FIG. 4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed and completed at the same time, but may be executed at different times. These sub-steps or The order of execution of the stages is also not necessarily sequential, but may be performed alternately or alternately with other steps or sub-steps of other steps or at least a portion of a stage.
在一个实施例中,如图5所示,提供了一种文本数据比较装置,包括:比对数据获取模块、相似度量矩阵确定模块和文本数据比对结果确定模块,其中:In one embodiment, as shown in FIG. 5, a text data comparison device is provided, comprising: a comparison data acquisition module, a similarity metric matrix determination module and a text data comparison result determination module, wherein:
比对数据获取模块,用于获取两个数据字典表中的文本数据项集合,并对两个文本数据项集合进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合。The comparison data acquisition module is used to acquire the text data item sets in the two data dictionary tables, and perform word segmentation processing on the two text data item sets to obtain the Chinese word set of each element in the two text data item sets.
相似度量矩阵确定模块,用于根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量,并通过预设相似比阈值对相似性度量进行预处理,得到相似度量矩阵。The similarity measure matrix determination module is used to calculate the similarity measure between the elements of the two text data item sets according to the Chinese word set of each element in the two text data item sets, and compare the similarity by a preset similarity ratio threshold The metric is preprocessed to obtain a similarity metric matrix.
文本数据比对结果确定模块,用于根据相似度量矩阵和两个文本数据项集合,将两个文本数据项集合比对分析问题转化为带权二分图的匹配问题;采用KM算法对带权二分图的匹配问题进行求解,得到两个文本数据项集合之间的一组全局最优的匹配关系。The text data comparison result determination module is used to convert the comparison analysis problem of two text data item sets into a matching problem of a weighted bipartite graph according to the similarity measurement matrix and the two text data item sets; The graph matching problem is solved to obtain a set of globally optimal matching relationships between two sets of text data items.
在其中一个实施例中,比对数据获取模块,还用于获取两个数据字典表中的文本数据项集合;采用基于统计的分词方法对两个文本数据项集合中的元素进行分词处理,得到两个文本数据项集合中每一个元素的中文词语集合。In one of the embodiments, the comparison data acquisition module is further configured to acquire the text data item sets in the two data dictionary tables; adopt the statistical-based word segmentation method to perform word segmentation processing on the elements in the two text data item sets, and obtain A set of Chinese words for each element in the two sets of text data items.
在其中一个实施例中,相似度量矩阵的行和列分别与第一个文本数据项集合中元素和第二个文本数据项集合中元素对应;相似度量矩阵确定模块,还用于根据两个文本数据项集合中每一个元素的中文词语集合,计算两个文本数据项集合的元素之间的相似性度量;当两个文本数据项集合的元素之间的相似性度量大于等于预设相似比阈值时,相似度量矩阵对应位置的元素等于相似性度量;当两个文本数据项集合的元素之间的相似性度量小于预设相似比阈值时,相似度量矩阵对应位置的元素等于0。In one of the embodiments, the rows and columns of the similarity metric matrix correspond to elements in the first text data item set and elements in the second text data item set, respectively; The Chinese word set of each element in the data item set, calculate the similarity measure between the elements of the two text data item sets; when the similarity measure between the elements of the two text data item sets is greater than or equal to the preset similarity ratio threshold When , the element at the corresponding position of the similarity measurement matrix is equal to the similarity measurement; when the similarity measurement between the elements of the two text data item sets is less than the preset similarity ratio threshold, the element at the corresponding position of the similarity measurement matrix is equal to 0.
在其中一个实施例中,相似度量矩阵确定模块中相似性度量的计算公式如式(1)所示。In one of the embodiments, the calculation formula of the similarity metric in the similarity metric matrix determination module is shown in formula (1).
关于文本数据比较装置的具体限定可以参见上文中对于文本数据比较方法的限定,在此不再赘述。上述文本数据比较装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the text data comparison apparatus, reference may be made to the limitation of the text data comparison method above, which will not be repeated here. Each module in the above text data comparison apparatus can be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210631816.9A CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210631816.9A CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114722160A CN114722160A (en) | 2022-07-08 |
CN114722160B true CN114722160B (en) | 2022-09-02 |
Family
ID=82232868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210631816.9A Active CN114722160B (en) | 2022-06-07 | 2022-06-07 | Text data comparison method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114722160B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115620323A (en) * | 2022-09-08 | 2023-01-17 | 广州希音国际进出口有限公司 | Image-sensitive-text-based contrast matching method, storage device and intelligent terminal |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100543735C (en) * | 2005-10-31 | 2009-09-23 | 北大方正集团有限公司 | Document Similarity Measuring Method Based on Document Structure |
CN106547739B (en) * | 2016-11-03 | 2019-04-02 | 同济大学 | A kind of text semantic similarity analysis method |
CN113934842A (en) * | 2020-06-29 | 2022-01-14 | 数网金融有限公司 | Text clustering method and device and readable storage medium |
CN113407767A (en) * | 2021-06-29 | 2021-09-17 | 北京字节跳动网络技术有限公司 | Method and device for determining text relevance, readable medium and electronic equipment |
-
2022
- 2022-06-07 CN CN202210631816.9A patent/CN114722160B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN114722160A (en) | 2022-07-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022142027A1 (en) | Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium | |
CN110990638B (en) | Large-scale data query acceleration device and method based on FPGA-CPU heterogeneous environment | |
CN108304378B (en) | Text similarity computing method, apparatus, computer equipment and storage medium | |
WO2021139074A1 (en) | Knowledge graph-based case retrieval method, apparatus, device, and storage medium | |
WO2020143184A1 (en) | Knowledge fusion method and apparatus, computer device, and storage medium | |
CN110275947A (en) | Domain-specific knowledge graph natural language query method and device based on named entity recognition | |
CN105045875B (en) | Personalized search and device | |
CN108345585A (en) | A kind of automatic question-answering method based on deep learning | |
CN109166615B (en) | Medical CT image storage and retrieval method based on random forest hash | |
CN108509543B (en) | A Multi-Keyword Parallel Search Method for Streaming RDF Data Based on Spark Streaming | |
CN110046298A (en) | Query word recommendation method, apparatus, terminal device, and computer-readable medium | |
CN113177141A (en) | Multi-label video hash retrieval method and device based on semantic embedded soft similarity | |
CN111460170B (en) | Word recognition method, device, terminal equipment and storage medium | |
CN107145519B (en) | Image retrieval and annotation method based on hypergraph | |
CN110083683B (en) | Entity semantic annotation method based on random walk | |
CN109145143A (en) | Sequence constraints hash algorithm in image retrieval | |
CN112860916B (en) | Movie-television-oriented multi-level knowledge map generation method | |
CN114860868A (en) | A Semantic Similarity Vector Re-sparse Coding Index and Retrieval Method | |
CN113011194A (en) | Text similarity calculation method fusing keyword features and multi-granularity semantic features | |
CN115238053A (en) | New crown knowledge intelligent question answering system and method based on BERT model | |
CN117453861A (en) | Code search recommendation method and system based on comparison learning and pre-training technology | |
CN115795065A (en) | Multimedia data cross-modal retrieval method and system based on weighted hash code | |
WO2023246337A1 (en) | Unsupervised semantic retrieval method and apparatus, and computer-readable storage medium | |
CN112199461B (en) | Document retrieval method, device, medium and equipment based on block index structure | |
CN114722160B (en) | Text data comparison method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |