CN114911999A - A name matching method and device - Google Patents

A name matching method and device Download PDF

Info

Publication number
CN114911999A
CN114911999A CN202210569401.3A CN202210569401A CN114911999A CN 114911999 A CN114911999 A CN 114911999A CN 202210569401 A CN202210569401 A CN 202210569401A CN 114911999 A CN114911999 A CN 114911999A
Authority
CN
China
Prior art keywords
name
matching
candidate
similarity
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210569401.3A
Other languages
Chinese (zh)
Other versions
CN114911999B (en
Inventor
胡玉婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202210569401.3A priority Critical patent/CN114911999B/en
Publication of CN114911999A publication Critical patent/CN114911999A/en
Application granted granted Critical
Publication of CN114911999B publication Critical patent/CN114911999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供了一种名称匹配方法和装置,涉及大数据领域,所述方法包括:响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词;基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称;将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称;基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。本发明实施例减少了匹配计算工作量,提高了匹配准确率。

Figure 202210569401

Embodiments of the present invention provide a name matching method and device, which relate to the field of big data. The method includes: in response to a search request for an original name to be searched, matching the original name based on characters; if the matching fails, Then, split the original name to obtain a plurality of word segments, and classify the plurality of word segments according to preset categories to obtain a plurality of word segments with categories; Recombining each word segment to obtain a plurality of candidate names; match the plurality of candidate names with the preset name database for similarity respectively, and determine the target candidate name with the highest similarity; The candidate names are semantically classified and matched with the name database to obtain a matching result. The embodiment of the present invention reduces the workload of matching calculation and improves the matching accuracy.

Figure 202210569401

Description

一种名称匹配方法和装置A name matching method and device

技术领域technical field

本发明涉及大数据技术领域,特别是涉及一种名称匹配方法和一种名称匹配装置。The invention relates to the technical field of big data, in particular to a name matching method and a name matching device.

背景技术Background technique

为了实现全方位客户洞察和产品推荐,相关技术支持通过在系统中输入企业名称获取对应的客户信息,包括客户发展信息,收入信息,招投标信息,商情信息,企业图谱和投诉信息等,本质需要将企业名称(在网页中爬取得到的公司的名称)和客户名称(在系统中记录、存储的公司的名称)进行关联打通,从而获取系统中客户名称对应的所有信息。而当前用户在输入企业名称时经常会输入名称缩写/简写,由于录入不规范,存在错别字,简称复杂多样等原因,导致系统无法精确查找出客户名称。In order to achieve all-round customer insight and product recommendation, the relevant technical support obtains the corresponding customer information by entering the company name in the system, including customer development information, income information, bidding information, business information, enterprise map and complaint information, etc. Connect the company name (the company name obtained from the web page) and the customer name (the company name recorded and stored in the system) to get all the information corresponding to the customer name in the system. However, current users often enter name abbreviations/abbreviations when entering company names. Due to irregular input, typos, complex and diverse abbreviations, etc., the system cannot accurately find the customer name.

为了解决以上问题,相关技术的解决方法通常是将企业名称与客户名称进行相似度匹配,并对所有的匹配得分进行排序,获取匹配得分高于一定阈值的客户名称作为该企业名称的匹配结果,这种相似匹配方法匹配准确率通常比较低。In order to solve the above problems, the solution in the related art is usually to perform similarity matching between the company name and the customer name, sort all the matching scores, and obtain the customer name whose matching score is higher than a certain threshold as the matching result of the company name. The matching accuracy of this similarity matching method is usually low.

另外,在计算字符串匹配时一般采用最短编辑距离的相似度算法,它从整体上考虑了文本上下文之间的语义关系,是一种常用的距离函数度量方法,在字符串相似性匹配领域得到了广泛的应用,但是仍然存在一些问题:In addition, the similarity algorithm of the shortest edit distance is generally used when calculating string matching. It considers the semantic relationship between text contexts as a whole and is a commonly used distance function measurement method. In the field of string similarity matching, it is obtained It has been widely used, but there are still some problems:

1)传统的编辑距离算法只考虑了编辑操作次数,不具备普遍适用性;1) The traditional edit distance algorithm only considers the number of editing operations and does not have universal applicability;

2)传统的编辑距离算法对于长字符串插入和删除错误等计算存在一定的偏差,导致匹配的准确度较低。2) The traditional edit distance algorithm has a certain deviation in the calculation of long string insertion and deletion errors, resulting in low matching accuracy.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题,提出了本发明实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种名称匹配方法和相应的一种名称匹配装置。In view of the above problems, the embodiments of the present invention are proposed to provide a name matching method and a corresponding name matching apparatus that overcome the above problems or at least partially solve the above problems.

为了解决上述问题,本发明实施例公开了一种名称匹配方法,其特征在于,所述方法包括:In order to solve the above problem, an embodiment of the present invention discloses a name matching method, characterized in that the method includes:

响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;in response to a search request for the original name to be searched, matching the original name based on the characters;

若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词;If the matching fails, the original name is split to obtain a plurality of word segments, and the plurality of word segments are classified according to preset categories to obtain a plurality of word segments with categories;

基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称;Recombining the plurality of word segmentations based on the target category in the category to obtain a plurality of candidate names;

将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称;Perform similarity matching between the multiple candidate names and a preset name database respectively, and determine the target candidate name with the highest similarity;

基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。Based on the trained classification model, semantic classification matching is performed between the target candidate name and the name database, and a matching result is obtained.

优选地,所述基于字符对所述原始名称进行匹配,包括:Preferably, the character-based matching of the original name includes:

检测所述名称数据库中是否存在与所述原始名称的字符相同的名称;detecting whether a name with the same characters as the original name exists in the name database;

若存在,则匹配成功;若不存在,则匹配失败。If it exists, the match succeeds; if it does not exist, the match fails.

优选地,所述对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类型进行分类,得到多个具有类别的分词,包括:Preferably, the original name is split to obtain a plurality of word segments, and the plurality of word segments are classified according to a preset type to obtain a plurality of word segments with categories, including:

采用jieba对所述原始名称进行拆分,得到多个分词;Use jieba to split the original name to obtain multiple participles;

将每个分词与预设的类别库进行匹配,确定出每个分词一一对应的类别,得到多个具有类别的分词。Match each participle with a preset category library, determine the one-to-one corresponding category of each participle, and obtain multiple participles with categories.

优选地,所述基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称,包括:Preferably, the multiple word segmentations are reorganized based on the target category in the category to obtain multiple candidate names, including:

将所述多个分词中具有目标类别的分别与具有非目标类别的分词分别进行重组,得到多个候选名称。Recombining the word segments with the target category and the word segments with the non-target category among the plurality of word segmentations to obtain a plurality of candidate names.

优选地,所述将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称,包括:Preferably, the similarity matching is performed on the multiple candidate names with a preset name database respectively, and the target candidate name with the highest similarity is determined, including:

将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度;Carrying out similarity matching between the multiple candidate names and the name database respectively, and determining the candidate similarity of each candidate name;

确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称。The target candidate similarity with the highest similarity among the plurality of candidate similarities is determined, and the candidate name corresponding to the target candidate similarity is used as the target candidate name.

优选地,所述将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度,包括:Preferably, performing similarity matching between the multiple candidate names and the name database, and determining the candidate similarity of each candidate name, including:

针对所述多个候选名称中的任一候选名称,将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度;For any candidate name in the plurality of candidate names, perform similarity matching between the any candidate name and at least one preset name in the name database to obtain at least one similarity;

确定出所述至少一个相似度中相似度最高的候选相似度。A candidate similarity with the highest similarity among the at least one similarity is determined.

优选地,所述将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度,包括:Preferably, performing similarity matching between any candidate name and at least one preset name in the name database to obtain at least one similarity, including:

针对所述至少一个预设名称中的任一预设名称,获取所述任一候选名称与所述任一预设名称的前向最大公共子串和后向最大公共子串;For any preset name in the at least one preset name, obtain the forward maximum common substring and the backward maximum common substring of the any candidate name and the any preset name;

基于所述前向最大公共子串计算出前向相似度,以及,采用所述后向最大公共子串计算出后向相似度;Calculate the forward similarity based on the forward maximum common substring, and calculate the backward similarity using the backward maximum common substring;

基于所述前向相似度和所述后向相似度,计算出所述任一候选名称与所述任一预设名称的相似度。Based on the forward similarity and the backward similarity, the similarity between any candidate name and any preset name is calculated.

优选地,所述确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称,包括:Preferably, the target candidate similarity with the highest similarity among the multiple candidate similarities is determined, and the candidate name corresponding to the target candidate similarity is used as the target candidate name, including:

基于前向最大公共子串和后向最大公共子串对所述多个候选相似度进行归一化处理,得到相似度最高的目标候选相似度;Normalize the multiple candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain the target candidate similarity with the highest similarity;

将所述目标候选相似度对应的候选名称作为目标候选名称。The candidate name corresponding to the similarity of the target candidate is used as the target candidate name.

优选地,所述基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果,包括:Preferably, the classification model based on the training completes semantic classification matching between the target candidate name and the name database to obtain a matching result, including:

将所述目标候选名称输入训练完成的分类模型,以使得所述分类模型采用预设的特征指标将所述目标候选名称与所述名称数据库进行语义分类匹配;Inputting the target candidate name into the trained classification model, so that the classification model uses a preset feature index to perform semantic classification matching between the target candidate name and the name database;

若匹配成功,则将匹配的预设名称作为匹配结果;若匹配失败,则生成匹配失败信息。If the matching is successful, the matching preset name will be used as the matching result; if the matching fails, the matching failure information will be generated.

相应的,本发明实施例公开了一种名称匹配装置,其特征在于,所述装置包括:Correspondingly, an embodiment of the present invention discloses a name matching device, characterized in that the device includes:

第一匹配模块,用于响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;a first matching module, configured to match the original name based on characters in response to a search request for the original name to be searched;

分词模块,用于若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词;A word segmentation module, used for splitting the original name if the matching fails, to obtain a plurality of word segmentations, and classifying the plurality of word segmentations according to preset categories to obtain a plurality of word segmentations with categories;

重组模块,用于基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称;a reorganization module, configured to reorganize the plurality of word segmentations based on the target category in the category to obtain a plurality of candidate names;

第二匹配模块,用于将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称;The second matching module is used to perform similarity matching between the multiple candidate names and the preset name database respectively, and determine the target candidate name with the highest similarity;

分类模块,用于基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。The classification module is configured to perform semantic classification matching on the target candidate name and the name database based on the classification model completed by training, and obtain a matching result.

优选地,所述第一匹配模块,具体用于:Preferably, the first matching module is specifically used for:

检测所述名称数据库中是否存在与所述原始名称的字符相同的名称;detecting whether a name with the same characters as the original name exists in the name database;

若存在,则匹配成功;若不存在,则匹配失败。If it exists, the match succeeds; if it does not exist, the match fails.

优选地,所述分词模块,具体用于:Preferably, the word segmentation module is specifically used for:

采用jieba对所述原始名称进行拆分,得到多个分词;Use jieba to split the original name to obtain multiple participles;

将每个分词与预设的类别库进行匹配,确定出每个分词一一对应的类别,得到多个具有类别的分词。Match each participle with a preset category library, determine the one-to-one corresponding category of each participle, and obtain multiple participles with categories.

优选地,所述重组模块,具体用于:Preferably, the recombination module is specifically used for:

将所述多个分词中具有目标类别的分别与具有非目标类别的分词分别进行重组,得到多个候选名称。Recombining the word segments with the target category and the word segments with the non-target category among the plurality of word segmentations to obtain a plurality of candidate names.

优选地,所述第二匹配模块,包括:Preferably, the second matching module includes:

相似度匹配子模块,用于将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度;a similarity matching submodule, used to perform similarity matching between the multiple candidate names and the name database respectively, and determine the candidate similarity of each candidate name;

确定子模块,用于确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称。The determination submodule is used to determine the target candidate similarity with the highest similarity among the multiple candidate similarities, and use the candidate name corresponding to the target candidate similarity as the target candidate name.

优选地,所述相似度匹配子模块,包括:Preferably, the similarity matching sub-module includes:

匹配单元,用于针对所述多个候选名称中的任一候选名称,将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度;a matching unit, configured to perform similarity matching between any candidate name in the plurality of candidate names and at least one preset name in the name database to obtain at least one similarity;

确定单元,用于确定出所述至少一个相似度中相似度最高的候选相似度。A determining unit, configured to determine a candidate similarity with the highest similarity among the at least one similarity.

优选地,所述匹配单元,具体用于:Preferably, the matching unit is specifically used for:

针对所述至少一个预设名称中的任一预设名称,获取所述任一候选名称与所述任一预设名称的前向最大公共子串和后向最大公共子串;For any preset name in the at least one preset name, obtain the forward maximum common substring and the backward maximum common substring of the any candidate name and the any preset name;

基于所述前向最大公共子串计算出前向相似度,以及,采用所述后向最大公共子串计算出后向相似度;Calculate the forward similarity based on the forward maximum common substring, and calculate the backward similarity using the backward maximum common substring;

基于所述前向相似度和所述后向相似度,计算出所述任一候选名称与所述任一预设名称的相似度。Based on the forward similarity and the backward similarity, the similarity between any candidate name and any preset name is calculated.

优选地,所述确定子模块,具体用于:Preferably, the determining submodule is specifically used for:

基于前向最大公共子串和后向最大公共子串对所述多个候选相似度进行归一化处理,得到相似度最高的目标候选相似度;Normalize the multiple candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain the target candidate similarity with the highest similarity;

将所述目标候选相似度对应的候选名称作为目标候选名称。The candidate name corresponding to the similarity of the target candidate is used as the target candidate name.

优选地,所述分类模块,具体用于:Preferably, the classification module is specifically used for:

将所述目标候选名称输入训练完成的分类模型,以使得所述分类模型采用预设的特征指标将所述目标候选名称与所述名称数据库进行语义分类匹配;Inputting the target candidate name into the trained classification model, so that the classification model uses a preset feature index to perform semantic classification matching between the target candidate name and the name database;

若匹配成功,则将匹配的预设名称作为匹配结果;若匹配失败,则生成匹配失败信息。If the matching is successful, the matching preset name will be used as the matching result; if the matching fails, the matching failure information will be generated.

相应的,本发明实施例公开了一种电子设备,包括:处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现上述名称匹配方法实施例的各个步骤。Correspondingly, an embodiment of the present invention discloses an electronic device, comprising: a processor, a memory, and a computer program stored on the memory and capable of running on the processor, the computer program being executed by the processor Each step of the above name matching method embodiment is implemented.

相应的,本发明实施例公开了一种计算机可读存储介质,所述计算机可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现上述名称匹配方法实施例的各个步骤。Correspondingly, an embodiment of the present invention discloses a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each step of the above name matching method embodiment is implemented.

本发明实施例包括以下优点:The embodiments of the present invention include the following advantages:

后台服务器响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词,然后基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称,并将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称,再基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。这样,在对原始名称进行精确匹配失败后,可以按照类别对原始名称进行拆解得到多个分词,再重新围绕类别中的指定类别将多个分词进行重组得到多个新名称,操作简单,缩小了分词组合范围,减少了匹配计算工作量。而且,针对确定出的目标候选名称,采用语义分类匹配的方式,将传统的相似度匹配问题转为机器分类问题,从语义相近性考虑了名称对的相似程度,提高了匹配准确率,进一步降低了匹配计算工作量。In response to the search request for the original name to be searched, the background server matches the original name based on the characters; if the matching fails, the original name is split to obtain a plurality of participles, and the multiple participles are divided into two parts. Classify according to preset categories to obtain a plurality of word segmentations with categories, and then reorganize the plurality of word segmentations based on the target categories in the categories to obtain a plurality of candidate names, and combine the plurality of candidate names with the The preset name database performs similarity matching, determines the target candidate name with the highest similarity, and then performs semantic classification matching on the target candidate name and the name database based on the classification model completed to obtain a matching result. In this way, after the exact match of the original name fails, the original name can be disassembled according to the category to obtain multiple word segmentations, and then the multiple segmentations can be reorganized around the specified category in the category to obtain multiple new names, which is easy to operate and reduce The range of word segmentation and combination is reduced, and the workload of matching calculation is reduced. Moreover, for the identified target candidate names, the method of semantic classification and matching is used to convert the traditional similarity matching problem into a machine classification problem, and the similarity of the name pairs is considered from the semantic similarity, which improves the matching accuracy and further reduces the to match the computational workload.

附图说明Description of drawings

图1是本发明的一种名称匹配方法实施例的步骤流程图;Fig. 1 is the step flow chart of a kind of name matching method embodiment of the present invention;

图2是本发明的一种名称匹配装置实施例的结构框图。FIG. 2 is a structural block diagram of an embodiment of a name matching apparatus according to the present invention.

具体实施方式Detailed ways

为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

本发明实施例的核心构思之一在于,后台服务器响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词,然后基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称,并将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称,再基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。这样,在对原始名称进行精确匹配失败后,可以按照类别对原始名称进行拆解得到多个分词,再重新围绕类别中的指定类别将多个分词进行重组得到多个新名称,操作简单,缩小了分词组合范围,减少了匹配计算工作量。而且,针对确定出的目标候选名称,采用语义分类匹配的方式,将传统的相似度匹配问题转为机器分类问题,从语义相近性考虑了名称对的相似程度,提高了匹配准确率,进一步降低了匹配计算工作量。One of the core concepts of the embodiments of the present invention is that, in response to a search request for the original name to be searched, the background server matches the original name based on characters; if the matching fails, splits the original name to obtain multiple word segmentations, and classify the multiple word segmentations according to preset categories to obtain multiple word segmentations with categories, and then reorganize the multiple word segmentations based on the target category in the categories to obtain multiple candidate names , and perform similarity matching between the multiple candidate names and the preset name database respectively, determine the target candidate name with the highest similarity, and then compare the target candidate name with the name database based on the classification model completed by training. Semantic classification matching, get matching results. In this way, after the exact match of the original name fails, the original name can be disassembled according to the category to obtain multiple word segmentations, and then the multiple segmentations can be reorganized around the specified category in the category to obtain multiple new names, which is easy to operate and reduce The range of word segmentation and combination is reduced, and the workload of matching calculation is reduced. Moreover, for the identified target candidate names, the method of semantic classification and matching is used to convert the traditional similarity matching problem into a machine classification problem, and the similarity of the name pairs is considered from the semantic similarity, which improves the matching accuracy and further reduces the to match the computational workload.

参照图1,示出了本发明的一种名称匹配方法实施例的步骤流程图,具体可以包括如下步骤:Referring to FIG. 1 , a flow chart of steps of an embodiment of a name matching method of the present invention is shown, which may specifically include the following steps:

步骤101,响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配。Step 101 , in response to a search request for the original name to be searched, matching the original name based on the characters.

其中,本发明实施例可以应用于后台服务器,后台服务器可以与前端设备进行数据交互,用户在前端设备中输入需要搜索的名称(记为“原始名称”),并发起搜索指令,前端设备接收到搜索指令后,可以采用原始名称生成搜索请求,然后将搜索请求发送至后台服务器,后台服务器接收到搜索请求后,可以将原始名称的字符与预设的名称数据库进行精确匹配,从而确定出名称数据库中是否存在与原始名称字符相同的名称。Among them, the embodiment of the present invention can be applied to a back-end server, and the back-end server can perform data interaction with the front-end device. The user enters the name to be searched (referred to as "original name") in the front-end device, and initiates a search instruction, and the front-end device receives After the search instruction, the original name can be used to generate a search request, and then the search request can be sent to the backend server. After the backend server receives the search request, it can precisely match the characters of the original name with the preset name database to determine the name database. Is there a name with the same characters as the original name in .

其中,名称数据库中包括至少一个已存储的名称(记为“预设名称”);名称可以是企业、公司等的名称,比如“北京一二三文化传播有公司”。Wherein, the name database includes at least one stored name (referred to as "preset name"); the name may be the name of an enterprise, company, etc., such as "Beijing One Two Three Culture Communication Company".

在本发明实施例中,所述基于字符对所述原始名称进行匹配,包括:In this embodiment of the present invention, the character-based matching of the original name includes:

检测所述名称数据库中是否存在与所述原始名称的字符相同的名称;detecting whether a name with the same characters as the original name exists in the name database;

若存在,则匹配成功;若不存在,则匹配失败。If it exists, the match succeeds; if it does not exist, the match fails.

具体而言,在精确匹配时,可以检测原始名称与任一预设名称的字符是否完全相同,如果存在完全相同的任一预设名称,那么匹配成功,如果与所有预设名称均不是完全相同的,那么匹配失败。Specifically, in the case of exact matching, it can detect whether the characters of the original name and any preset name are exactly the same, if there is any preset name that is exactly the same, then the match is successful, if it is not exactly the same as all preset names , then the match fails.

比如,原始名称为“北京一二三文化传播有限公司”,假设名称数据库中的某个预设名称也是“北京一二三文化传播有限公司”,那么精确匹配成功,如果名称数据库中的某个预设名称是“北京一二三文化传播公司”,那么精确匹配失败。For example, the original name is "Beijing One Two Three Culture Communication Co., Ltd.", assuming that a preset name in the name database is also "Beijing One Two Three Culture Communication Co., Ltd.", then the exact match is successful. The default name is "Beijing One Two Three Culture Communication Company", then the exact match fails.

当精确匹配成功时,返回匹配的结果即可。When the exact match is successful, just return the matching result.

步骤102,若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词。Step 102 , if the matching fails, split the original name to obtain a plurality of word segments, and classify the plurality of word segments according to preset categories to obtain a plurality of word segments with categories.

在精确匹配失败时,可以对原始名称进行拆分,得到多个分词,然后对每个分词进行分类,从而确定出每个分词一一对应的类别。When the exact match fails, the original name can be split to obtain multiple participles, and then each participle can be classified to determine the one-to-one corresponding category of each participle.

在本发明实施例中,所述对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类型进行分类,得到多个具有类别的分词,包括:In the embodiment of the present invention, the original name is split to obtain a plurality of word segments, and the plurality of word segments are classified according to preset types to obtain a plurality of word segments with categories, including:

采用jieba对所述原始名称进行拆分,得到多个分词;Use jieba to split the original name to obtain multiple participles;

将每个分词与预设的类别库进行匹配,确定出每个分词一一对应的类别,得到多个具有类别的分词。Match each participle with a preset category library, determine the one-to-one corresponding category of each participle, and obtain multiple participles with categories.

具体而言,在对原始名称进行拆分时,可以采用分词工具jieba对原始名称进行拆分,从而得到多个分词。然后将每个分词分别与预设的类别库进行匹配,确定出每个分词一一对应的类别。Specifically, when splitting the original name, the word segmentation tool jieba can be used to split the original name to obtain multiple word segmentations. Then, each participle is matched with a preset category library to determine the one-to-one corresponding category of each participle.

其中,企业、公司等的名称一般由地区、关键词、行业和后缀四个部分组成,由于地区、行业和后缀的词汇数量有限,所以,本发明实施例可以预设3个类别库,分别为:地区库、行业库和后缀库。Among them, the names of enterprises, companies, etc. are generally composed of four parts: region, keyword, industry, and suffix. Since the number of words for region, industry, and suffix is limited, three category libraries can be preset in this embodiment of the present invention, which are respectively : Region library, industry library and suffix library.

地区库包括但不限于:北京、上海、四川、河北、湖南、陕西、云南、河南、甘肃、山东、湖北、广西、安徽、江西、新疆、山西、福建、内蒙古、浙江、黑龙江、贵州、吉林、青海、西藏、辽宁、广东、江苏、海南、宁夏、深圳、重庆、香港、澳门、台湾。Regional libraries include but are not limited to: Beijing, Shanghai, Sichuan, Hebei, Hunan, Shaanxi, Yunnan, Henan, Gansu, Shandong, Hubei, Guangxi, Anhui, Jiangxi, Xinjiang, Shanxi, Fujian, Inner Mongolia, Zhejiang, Heilongjiang, Guizhou, Jilin , Qinghai, Tibet, Liaoning, Guangdong, Jiangsu, Hainan, Ningxia, Shenzhen, Chongqing, Hong Kong, Macau, Taiwan.

行业库包括但不限于:信息、科技、商贸、贸易、服务、广告、技术、文化、传媒、传播、发展、交流、咨询、信息、管理、设计、维修、物流、培训、设计、租赁、建筑、工程、设备。Industry libraries include but are not limited to: information, technology, commerce, trade, service, advertising, technology, culture, media, communication, development, communication, consultation, information, management, design, maintenance, logistics, training, design, leasing, construction , engineering, equipment.

后缀库包括但不限于:公司、有限公司、有限责任公司、股份有限公司、分公司、责任有限公司、股份公司。The suffix library includes but is not limited to: company, limited company, limited liability company, limited liability company, branch company, limited liability company, and joint stock company.

这样,在得到多个分词之后,将每个分词分别与上述3个类别库进行匹配,即可确定出每个分词一一对应的类别。In this way, after obtaining a plurality of segmented words, each segmented word is matched with the above-mentioned three category libraries respectively, and the one-to-one corresponding category of each segmented word can be determined.

比如,接上例,对“北京一二三文化传播公司”进行分词,得到的分词为“北京”、“一二三”、“文化”、“传播”和“公司”,然后将各个分词分别与各个类别库进行匹配,确定出每个分词的类别如表1所示:For example, following the example above, the participles of "Beijing One Two Three Culture Communication Company" will be divided into "Beijing", "One Two Three", "Culture", "Communication" and "Company", and then each participle will be divided into Match with each category library, and determine the category of each participle as shown in Table 1:

类别category value 地区area 北京Beijing 关键词Key words 一二三one two Three 行业industry 文化、传播culture, communication 后缀suffix 公司company

表1Table 1

需要说明的是,类别的划分方式,以及类别包括的值除了可以如上述之外,还可以是其它的划分方式和其它的值,在实际应用中可以根据实际需求进行设置,本发明实施例对此不作限制。It should be noted that the classification method of categories and the values included in the categories may be other than those described above, and may also be other division methods and other values, which may be set according to actual requirements in practical applications. This is not limited.

步骤103,基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称。Step 103: Recombining the multiple word segmentations based on the target category in the category to obtain multiple candidate names.

在得到具有类别的多个分词之后,可以以多个类别中的目标类别为中心对多个分词进行重组,得到多个候选名称。After obtaining multiple word segmentations with categories, the multiple word segmentations can be reorganized with the target category in the multiple categories as the center to obtain multiple candidate names.

在本发明实施例中,所述基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称,包括:In this embodiment of the present invention, the multiple word segmentations are reorganized based on the target category in the category to obtain multiple candidate names, including:

将所述多个分词中具有目标类别的分别与具有非目标类别的分词分别进行重组,得到多个候选名称。Recombining the word segments with the target category and the word segments with the non-target category among the plurality of word segmentations to obtain a plurality of candidate names.

具体而言,在如前文所述的4个类别中,企业、公司名称中比较重要的类别就是“关键词”,所以,可以以关键词为中心对各个分词进行重组,重组的策略包括但不限于:关键词1,关键词2,……,关键词n,关键词1+关键词2,……,地区+关键词1,地区+关键词2,……,关键词1+行业,……。Specifically, among the four categories mentioned above, the most important category in the names of enterprises and companies is "keywords". Therefore, each word segmentation can be reorganized with the keywords as the center. The strategy of reorganization includes but not Limited to: Keyword1, Keyword2, …, Keywordn, Keyword1+Keyword2, …, Region+Keyword1, Region+Keyword2,…, Keyword1+Industry,… ….

比如,接上例,对“北京一二三文化传播公司”重组后得到的多个候选名称可以如表2所示:For example, following the example above, the multiple candidate names obtained after the reorganization of "Beijing One Two Three Culture Communication Company" can be shown in Table 2:

Figure BDA0003659628630000101
Figure BDA0003659628630000101

表2Table 2

步骤104,将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称。Step 104: Perform similarity matching between the plurality of candidate names and a preset name database respectively, and determine the target candidate name with the highest similarity.

在得到多个候选名称之后,可以将每个候选名称分别与名称数据库进行相似度匹配,得到每个候选名称一一对应的相似度,然后从所有候选名称中确定出相似度最高的候选名称作为目标候选名称。After obtaining multiple candidate names, each candidate name can be matched with the name database for similarity, and the similarity corresponding to each candidate name can be obtained, and then the candidate name with the highest similarity can be determined from all the candidate names as Target candidate name.

在本发明实施例中,所述将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称,包括:In the embodiment of the present invention, the similarity matching is performed between the multiple candidate names and the preset name database respectively, and the target candidate name with the highest similarity is determined, including:

将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度;Carrying out similarity matching between the multiple candidate names and the name database respectively, and determining the candidate similarity of each candidate name;

确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称。The target candidate similarity with the highest similarity among the plurality of candidate similarities is determined, and the candidate name corresponding to the target candidate similarity is used as the target candidate name.

具体而言,针对所有候选名称中的任一候选名称,可以将该任一候选名称与名称数据库中的所有预设名称进行相似度匹配,从而得到每个候选名称一一对应的候选相似度,然后从多个候选相似度中确定出相似度最高的目标候选相似度,并将目标候选相似度对应的候选名称作为目标候选名称。Specifically, for any candidate name among all the candidate names, the similarity can be matched with all the preset names in the name database, so as to obtain the candidate similarity corresponding to each candidate name one-to-one, Then, the target candidate similarity with the highest similarity is determined from the multiple candidate similarities, and the candidate name corresponding to the target candidate similarity is used as the target candidate name.

比如,针对表2中的7个候选名称,分别将每个候选名称与名称数据库进行匹配,计算得到每个候选名称一一对应的相似度,然后将7个相似度中相似度最高的候选名称作为目标候选名称。For example, for the 7 candidate names in Table 2, match each candidate name with the name database, calculate the similarity corresponding to each candidate name one-to-one, and then compare the candidate name with the highest similarity among the 7 similarities. as the target candidate name.

在本发明实施例中,所述将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度,包括:In the embodiment of the present invention, performing similarity matching between the multiple candidate names and the name database, and determining the candidate similarity of each candidate name, includes:

针对所述多个候选名称中的任一候选名称,将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度;For any candidate name in the plurality of candidate names, perform similarity matching between the any candidate name and at least one preset name in the name database to obtain at least one similarity;

确定出所述至少一个相似度中相似度最高的候选相似度。A candidate similarity with the highest similarity among the at least one similarity is determined.

具体而言,针对多个候选名称中的任一候选名称,可以将该任一候选名称与名称数据库中的所有预设名称进行相似度匹配,得到多个相似度,然后将相似度最高的相似度作为该任一候选名称的候选相似度。如此循环,即可得到每个候选名称一一对应的候选相似度。Specifically, for any candidate name among the multiple candidate names, the similarity can be matched with all the preset names in the name database to obtain multiple similarities, and then the similarity with the highest similarity can be matched. degree as the candidate similarity of any candidate name. In this way, the candidate similarity corresponding to each candidate name one-to-one can be obtained.

比如,接上例,假设名称数据库包括50个预设名称,针对7个候选名称中的“一二三”,将“一二三”与名称数据库中的50个预设名称进行相似度匹配,得到50个相似度,然后将50个相似度中相似度最高的相似度作为候选相似度。如此循环,即可计算出7个候选名称一一对应的候选相似度。For example, following the previous example, assuming that the name database includes 50 preset names, for the "one two three" in the seven candidate names, the similarity between "one two three" and the 50 preset names in the name database is matched, 50 similarities are obtained, and then the similarity with the highest similarity among the 50 similarities is used as the candidate similarity. In this way, the candidate similarity corresponding to the seven candidate names one-to-one can be calculated.

其中,所述将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度,包括:Wherein, performing similarity matching between any candidate name and at least one preset name in the name database to obtain at least one similarity, including:

针对所述至少一个预设名称中的任一预设名称,获取所述任一候选名称与所述任一预设名称的前向最大公共子串和后向最大公共子串;For any preset name in the at least one preset name, obtain the forward maximum common substring and the backward maximum common substring of the any candidate name and the any preset name;

基于所述前向最大公共子串计算出前向相似度,以及,采用所述后向最大公共子串计算出后向相似度;Calculate the forward similarity based on the forward maximum common substring, and calculate the backward similarity using the backward maximum common substring;

基于所述前向相似度和所述后向相似度,计算出所述任一候选名称与所述任一预设名称的相似度。Based on the forward similarity and the backward similarity, the similarity between any candidate name and any preset name is calculated.

在计算字符串匹配时一般采用最短编辑距离的相似度算法,它从整体上考虑了文本上下文之间的语义关系,是一种常用的距离函数度量方法,在字符串相似性匹配领域得到了广泛的应用。该算法是指由源字符串S转换到目标字符串T所需最少编辑操作数,所需要的操作数越少,两个字符串相似性越高。基本编辑操作有3种:①在串S中插入一个字符;②把串S中的一个字符删除;③把串S中的一个字符替换为串T中的一个字符。The shortest edit distance similarity algorithm is generally used when calculating string matching. It considers the semantic relationship between text contexts as a whole. It is a commonly used distance function measurement method and has been widely used in the field of string similarity matching. Applications. The algorithm refers to the minimum number of editing operations required to convert the source string S to the target string T. The fewer operations required, the higher the similarity between the two strings. There are three basic editing operations: ① insert a character in the string S; ② delete a character in the string S; ③ replace a character in the string S with a character in the string T.

在传统的最短编辑距离算法中,可以由最短编辑距离计算两个字符串之间的相似度。直观上,两个字符串编辑距离越小,相似度越高。将编辑距离转化为值在[0,1]区间的相似度公式如下:In the traditional shortest edit distance algorithm, the similarity between two strings can be calculated from the shortest edit distance. Intuitively, the smaller the edit distance between two strings, the higher the similarity. The formula for converting the edit distance into a similarity with a value in the [0,1] interval is as follows:

Figure BDA0003659628630000111
Figure BDA0003659628630000111

其中|S|,|T|分别表示字符串S和T的长度,ld表示字符串S和T之间的最短编辑距离。sim(S,T)越大,表示两个字符串相似程度越高。where |S|, |T| represent the lengths of strings S and T, respectively, and ld represents the shortest edit distance between strings S and T. The larger the sim(S, T), the higher the similarity between the two strings.

但是,传统的最短编辑距离算法只考虑了编辑操作次数和最大字符串长度的影响,并没有考虑字符串间的公共子串的影响,并不具备普遍适用性。例如字符串:S1=′BC′,S2=′CD′,S3=′EF′,则根据公式1计算两字符串的相似度如下所示:However, the traditional shortest edit distance algorithm only considers the influence of the number of editing operations and the maximum string length, and does not consider the influence of common substrings between strings, so it does not have universal applicability. For example, if the strings are: S 1 ='BC', S 2 ='CD', S 3 ='EF', the similarity between the two strings is calculated according to formula 1 as follows:

1)S1到S2需要1步替换和1步删除操作,最短编辑距离是2,则

Figure BDA0003659628630000121
1) S 1 to S 2 require 1-step replacement and 1-step deletion, and the shortest edit distance is 2, then
Figure BDA0003659628630000121

2)S1到S3需要2步替换操作,最短编辑距离是2,则

Figure BDA0003659628630000122
0。2) S 1 to S 3 require 2-step replacement operations, and the shortest edit distance is 2, then
Figure BDA0003659628630000122
0.

由计算结果得知S1、S2和S1、S3的相似度一样,但很显然S1、S2的相似程度大于S1、S3的相似程度,因为S1、S2之间存在最大公共子串“C”。所谓最大公共子串是指字符串序列X,如果分别是两个字符串的子序列,且是所有符合此条件序列中最长的,则称X为两个已知序列的最大公共子串。It can be known from the calculation result that S 1 and S 2 have the same similarity as S 1 and S 3 , but it is obvious that the similarity of S 1 and S 2 is greater than that of S 1 and S 3 because the difference between S 1 and S 2 The greatest common substring "C" exists. The so-called maximum common substring refers to the string sequence X. If it is a subsequence of two strings, and is the longest among all the sequences that meet this condition, then X is called the maximum common substring of two known sequences.

为了改进最短编辑距离算法存在的上述缺陷,本发明方案提出使用前向最大公共子串和后向最大公共子串的最短编辑距离相似度算法。In order to improve the above-mentioned defects of the shortest edit distance algorithm, the solution of the present invention proposes a shortest edit distance similarity algorithm using the forward maximum common substring and the backward maximum common substring.

具体而言,针对多个候选名称中的任一候选名称,记为S,以及,所有预设名称中的任一预设名称,记为T,获取二者的前向最大公共子串,记为lcs,以及,后向最大公共子串,记为rcs,然后采用公式(2)计算出前向相似度

Figure BDA0003659628630000127
Specifically, for any candidate name among the multiple candidate names, denoted as S, and any preset name among all the preset names, denoted as T, the forward maximum common substring of the two is obtained, denoted as is lcs, and the backward maximum common substring, denoted as rcs, and then use formula (2) to calculate the forward similarity
Figure BDA0003659628630000127

Figure BDA0003659628630000124
Figure BDA0003659628630000124

以及,采用公式(3)计算出后向相似度

Figure BDA0003659628630000125
And, using formula (3) to calculate the backward similarity
Figure BDA0003659628630000125

Figure BDA0003659628630000126
Figure BDA0003659628630000126

其中,|S|,|T|分别表示字符串S和T的长度,|lcs|,|rcs|分别表示前向最大公共子串lcs和后向最大公共子串rcs的长度,ld表示字符串S和T之间的最短编辑距离。前向最大公共子串是指两个字符串从左到右最大的公共子串,后向最大公共子串是指两个字符串从右到左最大的公共子串。Among them, |S|, |T| represent the lengths of strings S and T, respectively, |lcs|, |rcs| represent the lengths of the forward maximum common substring lcs and the backward maximum common substring rcs, respectively, and ld represents the string The shortest edit distance between S and T. The forward maximum common substring refers to the largest common substring of the two strings from left to right, and the backward maximum common substring refers to the largest common substring of the two strings from right to left.

然后采用前向相似度、后向相似度,以及二者的权值计算出最终的相似度,如公式(4)所示:Then the forward similarity, backward similarity, and the weight of the two are used to calculate the final similarity, as shown in formula (4):

Figure BDA0003659628630000131
Figure BDA0003659628630000131

其中,α和β分别为权重,α+β=1。Among them, α and β are weights, respectively, and α+β=1.

比如,接上例,假设α=0.5,β=0.5,使用公式4计算相似度:For example, following the example above, assuming α=0.5 and β=0.5, use Equation 4 to calculate the similarity:

Figure BDA0003659628630000132
Figure BDA0003659628630000132

Figure BDA0003659628630000133
Figure BDA0003659628630000133

由计算结果可知,S1、S2的相似程度大于S1、S3的相似程度,比传统的最短编辑距离计算结果更符合实际情况。It can be seen from the calculation results that the similarity of S 1 and S 2 is greater than that of S 1 and S 3 , which is more in line with the actual situation than the traditional calculation result of the shortest edit distance.

在本发明实施例中,所述确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称,包括:In the embodiment of the present invention, the target candidate similarity with the highest similarity among the multiple candidate similarities is determined, and the candidate name corresponding to the target candidate similarity is used as the target candidate name, including:

基于前向最大公共子串和后向最大公共子串对所述多个候选相似度进行归一化处理,得到相似度最高的目标候选相似度;Normalize the multiple candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain the target candidate similarity with the highest similarity;

将所述目标候选相似度对应的候选名称作为目标候选名称。The candidate name corresponding to the similarity of the target candidate is used as the target candidate name.

由于在字符串相似匹配应用中,最大公共子串的序列长度相差较大时,其对最终相似度计算结果有一定的影响,因此有必要对相似度进行归一化操作,从而解决了传统的最短编辑距离算法的计算缺陷,极大地提高了匹配准确率。具体而言,在得到每个候选名称一一对应的候选相似度后,可以采用前向最大公共子串和后向最大公共子串对多个候选相似度进行归一化处理,如公式(5)所示:In the application of string similarity matching, when the sequence lengths of the largest common substrings differ greatly, it will have a certain impact on the final similarity calculation result, so it is necessary to normalize the similarity, so as to solve the traditional problem. The calculation defect of the shortest edit distance algorithm greatly improves the matching accuracy. Specifically, after obtaining the candidate similarity corresponding to each candidate name one-to-one, the forward maximum common substring and the backward maximum common substring can be used to normalize the similarity of multiple candidates, as shown in formula (5 ) as shown:

Figure BDA0003659628630000134
Figure BDA0003659628630000134

这样,即可从多个候选相似度中确定出最终的目标候选相似度,然后将目标候选相似度对应的候选名称作为最终的目标候选名称。In this way, the final target candidate similarity can be determined from the multiple candidate similarities, and then the candidate name corresponding to the target candidate similarity can be used as the final target candidate name.

步骤105,基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。Step 105 , perform semantic classification matching between the target candidate name and the name database based on the classification model that has been trained to obtain a matching result.

在得到目标候选名称后,将其输入训练完成的分类模型,以使得分类模型将目标候选名称与名称数据库中的所有预设名称进行分类匹配,得到是否匹配的匹配结果。After the target candidate name is obtained, it is input into the trained classification model, so that the classification model classifies and matches the target candidate name with all preset names in the name database, and obtains a matching result of whether it matches.

在本发明实施例中,将所述目标候选名称输入训练完成的分类模型,以使得所述分类模型采用预设的特征指标将所述目标候选名称与所述名称数据库进行语义分类匹配;In the embodiment of the present invention, the target candidate name is input into a trained classification model, so that the classification model uses a preset feature index to perform semantic classification matching between the target candidate name and the name database;

若匹配成功,则将匹配的预设名称作为匹配结果;若匹配失败,则生成匹配失败信息。If the matching is successful, the matching preset name will be used as the matching result; if the matching fails, the matching failure information will be generated.

具体而言,将目标候选名称输入训练完成的分类模型,分类模型采用预设的特征指标将目标候选名称与名称数据库中的所有预设名称分别进行分类匹配,预设的特征指标包括但不限于表3:Specifically, the target candidate name is input into the trained classification model, and the classification model uses preset feature indicators to classify and match the target candidate name and all preset names in the name database. The preset feature indicators include but are not limited to table 3:

Figure BDA0003659628630000141
Figure BDA0003659628630000141

表3table 3

其中,客户名称为在后台服务器中记录、存储的公司的名称,也是被匹配的名称;企业名称为在网页中爬取得到的公司的名称,也就是待搜索的名称。Among them, the customer name is the name of the company recorded and stored in the background server, which is also the name to be matched; the company name is the name of the company obtained by crawling on the web page, that is, the name to be searched.

在分类匹配时,分类模型判断目标候选名称与所有预设名称是否表示了相同的语义,若与任一预设名称语义相同,则判定为1,也就是匹配成功,并将该任一预设名称作为匹配结果;若与所有预设名称均语义不相同,则判定为0,也就是匹配失败,并生成匹配失败信息。这样,将文本语义相似度匹配问题转换为分类问题,解决了只考虑字面相似而忽略语义相似导致的匹配率较低的问题。During classification matching, the classification model determines whether the target candidate name and all preset names represent the same semantics. If the semantics are the same as any preset name, it is determined as 1, that is, the matching is successful, and any preset name is determined as 1. The name is used as the matching result; if it is semantically different from all the preset names, it is judged as 0, that is, the matching fails, and a matching failure message is generated. In this way, the problem of text semantic similarity matching is transformed into a classification problem, which solves the problem of low matching rate caused by only considering literal similarity and ignoring semantic similarity.

进一步,分类模型可以通过以下方式生成:Further, classification models can be generated by:

1)确定样本集。1) Determine the sample set.

具体的,获取后台服务器中的所有客户名称,以及,从网页中爬取得到的所有企业名称,然后从客户名称中随机抽取一定数量(比如1%)作为客户名称样本,同时从企业名称中也随机抽取一定数量(比如1%)作为企业名称样本,将客户名称样本和企业名称样本作为最终的训练样本集。再将训练样本集中匹配的名称对(一个匹配的名称对为“客户名称-企业名称”)设置分类标签,标签为1表示语义相似,标签为0表示语义不相似。剩下的客户名称和企业名称作为测试样本集。Specifically, obtain all customer names in the backend server, as well as all the company names obtained from the web page, and then randomly select a certain number (such as 1%) from the customer names as customer name samples, and also from the company name. A certain number (for example, 1%) is randomly selected as the enterprise name sample, and the customer name sample and the enterprise name sample are used as the final training sample set. Then, the matching name pairs in the training sample set (one matching name pair is "customer name-enterprise name") are set as classification labels. The label is 1 for semantic similarity, and the label is 0 for semantically dissimilar. The remaining customer names and business names are used as a test sample set.

2)设置特征指标。2) Set the characteristic index.

也就是为分类模型设置分类采用的特征指标,包括但不限于表3所示的各项特征指标。That is, the feature indicators used for classification are set for the classification model, including but not limited to the various feature indicators shown in Table 3.

特征指标至少包括两个方面:一方面是计算名称对之间的各种相似度特征,另一方面是计算名称对的NLP(Natural Language Processing,自然语言处理)数据特征。The feature index includes at least two aspects: on the one hand, calculating various similarity features between name pairs, and on the other hand, calculating NLP (Natural Language Processing, natural language processing) data features of the name pairs.

3)建立分类学习模型。3) Establish a classification learning model.

本发明实施例使用两层Stacking方式建立机器学习模型:第一层Stacking选用了GussianNBClassifier、RandomForestClassifier、LogisticRegression三个基分类器作为Stacking基模型,第二层Stacking选用的是RandomForestClassifier分类器进行训练。The embodiment of the present invention uses a two-layer stacking method to build a machine learning model: the first layer of stacking selects three base classifiers, GussianNBClassifier, RandomForestClassifier, and LogisticRegression as the stacking base model, and the second layer of stacking selects the RandomForestClassifier classifier for training.

其中,训练模型参数设置为:学习率ρ=0.001,损失函数调整因子α=0.25,γ=0.15,可使模型损失函数最小,达到最优解。Among them, the training model parameters are set as: learning rate ρ=0.001, loss function adjustment factors α=0.25, γ=0.15, which can minimize the model loss function and achieve the optimal solution.

4)训练分类学习模型。4) Training the classification learning model.

采用上述参数设置和测试样本集对分类模型进行训练、验证,也就是计算客户名称与重组的企业名称的匹配结果,输出分类标签0或1。Use the above parameter settings and test sample set to train and verify the classification model, that is, calculate the matching result between the customer name and the reorganized enterprise name, and output the classification label 0 or 1.

采用1)~4)对分类模型进行多次跌代,最终的分类模型的综合评价指标(F值)可达0.79,企业名称和客户名称匹配的准确率可达90%以上,高效准确实现了企业名称和客户名称的匹配。Using 1) to 4) to reduce the classification model for many times, the comprehensive evaluation index (F value) of the final classification model can reach 0.79, and the matching accuracy rate of enterprise name and customer name can reach more than 90%, which is efficient and accurate. Business name and customer name matching.

在本发明实施例中,后台服务器响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词,然后基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称,并将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称,再基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。这样,在对原始名称进行精确匹配失败后,可以按照类别对原始名称进行拆解得到多个分词,再重新围绕类别中的指定类别将多个分词进行重组得到多个新名称,操作简单,缩小了分词组合范围,减少了匹配计算工作量。而且,针对确定出的目标候选名称,采用语义分类匹配的方式,将传统的相似度匹配问题转为机器分类问题,从语义相近性考虑了名称对的相似程度,提高了匹配准确率,进一步降低了匹配计算工作量。In this embodiment of the present invention, in response to a search request for the original name to be searched, the background server matches the original name based on characters; if the matching fails, the original name is split to obtain multiple word segments, Classify the plurality of word segmentations according to preset categories to obtain a plurality of word segmentations with categories, and then reorganize the plurality of word segmentations based on the target categories in the categories to obtain a plurality of candidate names, and combine the The plurality of candidate names are respectively matched with the preset name database for similarity, and the target candidate name with the highest similarity is determined, and then the target candidate name is semantically classified and matched with the name database based on the classification model completed by training, get matching results. In this way, after the exact match of the original name fails, the original name can be disassembled according to the category to obtain multiple word segmentations, and then the multiple segmentations can be reorganized around the specified category in the category to obtain multiple new names, which is easy to operate and reduce The range of word segmentation and combination is reduced, and the workload of matching calculation is reduced. Moreover, for the identified target candidate names, the method of semantic classification and matching is used to convert the traditional similarity matching problem into a machine classification problem, and the similarity of the name pairs is considered from the semantic similarity, which improves the matching accuracy and further reduces the to match the computational workload.

进一步,在对原始名称进行拆解时,使用了前向最大公共子串和后向最大公共子串的相似度算法,解决了传统编辑距离的相似度只考虑编辑次数的计算缺陷,不仅具备普遍适用性,而且极大地提高了匹配准确率。Further, when dismantling the original name, the similarity algorithm of forward maximum common substring and backward maximum common substring is used, which solves the calculation defect that the similarity of the traditional edit distance only considers the number of edits. Applicability, and greatly improve the matching accuracy.

更进一步,在确定目标候选名称时,传统的最短编辑距离算法忽略了字符串公共长度对编辑距离产生的影响,所以,对于长字符串插入和删除错误等的计算存在一定的偏差,本发明实施例提出了改进的归一化方法,解决了该问题,并进一步提高了匹配准确率。Further, when determining the target candidate name, the traditional shortest edit distance algorithm ignores the influence of the common length of the string on the edit distance, so there is a certain deviation for the calculation of long string insertion and deletion errors, etc., the present invention implements In this example, an improved normalization method is proposed to solve this problem and further improve the matching accuracy.

需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明实施例并不受所描述的动作顺序的限制,因为依据本发明实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明实施例所必须的。It should be noted that, for the sake of simple description, the method embodiments are described as a series of action combinations, but those skilled in the art should know that the embodiments of the present invention are not limited by the described action sequences, because According to embodiments of the present invention, certain steps may be performed in other sequences or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of the present invention.

参照图2,示出了本发明的一种名称匹配装置实施例的结构框图,具体可以包括如下模块:Referring to FIG. 2, a structural block diagram of an embodiment of a name matching apparatus of the present invention is shown, which may specifically include the following modules:

第一匹配模块201,用于响应于针对待搜索的原始名称的搜索请求,基于字符对所述原始名称进行匹配;a first matching module 201, configured to match the original name based on characters in response to a search request for the original name to be searched;

分词模块202,用于若匹配失败,则对所述原始名称进行拆分,得到多个分词,并将所述多个分词按照预设的类别进行分类,得到多个具有类别的分词;The word segmentation module 202 is configured to split the original name if the matching fails to obtain a plurality of word segmentations, and classify the plurality of word segmentations according to preset categories to obtain a plurality of word segmentations with categories;

重组模块203,用于基于所述类别中的目标类别对所述多个分词进行重组,得到多个候选名称;A reorganization module 203, configured to reorganize the multiple word segmentations based on the target category in the category to obtain multiple candidate names;

第二匹配模块204,用于将所述多个候选名称分别与预设的名称数据库进行相似度匹配,确定出相似度最高的目标候选名称;The second matching module 204 is configured to perform similarity matching between the multiple candidate names and a preset name database respectively, and determine the target candidate name with the highest similarity;

分类模块205,用于基于训练完成的分类模型将所述目标候选名称与所述名称数据库进行语义分类匹配,得到匹配结果。The classification module 205 is configured to perform semantic classification matching between the target candidate name and the name database based on the trained classification model to obtain a matching result.

在本发明实施例中,所述第一匹配模块,具体用于:In the embodiment of the present invention, the first matching module is specifically used for:

检测所述名称数据库中是否存在与所述原始名称的字符相同的名称;detecting whether a name with the same characters as the original name exists in the name database;

若存在,则匹配成功;若不存在,则匹配失败。If it exists, the match succeeds; if it does not exist, the match fails.

在本发明实施例中,所述分词模块,具体用于:In the embodiment of the present invention, the word segmentation module is specifically used for:

采用jieba对所述原始名称进行拆分,得到多个分词;Use jieba to split the original name to obtain multiple participles;

将每个分词与预设的类别库进行匹配,确定出每个分词一一对应的类别,得到多个具有类别的分词。Match each participle with a preset category library, determine the one-to-one corresponding category of each participle, and obtain multiple participles with categories.

在本发明实施例中,所述重组模块,具体用于:In the embodiment of the present invention, the reorganization module is specifically used for:

将所述多个分词中具有目标类别的分别与具有非目标类别的分词分别进行重组,得到多个候选名称。Recombining the word segments with the target category and the word segments with the non-target category among the plurality of word segmentations to obtain a plurality of candidate names.

在本发明实施例中,所述第二匹配模块,包括:In this embodiment of the present invention, the second matching module includes:

相似度匹配子模块,用于将所述多个候选名称分别与所述名称数据库进行相似度匹配,并确定出每个候选名称的候选相似度;a similarity matching submodule, used to perform similarity matching between the multiple candidate names and the name database respectively, and determine the candidate similarity of each candidate name;

确定子模块,用于确定出多个候选相似度中相似度最高的目标候选相似度,并将所述目标候选相似度对应的候选名称作为目标候选名称。The determination submodule is used to determine the target candidate similarity with the highest similarity among the multiple candidate similarities, and use the candidate name corresponding to the target candidate similarity as the target candidate name.

在本发明实施例中,所述相似度匹配子模块,包括:In this embodiment of the present invention, the similarity matching sub-module includes:

匹配单元,用于针对所述多个候选名称中的任一候选名称,将所述任一候选名称与所述名称数据库中的至少一个预设名称进行相似度匹配,得到至少一个相似度;a matching unit, configured to perform similarity matching between any candidate name in the plurality of candidate names and at least one preset name in the name database to obtain at least one similarity;

确定单元,用于确定出所述至少一个相似度中相似度最高的候选相似度。A determining unit, configured to determine a candidate similarity with the highest similarity among the at least one similarity.

在本发明实施例中,所述匹配单元,具体用于:In this embodiment of the present invention, the matching unit is specifically used for:

针对所述至少一个预设名称中的任一预设名称,获取所述任一候选名称与所述任一预设名称的前向最大公共子串和后向最大公共子串;For any preset name in the at least one preset name, obtain the forward maximum common substring and the backward maximum common substring of the any candidate name and the any preset name;

基于所述前向最大公共子串计算出前向相似度,以及,采用所述后向最大公共子串计算出后向相似度;Calculate the forward similarity based on the forward maximum common substring, and calculate the backward similarity using the backward maximum common substring;

基于所述前向相似度和所述后向相似度,计算出所述任一候选名称与所述任一预设名称的相似度。Based on the forward similarity and the backward similarity, the similarity between any candidate name and any preset name is calculated.

在本发明实施例中,所述确定子模块,具体用于:In this embodiment of the present invention, the determining submodule is specifically used for:

基于前向最大公共子串和后向最大公共子串对所述多个候选相似度进行归一化处理,得到相似度最高的目标候选相似度;Normalize the multiple candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain the target candidate similarity with the highest similarity;

将所述目标候选相似度对应的候选名称作为目标候选名称。The candidate name corresponding to the similarity of the target candidate is used as the target candidate name.

在本发明实施例中,所述分类模块,具体用于:In the embodiment of the present invention, the classification module is specifically used for:

将所述目标候选名称输入训练完成的分类模型,以使得所述分类模型采用预设的特征指标将所述目标候选名称与所述名称数据库进行语义分类匹配;Inputting the target candidate name into the trained classification model, so that the classification model uses a preset feature index to perform semantic classification matching between the target candidate name and the name database;

若匹配成功,则将匹配的预设名称作为匹配结果;若匹配失败,则生成匹配失败信息。If the matching is successful, the matching preset name will be used as the matching result; if the matching fails, the matching failure information will be generated.

对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。As for the apparatus embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and reference may be made to the partial description of the method embodiment for related parts.

本发明实施例还提供了一种电子设备,包括:The embodiment of the present invention also provides an electronic device, including:

包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述名称匹配方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。It includes a processor, a memory, and a computer program stored on the memory and capable of running on the processor. When the computer program is executed by the processor, each process of the above-mentioned name matching method embodiment can be realized, and the same technology can be achieved. The effect, in order to avoid repetition, is not repeated here.

本发明实施例还提供了一种计算机可读存储介质,计算机可读存储介质上存储计算机程序,计算机程序被处理器执行时实现上述名称匹配方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, each process of the foregoing name matching method embodiment can be achieved, and the same technical effect can be achieved , in order to avoid repetition, it will not be repeated here.

本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments may be referred to each other.

本领域内的技术人员应明白,本发明实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本发明实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It should be understood by those skilled in the art that the embodiments of the embodiments of the present invention may be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.

本发明实施例是参照根据本发明实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。Embodiments of the present invention are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal equipment to produce a machine that causes the instructions to be executed by the processor of the computer or other programmable data processing terminal equipment Means are created for implementing the functions specified in the flow or flows of the flowcharts and/or the blocks or blocks of the block diagrams.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing terminal equipment to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising instruction means, the The instruction means implement the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing terminal equipment, so that a series of operational steps are performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby executing on the computer or other programmable terminal equipment The instructions executed on the above provide steps for implementing the functions specified in the flowchart or blocks and/or the block or blocks of the block diagrams.

尽管已描述了本发明实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明实施例范围的所有变更和修改。Although preferred embodiments of the embodiments of the present invention have been described, additional changes and modifications to these embodiments may be made by those skilled in the art once the basic inventive concepts are known. Therefore, the appended claims are intended to be construed to include the preferred embodiments as well as all changes and modifications that fall within the scope of the embodiments of the present invention.

最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or terminal device comprising a list of elements includes not only those elements, but also a non-exclusive list of elements. other elements, or also include elements inherent to such a process, method, article or terminal equipment. Without further limitation, an element defined by the phrase "comprises a..." does not preclude the presence of additional identical elements in the process, method, article or terminal device comprising said element.

以上对本发明所提供的一种名称匹配方法和一种名称匹配装置,进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。A name matching method and a name matching device provided by the present invention have been introduced in detail above. The principles and implementations of the present invention are described with specific examples in this paper. The descriptions of the above embodiments are only for help. Understand the method of the present invention and its core idea; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in the specific implementation and application scope. In summary, the content of this specification does not It should be understood as a limitation of the present invention.

Claims (12)

1. A method of name matching, the method comprising:
in response to a search request for an original name to be searched, matching the original name based on characters;
if the matching fails, splitting the original name to obtain a plurality of participles, and classifying the participles according to preset categories to obtain a plurality of participles with categories;
recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
2. The name matching method of claim 1, wherein said matching the original name based on characters comprises:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
3. The name matching method according to claim 1, wherein the splitting the original name to obtain a plurality of segmented words, and classifying the plurality of segmented words according to a preset type to obtain a plurality of segmented words with categories includes:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
4. The name matching method according to claim 1, wherein the recombining the plurality of participles based on a target category in the categories to obtain a plurality of candidate names comprises:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
5. The name matching method according to claim 1, wherein the similarity matching of the candidate names with a preset name database is performed to determine a target candidate name with the highest similarity, and the method includes:
respectively carrying out similarity matching on the candidate names and the name database, and determining the candidate similarity of each candidate name;
and determining the target candidate similarity with the highest similarity in the candidate similarities, and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
6. The name matching method according to claim 5, wherein the similarity matching of the candidate names with the name database and the determination of the candidate similarity of each candidate name comprises:
for any candidate name in the candidate names, performing similarity matching on the candidate name and at least one preset name in the name database to obtain at least one similarity;
and determining the candidate similarity with the highest similarity in the at least one similarity.
7. The name matching method according to claim 6, wherein the similarity matching of any one of the candidate names with at least one preset name in the name database to obtain at least one similarity comprises:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
8. The name matching method according to claim 5, wherein the determining a target candidate similarity with the highest similarity among the plurality of candidate similarities and taking a candidate name corresponding to the target candidate similarity as a target candidate name comprises:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
9. The name matching method according to claim 1, wherein the performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result comprises:
inputting the target candidate name into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
10. A name matching apparatus, characterized in that the apparatus comprises:
the device comprises a first matching module, a second matching module and a searching module, wherein the first matching module is used for responding to a searching request aiming at an original name to be searched and matching the original name based on characters;
the word segmentation module is used for splitting the original name to obtain a plurality of words if the matching fails, and classifying the plurality of words according to preset categories to obtain a plurality of classified words;
the recombination module is used for recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
the second matching module is used for respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and the classification module is used for carrying out semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
11. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the name matching method as claimed in any one of claims 1 to 9.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the name matching method according to any one of claims 1 to 9.
CN202210569401.3A 2022-05-24 2022-05-24 Name matching method and device Active CN114911999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210569401.3A CN114911999B (en) 2022-05-24 2022-05-24 Name matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210569401.3A CN114911999B (en) 2022-05-24 2022-05-24 Name matching method and device

Publications (2)

Publication Number Publication Date
CN114911999A true CN114911999A (en) 2022-08-16
CN114911999B CN114911999B (en) 2024-11-29

Family

ID=82768152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210569401.3A Active CN114911999B (en) 2022-05-24 2022-05-24 Name matching method and device

Country Status (1)

Country Link
CN (1) CN114911999B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 An industrial chain construction method and system
WO2024066903A1 (en) * 2022-09-30 2024-04-04 上海寰通商务科技有限公司 Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909532A (en) * 2019-10-31 2020-03-24 银联智惠信息服务(上海)有限公司 User name matching method and device, computer equipment and storage medium
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
WO2021217850A1 (en) * 2020-04-26 2021-11-04 平安科技(深圳)有限公司 Disease name code matching method and apparatus, computer device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909532A (en) * 2019-10-31 2020-03-24 银联智惠信息服务(上海)有限公司 User name matching method and device, computer equipment and storage medium
CN111310456A (en) * 2020-02-13 2020-06-19 支付宝(杭州)信息技术有限公司 Entity name matching method, device and equipment
WO2021217850A1 (en) * 2020-04-26 2021-11-04 平安科技(深圳)有限公司 Disease name code matching method and apparatus, computer device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUOCHAO SONG 等: "Entity Matching Using Different Level Similarity for Different Attributes", 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 10 March 2019 (2019-03-10), pages 779 - 782 *
孙海霞 等: "科技文献数据库中机构名称匹配策略研究", 数据分析与知识发现, no. 08, 25 August 2018 (2018-08-25), pages 88 - 97 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024066903A1 (en) * 2022-09-30 2024-04-04 上海寰通商务科技有限公司 Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 An industrial chain construction method and system
CN117633518B (en) * 2024-01-25 2024-04-26 北京大学 Industrial chain construction method and system

Also Published As

Publication number Publication date
CN114911999B (en) 2024-11-29

Similar Documents

Publication Publication Date Title
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
CN107992633A (en) Electronic document automatic classification method and system based on keyword feature
CN103914494B (en) Method and system for identifying identity of microblog user
JP2018524725A (en) Colloquial meaning analysis system and method
CN104573130B (en) The entity resolution method and device calculated based on colony
US8122022B1 (en) Abbreviation detection for common synonym generation
CN105528411B (en) Device and method for full-text retrieval of ship equipment interactive electronic technical manual
CN112115232A (en) A data error correction method, device and server
CN104199965A (en) Semantic information retrieval method
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
JP2015529901A (en) Information classification based on product recognition
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN114722137A (en) Security policy configuration method, device and electronic device based on sensitive data identification
CN114911999A (en) A name matching method and device
CN118820389B (en) Keyword-based data association storage method and device
CN107423348A (en) A kind of precise search method based on keyword
CN117573800A (en) Paragraph retrieval method, device, equipment and storage medium
CN104572904B (en) A kind of determination method and device of label correlation degree
CN110532569B (en) Data collision method and system based on Chinese word segmentation
Al-Sarkhi et al. Estimating the parameters for linking unstandardized references with the matrix comparator
CN107239455B (en) Core word recognition method and device
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
CN112115362B (en) A programming information recommendation method and device based on similar code recognition
CN105354264A (en) Locality-sensitive-hashing-based subject label fast endowing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant