CN114911999A - Name matching method and device - Google Patents
Name matching method and device Download PDFInfo
- Publication number
- CN114911999A CN114911999A CN202210569401.3A CN202210569401A CN114911999A CN 114911999 A CN114911999 A CN 114911999A CN 202210569401 A CN202210569401 A CN 202210569401A CN 114911999 A CN114911999 A CN 114911999A
- Authority
- CN
- China
- Prior art keywords
- name
- matching
- candidate
- similarity
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000013145 classification model Methods 0.000 claims abstract description 32
- 230000004044 response Effects 0.000 claims abstract description 5
- 238000004590 computer program Methods 0.000 claims description 19
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000003860 storage Methods 0.000 claims description 8
- 230000006798 recombination Effects 0.000 claims description 5
- 238000005215 recombination Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 abstract description 16
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000012549 training Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a name matching method and a name matching device, which relate to the field of big data, and the method comprises the following steps: in response to a search request for an original name to be searched, matching the original name based on characters; if the matching fails, splitting the original name to obtain a plurality of participles, and classifying the participles according to preset categories to obtain a plurality of participles with categories; recombining the multiple participles based on a target category in the categories to obtain multiple candidate names; respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity; and performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result. The embodiment of the invention reduces the matching calculation workload and improves the matching accuracy.
Description
Technical Field
The invention relates to the technical field of big data, in particular to a name matching method and a name matching device.
Background
In order to realize all-round customer insights and product recommendations, the related art supports that corresponding customer information including customer development information, income information, bid and bid information, business situation information, an enterprise map, complaint information and the like is acquired by inputting an enterprise name into a system, and essentially, the enterprise name (the name of a company obtained by crawling in a webpage) and the customer name (the name of a company recorded and stored in the system) need to be associated and communicated, so that all information corresponding to the customer name in the system is acquired. When the user inputs the enterprise name, the name abbreviation/abbreviation is often input, and the system cannot accurately find the name of the client due to the reasons of non-standard input, wrongly written characters, complex and various abbreviation and the like.
In order to solve the above problems, a solution of the related art generally performs similarity matching between a business name and a client name, sorts all matching scores, and obtains a client name with a matching score higher than a certain threshold as a matching result of the business name, and this similar matching method is generally low in matching accuracy.
In addition, a similarity algorithm of the shortest editing distance is generally adopted when character string matching is calculated, the semantic relation between text contexts is considered on the whole, the method is a common distance function measurement method, and the method is widely applied to the field of character string similarity matching, but still has some problems:
1) the traditional editing distance algorithm only considers the number of editing operations and has no universal applicability;
2) the traditional edit distance algorithm has certain deviation on calculation such as long character string insertion and deletion errors, and the matching accuracy is low.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a name matching method and a corresponding name matching apparatus that overcome or at least partially solve the above problems.
In order to solve the above problem, an embodiment of the present invention discloses a name matching method, which is characterized in that the method includes:
in response to a search request for an original name to be searched, matching the original name based on characters;
if the matching fails, splitting the original name to obtain a plurality of participles, and classifying the participles according to preset categories to obtain a plurality of participles with categories;
recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
Preferably, the matching the original name based on characters includes:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
Preferably, the splitting the original name to obtain a plurality of participles, and classifying the plurality of participles according to a preset type to obtain a plurality of participles with categories includes:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
Preferably, the recombining the multiple participles based on the target category in the categories to obtain multiple candidate names includes:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
Preferably, the performing similarity matching between the candidate names and a preset name database respectively to determine a target candidate name with the highest similarity includes:
respectively carrying out similarity matching on the candidate names and the name database, and determining the candidate similarity of each candidate name;
and determining the target candidate similarity with the highest similarity in the candidate similarities, and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
Preferably, the performing similarity matching between the plurality of candidate names and the name database, and determining the candidate similarity of each candidate name includes:
for any candidate name in the candidate names, performing similarity matching on the candidate name and at least one preset name in the name database to obtain at least one similarity;
and determining the candidate similarity with the highest similarity in the at least one similarity.
Preferably, the performing similarity matching between the any candidate name and at least one preset name in the name database to obtain at least one similarity includes:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
Preferably, the determining a target candidate similarity with the highest similarity among the multiple candidate similarities and taking a candidate name corresponding to the target candidate similarity as a target candidate name includes:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
Preferably, the semantic classification matching of the target candidate name and the name database based on the trained classification model to obtain a matching result includes:
inputting the target candidate name into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
Correspondingly, the embodiment of the invention discloses a name matching device, which is characterized by comprising the following components:
the device comprises a first matching module, a second matching module and a searching module, wherein the first matching module is used for responding to a searching request aiming at an original name to be searched and matching the original name based on characters;
the word segmentation module is used for splitting the original name to obtain a plurality of words if the matching fails, and classifying the plurality of words according to preset categories to obtain a plurality of classified words;
the recombination module is used for recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
the second matching module is used for respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and the classification module is used for carrying out semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
Preferably, the first matching module is specifically configured to:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
Preferably, the word segmentation module is specifically configured to:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
Preferably, the restructuring module is specifically configured to:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
Preferably, the second matching module includes:
the similarity matching submodule is used for respectively performing similarity matching on the candidate names and the name database and determining the candidate similarity of each candidate name;
and the determining submodule is used for determining the target candidate similarity with the highest similarity in the candidate similarities and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
Preferably, the similarity matching sub-module includes:
a matching unit, configured to perform similarity matching on any candidate name in the multiple candidate names and at least one preset name in the name database to obtain at least one similarity;
a determining unit, configured to determine a candidate similarity with a highest similarity among the at least one similarity.
Preferably, the matching unit is specifically configured to:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
Preferably, the determining submodule is specifically configured to:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
Preferably, the classification module is specifically configured to:
inputting the target candidate name into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
Correspondingly, the embodiment of the invention discloses an electronic device, which comprises: a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, performs the steps of the above-described name matching method embodiments.
Correspondingly, the embodiment of the invention discloses a computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program realizes the steps of the name matching method embodiment.
The embodiment of the invention has the following advantages:
the method comprises the following steps that a background server responds to a search request aiming at an original name to be searched, and the original name is matched based on characters; if the matching fails, splitting the original name to obtain a plurality of participles, classifying the participles according to preset categories to obtain a plurality of participles with categories, recombining the participles based on target categories in the categories to obtain a plurality of candidate names, respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity, and carrying out semantic classification matching on the target candidate name and the name database based on a trained classification model to obtain a matching result. Therefore, after the original name is unsuccessfully matched accurately, the original name can be disassembled according to the category to obtain a plurality of participles, and then the plurality of participles are recombined around the specified category in the category to obtain a plurality of new names. Moreover, the determined target candidate name is subjected to semantic classification matching, the traditional similarity matching problem is converted into a machine classification problem, the similarity of the name pair is considered from semantic similarity, the matching accuracy is improved, and the matching calculation workload is further reduced.
Drawings
FIG. 1 is a flow chart of the steps of one embodiment of a name matching method of the present invention;
fig. 2 is a block diagram of a name matching apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
One of the core ideas of the embodiment of the invention is that a background server responds to a search request aiming at an original name to be searched and matches the original name based on characters; if the matching fails, splitting the original name to obtain a plurality of participles, classifying the participles according to preset categories to obtain a plurality of participles with categories, recombining the participles based on target categories in the categories to obtain a plurality of candidate names, respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity, and carrying out semantic classification matching on the target candidate name and the name database based on a trained classification model to obtain a matching result. Therefore, after the original name fails to be accurately matched, the original name can be disassembled according to the category to obtain a plurality of participles, and then the plurality of participles are recombined around the specified category in the category to obtain a plurality of new names. Moreover, the determined target candidate name is subjected to semantic classification matching, the traditional similarity matching problem is converted into a machine classification problem, the similarity of the name pair is considered from semantic similarity, the matching accuracy is improved, and the matching calculation workload is further reduced.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a name matching method according to the present invention is shown, which may specifically include the following steps:
The embodiment of the invention can be applied to a background server, the background server can perform data interaction with front-end equipment, a user inputs a name (marked as an original name) to be searched in the front-end equipment and initiates a search instruction, the front-end equipment can generate a search request by using the original name after receiving the search instruction and then sends the search request to the background server, and the background server can accurately match characters of the original name with a preset name database after receiving the search request so as to determine whether the name database has the same name as the original name characters.
The name database comprises at least one stored name (marked as a preset name); the name may be the name of an enterprise, a company, etc., such as "Beijing has a company spreading the first, second, third cultures".
In this embodiment of the present invention, the matching the original name based on the characters includes:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
Specifically, in the case of an exact match, it may be detected whether the characters of the original name and any preset name are identical, and if there is any identical preset name, the match is successful, and if not, the match is failed.
For example, the original name is "beijing-second-third-culture-propagation limited company", and if a certain preset name in the name database is also "beijing-second-third-culture-propagation limited company", the exact matching is successful, and if the certain preset name in the name database is "beijing-second-third-culture-propagation limited company", the exact matching is failed.
And when the accurate matching is successful, returning a matching result.
And step 102, if the matching fails, splitting the original name to obtain a plurality of participles, and classifying the participles according to preset categories to obtain a plurality of participles with categories.
When the precise matching fails, the original name can be split to obtain a plurality of participles, and then each participle is classified, so that the category corresponding to each participle one to one is determined.
In this embodiment of the present invention, the splitting the original name to obtain a plurality of segmented words, and classifying the plurality of segmented words according to a preset type to obtain a plurality of segmented words with categories includes:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
Specifically, when splitting the original name, a word segmentation tool jieba may be used to split the original name, so as to obtain a plurality of words. And then matching each participle with a preset category library respectively to determine the category corresponding to each participle one by one.
The names of enterprises, companies and the like generally consist of four parts, namely regions, keywords, industries and suffixes, and because the number of the words of the regions, the industries and the suffixes is limited, 3 category libraries can be preset in the embodiment of the invention, which are respectively: a region library, an industry library, and a suffix library.
Regional libraries include, but are not limited to: beijing, Shanghai, Sichuan, Hebei, Hunan, Shaanxi, Yunnan, Henan, Gansu, Shandong, Hubei, Guangxi, Anhui, Jiangxi, Xinjiang, Shanxi, Fujian, inner Mongolia, Zhejiang, Heilongjiang, Guizhou, Jilin, Qinghai, Xizang, Liaoning, Guangdong, Jiangsu, Hainan, Ningxia, Shenzhen, Chongqing, hong gang, Macao, Australian, Taiwan.
Industry libraries include, but are not limited to: information, science and technology, commerce, trade, service, advertising, technology, culture, media, dissemination, development, exchange, consultation, information, management, design, maintenance, logistics, training, design, lease, construction, engineering, equipment.
Suffix libraries include, but are not limited to: company, division, company, or company, or a, company, or a, company, or a company, or a, company, or a.
Thus, after a plurality of participles are obtained, each participle is matched with the 3 category libraries respectively, and the category corresponding to each participle one to one can be determined.
For example, in the above example, the word segmentation is performed on "beijing-two-three culture propagation company" to obtain the word segmentation of "beijing", "two-three", "culture", "propagation" and "company", and then each word segmentation is respectively matched with each category library to determine the category of each word segmentation as shown in table 1:
categories | Value of |
Region of land | Beijing |
Keyword | One, two, three |
Industry | Culture and spread |
Suffix | Company (SA) |
TABLE 1
It should be noted that, the classification manner of the category and the value included in the category may be other than the above, and may also be other classification manners and other values, which may be set according to actual requirements in practical applications, and the embodiment of the present invention is not limited to this.
And 103, recombining the multiple participles based on the target category in the categories to obtain multiple candidate names.
After obtaining the multiple participles with the category, the multiple participles may be recombined centering on a target category in the multiple categories to obtain multiple candidate names.
In this embodiment of the present invention, the recombining the multiple participles based on the target category in the categories to obtain multiple candidate names includes:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
Specifically, among the 4 categories described above, the category of the enterprise and company name that is important is the "keyword", so that the respective participles can be recombined centering on the keyword, and the strategy of recombination includes but is not limited to: keyword 1, keyword 2, … …, keyword n, keyword 1+ keyword 2, … …, region + keyword 1, region + keyword 2, … …, keyword 1+ industry, … ….
For example, as mentioned above, the names of the candidates obtained after the recombination of "beijing-di-tri-culture propagation company" can be shown in table 2:
TABLE 2
And 104, respectively carrying out similarity matching on the candidate names and a preset name database, and determining the target candidate name with the highest similarity.
After obtaining a plurality of candidate names, similarity matching may be performed between each candidate name and the name database, so as to obtain one-to-one similarity of each candidate name, and then a candidate name with the highest similarity is determined from all candidate names as a target candidate name.
In the embodiment of the present invention, the performing similarity matching between the multiple candidate names and a preset name database respectively to determine a target candidate name with the highest similarity includes:
respectively carrying out similarity matching on the candidate names and the name database, and determining the candidate similarity of each candidate name;
and determining the target candidate similarity with the highest similarity in the candidate similarities, and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
Specifically, for any candidate name in all candidate names, similarity matching may be performed between the any candidate name and all preset names in the name database, so as to obtain candidate similarity corresponding to each candidate name one to one, then a target candidate similarity with the highest similarity is determined from a plurality of candidate similarities, and the candidate name corresponding to the target candidate similarity is used as the target candidate name.
For example, for 7 candidate names in table 2, each candidate name is respectively matched with the name database, the one-to-one correspondence similarity of each candidate name is obtained through calculation, and then the candidate name with the highest similarity among the 7 similarities is used as the target candidate name.
In this embodiment of the present invention, the performing similarity matching between the plurality of candidate names and the name database, and determining the candidate similarity of each candidate name includes:
for any candidate name in the candidate names, performing similarity matching on the candidate name and at least one preset name in the name database to obtain at least one similarity;
and determining the candidate similarity with the highest similarity in the at least one similarity.
Specifically, for any candidate name in the plurality of candidate names, similarity matching may be performed between the any candidate name and all preset names in the name database to obtain a plurality of similarities, and then the similarity with the highest similarity is used as the candidate similarity of the any candidate name. And circulating the steps to obtain the candidate similarity corresponding to each candidate name one by one.
For example, in the above example, assuming that the name database includes 50 preset names, "two-three" in the 7 candidate names is subjected to similarity matching with the 50 preset names in the name database to obtain 50 similarities, and then the similarity with the highest similarity among the 50 similarities is used as the candidate similarity. And repeating the steps to calculate the candidate similarity corresponding to the 7 candidate names one by one.
Wherein, the performing similarity matching between the any candidate name and at least one preset name in the name database to obtain at least one similarity comprises:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
The similarity calculation method of the shortest edit distance is generally adopted when character string matching is calculated, the semantic relation between text contexts is considered on the whole, the method is a common distance function measurement method, and the method is widely applied to the field of character string similarity matching. The algorithm refers to the minimum number of editing operands required to convert from the source string S to the target string T, the fewer operands required, the higher the similarity between the two strings. There are 3 basic editing operations: inserting a character into a string S; deleting one character in the string S; replacing one character in the string S with one character in the string T.
In the conventional shortest edit distance algorithm, the similarity between two character strings can be calculated from the shortest edit distance. Intuitively, the smaller the edit distance of two strings, the higher the similarity. The similarity formula for converting the edit distance into a value in the [0,1] interval is as follows:
where | S |, | T | represents the lengths of the strings S and T, respectively, and ld represents the shortest edit distance between the strings S and T. The larger sim (S, T), the higher the degree of similarity between two character strings.
However, the conventional shortest editing distance algorithm only considers the influence of the number of editing operations and the maximum string length, does not consider the influence of a common substring between strings, and does not have universal applicability. For example, the character string: s 1 =′BC′,S 2 =′CD′,S 3 If 'EF', the similarity between two strings is calculated according to equation 1 as follows:
1)S 1 to S 2 1-step replacement and 1-step deletion operation are required, and the shortest editing distance is 2, then
From the calculation result, S is known 1 、S 2 And S 1 、S 3 Are the same, but it is obvious that S is 1 、S 2 Is greater than S 1 、S 3 Because of the degree of similarity of S 1 、S 2 There is a largest common substring "C" in between. The maximum common substring refers to a character string sequence X, and if the maximum common substring is the subsequence of two character strings and the longest of all sequences meeting the condition, the maximum common substring of two known sequences is called X.
In order to improve the defects of the shortest editing distance algorithm, the scheme of the invention provides a shortest editing distance similarity algorithm using a forward maximum common substring and a backward maximum common substring.
Specifically, for any one of a plurality of candidate names, the name is recorded as S, any one of all preset names is recorded as T, the forward maximum common substring of the candidate name and the name is obtained and recorded as lcs, and the backward maximum common substring is obtained and recorded as rcs, and then the forward similarity is calculated by adopting a formula (2)
Wherein, | S |, | T | respectively represent the lengths of the character strings S and T, | lcs |, | rcs | respectively represent the lengths of the forward largest common substring lcs and the backward largest common substring rcs, and ld represents the shortest editing distance between the character strings S and T. The forward maximum common substring is the largest common substring of the two character strings from left to right, and the backward maximum common substring is the largest common substring of the two character strings from right to left.
Then, the final similarity is calculated by adopting the forward similarity, the backward similarity and the weights of the forward similarity and the backward similarity, as shown in formula (4):
where α and β are weights, respectively, and α + β is 1.
For example, in the following example, assuming that α is 0.5 and β is 0.5, the similarity is calculated using equation 4:
from the calculation result, S 1 、S 2 Is more than S 1 、S 3 The similarity degree of the distance calculation method is more suitable for actual conditions than the traditional calculation result of the shortest editing distance.
In the embodiment of the present invention, the determining a target candidate similarity with the highest similarity among a plurality of candidate similarities and using a candidate name corresponding to the target candidate similarity as a target candidate name includes:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
In the application of character string similarity matching, when the sequence length of the largest common substring has a large difference, the largest common substring has certain influence on the final similarity calculation result, so that the similarity needs to be normalized, the calculation defect of the traditional shortest editing distance algorithm is overcome, and the matching accuracy is greatly improved. Specifically, after the candidate similarity corresponding to each candidate name one to one is obtained, the multiple candidate similarities may be normalized by using the forward maximum common substring and the backward maximum common substring, as shown in formula (5):
in this way, the final target candidate similarity can be determined from the multiple candidate similarities, and then the candidate name corresponding to the target candidate similarity is used as the final target candidate name.
And 105, performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
And after the target candidate name is obtained, inputting the target candidate name into the trained classification model so that the classification model performs classification matching on the target candidate name and all preset names in the name database to obtain a matching result of whether the target candidate name is matched or not.
In the embodiment of the invention, the target candidate name is input into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
Specifically, the target candidate name is input into the trained classification model, the classification model uses preset characteristic indexes to classify and match the target candidate name with all preset names in the name database, and the preset characteristic indexes include but are not limited to table 3:
TABLE 3
The client name is the name of the company recorded and stored in the background server, and is also the matched name; the business name is the name of the company crawled in the web page, i.e., the name to be searched.
When the classification is matched, the classification model judges whether the target candidate name and all preset names represent the same semantics, if the target candidate name and all preset names have the same semantics, the target candidate name is judged to be 1, namely the matching is successful, and the any preset name is taken as a matching result; if the semanteme of the preset name is different from that of all the preset names, judging that the preset name is 0, namely, the matching fails, and generating matching failure information. Therefore, the text semantic similarity matching problem is converted into a classification problem, and the problem of low matching rate caused by only considering literal similarity and neglecting semantic similarity is solved.
Further, the classification model may be generated by:
1) a sample set is determined.
Specifically, all client names in the background server are obtained, all enterprise names obtained through crawling from a webpage are obtained, then a certain number (for example, 1%) of the client names are randomly extracted to serve as client name samples, meanwhile, a certain number (for example, 1%) of the enterprise names are also randomly extracted to serve as enterprise name samples, and the client name samples and the enterprise name samples serve as a final training sample set. And then, setting classification labels for the matched name pairs (one matched name pair is 'customer name-enterprise name') in the training sample set, wherein the label is 1 to indicate that the semantics are similar, and the label is 0 to indicate that the semantics are dissimilar. The remaining customer names and business names are used as the test sample set.
2) And setting a characteristic index.
That is, the feature indicators used for classification are set for the classification model, including but not limited to the feature indicators shown in table 3.
The characteristic index at least comprises two aspects: on the one hand, various similarity features between the name pairs are calculated, and on the other hand, NLP (Natural Language Processing) data features of the name pairs are calculated.
3) And establishing a classification learning model.
The embodiment of the invention uses a two-layer Stacking mode to establish a machine learning model: the first layer of the Stacking selects GussianNBClassifier, RandomForestClassifier and Logistic Registration as a Stacking base model, and the second layer of the Stacking selects the RandomForestClassifier for training.
Wherein the parameters of the training model are set as follows: the learning rate ρ is 0.001, the loss function adjustment factor α is 0.25, and γ is 0.15, so that the model loss function can be minimized to achieve the optimal solution.
4) And training a classification learning model.
And training and verifying the classification model by adopting the parameter setting and the test sample set, namely calculating the matching result of the client name and the recombined enterprise name, and outputting a classification label 0 or 1.
1) -4) is adopted to carry out multi-time generation on the classification model, the comprehensive evaluation index (F value) of the final classification model can reach 0.79, the matching accuracy of the enterprise name and the client name can reach more than 90%, and the matching of the enterprise name and the client name is efficiently and accurately realized.
In the embodiment of the invention, a background server responds to a search request aiming at an original name to be searched and matches the original name based on characters; if the matching fails, splitting the original name to obtain a plurality of participles, classifying the participles according to preset categories to obtain a plurality of participles with categories, recombining the participles based on target categories in the categories to obtain a plurality of candidate names, respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity, and carrying out semantic classification matching on the target candidate name and the name database based on a trained classification model to obtain a matching result. Therefore, after the original name is unsuccessfully matched accurately, the original name can be disassembled according to the category to obtain a plurality of participles, and then the plurality of participles are recombined around the specified category in the category to obtain a plurality of new names. Moreover, the determined target candidate name is subjected to semantic classification matching, the traditional similarity matching problem is converted into a machine classification problem, the similarity of the name pair is considered from semantic similarity, the matching accuracy is improved, and the matching calculation workload is further reduced.
Furthermore, when the original name is disassembled, a similarity calculation method of the forward maximum common substring and the backward maximum common substring is used, the defect that the similarity of the traditional editing distance only considers the calculation of the editing times is overcome, the universal applicability is achieved, and the matching accuracy is greatly improved.
Furthermore, when determining the target candidate name, the traditional shortest editing distance algorithm ignores the influence of the common length of the character string on the editing distance, so that certain deviation exists in the calculation of long character string insertion and deletion errors and the like.
It should be noted that for simplicity of description, the method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 2, a block diagram of a name matching apparatus according to an embodiment of the present invention is shown, and specifically includes the following modules:
a first matching module 201, configured to, in response to a search request for an original name to be searched, match the original name based on characters;
a word segmentation module 202, configured to split the original name to obtain multiple segmented words if matching fails, and classify the multiple segmented words according to preset categories to obtain multiple segmented words with categories;
the recombination module 203 is configured to recombine the multiple participles based on a target category in the categories to obtain multiple candidate names;
a second matching module 204, configured to perform similarity matching on the multiple candidate names and a preset name database, respectively, and determine a target candidate name with the highest similarity;
the classification module 205 is configured to perform semantic classification matching on the target candidate name and the name database based on the trained classification model, so as to obtain a matching result.
In an embodiment of the present invention, the first matching module is specifically configured to:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
In an embodiment of the present invention, the word segmentation module is specifically configured to:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
In an embodiment of the present invention, the restructuring module is specifically configured to:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
In an embodiment of the present invention, the second matching module includes:
the similarity matching submodule is used for respectively performing similarity matching on the candidate names and the name database and determining the candidate similarity of each candidate name;
and the determining submodule is used for determining the target candidate similarity with the highest similarity in the candidate similarities and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
In this embodiment of the present invention, the similarity matching sub-module includes:
a matching unit, configured to perform similarity matching on any candidate name in the multiple candidate names and at least one preset name in the name database to obtain at least one similarity;
a determining unit, configured to determine a candidate similarity with a highest similarity among the at least one similarity.
In an embodiment of the present invention, the matching unit is specifically configured to:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
In an embodiment of the present invention, the determining sub-module is specifically configured to:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
In an embodiment of the present invention, the classification module is specifically configured to:
inputting the target candidate name into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
An embodiment of the present invention further provides an electronic device, including:
the name matching method comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein when the computer program is executed by the processor, each process of the name matching method embodiment is realized, the same technical effect can be achieved, and the description is omitted for avoiding repetition.
The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when being executed by a processor, the computer program implements each process of the name matching method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The above detailed description is provided for a name matching method and a name matching device provided by the present invention, and the principle and the implementation manner of the present invention are explained by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (12)
1. A method of name matching, the method comprising:
in response to a search request for an original name to be searched, matching the original name based on characters;
if the matching fails, splitting the original name to obtain a plurality of participles, and classifying the participles according to preset categories to obtain a plurality of participles with categories;
recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
2. The name matching method of claim 1, wherein said matching the original name based on characters comprises:
detecting whether a name identical to the character of the original name exists in the name database;
if yes, matching is successful; if not, the matching fails.
3. The name matching method according to claim 1, wherein the splitting the original name to obtain a plurality of segmented words, and classifying the plurality of segmented words according to a preset type to obtain a plurality of segmented words with categories includes:
splitting the original name by adopting jieba to obtain a plurality of participles;
and matching each participle with a preset category library, determining the category corresponding to each participle one by one, and obtaining a plurality of participles with the categories.
4. The name matching method according to claim 1, wherein the recombining the plurality of participles based on a target category in the categories to obtain a plurality of candidate names comprises:
and recombining the participles with the target category and the participles with the non-target category respectively to obtain a plurality of candidate names.
5. The name matching method according to claim 1, wherein the similarity matching of the candidate names with a preset name database is performed to determine a target candidate name with the highest similarity, and the method includes:
respectively carrying out similarity matching on the candidate names and the name database, and determining the candidate similarity of each candidate name;
and determining the target candidate similarity with the highest similarity in the candidate similarities, and taking the candidate name corresponding to the target candidate similarity as the target candidate name.
6. The name matching method according to claim 5, wherein the similarity matching of the candidate names with the name database and the determination of the candidate similarity of each candidate name comprises:
for any candidate name in the candidate names, performing similarity matching on the candidate name and at least one preset name in the name database to obtain at least one similarity;
and determining the candidate similarity with the highest similarity in the at least one similarity.
7. The name matching method according to claim 6, wherein the similarity matching of any one of the candidate names with at least one preset name in the name database to obtain at least one similarity comprises:
aiming at any preset name in the at least one preset name, acquiring a forward maximum common substring and a backward maximum common substring of the any candidate name and the any preset name;
calculating forward similarity based on the forward maximum common substring, and calculating backward similarity by using the backward maximum common substring;
and calculating the similarity between any candidate name and any preset name based on the forward similarity and the backward similarity.
8. The name matching method according to claim 5, wherein the determining a target candidate similarity with the highest similarity among the plurality of candidate similarities and taking a candidate name corresponding to the target candidate similarity as a target candidate name comprises:
normalizing the candidate similarities based on the forward maximum common substring and the backward maximum common substring to obtain a target candidate similarity with the highest similarity;
and taking the candidate name corresponding to the target candidate similarity as a target candidate name.
9. The name matching method according to claim 1, wherein the performing semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result comprises:
inputting the target candidate name into a trained classification model, so that the classification model performs semantic classification matching on the target candidate name and the name database by adopting a preset characteristic index;
if the matching is successful, taking the matched preset name as a matching result; and if the matching fails, generating matching failure information.
10. A name matching apparatus, characterized in that the apparatus comprises:
the device comprises a first matching module, a second matching module and a searching module, wherein the first matching module is used for responding to a searching request aiming at an original name to be searched and matching the original name based on characters;
the word segmentation module is used for splitting the original name to obtain a plurality of words if the matching fails, and classifying the plurality of words according to preset categories to obtain a plurality of classified words;
the recombination module is used for recombining the multiple participles based on a target category in the categories to obtain multiple candidate names;
the second matching module is used for respectively carrying out similarity matching on the candidate names and a preset name database to determine a target candidate name with the highest similarity;
and the classification module is used for carrying out semantic classification matching on the target candidate name and the name database based on the trained classification model to obtain a matching result.
11. An electronic device, comprising: processor, memory and computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the name matching method as claimed in any one of claims 1 to 9.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the name matching method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210569401.3A CN114911999A (en) | 2022-05-24 | 2022-05-24 | Name matching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210569401.3A CN114911999A (en) | 2022-05-24 | 2022-05-24 | Name matching method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114911999A true CN114911999A (en) | 2022-08-16 |
Family
ID=82768152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210569401.3A Pending CN114911999A (en) | 2022-05-24 | 2022-05-24 | Name matching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114911999A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117633518A (en) * | 2024-01-25 | 2024-03-01 | 北京大学 | Industrial chain construction method and system |
WO2024066903A1 (en) * | 2022-09-30 | 2024-04-04 | 上海寰通商务科技有限公司 | Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909532A (en) * | 2019-10-31 | 2020-03-24 | 银联智惠信息服务(上海)有限公司 | User name matching method and device, computer equipment and storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
WO2021217850A1 (en) * | 2020-04-26 | 2021-11-04 | 平安科技(深圳)有限公司 | Disease name code matching method and apparatus, computer device and storage medium |
-
2022
- 2022-05-24 CN CN202210569401.3A patent/CN114911999A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909532A (en) * | 2019-10-31 | 2020-03-24 | 银联智惠信息服务(上海)有限公司 | User name matching method and device, computer equipment and storage medium |
CN111310456A (en) * | 2020-02-13 | 2020-06-19 | 支付宝(杭州)信息技术有限公司 | Entity name matching method, device and equipment |
WO2021217850A1 (en) * | 2020-04-26 | 2021-11-04 | 平安科技(深圳)有限公司 | Disease name code matching method and apparatus, computer device and storage medium |
Non-Patent Citations (2)
Title |
---|
GUOCHAO SONG 等: "Entity Matching Using Different Level Similarity for Different Attributes", 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 10 March 2019 (2019-03-10), pages 779 - 782 * |
孙海霞 等: "科技文献数据库中机构名称匹配策略研究", 数据分析与知识发现, no. 08, 25 August 2018 (2018-08-25), pages 88 - 97 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024066903A1 (en) * | 2022-09-30 | 2024-04-04 | 上海寰通商务科技有限公司 | Method and device for recognizing pharmaceutical-industry target object to be recognized, and medium |
CN117633518A (en) * | 2024-01-25 | 2024-03-01 | 北京大学 | Industrial chain construction method and system |
CN117633518B (en) * | 2024-01-25 | 2024-04-26 | 北京大学 | Industrial chain construction method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10169337B2 (en) | Converting data into natural language form | |
Huq et al. | Sentiment analysis on Twitter data using KNN and SVM | |
CN111324784B (en) | Character string processing method and device | |
CN108280114B (en) | Deep learning-based user literature reading interest analysis method | |
US20100293179A1 (en) | Identifying synonyms of entities using web search | |
CN114911999A (en) | Name matching method and device | |
CN112416778B (en) | Test case recommendation method and device and electronic equipment | |
CN110163376B (en) | Sample detection method, media object identification method, device, terminal and medium | |
US9990268B2 (en) | System and method for detection of duplicate bug reports | |
CN111506608A (en) | Method and device for comparing structured texts | |
CN109408100B (en) | Software defect information fusion method based on multi-source data | |
CN112286799B (en) | Software defect positioning method combining sentence embedding and particle swarm optimization algorithm | |
CN109739554A (en) | Prevent code from repeating submission method, system, computer equipment and storage medium | |
CN115150354B (en) | Method and device for generating domain name, storage medium and electronic equipment | |
CN110309258B (en) | Input checking method, server and computer readable storage medium | |
CN112115362B (en) | Programming information recommendation method and device based on similar code recognition | |
CN115328945A (en) | Data asset retrieval method, electronic device and computer-readable storage medium | |
Ziv et al. | CompanyName2Vec: Company entity matching based on job ads | |
CN109344254B (en) | Address information classification method and device | |
JP2011100302A (en) | Ranking function generating device, ranking function generating method, and ranking function generation program | |
JP2009157458A (en) | Index creation device, its method, program, and recording medium | |
Li et al. | Machine learning methodology for enhancing automated process in IT incident management | |
Wunderle et al. | Pointer Networks: A Unified Approach to Extracting German Opinions | |
Van Delden et al. | Finding enterprise websites | |
Bansal et al. | Literature review of finding duplicate bugs in open source systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |