CN115712757A - Enterprise name matching method and device based on index tree - Google Patents

Enterprise name matching method and device based on index tree Download PDF

Info

Publication number
CN115712757A
CN115712757A CN202211373485.XA CN202211373485A CN115712757A CN 115712757 A CN115712757 A CN 115712757A CN 202211373485 A CN202211373485 A CN 202211373485A CN 115712757 A CN115712757 A CN 115712757A
Authority
CN
China
Prior art keywords
matching
matched
enterprise
index
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211373485.XA
Other languages
Chinese (zh)
Inventor
向桥梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liantong Hangzhou Technology Service Co ltd
Original Assignee
Liantong Hangzhou Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liantong Hangzhou Technology Service Co ltd filed Critical Liantong Hangzhou Technology Service Co ltd
Priority to CN202211373485.XA priority Critical patent/CN115712757A/en
Publication of CN115712757A publication Critical patent/CN115712757A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application aims to provide an enterprise name matching method and device based on an index tree. Compared with the prior art, the method and the device determine administrative region words and word size industry words of the names of the database enterprises; establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word; establishing index trees with index levels of the administrative region words, the multi-element groups and the database enterprise names respectively; and determining one or more matched enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched. By the method, the index tree is established according to the composition characteristics of the enterprise names, and the enterprise names are matched in the index tree mode, so that the matching efficiency and the matching accuracy are greatly improved.

Description

Enterprise name matching method and device based on index tree
Technical Field
The application relates to the technical field of computers, in particular to an enterprise name matching technology based on an index tree.
Background
The existing enterprise name matching method needs to match the enterprise name to be matched with the enterprise names in the database one by one to judge the similarity degree, then screens the enterprise names in all the databases according to the similarity degree, and determines one or more matching results as the enterprise name to be matched.
Meanwhile, when the word size and the industry content in the enterprise name are matched, a character string similarity algorithm is generally adopted for calculation, although the method can accurately determine the similarity degree of the word size and the industry field in the two enterprise names, the execution efficiency is low and the required time cost is overlarge due to the high complexity of the character string similarity algorithm.
Disclosure of Invention
The application aims to provide an index tree-based enterprise name matching method and equipment.
According to one aspect of the application, an index tree-based enterprise name matching method is provided, wherein the method comprises the following steps:
determining administrative region words and word size industry words of the database enterprise name;
establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word;
establishing index trees with index levels of the administrative region words, the multi-element groups and the database enterprise names respectively;
and determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched.
Further, the establishing of the index hierarchy respectively as the index trees of the administrative region words, the tuples and the database enterprise names comprises:
taking the administrative region words as first-level index keys of the index tree;
taking each element of the multi-tuple as a secondary index key of the administrative region word corresponding to the element, wherein the secondary index key is an index value of the primary index key corresponding to the element;
and taking the database enterprise name simultaneously containing the secondary index key and the primary index key corresponding to the secondary index key as the index value of the secondary index key.
Further, after determining the administrative region word of the database business name, the method further includes:
converting the administrative region words into corresponding standard administrative region words;
the taking the administrative region words as the first-level index keys of the index tree comprises:
and taking the standard administrative region words as a first-level index key of the index tree.
Further, the determining one or more matching enterprise names corresponding to the enterprise name to be matched according to the index tree and the acquired enterprise name to be matched includes:
determining administrative region words of the enterprise name to be matched and corresponding multi-element groups;
determining a matching dictionary of the enterprise name to be matched according to the index tree, and administrative regional words and a multi-element group of the enterprise name to be matched, wherein the matching dictionary is a whole database enterprise name set containing one or more multi-element group elements of the enterprise name to be matched;
determining a first matching score of each database business name in the matching dictionary and a tuple of the matching business name;
and determining one or more database enterprise names in the matching dictionary according to the first matching score and the enterprise name to be matched as the matching enterprise name.
Further, the determining administrative region words and word size industry words of the database enterprise name further includes:
determining organization form words of all database enterprise names;
the determining the administrative region words and the word size industry words of the enterprise names to be matched further comprises: determining organization form words of the enterprise names to be matched;
wherein an n-order matrix including a similarity between n organization form words, a first threshold and a second threshold are preset, and the determining one or more index values in the matching dictionary as the matching business names according to the first matching score includes:
screening one or more names in the matching dictionary according to the first threshold and the first matching distribution to serve as pre-selected matching business names;
determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix;
determining a second matching score of the enterprise name to be matched and the word size industry word of each pre-selected matching enterprise name, wherein the second matching score is determined by a character string similarity algorithm;
determining the total matching score of the enterprise name to be matched and each pre-selected matching enterprise name according to the first matching score, the organization form similarity and the second matching score;
determining one or more of all pre-selected matching business names as the matching business name according to the second threshold and the overall matching score.
Further, the determining the matching dictionary of the enterprise name to be matched according to the index tree, the administrative region words and the multi-element group of the enterprise name to be matched comprises:
determining the primary index key corresponding to the administrative region word of the enterprise name to be matched in the index tree;
determining a plurality of secondary index keys corresponding to the multi-tuple of the enterprise name to be matched in the index values of the determined primary index keys;
and using the index value set of all the determined secondary index keys as the matching dictionary.
Further, presetting the hit number with an initial value of 0, wherein determining a plurality of secondary index keys corresponding to the tuples of the enterprise name to be matched in the index values of the determined primary index keys comprises:
matching the multi-element group elements of the enterprise name to be matched with each index value of the determined primary index key one by one;
if the multi-tuple element is the same as an index value of the determined first-level index key, taking the index value of the first-level index key as a second-level index key corresponding to the multi-tuple of the enterprise name to be matched, and adding one to the hit number of the database enterprise name corresponding to the index value of the second-level index key;
wherein the determining a first matching score for each database business name in the matching dictionary and the tuple of matching business names comprises:
determining the number m1 of the multi-element elements of the matched enterprise name and the number m2 of the multi-element elements corresponding to each database enterprise name in the matched dictionary, and acquiring the hit number m3 of each database enterprise name in the matched dictionary;
the first matching score
Figure BDA0003925849340000031
Further, after establishing a tuple corresponding to an administrative domain word of the database enterprise name according to the word size industry word of the database enterprise name, the method further comprises:
establishing a comparison table, wherein the comparison table comprises administrative region words, word size industry words, organization form words and the number of multi-group elements corresponding to the names of the database enterprises;
wherein the determining the number m2 of the multi-element elements corresponding to each index value in the matching dictionary comprises:
determining a database enterprise name which is the same as the index value in the comparison table, and taking the multi-element number of the database enterprise name as the multi-element number m2 corresponding to the index value;
before determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix, the method further comprises the following steps:
obtaining organization form words of each pre-selected matched enterprise name according to the comparison table;
wherein, before determining the second matching score between the enterprise name to be matched and the word size industry word of each preselected matching enterprise name, the method further comprises the following steps:
and acquiring the word size industry words of each pre-selected matched enterprise name according to the comparison table.
Further, the element of the multi-tuple is a total combination of two adjacent characters in the word size industry word of the database enterprise name.
According to another aspect of the present application, there is also provided a computer readable medium having computer readable instructions stored thereon, the computer readable instructions being executable by a processor to implement the operations of the method as described above.
According to another aspect of the present application, there is also provided an index tree-based business name matching apparatus, where the apparatus includes:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the above-described method.
Compared with the prior art, the method and the device determine administrative region words and word size industry words of the names of the database enterprises; establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word; establishing index trees with index levels of the administrative region words, the multi-element groups and the database enterprise names respectively; and determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched. By the method, the index tree is established according to the composition characteristics of the enterprise names, and the enterprise names are matched in the index tree mode, so that the matching efficiency and the matching accuracy are greatly improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 illustrates a flow diagram of a method for index tree based business name matching in accordance with an aspect of the subject application;
fig. 2 is a flowchart illustrating an index tree based business name matching method according to a preferred embodiment of the present application.
The same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
In a typical configuration of the present application, the terminal, the device serving the network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
To further illustrate the technical means and effects adopted by the present application, the following description clearly and completely describes the technical solution of the present application with reference to the accompanying drawings and preferred embodiments.
Fig. 1 illustrates an index-based business name matching method provided in an aspect of the present application, where the method includes:
s11, determining administrative region words and word size industry words of the database enterprise names;
s12, establishing a multi-tuple corresponding to the administrative area word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word;
s13, establishing index trees with index levels respectively being the administrative region words, the multi-element groups and the database enterprise names;
s14, determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched.
In this embodiment, in step S11, administrative area words and word size industry words of the database business names are determined.
According to the actual naming situation of the enterprise, the name of the enterprise is composed of one or more of administrative area words, word size industry words (word sizes such as millet, huaji, association and the like) and organizational form words (such as group stocks, group control and group responsibility companies and the like). In the application scenario of the present application, all enterprise names in the service range, that is, database enterprise names, are stored in the database of the server, and the method of the present application can be applied to a link of a service and can also be used as a complete external service, and the present application does not perform a restrictive explanation on a specific application scenario. Since the objective of the present application is to perform enterprise name matching, all enterprise names to be matched, that is, the database enterprise names, are stored in the database to which the present application relates, in this step, feature extraction before matching of all database enterprise names is required, specifically, administrative area words and word size industry words of each database enterprise name are extracted. The specific feature extraction manner is not limited in the present application, for example, a model for extracting corresponding features may be established in a manner of training the model, and then corresponding feature extraction operations may be performed according to the model; or a word bank corresponding to each feature may be established in advance, and all feature values of the feature may be stored in the word bank, so as to perform feature extraction according to the word bank.
Furthermore, because the enterprise name may have character contents such as punctuation marks, letter tones and the like which are irrelevant to the matching of the enterprise name, before feature extraction, cleaning operation is carried out on the enterprise name to be matched and the enterprise name in the database, such character contents irrelevant to the matching of the enterprise name are deleted, and the matching accuracy is improved.
Continuing in this step, in step S12, a tuple corresponding to the administrative domain word of the database business name is established according to the word size industry word of the database business name, where elements of the tuple are all equal-length combinations of adjacent characters in the word size industry word.
Here, because the word size and the industry field in the enterprise name are extremely unstable, for example, the name of the word size field is irregular, the content randomness is extremely strong, it is difficult to directly identify the word size field in the enterprise name through program design, and since the industry is in a continuously replaced state, a new industry not related before appears frequently, in order to prevent the situation of misrecognition, an elimination method is often adopted in the process of extracting the characteristics of the enterprise name, for example, the administrative area word and the organization form word in the enterprise name are accurately identified according to the model or the word stock in the above example, and then the fields except the administrative area word and the organization form word in the enterprise name are merged and identified as the word size industry word, accordingly, in all database enterprise names, the repetition rate of the word size industry word is extremely low, the word size industry words of different enterprises are generally not repeated, therefore, in the matching process, the word size industry word to be matched with the word size industry word to be extracted is compared with the word size industry word to determine the matching situation, and the required workload is extremely large, and more redundancy exists. In order to avoid extra time and resource cost caused by redundant comparison, the characteristics of enterprise names are utilized, namely the enterprise names with different administrative area words do not have matching relations, so that the enterprise names are matched again according to word size industry words in the database enterprise names matched with the administrative area words of the enterprise names to be matched, subsequent operation is not needed under the condition that the administrative area words are not matched, the matching range is reduced, and the redundant matching process is greatly reduced.
In order to realize the layered matching mode, the corresponding relation between the administrative area words of the database enterprise names and the word size industry words needs to be established, and then the administrative area words are matched again according to the word size industry words in the successfully matched database enterprise names. In the matching process of the word size industry words, if the word size industry word matching degree of the enterprise name to be matched and the database enterprise name is calculated by adopting a character string similarity algorithm, the execution efficiency is low, and the problem complexity is high. The method extracts each segment in the word size industry, collects all the segments into a multi-element group, specifically, sets a fixed length according to an actual application scene, traverses the word in the word size industry, extracts a character string taking the character as an initial character for the traversed character, wherein the length of the character string (namely the equal length combination of adjacent characters) is equal to the preset fixed length, and adds the extracted character string into the multi-element group. By the method, the multi-element group of the enterprise name to be matched is matched with the multi-element group of the database enterprise name, the similarity degree of the multi-element group of the enterprise name to be matched and the word size industry word of the database enterprise name is determined according to the element coincidence proportion between the two multi-element groups, the similarity calculation problem of the two long character strings is converted into the matching problem between the two multi-element groups, the complexity of the problem is reduced to O (n), and the execution efficiency of determining the word size industry word similarity degree is greatly improved. Meanwhile, all the segments of the word size industry words are extracted according to the preset fixed length, and all the segments of the fixed length in the word size industry words are fully covered, so that all the character combinations of the fixed length possibly existing in the word size industry words are matched in the matching process of the word size industry words, and the reliability of the matching result is ensured.
Preferably, the element of the tuple is a total combination of two adjacent characters in the word size industry word of the database business name. The purpose of extracting the segments of the words in the word size industry is to cover as many character combinations as possible, so that the reliability of the word matching result in the word size industry is improved, and therefore, the set fixed length is as small as possible, and the extracted character combinations are ensured to be as many as possible.
In this embodiment, in step S13, an index tree whose index levels are the administrative area words, the tuples, and the database business names is established.
The index tree with index levels sequentially including administrative area words, multi-element groups and database enterprise names is established, so that when matching is conducted, firstly, the index tree range required by the matching process is limited according to the administrative area words, then word size industry words are matched based on the lower-level index of the administrative area words which are successfully matched, and finally, the matched enterprise names are determined based on the lower-level index of the word size industry words which are successfully matched.
In a preferred embodiment, refer to fig. 2, wherein steps S21, S22 and S26 in fig. 2 are the same as or substantially the same as steps S11, S12, S13 and S14 in the embodiment of fig. 1, and therefore are not described herein again, but are included herein by way of reference. Wherein, the steps S23, S24 and S25 include: establishing an index tree with index levels respectively being the administrative region words, the tuples and the database enterprise names comprises the following steps: taking the administrative region words as first-level index keys of the index tree; taking each element of the multi-tuple as a secondary index key of the administrative region word corresponding to the element, wherein the secondary index key is an index value of the primary index key corresponding to the element; and taking the database enterprise name simultaneously containing the secondary index key and the primary index key corresponding to the secondary index key as the index value of the secondary index key.
The method comprises the steps of taking the administrative area words of all database enterprise names after duplication removal as first-level index keys of an index tree, traversing all the administrative area words, determining a multi-element group of word number industry words of all the database enterprise names including the administrative area words, taking all elements of all the multi-element groups as second-level index keys of the administrative area words, traversing all the multi-element group elements, and taking the database enterprise names including the multi-element group elements and the corresponding first-level index keys of the elements as index values of the second-level index keys, so that a third-level index tree is established.
Furthermore, because the administrative area words have multiple writing methods, in order to avoid interference caused by different writing methods, the administrative area words are subjected to standardized operation, specifically, after the administrative area words of the database enterprise names are determined, the administrative area words are converted into corresponding standard administrative area words, and then in the process of establishing the index tree, the standard administrative area words with the duplication removed of all the database enterprise names are used as the primary index keys of the index tree, so that the writing methods of all the administrative areas are ensured to be uniform.
In this embodiment, in the step S14, one or more matching enterprise names corresponding to the to-be-matched enterprise name are determined according to the index tree and the acquired to-be-matched enterprise name.
Determining the administrative region words of the enterprise names to be matched and the corresponding multi-element groups; and determining a matching dictionary of the enterprise name to be matched according to the index tree, and administrative region words and the multi-element group of the enterprise name to be matched, wherein the matching dictionary is a whole index value set containing one or more multi-element group elements of the enterprise name to be matched.
Further, determining one or more matching enterprise names corresponding to the enterprise name to be matched according to the index tree and the obtained enterprise name to be matched includes: determining administrative region words of the enterprise name to be matched and corresponding multi-element groups; determining a matching dictionary of the enterprise name to be matched according to the index tree, and administrative regional words and a multi-element group of the enterprise name to be matched, wherein the matching dictionary is a whole database enterprise name set containing one or more multi-element group elements of the enterprise name to be matched; determining a first matching score of each database business name in the matching dictionary and a tuple of the matching business name; and determining one or more database enterprise names in the matching dictionary according to the first matching distribution and the enterprise names to be matched as the matching enterprise names.
Specifically, firstly, processing the enterprise name to be matched with the database enterprise name in the same way, namely determining an administrative region word and a word size industry word of the enterprise name to be matched, and determining a multi-element group for the word size industry word according to a preset fixed length; then matching the administrative region words of the enterprise names to be matched with the primary index keys in the index tree, matching all the secondary index keys in the primary index keys with the multi-element group elements of the enterprise names to be matched one by one in the primary index keys with the same content as the administrative region words of the enterprise names to be matched, and taking the index values of the secondary index keys (namely the pre-matched database enterprise names) as the elements of the matching dictionary for the secondary index keys with the same content as the multi-element group elements of the enterprise names to be matched; and for each database enterprise name in the matching dictionary, calculating a first matching score between the database enterprise name and the enterprise name to be matched, screening each database enterprise name in the matching dictionary according to the first matching score, for example, setting a screening threshold, and taking the database enterprise name of which the first matching score is greater than the screening threshold as the matching enterprise name of the enterprise name to be matched.
Preferably, a first matching score is calculated by presetting a hit number with an initial value of 0, where the hit number is used to measure the number of repeated elements in the tuple corresponding to the enterprise name of the database and the tuple of the enterprise name to be matched.
Specifically, matching the multi-element group elements of the enterprise name to be matched with each index value of the determined primary index key one by one; if the multi-tuple element is the same as an index value of the determined first-level index key, taking the index value of the first-level index key as a second-level index key corresponding to the multi-tuple of the enterprise name to be matched, and taking the number of hits of the database enterprise name corresponding to the index value of the second-level index keyAdding one; determining the number m1 of the multi-element elements of the matched enterprise name and the number m2 of the multi-element elements corresponding to each database enterprise name in the matched dictionary, and acquiring the hit number m3 of each database enterprise name in the matched dictionary; the first matching score
Figure BDA0003925849340000101
After the first-level index key corresponding to the administrative regional word of the enterprise name to be matched is determined in the index tree, each element of the multi-element group of the enterprise name to be matched (namely, the word size industry word segment of the enterprise name to be matched) is matched with all the second-level indexes corresponding to the determined first-level index key one by one, and according to the establishment process of the index tree, all the second-level index keys corresponding to the first-level index key are all the database enterprise names and the word size industry word segments thereof which are the same as the administrative regional word of the enterprise name to be matched, therefore, the multi-element group elements of the enterprise name to be matched are matched with each index value of the determined first-level index key one by one, namely, in all the database enterprise names which are the same as the administrative regional word of the enterprise name to be matched, determining all database enterprise names which are repeated with the word size industry word segments of the enterprise name to be matched, and further adding one to the hit number of the index value of the secondary index key, namely the database enterprise name, of a secondary index key (namely the word size industry word segment of the database enterprise name) which is the same as a certain element in the multi-tuple of the enterprise name to be matched, wherein the number of the same segments of the word size industry word representing the enterprise name to be matched and the word size industry word of the database enterprise name is increased by one, and the more the hit number is, the more the number of the same segments of the word size industry word representing the enterprise name to be matched and the database enterprise name is, and the higher the similarity between the enterprise name to be matched and the database enterprise name is under the condition that the number of multi-element elements (namely the number of the segments of the word size industry word) of the enterprise name to be matched and the database enterprise name is unchanged.
Specifically, a calculation method for similarity degree of word size industry words of the enterprise name to be matched and the database enterprise name is designed, and the enterprise name to be matched is a multi-element groupThe number of the elements is m1, the number m2 of the multi-element group elements corresponding to each database enterprise name in the dictionary is matched, and the hit number m3 of each database enterprise name in the matching dictionary is obtained; the first matching score
Figure BDA0003925849340000111
The first matching score represents the similarity degree of the word size industry words of the business name to be matched and the database business name.
The similarity degree of the business name to be matched and the word size industry word of the database business name is measured by the ratio of the same segment number of the word size industry word of the business name to be matched and the database business name in the average segment number of the word size industry word of the business name to be matched and the database business name. Here, it should be clear that the value of the first matching score Y is 0 or more and 1 or less.
Furthermore, because the enterprise name also comprises the organization form words, after the administrative region words and the word size industry words are matched according to the index tree, the matching result can be optimized by combining the organization form words, so that the accuracy of the determined matched enterprise name is higher. In this scheme, the organizational form word of the database business name needs to be determined. The administrative region words, the organization form words and the organization form words of the enterprise names of the database can be determined in the same link in one embodiment, namely the administrative region words and the organization form words of the enterprise names of the database are determined first, and the remaining field contents are used as the word number industry words.
Furthermore, in order to optimize the matching result according to the organization form words, the organization form words of the enterprise name to be matched and the database enterprise name need to be matched, but whether the two organization form words are matched or not can not be determined by directly utilizing the similarity of the character strings, for example, in the scene of inquiring the cooperative enterprise, the enterprise name to be matched input by the user is the "wuzhen tourism company", the database enterprise name of the "wuzhen tourism stock limited company" exists, and the matching relationship should exist between the two enterprise names, but if the similarity of the organization form fields is analyzed by adopting the similarity of the character strings, the similarity between the "stock limited company" and the "company" is necessarily low, and if the organization form words of the two enterprise names are considered to be not matched, the matching result is not consistent with the actual situation, and the matching result of the whole is likely to be greatly influenced.
Based on the above situation, an n-order matrix is set, the matrix elements are the similarity between the manually set organization forms, and the attributes of the rows and columns of the matrix are agreed, for example, U (i, j) represents that the organization form field of the enterprise name to be matched corresponds to the ith row, the organization form of the database enterprise name corresponds to the jth column, and the element value of the n-order matrix U (i, j) is the similarity between the enterprise name to be matched and the organization form word of the database enterprise name. Here, the value of the similarity of each organization form word in the n-order matrix is an empirical value in practical application, and for example, the similarity of the organization form words of "company" and "company" may be set to 1, the similarity of the organization form words of "head company" and "branch company" may be set to 0.9, the similarity of the organization form words of "limited liability company" and "limited liability company" may be set to 0.95, and the similarity of the organization form words of "stock limited company" and "limited liability company" may be set to 0. By the method, the fuzzy matching method is set based on the actual application situation, the similarity between the organization form fields is flexibly determined, the characteristic that the organization form fields cannot simply determine the similarity through the character string similarity is adapted, and the accuracy of the matching result is improved. Here, it should be clear that each element value of the matrix U is equal to or greater than 0 and equal to or less than 1.
Based on the above, a first threshold and a second threshold are also preset, and in one embodiment, one or more than one of the matching dictionaries are screened as pre-selected matching business names according to the first threshold and the first matching score; determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix; determining the total matching score of the enterprise name to be matched and each preselected matching enterprise name according to the first matching score and the organization form similarity; determining one or more of all pre-selected matching business names as the matching business name according to the second threshold and the overall matching score. The first threshold is used in a pre-matching link, specifically, after the administrative area words and the word dimensions of the word size industry are pre-matched through the index tree, the first threshold is used for screening a plurality of pre-selected matching enterprise names with matching effects meeting requirements from a matching result, namely a matching dictionary; and the second threshold is used for a comprehensive matching link, namely, one or more matched enterprise names with better matching effect are further screened out in combination with the dimension of the organization form words in the preselection matched enterprise names determined by matching in advance. The method adopts a double matching mode combining pre-matching and comprehensive matching, reduces the range of the matching result through pre-matching, further determines the matching result in the range through comprehensive matching, and improves the matching accuracy.
In another embodiment, one or more of the matching dictionary is screened as a pre-selected matching business name according to the first threshold and the first matching score; determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix; determining a second matching score of the enterprise name to be matched and the word size industry word of each pre-selected matching enterprise name, wherein the second matching score is determined by a character string similarity algorithm; determining the total matching score of the enterprise name to be matched and each pre-selected matching enterprise name according to the first matching score, the organization form similarity and the second matching score; determining one or more of all pre-selected matching business names as the matching business name according to the second threshold and the overall matching score.
In the step of comprehensive matching, not only are the factors of organizing the form words considered, but also the number industry words are evaluated from another dimension, namely the similarity degree (namely the second matching score) of the number industry words is evaluated again according to the similarity degree of the character strings of the number industry words of the enterprise name to be matched and the database enterprise name, the manually set similarity degree of the organizing form words and the similarity degree determined by the repetition degree of the number industry segments are integrated with the similarity degree of the character strings determined by the algorithm, and the integrated result is used as the total score obtained by comprehensive matching. The method evaluates the similarity of the words in the word number industry from two angles, and improves the accuracy of an evaluation result.
Furthermore, in the calculation process of the first matching score and the second matching score, the administrative area words, the word size industry words, the organization form words and the multi-component element number corresponding to the enterprise name of the database need to be acquired, so that a comparison table can be established in order to directly acquire related information without repeatedly determining the information in the process of calculating the first matching score and the second matching score, the administrative area words, the word size industry words and the organization form words corresponding to the enterprise name of the database are recorded while the index tree is established, and the multi-component element number of the enterprise name of the database is recorded while the multi-component element number of the enterprise name of the database is determined. Specifically, a comparison table is established, wherein the comparison table comprises administrative area words, word size industry words, organization form words and the number of multi-group elements corresponding to the names of the database enterprises; wherein the determining the number m2 of the multi-element elements corresponding to each index value in the matching dictionary comprises: determining a database enterprise name which is the same as the index value in the comparison table, and taking the multi-element number of the database enterprise name as the multi-element number m2 corresponding to the index value; before determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix, the method further comprises the following steps: obtaining organization form words of each pre-selected matched enterprise name according to the comparison table; wherein, before determining the second matching score between the enterprise name to be matched and the word size industry word of each preselected matching enterprise name, the method further comprises the following steps: and acquiring the word size industry words of each preselected matching enterprise name according to the comparison table.
Compared with the prior art, the method and the device determine administrative region words and word size industry words of the names of the database enterprises; establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word; establishing index trees with index levels respectively being the administrative region words, the multi-element groups and the database enterprise names; and determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched. By the method, the index tree is established according to the composition characteristics of the enterprise names, and the enterprise names are matched in the index tree mode, so that the matching efficiency and the matching accuracy are greatly improved.
Furthermore, the embodiment of the present application also provides a computer readable medium, on which computer readable instructions are stored, and the computer readable instructions can be executed by a processor to implement the foregoing method.
The embodiment of the present application further provides an enterprise name matching device based on an index tree, where the device includes:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the foregoing method.
For example, the computer readable instructions, when executed, cause the one or more processors to: determining administrative region words and word size industry words of the database enterprise name;
establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word;
establishing index trees with index levels of the administrative region words, the multi-element groups and the database enterprise names respectively;
and determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims (11)

1. An index tree-based business name matching method, wherein the method comprises the following steps:
determining administrative region words and word size industry words of the database enterprise names;
establishing a multi-tuple corresponding to the administrative region word of the database enterprise name according to the word size industry word of the database enterprise name, wherein elements of the multi-tuple are all equal-length combinations of adjacent characters in the word size industry word;
establishing index trees with index levels of the administrative region words, the multi-element groups and the database enterprise names respectively;
and determining one or more matching enterprise names corresponding to the enterprise names to be matched according to the index tree and the acquired enterprise names to be matched.
2. The method of claim 1, wherein the building an index tree of index levels for the administrative domain words, the tuples, and the database business names, respectively, comprises:
taking the administrative region words as first-level index keys of the index tree;
taking each element of the multi-tuple as a secondary index key of the administrative region word corresponding to the element, wherein the secondary index key is an index value of the primary index key corresponding to the element;
and taking the database enterprise name simultaneously containing the secondary index key and the primary index key corresponding to the secondary index key as the index value of the secondary index key.
3. The method of claim 2, wherein after determining administrative domain words for the database business name, further comprising:
converting the administrative region words into corresponding standard administrative region words;
the taking the administrative region words as the first-level index keys of the index tree comprises:
and taking the standard administrative region words as a first-level index key of the index tree.
4. The method according to claim 2 or 3, wherein the determining one or more matching business names corresponding to the business name to be matched according to the index tree and the obtained business name to be matched comprises:
determining administrative region words of the enterprise name to be matched and corresponding multi-element groups thereof;
determining a matching dictionary of the enterprise name to be matched according to the index tree, and administrative regional words and a multi-element group of the enterprise name to be matched, wherein the matching dictionary is a whole database enterprise name set containing one or more multi-element group elements of the enterprise name to be matched;
determining a first matching score of each database business name in the matching dictionary and a tuple of the matching business name;
and determining one or more database enterprise names in the matching dictionary according to the first matching score and the enterprise name to be matched as the matching enterprise name.
5. The method of claim 4, wherein the determining administrative domain terms and font size industry terms for database business names further comprises:
determining organization form words of all database enterprise names;
the determining the administrative region words and the word size industry words of the enterprise names to be matched further comprises: determining organization form words of the enterprise names to be matched;
wherein an n-order matrix including a similarity between n organization form words, a first threshold and a second threshold are preset, and the determining one or more index values in the matching dictionary as the matching business names according to the first matching score includes:
screening one or more names in the matching dictionary according to the first threshold and the first matching distribution to serve as pre-selected matching business names;
determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix;
determining a second matching score of the enterprise name to be matched and the word size industry word of each pre-selected matching enterprise name, wherein the second matching score is determined by a character string similarity algorithm;
determining the total matching score of the business name to be matched and each pre-selected matching business name according to the first matching score, the organization form similarity and the second matching score;
and determining one or more of all the preselected matching business names as the matching business names according to the second threshold and the total matching distribution.
6. The method according to claim 4 or 5, wherein the determining the matching dictionary of the business name to be matched according to the index tree and the administrative regional words and the multi-element group of the business name to be matched comprises:
determining the primary index key corresponding to the administrative region word of the enterprise name to be matched in the index tree;
determining a plurality of secondary index keys corresponding to the multi-tuple of the enterprise name to be matched in the index values of the determined primary index keys;
and using the index value sets of all the determined secondary index keys as the matching dictionary.
7. The method according to claim 6, wherein the hit number is preset to an initial value of 0, and wherein determining, in the index values of the determined primary index keys, a number of secondary index keys corresponding to the tuples of the business name to be matched comprises:
matching the multi-element group elements of the enterprise name to be matched with each index value of the determined primary index key one by one;
if the multi-tuple element is the same as an index value of the determined first-level index key, taking the index value of the first-level index key as a second-level index key corresponding to the multi-tuple of the enterprise name to be matched, and adding one to the hit number of the database enterprise name corresponding to the index value of the second-level index key;
wherein the determining a first matching score for each database business name in the matching dictionary and the tuple of matching business names comprises:
determining the number m1 of the multi-element elements of the matched enterprise name and the number m2 of the multi-element elements corresponding to each database enterprise name in the matched dictionary, and acquiring the hit number m3 of each database enterprise name in the matched dictionary;
the first matching score
Figure FDA0003925849330000041
8. The method according to any one of claims 5 to 7, wherein the establishing a tuple corresponding to the administrative area word of the database business name according to the word size industry word of the database business name further comprises:
establishing a comparison table, wherein the comparison table comprises administrative region words, word size industry words, organization form words and the number of multi-group elements corresponding to the names of the database enterprises;
wherein the determining the number m2 of the multi-element elements corresponding to each index value in the matching dictionary comprises:
determining a database enterprise name which is the same as the index value in the comparison table, and taking the multi-element number of the database enterprise name as the multi-element number m2 corresponding to the index value;
before determining the similarity of the enterprise name to be matched and the organization form words of each preselected matching enterprise name according to the n-order matrix, the method further comprises the following steps:
obtaining organization form words of each pre-selected matched enterprise name according to the comparison table;
wherein, before determining the second matching score between the enterprise name to be matched and the word size industry word of each preselected matching enterprise name, the method further comprises the following steps:
and acquiring the word size industry words of each preselected matching enterprise name according to the comparison table.
9. The method of any one of claims 1 to 8, wherein an element of the tuple is a full combination of two adjacent characters in the size industry word of the database business name.
10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement the method of any one of claims 1 to 9.
11. An index tree based enterprise name matching device, wherein the device comprises:
one or more processors; and
a memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any of claims 1 to 9.
CN202211373485.XA 2022-11-04 2022-11-04 Enterprise name matching method and device based on index tree Pending CN115712757A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211373485.XA CN115712757A (en) 2022-11-04 2022-11-04 Enterprise name matching method and device based on index tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211373485.XA CN115712757A (en) 2022-11-04 2022-11-04 Enterprise name matching method and device based on index tree

Publications (1)

Publication Number Publication Date
CN115712757A true CN115712757A (en) 2023-02-24

Family

ID=85232151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211373485.XA Pending CN115712757A (en) 2022-11-04 2022-11-04 Enterprise name matching method and device based on index tree

Country Status (1)

Country Link
CN (1) CN115712757A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361517A (en) * 2023-05-29 2023-06-30 北京拓普丰联信息科技股份有限公司 Enterprise word size duplicate checking method, device, equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116361517A (en) * 2023-05-29 2023-06-30 北京拓普丰联信息科技股份有限公司 Enterprise word size duplicate checking method, device, equipment and medium
CN116361517B (en) * 2023-05-29 2023-08-25 北京拓普丰联信息科技股份有限公司 Enterprise word size duplicate checking method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US20220075670A1 (en) Systems and methods for replacing sensitive data
US10726356B1 (en) Target variable distribution-based acceptance of machine learning test data sets
CN107491487B (en) Full-text database architecture and bitmap index creation and data query method, server and medium
US9336246B2 (en) Generating composite key relationships between database objects based on sampling
US8666998B2 (en) Handling data sets
US20170147688A1 (en) Automatically mining patterns for rule based data standardization systems
US10789225B2 (en) Column weight calculation for data deduplication
CN111258966A (en) Data deduplication method, device, equipment and storage medium
CN113342976A (en) Method, device, storage medium and equipment for automatically acquiring and processing data
WO2023093100A1 (en) Method and apparatus for identifying abnormal calling of api gateway, device, and product
CN112527970A (en) Data dictionary standardization processing method, device, equipment and storage medium
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN115712757A (en) Enterprise name matching method and device based on index tree
CN113965389A (en) Network security management method, equipment and medium based on firewall log
CN111861733B (en) Fraud prevention and control system and method based on address fuzzy matching
CN106897174B (en) Fragment recovery method for MYSQL database
US11308130B1 (en) Constructing ground truth when classifying data
CN116611915A (en) Salary prediction method and device based on statistical reasoning
CN115186138A (en) Comparison method and terminal for power distribution network data
CN113610629A (en) Method and device for screening client data features from large-scale feature set
CN113537349A (en) Method, device, equipment and storage medium for identifying hardware fault of large host
CN117389980B (en) Log file analysis method and device, computer equipment and readable storage medium
US8176407B2 (en) Comparing values of a bounded domain
US20230086319A1 (en) Assessing data records
CN112651415B (en) Power utilization problem mining method and device for power customer group

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination