CN117573806A

CN117573806A - Name matching method and device without separator

Info

Publication number: CN117573806A
Application number: CN202311557832.9A
Authority: CN
Inventors: 余孟泽; 陈云
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2023-11-21
Filing date: 2023-11-21
Publication date: 2024-02-20

Abstract

The invention discloses a name matching method and device without separator, wherein the method comprises the following steps: word segmentation is carried out on all names of all entities in a financial list, and a word list of each name is obtained; splicing all words in each word list into a matching character string; constructing a name matching automaton according to the plurality of matching character strings and the associated entity information; after the message to be analyzed is obtained, forming a character string to be analyzed; inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list; and eliminating entity information meeting the miss judgment condition from the entity list. The invention can solve the problem of missing report caused by word segmentation error condition of the screening system based on word search.

Description

Name matching method and device without separator

Technical Field

The invention relates to the technical field of big data, in particular to a method and a device for matching name without separator.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

The existing list monitoring is mainly based on word searching, and the accuracy of word segmentation results determines the accuracy of final screening results. In the prior banking transaction system, some field data of messages are input by a client, so that the client can freely split and combine the original name words, for example, the length of each row in a SWIFT message is limited, and one word is split into two rows for storage. Under these circumstances, the word split by the existing word segmentation has a large gap from the actual word list of the name, and the list name cannot be correctly screened from the input based on the word search.

Disclosure of Invention

The embodiment of the invention provides a no-separator name matching method, which is used for solving the problem of missing report caused by a screening system based on word searching under the word segmentation error condition, and comprises the following steps:

word segmentation is carried out on all names of all entities in a financial list, and a word list of each name is obtained;

splicing all words in each word list into a matching character string;

constructing a name matching automaton according to the plurality of matching character strings and the associated entity information;

after the message to be analyzed is obtained, forming a character string to be analyzed;

inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list;

and eliminating entity information meeting the miss judgment condition from the entity list.

The embodiment of the invention also provides a name matching device without separator, which is used for solving the problem of missing report caused by a screening system based on word searching under the word segmentation error condition, and comprises the following steps:

the word segmentation module is used for segmenting all names of all entities in the financial list to obtain a word list of each name;

the character string splicing module is used for splicing all the words in each word list into a matching character string;

the automaton construction module is used for constructing a name matching automaton according to the plurality of matching character strings and the associated entity information;

the character string obtaining module is used for forming a character string to be analyzed after obtaining the message to be analyzed;

the matching module is used for inputting the character strings to be analyzed into the name matching automaton for matching to obtain an entity list;

and the miss rejection module is used for rejecting entity information meeting miss judgment conditions from the entity list.

The embodiment of the invention also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the no-separator name matching method when executing the computer program.

The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the separator-free name matching method when being executed by a processor.

The embodiment of the invention also provides a computer program product, which comprises a computer program, wherein the computer program is executed by a processor to realize the separator-free name matching method.

In the embodiment of the invention, all names of all entities in a financial list are segmented to obtain a word list of each name; splicing all words in each word list into a matching character string; constructing a name matching automaton according to the plurality of matching character strings and the associated entity information; after the message to be analyzed is obtained, forming a character string to be analyzed; inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list; and eliminating entity information meeting the miss judgment condition from the entity list. According to the embodiment of the invention, the name matching automaton is constructed for the matching character strings, and then the entity list matching is carried out, so that the phenomenon of missing report caused by word segmentation error conditions of the conventional screening system based on word search is solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:

FIG. 1 is a flow chart of a method for matching name without separator according to an embodiment of the invention;

FIG. 2 is a schematic diagram of no separator name matching in an embodiment of the present invention;

FIG. 3 is an example of a dictionary tree in an embodiment of the present invention;

FIG. 4 is a diagram of failure pointers of a dictionary tree calculated in an embodiment of the present invention;

fig. 5 is a schematic diagram of a device for matching name without separator according to an embodiment of the invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings. The exemplary embodiments of the present invention and their descriptions herein are for the purpose of explaining the present invention, but are not to be construed as limiting the invention.

The inventors have found that the general procedure of an existing inventory monitoring screening engine system is as follows:

analyzing the collected blacklist data, storing the processed data into a structured database, then preprocessing the name of the list such as word segmentation, and the like, and establishing an inverted index of the name.

And the message transmitted by the transaction system is used for extracting the screening text according to the definition of the message column, and the screening file is subjected to word segmentation to remove pretreatment such as stop words.

Searching matched list names from the established inverted index library for the extracted word list, then carrying out similarity calculation on the input names and the list names, and outputting the list names with similarity lower than a threshold value as a final result after eliminating the list names with similarity lower than the threshold value.

The disadvantage of the above list monitoring and screening engine system is that the word list separated by word segmentation has a large gap from the actual word list of the names, and the list names cannot be screened from the input correctly based on word searching.

Therefore, the embodiment of the invention provides a matching method without separator names, which solves the problem that the traditional screening system based on word search causes missing report under the word segmentation error condition, and simultaneously embeds a plurality of matching rules, so that a user can screen the matching rules according to a service scene and a configuration list with a self-defined risk preference, thereby reducing the sanctioned compliance risk caused by missing report under the condition that the transaction is not interrupted as much as possible to influence the customer experience.

The data acquisition, storage, use, processing and the like in the technical scheme meet the relevant regulations of national laws and regulations.

FIG. 1 is a flowchart of a method for matching name without separator according to an embodiment of the present invention, including:

step 101, word segmentation is carried out on all names of all entities in a financial list, and a word list of each name is obtained;

step 102, splicing all words in each word list into a matching character string;

step 103, constructing a name matching automaton according to the plurality of matching character strings and the associated entity information;

104, after obtaining the message to be analyzed, forming a character string to be analyzed;

step 105, inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list;

and 106, eliminating entity information meeting the miss judgment condition from the entity list.

Each step is described in detail below. FIG. 2 is a schematic diagram of no separator name matching in an embodiment of the present invention. Wherein steps 101-103 correspond to the name matching automaton construction of fig. 2, and steps 104-106 correspond to the name screening of fig. 2.

In step 101, word segmentation is performed on all names of all entities in a financial list, and a word list of each name is obtained;

in one embodiment, after obtaining the word list for each name, further comprising:

deleting the words of the suffix type in each word list according to the name suffix list;

splicing all words in each word list into a matching character string, including:

all words in each word list from which the suffix type words are deleted are spliced into a matching string.

The name suffix table includes some suffixes of company names and personal names, including suffixes to public names and honors to privacy, job positions, and the like. Suffixes to common names such as CO, CO LTD, etc., and s.a. de c.v. In addition, a company name suffix table and a personal name suffix table may be established and applied separately.

In step 102, all words in each word list are spliced into a matching string;

if the first Name and the last Name exist simultaneously for the personal Name, the first Name is subjected to acronym transformation, and the last Name is spliced and then the abbreviated Name is output.

In step 103, a name matching automaton is constructed from the plurality of matching strings and associated entity information.

In an embodiment, constructing a name matching automaton according to a plurality of matching strings and corresponding entity information includes:

constructing a dictionary tree, wherein the dictionary tree comprises a root node and a plurality of sub-nodes, each sub-node represents one character on a matching character string, and each sub-node stores entity information associated with the matching character string formed by the characters sequentially arranged from the root node to the sub-node;

determining the symbol of each child node in the dictionary tree;

calculating a failure pointer of the dictionary tree;

and constructing a name matching automaton according to the symbol and the failure pointer of each child node in the dictionary tree.

Fig. 3 is an example of a dictionary tree in an embodiment of the present invention. In this embodiment, for example, the child node representing the character y, the child node representing the character e, and the child node representing the character i store entity information associated with the character y, jose, oscai respectively, and if a matching character string is a scan, the child node on the identification character r stores entity information associated with the scan. As can be seen from fig. 3, a matching string is arranged sequentially on the child nodes. For a child node representing character y, the scar is the child node's symbol. Fig. 4 is a schematic diagram of a failure pointer of a dictionary tree calculated according to an embodiment of the present invention, where letters a and s correspond to the failure pointer.

In a specific implementation, the entity information at least includes an entity identifier, an entity type, a release structure, a name type and a name content, wherein the entity identifier and the name content are information without public attribute, and the entity type, the release structure and the name type are information with public attribute.

The main disadvantage of the constructed name matching automaton is that data is fully loaded into the memory, and the occupied memory is large. For the list with data volume in tens of millions, each name in the name matching automaton stores the entity name, entity type, issuing mechanism, name type and list content (original name) associated with the name at the same time, and the whole list is loaded into the database to occupy memory. Aiming at a list screening scene, the entity type, the issuing mechanism and the name type are found through statistical analysis, the combination level of the name type in all lists is ten thousand, the information is the information without public attribute, and the entity identification and the name content are the information without public attribute, so that the entity type, the issuing mechanism and the name type can form a combined object through a sharing meta mode (Flyweight Pattern), and whether the combined object is put into a combined object pool or not is judged, the number of the objects is reduced, and the memory occupation is improved. The shared element mode is used for forming the manageable entity type of the combined object, the issuing mechanism and the name type can reduce the memory occupation.

Thus, the following structure can be formed:

the entity information at least comprises an entity identifier, a combined object and name content;

the combined object includes a plurality of pieces of information having a common attribute, wherein the information having the common attribute is an entity type, a publication structure, or a name type.

In step 104, after obtaining the message to be analyzed, a character string to be analyzed is formed, which specifically includes:

and deleting all punctuation marks, spaces and preset characters of the message to be analyzed to form a character string to be analyzed.

The preset character may be a special character such as a line feed character.

in an embodiment, inputting a character string to be analyzed into a name matching automaton for matching, and obtaining an entity list includes:

after the character string to be analyzed is matched with the node in the name matching automaton, adding entity information associated with the node into an entity list; the entity list takes entity names as main keys; in the matching process, jumping is carried out through a failure pointer;

and removing entity information corresponding to the information which is not in the value range from the entity list according to the value range of the information in the preconfigured combined object.

In specific implementation, the value ranges can be respectively determined for the entity types, the release structures or the name types in the combined object, and then, according to each value range, the entity list is screened, and takes the entity names as the primary keys, so that the entity list is also called as an entity name list.

The names are numerous in types, including primary names, high quality aliases, low quality aliases, etc., and in actual screening, a large number of false alarms are found if returned directly as a result of the list of entities output as described above. The names of the entities of the outputted entity list may include several cases: the name is part of entering a word, such as entering a KANNA KHOSHABA hit ANN name; the name head-to-tail matches a portion of the input word, etc. In combination with the service scene, the following miss judgment conditions are not met, and the following miss judgment conditions are deleted as miss, so that the task number of subsequent manual operation is reduced:

(1) The entity type is the name of the person, and a certain number of words should be hit completely, namely, the first word and the last word of the hit word should be a complete word;

(2) The entity type is company name, if the hit name consists of one word, the hit should hit a few words completely, i.e. the first word and the last word of the hit word should be one complete word;

(3) The entity type is company name, if the hit name consists of multiple words, the hit word should be a complete word at the beginning and the suffix mismatch should be a write through to the public stop word.

In the method provided by the embodiment of the invention, all names of all entities in a financial list are segmented to obtain a word list of each name; splicing all words in each word list into a matching character string; constructing a name matching automaton according to the plurality of matching character strings and the associated entity information; after the message to be analyzed is obtained, forming a character string to be analyzed; inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list; and eliminating entity information meeting the miss judgment condition from the entity list. According to the embodiment of the invention, the name matching automaton is constructed for the matching character strings, and then the entity list matching is carried out, so that the phenomenon of missing report caused by word segmentation error conditions of the conventional screening system based on word search is solved. Meanwhile, a plurality of miss judgment conditions are built in, and a user can configure the miss judgment conditions according to service scenes and risk preference in a self-defined mode, so that risks caused by missing reports are reduced under the condition that transaction is not interrupted as much as possible to influence customer experience.

The embodiment of the invention also provides a separator-free name matching device, which is described in the following embodiment. Because the principle of the device for solving the problem is similar to that of the no-separator name matching method, the implementation of the device can refer to the implementation of the no-separator name matching method, and the repetition is omitted.

FIG. 5 is a schematic diagram of a device for matching name without separator according to an embodiment of the present invention, including:

the word segmentation module 501 is configured to segment all names of all entities in the financial list to obtain a word list of each name;

a string concatenation module 502, configured to concatenate all words in each word list into a matching string;

an automaton construction module 503, configured to construct a name matching automaton according to the plurality of matching strings and the associated entity information;

the character string obtaining module 504 is configured to form a character string to be analyzed after obtaining the message to be analyzed;

the matching module 505 is configured to input a character string to be analyzed into a name matching automaton for matching, so as to obtain an entity list;

the miss rejection module 506 is configured to reject, from the entity list, entity information that satisfies the miss judgment condition.

In an embodiment, the word segmentation module is further configured to:

after obtaining the word list of each name, deleting the word of the suffix type in each word list according to the name suffix list;

the character string splicing module is specifically used for:

In one embodiment, the automaton construction module is specifically configured to:

determining the symbol of each child node in the dictionary tree;

calculating a failure pointer of the dictionary tree;

In one embodiment, the entity information includes at least an entity identifier, a combined object and a name content;

In one embodiment, the character string obtaining module is specifically configured to:

In an embodiment, the matching module is specifically configured to:

In the device provided by the embodiment of the invention, all names of all entities in a financial list are segmented to obtain a word list of each name; splicing all words in each word list into a matching character string; constructing a name matching automaton according to the plurality of matching character strings and the associated entity information; after the message to be analyzed is obtained, forming a character string to be analyzed; inputting the character strings to be analyzed into a name matching automaton for matching, and obtaining an entity list; and eliminating entity information meeting the miss judgment condition from the entity list. According to the embodiment of the invention, the name matching automaton is constructed for the matching character strings, and then the entity list matching is carried out, so that the phenomenon of missing report caused by word segmentation error conditions of the conventional screening system based on word search is solved. Meanwhile, a plurality of miss judgment conditions are built in, and a user can configure the miss judgment conditions according to service scenes and risk preference in a self-defined mode, so that risks caused by missing reports are reduced under the condition that transaction is not interrupted as much as possible to influence customer experience.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A separator-less name matching method, comprising:

splicing all words in each word list into a matching character string;

2. The method of claim 1, further comprising, after obtaining the word list for each name:

3. The method of claim 1, wherein constructing a name matching automaton from a plurality of matching strings and corresponding entity information comprises:

determining the symbol of each child node in the dictionary tree;

calculating a failure pointer of the dictionary tree;

4. The method of claim 1, wherein the entity information includes at least an entity identification, a combined object and a name content;

5. The method of claim 1, wherein after obtaining the message to be analyzed, forming the string to be analyzed comprises:

6. The method of claim 1, wherein inputting the character string to be analyzed into the name matching automaton for matching, obtaining the entity list, comprises:

7. A separator-less name matching device, comprising:

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 6 when executing the computer program.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the method of any of claims 1 to 6.