CN115796150A

CN115796150A - Fuzzy matching model establishing method, device and system for financial institution names

Info

Publication number: CN115796150A
Application number: CN202211510701.0A
Authority: CN
Inventors: 朱俊祺; 郑康豪; 王立; 何煜; 邓俊峰; 龙海; 陈祖杰
Original assignee: China Guangfa Bank Co Ltd
Current assignee: China Guangfa Bank Co Ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-14

Abstract

The invention discloses a method, a device and a system for establishing a fuzzy matching model for a financial institution name, wherein the method comprises the following steps: according to the open source data set and the loss function, performing supervised learning on the BERT model to obtain a cross-domain fuzzy matching model; and performing unsupervised learning on the cross-domain fuzzy matching model according to the financial institution name data set and the loss function to obtain the financial domain fuzzy matching model. By adopting the embodiment of the invention, the cross-domain fuzzy matching model obtained by training has the universal capability of accurate judgment, and the model obtained by training has higher calculation accuracy aiming at the similarity of the names of the financial institutions.

Description

Fuzzy matching model establishing method, device and system for financial institution names

Technical Field

The invention relates to the field of natural language processing, in particular to a method, a device and a system for establishing a fuzzy matching model for names of financial institutions.

Background

At present, similarity calculation between names of financial institutions can be realized through a matching model, and the establishment of the matching model mainly depends on two modes: supervised learning based on state-of-the-art data, and migratory learning based on non-state-of-the-art data. If supervised learning is carried out, a data set of the financial institution name pair needs to be constructed in advance; the data set can be established by marking similar or dissimilar labels on each name by establishing artificial rules, and collecting search words of the user and similar mechanism names associated with clicking behaviors of the user, wherein the clicked mechanism names are regarded as similar to the search words, and the rest names are regarded as negative samples. However, no matter the name pair data is generated through manual rules or data clicking, possible input, continuous communication with business personnel and continuous checking of production data are required to be observed manually, or input of a system is required to be confirmed manually, label errors of labels can occur in a large number of screening processes, and therefore the calculation accuracy of the model is reduced. If migration learning is carried out, the established model has wider discrimination capability, and when the model is applied to the financial field and the name time of a financial institution is compared, misjudgment is easy to generate, so that rules need to be made manually, and the manual rules need developers to know a large amount of data characteristics of the model before the data characteristics can be specified, so that the calculation accuracy of the matching model is difficult to effectively improve.

Disclosure of Invention

The invention provides a method, a device and a system for establishing a fuzzy matching model for financial institution names, which aim to solve the technical problem that the calculation accuracy is low when the similarity between the financial institution names is calculated by the conventional matching model.

In order to solve the above technical problem, an embodiment of the present invention provides a fuzzy matching model establishing method for names of financial institutions, including:

according to the open source data set and the loss function, performing supervised learning on the BERT model to obtain a cross-domain fuzzy matching model; wherein the open source data set comprises: the sentence matching method comprises the following steps of a first sentence, a second sentence to be matched corresponding to the first sentence, and a label indicating similarity or dissimilarity between the first sentence and the second sentence;

according to the financial institution name data set and the loss function, performing unsupervised learning on the cross-domain fuzzy matching model to obtain a financial domain fuzzy matching model; the financial institution name data set is a name pair data set provided with a positive sample and a negative sample according to a financial institution name library and a generation rule.

According to the method, the BERT model is supervised and trained according to the open source data set provided with the labels, the open source data set is provided with the first statement, the second statement to be tested corresponding to the first statement and the accurately set labels, and the cross-domain fuzzy matching model obtained through training has the universal capability of accurate judgment according to the open source data set and the loss function. The financial institution name data set contains a third sentence and a positive sample and a negative sample corresponding to the third sentence, so that the cross-domain fuzzy matching model performs unsupervised learning according to the third sentence, the positive sample and the negative sample in combination with a loss function, and the trained model has higher calculation accuracy for the similarity of the financial institution names.

Further, the supervised learning is performed on the BERT model according to the open source data set and the loss function to obtain a cross-domain fuzzy matching model, which specifically comprises:

inputting the first statement and the second statement to the BERT model, and respectively carrying out vector coding processing on the first statement and the second statement to obtain a first statement vector corresponding to the first statement and a second statement vector corresponding to the second statement; wherein the second statement vector comprises: the label indicates similar statement vectors corresponding to similar second statements, or the label indicates dissimilar statement vectors corresponding to dissimilar second statements;

calculating loss according to the first statement vector, the second statement vector, the label and the loss function to obtain a first loss result;

and according to the first loss result, performing gradient return and updating the weight until the BERT model converges to obtain the cross-domain fuzzy matching model.

According to the method, supervised two-classification learning is carried out through a BERT model according to a first statement, a second statement and a label in combination with a loss function, and when the BERT model is converged, a cross-domain fuzzy matching model is obtained; based on the label accurately labeled in the open source data set, the cross-domain fuzzy matching model has accurate general capability and can calculate the similarity of different texts.

Further, the unsupervised learning of the cross-domain fuzzy matching model is performed according to the financial institution name data set and the loss function to obtain the financial domain fuzzy matching model, which specifically comprises:

wherein the organization name dataset comprises: the third sentence and a fourth sentence to be matched corresponding to the third sentence; the fourth sentence includes: positive sample statements and negative sample statements;

inputting the third statement and the fourth statement to the cross-domain matching model, and respectively carrying out vector coding processing on the third statement and the fourth statement to obtain a third statement vector corresponding to the third statement and a fourth statement vector corresponding to the fourth statement; wherein the fourth statement vector comprises: a positive sample statement vector and a negative sample statement vector;

performing parameter zero setting processing on the fourth statement vector to obtain a processed fourth statement;

calculating loss according to the third statement vector, the processed fourth statement vector and the loss function to obtain a second loss result;

and according to the second loss result, performing gradient return and updating the weight until the cross-domain matching model converges to obtain the fuzzy matching model of the financial domain.

Further, the positive sample sentence is the same sentence as the third sentence; the negative sample statement is in a financial institution name dataset, the third statement and the other statements except the positive sample statement; the positive sample statement vector is obtained after vector coding processing of the positive sample statement; the negative sample statement vector is obtained after the negative sample statement is subjected to vector coding processing.

Further, the parameter zeroing processing on the fourth statement to obtain a processed fourth statement specifically includes:

when the fourth statement vector is a positive sample statement vector, setting all parameters greater than a probability threshold value in the positive sample statement vector to zero according to a preset probability threshold value to obtain a processed fourth statement vector;

when the fourth statement vector is a negative sample statement vector, the processed fourth statement vector is equal to the negative sample statement vector.

The invention sets the same sentences as the third sentences into positive sample sentences and sets the same sentences as the third sentences into negative sample sentences according to the establishment rules, increases the third sentence vectors and the positive sample sentence vectors through parameter zero setting processing, and enables the cross-domain fuzzy matching model of unsupervised learning to be carried out according to the third sentence vectors and the processed positive sample sentence vectors, thereby obtaining higher calculation accuracy.

Further, the vector encoding process includes:

converting the sentence into an ID string according to a preset Chinese word list; the sentence includes: a first sentence, a second sentence, a third sentence, or a fourth sentence;

inquiring a word vector of the statement and a position vector of the statement according to the ID string;

correspondingly adding the word vector of the statement and the position vector of the statement to obtain an input vector;

according to an encoder, encoding the input vector for a plurality of times, and obtaining a corresponding statement vector after vector transverse summation and averaging; the statement vector includes: a first statement vector, a second statement vector, a third statement vector, or a fourth statement vector.

Further, the expression of the loss function is:

where sim is the cosine similarity, h _i Either the first statement vector or the third statement vector,

either a similar statement vector or a positive sample statement vector,

either a dissimilar statement vector or a negative sample statement vector.

The loss function of the invention considers the positive sample and the negative sample in the unsupervised learning during the calculation, and also considers the similar statement vector and the dissimilar statement vector in the supervised learning; therefore, the method can be used for the convergence process of the BERT model during supervised learning and the convergence process of the cross-domain fuzzy matching model during unsupervised learning, and can improve the calculation accuracy of the fuzzy matching model in the financial domain after gradient return and weight update.

Further, after the unsupervised learning of the cross-domain fuzzy matching model is performed according to the financial institution name data set and the loss function to obtain the financial domain fuzzy matching model, the method includes:

receiving a financial institution name and a financial institution name to be matched, which are input by a user, so that the fuzzy matching model of the financial field carries out similarity calculation to obtain a calculation result of the financial institution name and the financial institution name to be matched; wherein the calculation result comprises: similar or dissimilar.

On the other hand, the embodiment of the invention also provides a fuzzy matching model establishing device for the name of the financial institution, which comprises a first model establishing module and a second model establishing module;

the first model building module is used for carrying out supervised learning on the BERT model according to the open source data set and the loss function to obtain a cross-domain fuzzy matching model; wherein the open source data set comprises: the sentence matching method comprises the following steps of a first sentence, a second sentence to be matched corresponding to the first sentence, and a label indicating similarity or dissimilarity between the first sentence and the second sentence;

the second model establishing module is used for enabling the cross-domain fuzzy matching model to perform unsupervised learning according to the financial institution name data set and the loss function to obtain a financial domain fuzzy matching model; the financial institution name data set is a name pair data set provided with a positive sample and a negative sample according to a financial institution name library and a generation rule.

On the other hand, the embodiment of the invention also provides a system for establishing the fuzzy matching model aiming at the name of the financial institution, which comprises matching equipment and a user side;

the matching equipment is used for executing the fuzzy matching model establishing method for the names of the financial institutions in the embodiment of the invention;

the user side is used for inputting the name of the financial institution and the name of the financial institution to be matched to the matching equipment; and for viewing the results of the calculations of the matching devices.

Drawings

FIG. 1 is a flow chart illustrating a fuzzy matching model building method for names of financial institutions according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating a method for establishing a fuzzy matching model for names of financial institutions according to another embodiment of the present invention;

FIG. 3 is a schematic flowchart of a fuzzy matching model building method for financial institution names according to still another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an embodiment of an encoding process provided by the present invention;

FIG. 5 is a flowchart illustrating an embodiment of an encoding process provided by the present invention;

FIG. 6 is a schematic structural diagram of an embodiment of a fuzzy matching model building apparatus for names of financial institutions according to the present invention;

fig. 7 is a schematic structural diagram of an embodiment of the fuzzy matching model building system for financial institution names according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1, a flow chart of an embodiment of the method for establishing a fuzzy matching model for a financial institution name according to the present invention mainly includes

steps

101 and 102, which are as follows:

step 101: according to the open source data set and the loss function, performing supervised learning on the BERT model to obtain a cross-domain fuzzy matching model; wherein the open source data set comprises: the sentence matching method comprises the steps of a first sentence, a second sentence to be matched corresponding to the first sentence and a label indicating similarity or dissimilarity between the first sentence and the second sentence.

In this embodiment, the open source data set may use 1 to indicate that the semantics of the two sentences are similar or identical, and use 0 to indicate that the semantics of the two sentences are different, as shown in the following table:

step 102: according to the financial institution name data set and the loss function, performing unsupervised learning on the cross-domain fuzzy matching model to obtain a financial domain fuzzy matching model; the financial institution name data set is a name pair data set provided with a positive sample and a negative sample according to a financial institution name library and a generation rule.

In the embodiment, the financial institution name data set is generated according to the financial institution name library and the generation rule without human intervention. If the input financial institution name is the same as the financial institution name to be matched, the matched financial institution name in the name is set as a positive sample, and the unpaired financial institution name is automatically set as a negative sample.

Referring to fig. 2, a schematic flow chart of another embodiment of the method for establishing a fuzzy matching model for a financial institution name according to the present invention mainly includes steps 201 to 203, which are as follows:

step 201: inputting the first statement and the second statement to the BERT model, and respectively carrying out vector coding processing on the first statement and the second statement to obtain a first statement vector corresponding to the first statement and a second statement vector corresponding to the second statement; wherein the second statement vector comprises: the tag indicates a similar statement vector corresponding to a similar second statement, or the tag indicates a dissimilar statement vector corresponding to a dissimilar second statement.

Step 202: and calculating loss according to the first statement vector, the second statement vector, the label and the loss function to obtain a first loss result.

Step 203: and according to the first loss result, performing gradient return and updating the weight until the BERT model converges to obtain the cross-domain fuzzy matching model.

Referring to fig. 3, a flow chart of another embodiment of the method for establishing a fuzzy matching model for a financial institution name according to the present invention mainly includes steps 301 to 304, which are as follows:

in this embodiment, the organization name data set includes: the third statement and a fourth statement to be matched corresponding to the third statement; the fourth sentence includes: positive sample statements and negative sample statements.

Step 301: inputting the third statement and the fourth statement to the cross-domain matching model, and respectively performing vector coding processing on the third statement and the fourth statement to obtain a third statement vector corresponding to the third statement and a fourth statement vector corresponding to the fourth statement; wherein the fourth statement vector comprises: a positive sample statement vector and a negative sample statement vector.

In this embodiment, the positive sample sentence is the same sentence as the third sentence; the negative sample statement is in a financial institution name dataset, the third statement and the other statements except the positive sample statement; the positive sample statement vector is a vector obtained after vector coding processing of the positive sample statement; the negative sample statement vector is obtained after the negative sample statement is subjected to vector coding processing.

Step 302: and carrying out parameter zero setting processing on the fourth statement vector to obtain a processed fourth statement.

In this embodiment, the performing parameter zeroing processing on the fourth statement to obtain a processed fourth statement specifically includes: when the fourth statement vector is a positive sample statement vector, setting all parameters which are larger than a probability threshold value in the positive sample statement vector to zero according to a preset probability threshold value to obtain a processed fourth statement vector; when the fourth statement vector is a negative sample statement vector, the processed fourth statement vector is equal to the negative sample statement vector.

In this embodiment, the probability threshold may be set to 0.1, and the parameters in the vector are set to zero whenever the parameters generated in the interval 0 to 1 are higher than 0.1.

Step 303: and calculating loss according to the third statement vector, the processed fourth statement vector and the loss function to obtain a second loss result.

Step 304: and according to the second loss result, performing gradient return and updating the weight until the cross-domain matching model converges to obtain the fuzzy matching model of the financial domain.

In this embodiment, the vector encoding process includes:

converting the sentence into an ID string according to a preset Chinese word list; the sentence includes: a first sentence, a second sentence, a third sentence, or a fourth sentence; inquiring a word vector of the statement and a position vector of the statement according to the ID string; correspondingly adding the word vector of the statement and the position vector of the statement to obtain an input vector; according to an encoder, encoding the input vector for a plurality of times, and obtaining a corresponding statement vector after vector transverse summation and averaging; the statement vector includes: a first statement vector, a second statement vector, a third statement vector, or a fourth statement vector.

In this embodiment, the chinese word list is self-contained by the BERT model, each word of chinese has a unique ID in the chinese word list, and the ID is a positive integer; before calculating the similarity between names, each sentence needs to be converted into an ID string according to each word in the sentence.

Referring to fig. 4, which is a schematic structural diagram of an embodiment of the encoding process provided by the present invention, the BERT model finds a word vector and a position vector of each sentence based on the ID, and adds the word vector and the position vector to obtain an input vector.

Referring to fig. 5, which is a schematic flow chart of an embodiment of the encoding processing provided by the present invention, an Encoder performs 6 Encoder blocks on an input vector X to obtain a corresponding statement vector C.

In this embodiment, the expression of the loss function is:

either a similar statement vector or a positive sample statement vector,

either a dissimilar statement vector or a negative sample statement vector.

In this example, the expression of the cosine similarity is:

wherein h is ₁ As a first statement vector or a third languageSentence vector, h ₂ A similar statement vector, a positive sample statement vector, a dissimilar statement vector, or a negative sample statement vector.

In this embodiment, after the performing unsupervised learning on the cross-domain fuzzy matching model according to the financial institution name data set and the loss function to obtain the financial-domain fuzzy matching model, the method includes: receiving a financial institution name and a financial institution name to be matched, which are input by a user, so that the fuzzy matching model of the financial field carries out similarity calculation to obtain a calculation result of the financial institution name and the financial institution name to be matched; wherein the calculation result comprises: similar or dissimilar.

Referring to fig. 6, a schematic structural diagram of an embodiment of an apparatus for establishing a fuzzy matching model for a financial institution name according to the present invention mainly includes: a first model building module 601 and a second model building module 602.

In this embodiment, the first model establishing module 601 is configured to perform supervised learning on the BERT model according to the open source data set and the loss function, so as to obtain a cross-domain fuzzy matching model; wherein the open source data set comprises: the sentence matching method comprises the steps of a first sentence, a second sentence to be matched corresponding to the first sentence and a label indicating similarity or dissimilarity between the first sentence and the second sentence.

In this embodiment, the first model building module 601 includes: a first encoding processing unit, a first loss calculating unit and a first adjusting unit; the first coding processing unit is used for inputting the first statement and the second statement to the BERT model, and respectively carrying out vector coding processing on the first statement and the second statement to obtain a first statement vector corresponding to the first statement and a second statement vector corresponding to the second statement; wherein the second statement vector comprises: the label indicates similar statement vectors corresponding to similar second statements, or the label indicates dissimilar statement vectors corresponding to dissimilar second statements; the first loss calculating unit is used for calculating loss according to the first statement vector, the second statement vector, the label and the loss function after the first coding processing unit obtains the first statement vector corresponding to the first statement and the second statement vector corresponding to the second statement, so as to obtain a first loss result; and the first adjusting unit is used for performing gradient return and updating the weight according to the first loss result after the first loss calculating unit obtains the first loss result until the BERT model converges to obtain the cross-domain fuzzy matching model.

The second model establishing module 602 is configured to perform unsupervised learning on the cross-domain fuzzy matching model according to the financial institution name data set and the loss function to obtain a financial-domain fuzzy matching model; the financial institution name data set is a name pair data set provided with a positive sample and a negative sample according to a financial institution name library and a generation rule.

In this embodiment, the model building module 602 includes: the second coding processing unit, the parameter zero setting unit, the second loss calculating unit and the second adjusting unit; wherein the organization name dataset comprises: the third statement and a fourth statement to be matched corresponding to the third statement; the fourth sentence includes: positive sample statements and negative sample statements; the second coding processing unit is used for inputting the third statement and the fourth statement to the cross-domain matching model, and respectively carrying out vector coding processing on the third statement and the fourth statement to obtain a third statement vector corresponding to the third statement and a fourth statement vector corresponding to the fourth statement; wherein the fourth statement vector comprises: a positive sample statement vector and a negative sample statement vector; the parameter zero setting unit is used for performing parameter zero setting processing on a fourth statement vector after the second coding processing unit obtains a third statement vector corresponding to the third statement and a fourth statement vector corresponding to the fourth statement to obtain a processed fourth statement; the second loss calculating unit is used for calculating loss according to the third statement vector, the processed fourth statement vector and the loss function after the parameter zeroing unit obtains the processed fourth statement, so as to obtain a second loss result; and the second adjusting unit is used for performing gradient return and updating the weight according to the second loss result after the second loss calculating unit obtains the second loss result until the cross-domain matching model converges to obtain the fuzzy matching model of the financial domain.

Referring to fig. 7, a schematic structural diagram of an embodiment of an apparatus for establishing a fuzzy matching model for names of financial institutions according to the present invention mainly includes: matching device 701 and user side 702.

In this embodiment, the matching device 701 is configured to execute the fuzzy matching model establishing method for the names of the financial institutions according to the embodiment of the present invention.

The user side 702 is configured to input a financial institution name and a financial institution name to be matched to the matching device 701; and for viewing the results of the calculations of the matching device 701.

The above-mentioned embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, and it should be understood that the above-mentioned embodiments are only examples of the present invention and are not intended to limit the scope of the present invention. It should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A fuzzy matching model building method for financial institution names is characterized by comprising the following steps:

2. The method for building the fuzzy matching model for the financial institution name as claimed in claim 1, wherein the supervised learning of the BERT model is performed according to the open source data set and the loss function to obtain the cross-domain fuzzy matching model, specifically:

inputting the first statement and the second statement to the BERT model, and respectively carrying out vector coding processing on the first statement and the second statement to obtain a first statement vector corresponding to the first statement and a second statement vector corresponding to the second statement; wherein the second sentence vector comprises: the label indicates similar statement vectors corresponding to similar second statements, or the label indicates dissimilar statement vectors corresponding to dissimilar second statements;

3. The method according to claim 1, wherein the cross-domain fuzzy matching model is subjected to unsupervised learning according to the financial institution name data set and the loss function to obtain the financial domain fuzzy matching model, and specifically comprises:

wherein the organization name dataset comprises: the third statement and a fourth statement to be matched corresponding to the third statement; the fourth sentence includes: positive sample statements and negative sample statements;

inputting the third statement and the fourth statement to the cross-domain matching model, and respectively performing vector coding processing on the third statement and the fourth statement to obtain a third statement vector corresponding to the third statement and a fourth statement vector corresponding to the fourth statement; wherein the fourth statement vector comprises: a positive sample statement vector and a negative sample statement vector;

4. The fuzzy matching model building method for financial institution names according to claim 3, wherein the positive sample sentence is the same sentence as the third sentence; the negative sample statement is in a financial institution name dataset, the third statement and the other statements except the positive sample statement; the positive sample statement vector is obtained after vector coding processing of the positive sample statement; the negative sample statement vector is obtained after the vector coding processing of the negative sample statement.

5. The method for establishing the fuzzy matching model for the names of the financial institutions according to claim 3, wherein the parameter zeroing is performed on the fourth sentence to obtain a processed fourth sentence, and specifically:

when the fourth statement vector is a positive sample statement vector, setting all parameters which are larger than a probability threshold value in the positive sample statement vector to zero according to a preset probability threshold value to obtain a processed fourth statement vector;

6. The fuzzy matching model building method for financial institution names according to any one of claims 2-5, wherein the vector encoding process comprises:

7. The fuzzy matching model building method for financial institution names as claimed in any one of claims 1 to 5, wherein the loss function is expressed as:

either a similar statement vector or a positive sample statement vector,

either a dissimilar statement vector or a negative sample statement vector.

8. The method for building the fuzzy matching model for the financial institution name as claimed in claim 3, wherein after the unsupervised learning of the cross-domain fuzzy matching model according to the financial institution name data set and the loss function to obtain the fuzzy matching model for the financial domain, the method comprises:

9. The fuzzy matching model establishing device for the names of the financial institutions is characterized by comprising a first model establishing module and a second model establishing module;

10. A fuzzy matching model building system for financial institution names is characterized by comprising matching equipment and a user side;

wherein the matching device is used for executing the fuzzy matching model building method for the financial institution name according to any one of claims 1-8;