CN109165326A

CN109165326A - A kind of character string matching method and device

Info

Publication number: CN109165326A
Application number: CN201810936946.7A
Authority: CN
Inventors: 曾伟雄; 薛重阳; 孟庆文; 王维; 刘晓东
Original assignee: Bee Wisdom (beijing) Technology Co Ltd
Current assignee: Bee Wisdom (beijing) Technology Co Ltd
Priority date: 2018-08-16
Filing date: 2018-08-16
Publication date: 2019-01-08

Abstract

The invention discloses a kind of character string matching method and devices.The described method includes: after obtaining the first character string and the second character string, two character strings can be segmented, and determine the corresponding field of each word that the two character strings include, and then the matching degree between the two character strings can be determined according to the weighted value of each field, if match degree is greater than the preset threshold, it may be considered that the two character strings match.Wherein, the weighted value of each field can be determined according to sample character string.In this way, matched accuracy between different character strings can be improved by the weighted value of setting different field；Further, compared with the prior art for the mode of middle artificial contrast, the embodiment of the present invention is not necessarily to artificial contrast, effectively reduces human cost, can simplify and carry out matched operation to enterprise name, and can also shorten match time.

Description

Character string matching method and device

Technical Field

The invention relates to the field of data science, in particular to a character string matching method and device.

Background

Business name matching is a very important technology in the field of risk control. For example, in the financial industry, and particularly in the credit industry, it is common for a customer to fill in a business name for risk management and to match the business name filled in by the customer. For example, the business name filled by the client can be matched with the business name reported by the credit investigation report to see whether the client works at the business before; alternatively, the business name of the client may be compared with the business names of other clients to see if any colleagues are also local mechanism clients.

In the prior art, when matching business names, matching is usually performed in a manual comparison manner, that is, different business names are considered to be matched. Obviously, the method has high labor cost, complex operation and long time consumption.

Based on this, a method for matching character strings is needed to solve the problem of high labor cost caused by manual comparison in the prior art.

Disclosure of Invention

The embodiment of the invention provides a character string matching method and device, and aims to solve the technical problem of high labor cost caused by manual comparison mode for character string matching in the prior art.

The embodiment of the invention provides a character string matching method, which comprises the following steps:

acquiring a first character string and a second character string;

dividing words of the first character string and the second character string respectively to obtain words contained in the first character string and words contained in the second character string;

determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words;

determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;

and if the matching degree is larger than a preset threshold value, determining that the first character string is matched with the second character string.

Therefore, the matching accuracy between different character strings can be improved by setting the weight values of different fields; further, compared with a manual comparison mode in the prior art, the embodiment of the invention does not need manual comparison, effectively reduces the labor cost, can simplify the operation of matching enterprise names, and can also shorten the matching time.

In one possible implementation, the weight value of each field is determined by:

performing word segmentation on each sample character string to obtain each word contained in each sample character string;

determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word;

determining the repetition rate of each field according to the repetition rate of the word corresponding to each field;

and determining the weight value of each field according to the repetition rate of each field.

Therefore, the determined weight value can be more accurate according to the weight value of the field determined by the repetition rate of each field, the importance degree of the field is more met, and the matching accuracy between different character strings is further improved.

In one possible implementation, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the method further includes:

and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.

In one possible implementation manner, determining the weight value of each field according to the repetition rate of each field includes:

determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field;

determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field;

and determining the weight value of the field according to the discrimination of each field and the total discrimination corresponding to all the fields.

The embodiment of the invention provides a character string matching device, which comprises:

an acquisition unit configured to acquire a first character string and a second character string;

the processing unit is used for segmenting the first character string and the second character string respectively to obtain each word contained in the first character string and each word contained in the second character string; determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words; determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;

and the matching unit is used for determining that the first character string is matched with the second character string if the matching degree is greater than a preset threshold value.

In a possible implementation manner, the processing unit is specifically configured to:

performing word segmentation on each sample character string to obtain each word contained in each sample character string; determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word; determining the repetition rate of each field according to the repetition rate of the word corresponding to each field; and determining the weight value of each field according to the repetition rate of each field.

In a possible implementation manner, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the processing unit is further configured to:

In a possible implementation manner, the specific unit is specifically configured to:

determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field; determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field; and determining the weight value of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.

The embodiment of the present application further provides an apparatus, which has a function of implementing the above-described character string matching method. This function may be implemented by hardware executing corresponding software, and in one possible design, the apparatus includes: a processor, a transceiver, a memory; the memory is used for storing computer execution instructions, the transceiver is used for realizing the communication between the device and other communication entities, the processor and the memory are connected through the bus, and when the device runs, the processor executes the computer execution instructions stored in the memory so as to enable the device to execute the character string matching method described above.

An embodiment of the present invention further provides a computer storage medium, where a software program is stored, and when the software program is read and executed by one or more processors, the method for matching a character string described in the foregoing various possible implementation manners is implemented.

Embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the string matching method described in the above-mentioned various possible implementation manners.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below.

Fig. 1 is a schematic flow chart of a character string matching method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for determining a weight value of a field according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a character string matching apparatus according to an embodiment of the present invention.

Detailed Description

The present application will be described in detail below with reference to the accompanying drawings, and the specific operation methods in the method embodiments can also be applied to the apparatus embodiments.

In the prior art, when determining whether two enterprise names are matched, a computer can be used for full character matching in addition to a manual comparison mode. However, this method may cause erroneous judgment. For example, if two names of enterprises to be matched are "associated company" and "hundredth company", respectively, the existing full-character matching method may be adopted, and since the word "company" in the character string of "associated company" is equal to the word "company" in the character string of "hundredth company", the existing technology may consider that "associated company" and "hundredth company" are matched. Obviously, this knowledge is wrong.

Based on this, an embodiment of the present invention provides a method for matching a character string, as shown in fig. 1, which is a schematic flow chart of the method for matching a character string provided in the embodiment of the present invention, and specifically includes the following steps:

step 101, a first character string and a second character string are obtained.

102, performing word segmentation on the first character string and the second character string respectively to obtain words contained in the first character string and words contained in the second character string.

Step 103, determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relationship between the fields and the words.

And 104, determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field.

And 105, if the matching degree is determined to be greater than a preset threshold value, determining that the first character string is matched with the second character string.

Specifically, in step 101, the matching of the character strings may be applied to the matching between two character strings or between a plurality of character strings. The character string is composed of a plurality of characters, and the number of the characters composing the character string can be determined according to specific requirements. The content of the character strings may also be determined according to specific requirements, for example, if the character strings of the enterprise name class need to be matched, the obtained character strings may be "associated companies", "hundred degree companies", and the like.

In step 102, the character string usually includes a plurality of characters, and the character string may be segmented first in consideration of the relationship between the characters. For example, the first character string is "Beijing XXX information technology, Inc., and the character string can be divided into four words, i.e.," Beijing "," XXX "," information technology ", and" Yongquan "; the second character string is "Shanghai XXX information technology, Inc., and the character string can be divided into four words, i.e.," Shanghai "," XXX "," information technology ", and" Yongzi, "after being divided into words.

In step 103, according to the type of the character string to be matched, the field corresponding to the character string can be determined. Taking the type of the character string as the business name class as an example, normally, the business name generally consists of an administrative division, a word size, an industry and an organization form in sequence (except for the rules and regulations).

Further, according to the preset corresponding relationship between the fields and the words, the fields corresponding to the words contained in the character string can be determined. Specifically, the correspondence of the field to the word may be determined by:

(1) the field of "administrative division" in the character string of the enterprise name class may be the name or place name of the administrative division above the county level of the enterprise. In some special cases, the country name may also be treated as an administrative division. That is, the "administrative division" field corresponds to words of the types of place name, administrative district name, country name, and the like.

(2) The "word size" field in the character string of the enterprise name class may be composed of more than two Chinese characters. The name of a natural human investor may be used as a character size, but the administrative division may not be used as a character size (except for the names of places of administrative divisions above county, which have other meanings). That is, the "size" field corresponds to a name of a person, a name of a brand, and the like.

(3) The field of "industry" in the character string of the enterprise name class should be a term reflecting the national economic industry or the operation characteristics of the enterprise to which the economic activity property of the enterprise belongs, and the content expressed by the field of "industry" should be consistent with the operation range of the enterprise. The economic activity properties of enterprises belong to different major categories of national economic industry respectively, and the industry in the enterprise name is expressed by national economic industry category expressions to which the main economic activity properties belong. That is, the "industry" field corresponds to the type of the industry category, the operation characteristic, etc.

(4) According to related laws and regulations, a company-made enterprise can apply for organization forms such as 'limited company', 'limited responsibility company', 'stock limited company' and the like; non-company enterprises can apply for "factory", "store", "department", "center", and the like as organization forms. That is, the "organization form" field corresponds to words such as "limited company", "limited liability company", "stock limited company", "factory", "store", "department", "center", and the like.

For example, if the first string is "beijing AA information technology corporation", and the second string is "beijing AA technology corporation", as shown in table 1, the first string is an example of a field corresponding to each word included in the string. See in particular the contents shown in table 1, which are not described in detail here.

Table 1: example of fields corresponding to words contained in character string

Character string	Administrative division	Word size	Industry	Tissue forms
					First character string	Beijing	AA	Information technology	Company Limited
Second character string	Beijing	AA	Technique of	Co Ltd

In step 104, the weight value of each field may be determined according to a plurality of sample strings. Specifically, as shown in fig. 2, a flowchart corresponding to the method for determining a weight value of a field provided in the embodiment of the present invention specifically includes the following steps:

step 201, performing word segmentation on each sample character string to obtain each word contained in each sample character string.

Before the step 201 is executed, data cleaning may be performed on the sample character string, and specific methods for data cleaning include, but are not limited to, unicode set, deleting non-chinese characters, deleting repeated character strings, and the like.

Further, the manner of segmenting the sample character string may refer to what is described in the above step 102, and is not described in detail here.

Step 202, determining each field corresponding to each word contained in each sample character string according to the corresponding relationship between the field and the word.

The method for determining each field corresponding to each word included in the sample character string may refer to what is described in step 103, and is not described in detail here.

For example, as shown in table 2, an example of a field corresponding to a sample string is shown. The sample 1 is a linkage advantage company, the corresponding word of the font size field is a linkage advantage, and the corresponding word of the organization form is a company; sample 2 is "association group", the word of the corresponding font size field is "association", and the word of the corresponding organization form is "group"; sample 3 is "association company", the corresponding word in the font size field is "association", and the corresponding word in the organizational form is "company"; sample 4 is "Baidu corporation", the corresponding word in the font size field is "Baidu", and the corresponding word in the organizational form is "corporation".

Table 2: example of fields corresponding to sample strings

Numbering	Sample string	Word size field	Organization form field
				Sample 1	Company of linkage dominance	Advantage of linkage	Company(s)
Sample 2	Associative group	Association of people	Group of people
				Sample 3	Association company	Association of people	Company(s)
Sample 4	Baidu Co Ltd	Hundredth degree	Company(s)

Step 203, determining the repetition rate of each field according to the repetition rate of the word corresponding to each field.

Before the step 203 is executed, the repetition rate of any word corresponding to each field in the word corresponding to the field may be determined according to the word corresponding to each field, that is, the repetition rate of the word in each field may be determined. Specifically, when calculating the repetition rate of a certain word, the repetition rate of the word may be determined using formula (1).

Wherein, C_iThe repetition rate of the ith word corresponding to the field; n is_ciThe number of times the ith word recurs in the field; and N is the total number of the words corresponding to the field.

For example, the sample character string shown in table 2 is taken as an example, and as shown in table 3, the repetition rate of each word in the sample character string is shown. In the number field shown in table 3, the repetition rate corresponding to "linkage advantage" is 0, the repetition rate corresponding to "association" is 1/3, and the repetition rate corresponding to "hundredth" is 0; in the organization form field shown in table 3, the repetition rate corresponding to "company" is 2/3, and the repetition rate corresponding to "clique" is 0.

Table 3: example of repetition Rate of words in sample string

Further, there are various methods for determining the repetition rate of each field, and in one example, the average value of the repetition rates of the words corresponding to each field may be used as the repetition rate of the field. Specifically, when calculating the repetition rate of a certain field, equation (2) may be employed to determine the repetition rate of the field.

Wherein Z is_jIs the repetition rate of the jth field; c_iThe repetition rate of the ith word corresponding to the jth field; n is a radical of_jIs the total number of words corresponding to the jth field.

For example, the repetition rate of each word in the sample string shown in table 3 is taken as an example, and as shown in table 4, the repetition rate is an example of the field repetition rate of the sample string. The repetition rate of the font size field is 1/6, and the repetition rate of the organization type field is 1/2.

Table 4: example of repetition rate of fields of a sample string

	Word size field	Organization form field
			Repetition rate of fields	1/6	1/2

In other possible examples, the repetition rate of the field may also be determined by using other methods, for example, the repetition rate of the field is determined according to the repetition rate of the word corresponding to the field and a preset coefficient, which is not limited specifically.

Step 204, determining the weight value of each field according to the repetition rate of each field.

In the embodiment of the present invention, there are various ways to determine the weight values of the fields, and one possible implementation manner is to determine the discrimination between the words corresponding to each field according to the repetition rate of each field, then determine the total discrimination corresponding to all the fields according to the discrimination between the words corresponding to each field, and further determine the weight values of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.

Specifically, the repetition rate of the field is positively correlated with the repetition rate of the word corresponding to the field, that is, the higher the repetition rate of the field is, the more times the word corresponding to the field is repeated, the more difficult it is to distinguish the words corresponding to the field, that is, the repetition rate of the field and the distinguishing degree between the words corresponding to the field are in a negative correlation relationship. Further, when calculating the distinction degree between a plurality of words corresponding to a certain field, formula (3) may be used to determine the distinction degree between a plurality of words corresponding to the field.

Q_j＝1-Z_jFormula (3)

Wherein Q is_jThe discrimination between a plurality of words corresponding to the jth field; z_jIs the repetition rate of the jth field.

For example, taking the repetition rate of the fields of the sample string shown in table 4 as an example, as shown in table 5, the method is an example of the discrimination of the fields of the sample string. The distinction degree of the font size field is 5/6, and the distinction degree of the organization type field is 1/2.

Table 5: example of discrimination of fields of a sample string

	Word size field	Organization form field
			Repetition rate of fields	1/6	1/2
Discrimination of fields	5/6	1/2

Further, the total discrimination of all fields may be the sum of the discriminations of each field, and the total discrimination of all fields of the character string shown in table 5 is 5/6+1/2 ═ 8/6.

Further, the greater the discrimination of the fields, the greater the weight value may be assigned. In particular, equation (4) may be employed to determine the weight value for the field.

Wherein, W_jThe weight value of the jth field; q_jThe discrimination between a plurality of words corresponding to the jth field; sigma Q_jIs the total discrimination of all fields.

For example, taking the distinction degree of the fields of the sample character string shown in table 5 as an example, as shown in table 6, the distinction degree is an example of the weight value of the fields of the sample character string. Wherein, the weight value of the word size field is 5/8, and the weight value of the organization form field is 3/8.

Table 5: example of weight values for fields of a sample string

	Word size field	Organization form field
			Repetition rate of fields	1/6	1/2
Discrimination of fields	5/6	1/2
			Weight value of field	5/8	3/8

In other possible implementations, the weight value of the field may be determined in other manners, for example, a person skilled in the art may determine the weight value of the field according to experience and practical situations and by combining the repetition rate of the field.

By adopting the weight values of the fields determined by the contents described in the above steps 201 to 205, the determined weight values can be more accurate and better meet the importance degree of the fields, and further the accuracy of matching between different character strings is improved.

In the embodiment of the present invention, the matching degree between the first character string and the second character string may be determined by combining the weight values of the fields determined above. Specifically, it may be determined whether the words belonging to the first character string and the second character string in the field are the same, based on each word and corresponding field included in the first character string and each word included in the second character string, that is, corresponding field, and if the words are the same, the matching degree between the first character string and the second character string may be determined based on the weight value of the field.

For example, if the first string is "company a", the second string is "company a limited". It can be known that "a" in the first string corresponds to the font size field and "company" corresponds to the organizational form field; in the second string, "a" corresponds to the font size field and "company limited" corresponds to the organizational form field. It can be further understood that the words (i.e., "a") corresponding to the font size field are the same between the first character string and the second character string, and the words corresponding to the organization type field are different. Further, if the weight value of the font size field is 5/8 and the weight value of the organization form field is 3/8, it can be determined that the matching degree of the first character string and the second character string is 5/8.

In step 105, the preset threshold may be determined by those skilled in the art based on experience and practical situations, and is not limited specifically.

For example, if the preset threshold is set to 1/2, the first character string is "company a", and the second character string is "company a limited", it can be determined from the above that the matching degree of the first character string and the second character string is 5/8, and the matching degree (i.e., 5/8) is greater than the preset threshold (1/2), and it is determined that the first character string and the second character string match.

Further, if it is determined that the matching degree is less than or equal to a preset threshold, it may be determined that the first character string does not match the second character string.

Based on the same inventive concept, fig. 3 exemplarily shows a schematic structural diagram of a character string matching apparatus provided by the embodiment of the present invention, as shown in fig. 3, the apparatus includes an obtaining unit 301, a processing unit 302, and a matching unit 303; wherein,

an acquiring unit 301 configured to acquire a first character string and a second character string;

a processing unit 302, configured to perform word segmentation on the first character string and the second character string, respectively, to obtain words included in the first character string and words included in the second character string; determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words; determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;

a matching unit 303, configured to determine that the first character string is matched with the second character string if it is determined that the matching degree is greater than a preset threshold.

In a possible implementation manner, the processing unit 302 is specifically configured to:

In a possible implementation manner, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the processing unit 302 is further configured to:

In a possible implementation manner, the specific unit 302 is specifically configured to:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of string matching, the method comprising:

acquiring a first character string and a second character string;

2. The method of claim 1, wherein the weight value of each field is determined by:

3. The method of claim 2, wherein prior to determining the repetition rate for each field based on the plurality of words for each field, the method further comprises:

4. The method of claim 2, wherein determining the weight value of each field according to the repetition rate of each field comprises:

5. An apparatus for matching character strings, the apparatus comprising:

6. The apparatus according to claim 5, wherein the processing unit is specifically configured to:

7. The apparatus of claim 6, wherein the processing unit, prior to determining the repetition rate for each field based on the plurality of words for each field, is further configured to:

8. The apparatus according to claim 6, wherein the specific unit is specifically configured to:

9. A computer-readable storage medium, characterized in that the storage medium stores instructions that, when executed on a computer, cause the computer to carry out performing the method of any one of claims 1 to 4.

10. A computer device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any of claims 1 to 4 in accordance with the obtained program.