CN109165326A - A kind of character string matching method and device - Google Patents

A kind of character string matching method and device Download PDF

Info

Publication number
CN109165326A
CN109165326A CN201810936946.7A CN201810936946A CN109165326A CN 109165326 A CN109165326 A CN 109165326A CN 201810936946 A CN201810936946 A CN 201810936946A CN 109165326 A CN109165326 A CN 109165326A
Authority
CN
China
Prior art keywords
field
character string
determining
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810936946.7A
Other languages
Chinese (zh)
Inventor
曾伟雄
薛重阳
孟庆文
王维
刘晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bee Wisdom (beijing) Technology Co Ltd
Original Assignee
Bee Wisdom (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bee Wisdom (beijing) Technology Co Ltd filed Critical Bee Wisdom (beijing) Technology Co Ltd
Priority to CN201810936946.7A priority Critical patent/CN109165326A/en
Publication of CN109165326A publication Critical patent/CN109165326A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of character string matching method and devices.The described method includes: after obtaining the first character string and the second character string, two character strings can be segmented, and determine the corresponding field of each word that the two character strings include, and then the matching degree between the two character strings can be determined according to the weighted value of each field, if match degree is greater than the preset threshold, it may be considered that the two character strings match.Wherein, the weighted value of each field can be determined according to sample character string.In this way, matched accuracy between different character strings can be improved by the weighted value of setting different field;Further, compared with the prior art for the mode of middle artificial contrast, the embodiment of the present invention is not necessarily to artificial contrast, effectively reduces human cost, can simplify and carry out matched operation to enterprise name, and can also shorten match time.

Description

Character string matching method and device
Technical Field
The invention relates to the field of data science, in particular to a character string matching method and device.
Background
Business name matching is a very important technology in the field of risk control. For example, in the financial industry, and particularly in the credit industry, it is common for a customer to fill in a business name for risk management and to match the business name filled in by the customer. For example, the business name filled by the client can be matched with the business name reported by the credit investigation report to see whether the client works at the business before; alternatively, the business name of the client may be compared with the business names of other clients to see if any colleagues are also local mechanism clients.
In the prior art, when matching business names, matching is usually performed in a manual comparison manner, that is, different business names are considered to be matched. Obviously, the method has high labor cost, complex operation and long time consumption.
Based on this, a method for matching character strings is needed to solve the problem of high labor cost caused by manual comparison in the prior art.
Disclosure of Invention
The embodiment of the invention provides a character string matching method and device, and aims to solve the technical problem of high labor cost caused by manual comparison mode for character string matching in the prior art.
The embodiment of the invention provides a character string matching method, which comprises the following steps:
acquiring a first character string and a second character string;
dividing words of the first character string and the second character string respectively to obtain words contained in the first character string and words contained in the second character string;
determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words;
determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;
and if the matching degree is larger than a preset threshold value, determining that the first character string is matched with the second character string.
Therefore, the matching accuracy between different character strings can be improved by setting the weight values of different fields; further, compared with a manual comparison mode in the prior art, the embodiment of the invention does not need manual comparison, effectively reduces the labor cost, can simplify the operation of matching enterprise names, and can also shorten the matching time.
In one possible implementation, the weight value of each field is determined by:
performing word segmentation on each sample character string to obtain each word contained in each sample character string;
determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word;
determining the repetition rate of each field according to the repetition rate of the word corresponding to each field;
and determining the weight value of each field according to the repetition rate of each field.
Therefore, the determined weight value can be more accurate according to the weight value of the field determined by the repetition rate of each field, the importance degree of the field is more met, and the matching accuracy between different character strings is further improved.
In one possible implementation, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the method further includes:
and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.
In one possible implementation manner, determining the weight value of each field according to the repetition rate of each field includes:
determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field;
determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field;
and determining the weight value of the field according to the discrimination of each field and the total discrimination corresponding to all the fields.
The embodiment of the invention provides a character string matching device, which comprises:
an acquisition unit configured to acquire a first character string and a second character string;
the processing unit is used for segmenting the first character string and the second character string respectively to obtain each word contained in the first character string and each word contained in the second character string; determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words; determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;
and the matching unit is used for determining that the first character string is matched with the second character string if the matching degree is greater than a preset threshold value.
In a possible implementation manner, the processing unit is specifically configured to:
performing word segmentation on each sample character string to obtain each word contained in each sample character string; determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word; determining the repetition rate of each field according to the repetition rate of the word corresponding to each field; and determining the weight value of each field according to the repetition rate of each field.
In a possible implementation manner, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the processing unit is further configured to:
and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.
In a possible implementation manner, the specific unit is specifically configured to:
determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field; determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field; and determining the weight value of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.
The embodiment of the present application further provides an apparatus, which has a function of implementing the above-described character string matching method. This function may be implemented by hardware executing corresponding software, and in one possible design, the apparatus includes: a processor, a transceiver, a memory; the memory is used for storing computer execution instructions, the transceiver is used for realizing the communication between the device and other communication entities, the processor and the memory are connected through the bus, and when the device runs, the processor executes the computer execution instructions stored in the memory so as to enable the device to execute the character string matching method described above.
An embodiment of the present invention further provides a computer storage medium, where a software program is stored, and when the software program is read and executed by one or more processors, the method for matching a character string described in the foregoing various possible implementation manners is implemented.
Embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the string matching method described in the above-mentioned various possible implementation manners.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described below.
Fig. 1 is a schematic flow chart of a character string matching method according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for determining a weight value of a field according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a character string matching apparatus according to an embodiment of the present invention.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings, and the specific operation methods in the method embodiments can also be applied to the apparatus embodiments.
In the prior art, when determining whether two enterprise names are matched, a computer can be used for full character matching in addition to a manual comparison mode. However, this method may cause erroneous judgment. For example, if two names of enterprises to be matched are "associated company" and "hundredth company", respectively, the existing full-character matching method may be adopted, and since the word "company" in the character string of "associated company" is equal to the word "company" in the character string of "hundredth company", the existing technology may consider that "associated company" and "hundredth company" are matched. Obviously, this knowledge is wrong.
Based on this, an embodiment of the present invention provides a method for matching a character string, as shown in fig. 1, which is a schematic flow chart of the method for matching a character string provided in the embodiment of the present invention, and specifically includes the following steps:
step 101, a first character string and a second character string are obtained.
102, performing word segmentation on the first character string and the second character string respectively to obtain words contained in the first character string and words contained in the second character string.
Step 103, determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relationship between the fields and the words.
And 104, determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field.
And 105, if the matching degree is determined to be greater than a preset threshold value, determining that the first character string is matched with the second character string.
Therefore, the matching accuracy between different character strings can be improved by setting the weight values of different fields; further, compared with a manual comparison mode in the prior art, the embodiment of the invention does not need manual comparison, effectively reduces the labor cost, can simplify the operation of matching enterprise names, and can also shorten the matching time.
Specifically, in step 101, the matching of the character strings may be applied to the matching between two character strings or between a plurality of character strings. The character string is composed of a plurality of characters, and the number of the characters composing the character string can be determined according to specific requirements. The content of the character strings may also be determined according to specific requirements, for example, if the character strings of the enterprise name class need to be matched, the obtained character strings may be "associated companies", "hundred degree companies", and the like.
In step 102, the character string usually includes a plurality of characters, and the character string may be segmented first in consideration of the relationship between the characters. For example, the first character string is "Beijing XXX information technology, Inc., and the character string can be divided into four words, i.e.," Beijing "," XXX "," information technology ", and" Yongquan "; the second character string is "Shanghai XXX information technology, Inc., and the character string can be divided into four words, i.e.," Shanghai "," XXX "," information technology ", and" Yongzi, "after being divided into words.
In step 103, according to the type of the character string to be matched, the field corresponding to the character string can be determined. Taking the type of the character string as the business name class as an example, normally, the business name generally consists of an administrative division, a word size, an industry and an organization form in sequence (except for the rules and regulations).
Further, according to the preset corresponding relationship between the fields and the words, the fields corresponding to the words contained in the character string can be determined. Specifically, the correspondence of the field to the word may be determined by:
(1) the field of "administrative division" in the character string of the enterprise name class may be the name or place name of the administrative division above the county level of the enterprise. In some special cases, the country name may also be treated as an administrative division. That is, the "administrative division" field corresponds to words of the types of place name, administrative district name, country name, and the like.
(2) The "word size" field in the character string of the enterprise name class may be composed of more than two Chinese characters. The name of a natural human investor may be used as a character size, but the administrative division may not be used as a character size (except for the names of places of administrative divisions above county, which have other meanings). That is, the "size" field corresponds to a name of a person, a name of a brand, and the like.
(3) The field of "industry" in the character string of the enterprise name class should be a term reflecting the national economic industry or the operation characteristics of the enterprise to which the economic activity property of the enterprise belongs, and the content expressed by the field of "industry" should be consistent with the operation range of the enterprise. The economic activity properties of enterprises belong to different major categories of national economic industry respectively, and the industry in the enterprise name is expressed by national economic industry category expressions to which the main economic activity properties belong. That is, the "industry" field corresponds to the type of the industry category, the operation characteristic, etc.
(4) According to related laws and regulations, a company-made enterprise can apply for organization forms such as 'limited company', 'limited responsibility company', 'stock limited company' and the like; non-company enterprises can apply for "factory", "store", "department", "center", and the like as organization forms. That is, the "organization form" field corresponds to words such as "limited company", "limited liability company", "stock limited company", "factory", "store", "department", "center", and the like.
For example, if the first string is "beijing AA information technology corporation", and the second string is "beijing AA technology corporation", as shown in table 1, the first string is an example of a field corresponding to each word included in the string. See in particular the contents shown in table 1, which are not described in detail here.
Table 1: example of fields corresponding to words contained in character string
Character string Administrative division Word size Industry Tissue forms
First character string Beijing AA Information technology Company Limited
Second character string Beijing AA Technique of Co Ltd
In step 104, the weight value of each field may be determined according to a plurality of sample strings. Specifically, as shown in fig. 2, a flowchart corresponding to the method for determining a weight value of a field provided in the embodiment of the present invention specifically includes the following steps:
step 201, performing word segmentation on each sample character string to obtain each word contained in each sample character string.
Before the step 201 is executed, data cleaning may be performed on the sample character string, and specific methods for data cleaning include, but are not limited to, unicode set, deleting non-chinese characters, deleting repeated character strings, and the like.
Further, the manner of segmenting the sample character string may refer to what is described in the above step 102, and is not described in detail here.
Step 202, determining each field corresponding to each word contained in each sample character string according to the corresponding relationship between the field and the word.
The method for determining each field corresponding to each word included in the sample character string may refer to what is described in step 103, and is not described in detail here.
For example, as shown in table 2, an example of a field corresponding to a sample string is shown. The sample 1 is a linkage advantage company, the corresponding word of the font size field is a linkage advantage, and the corresponding word of the organization form is a company; sample 2 is "association group", the word of the corresponding font size field is "association", and the word of the corresponding organization form is "group"; sample 3 is "association company", the corresponding word in the font size field is "association", and the corresponding word in the organizational form is "company"; sample 4 is "Baidu corporation", the corresponding word in the font size field is "Baidu", and the corresponding word in the organizational form is "corporation".
Table 2: example of fields corresponding to sample strings
Numbering Sample string Word size field Organization form field
Sample 1 Company of linkage dominance Advantage of linkage Company(s)
Sample 2 Associative group Association of people Group of people
Sample 3 Association company Association of people Company(s)
Sample 4 Baidu Co Ltd Hundredth degree Company(s)
Step 203, determining the repetition rate of each field according to the repetition rate of the word corresponding to each field.
Before the step 203 is executed, the repetition rate of any word corresponding to each field in the word corresponding to the field may be determined according to the word corresponding to each field, that is, the repetition rate of the word in each field may be determined. Specifically, when calculating the repetition rate of a certain word, the repetition rate of the word may be determined using formula (1).
Wherein, CiThe repetition rate of the ith word corresponding to the field; n isciThe number of times the ith word recurs in the field; and N is the total number of the words corresponding to the field.
For example, the sample character string shown in table 2 is taken as an example, and as shown in table 3, the repetition rate of each word in the sample character string is shown. In the number field shown in table 3, the repetition rate corresponding to "linkage advantage" is 0, the repetition rate corresponding to "association" is 1/3, and the repetition rate corresponding to "hundredth" is 0; in the organization form field shown in table 3, the repetition rate corresponding to "company" is 2/3, and the repetition rate corresponding to "clique" is 0.
Table 3: example of repetition Rate of words in sample string
Further, there are various methods for determining the repetition rate of each field, and in one example, the average value of the repetition rates of the words corresponding to each field may be used as the repetition rate of the field. Specifically, when calculating the repetition rate of a certain field, equation (2) may be employed to determine the repetition rate of the field.
Wherein Z isjIs the repetition rate of the jth field; ciThe repetition rate of the ith word corresponding to the jth field; n is a radical ofjIs the total number of words corresponding to the jth field.
For example, the repetition rate of each word in the sample string shown in table 3 is taken as an example, and as shown in table 4, the repetition rate is an example of the field repetition rate of the sample string. The repetition rate of the font size field is 1/6, and the repetition rate of the organization type field is 1/2.
Table 4: example of repetition rate of fields of a sample string
Word size field Organization form field
Repetition rate of fields 1/6 1/2
In other possible examples, the repetition rate of the field may also be determined by using other methods, for example, the repetition rate of the field is determined according to the repetition rate of the word corresponding to the field and a preset coefficient, which is not limited specifically.
Step 204, determining the weight value of each field according to the repetition rate of each field.
In the embodiment of the present invention, there are various ways to determine the weight values of the fields, and one possible implementation manner is to determine the discrimination between the words corresponding to each field according to the repetition rate of each field, then determine the total discrimination corresponding to all the fields according to the discrimination between the words corresponding to each field, and further determine the weight values of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.
Specifically, the repetition rate of the field is positively correlated with the repetition rate of the word corresponding to the field, that is, the higher the repetition rate of the field is, the more times the word corresponding to the field is repeated, the more difficult it is to distinguish the words corresponding to the field, that is, the repetition rate of the field and the distinguishing degree between the words corresponding to the field are in a negative correlation relationship. Further, when calculating the distinction degree between a plurality of words corresponding to a certain field, formula (3) may be used to determine the distinction degree between a plurality of words corresponding to the field.
Qj=1-ZjFormula (3)
Wherein Q isjThe discrimination between a plurality of words corresponding to the jth field; zjIs the repetition rate of the jth field.
For example, taking the repetition rate of the fields of the sample string shown in table 4 as an example, as shown in table 5, the method is an example of the discrimination of the fields of the sample string. The distinction degree of the font size field is 5/6, and the distinction degree of the organization type field is 1/2.
Table 5: example of discrimination of fields of a sample string
Word size field Organization form field
Repetition rate of fields 1/6 1/2
Discrimination of fields 5/6 1/2
Further, the total discrimination of all fields may be the sum of the discriminations of each field, and the total discrimination of all fields of the character string shown in table 5 is 5/6+1/2 ═ 8/6.
Further, the greater the discrimination of the fields, the greater the weight value may be assigned. In particular, equation (4) may be employed to determine the weight value for the field.
Wherein, WjThe weight value of the jth field; qjThe discrimination between a plurality of words corresponding to the jth field; sigma QjIs the total discrimination of all fields.
For example, taking the distinction degree of the fields of the sample character string shown in table 5 as an example, as shown in table 6, the distinction degree is an example of the weight value of the fields of the sample character string. Wherein, the weight value of the word size field is 5/8, and the weight value of the organization form field is 3/8.
Table 5: example of weight values for fields of a sample string
Word size field Organization form field
Repetition rate of fields 1/6 1/2
Discrimination of fields 5/6 1/2
Weight value of field 5/8 3/8
In other possible implementations, the weight value of the field may be determined in other manners, for example, a person skilled in the art may determine the weight value of the field according to experience and practical situations and by combining the repetition rate of the field.
By adopting the weight values of the fields determined by the contents described in the above steps 201 to 205, the determined weight values can be more accurate and better meet the importance degree of the fields, and further the accuracy of matching between different character strings is improved.
In the embodiment of the present invention, the matching degree between the first character string and the second character string may be determined by combining the weight values of the fields determined above. Specifically, it may be determined whether the words belonging to the first character string and the second character string in the field are the same, based on each word and corresponding field included in the first character string and each word included in the second character string, that is, corresponding field, and if the words are the same, the matching degree between the first character string and the second character string may be determined based on the weight value of the field.
For example, if the first string is "company a", the second string is "company a limited". It can be known that "a" in the first string corresponds to the font size field and "company" corresponds to the organizational form field; in the second string, "a" corresponds to the font size field and "company limited" corresponds to the organizational form field. It can be further understood that the words (i.e., "a") corresponding to the font size field are the same between the first character string and the second character string, and the words corresponding to the organization type field are different. Further, if the weight value of the font size field is 5/8 and the weight value of the organization form field is 3/8, it can be determined that the matching degree of the first character string and the second character string is 5/8.
In step 105, the preset threshold may be determined by those skilled in the art based on experience and practical situations, and is not limited specifically.
For example, if the preset threshold is set to 1/2, the first character string is "company a", and the second character string is "company a limited", it can be determined from the above that the matching degree of the first character string and the second character string is 5/8, and the matching degree (i.e., 5/8) is greater than the preset threshold (1/2), and it is determined that the first character string and the second character string match.
Further, if it is determined that the matching degree is less than or equal to a preset threshold, it may be determined that the first character string does not match the second character string.
Based on the same inventive concept, fig. 3 exemplarily shows a schematic structural diagram of a character string matching apparatus provided by the embodiment of the present invention, as shown in fig. 3, the apparatus includes an obtaining unit 301, a processing unit 302, and a matching unit 303; wherein,
an acquiring unit 301 configured to acquire a first character string and a second character string;
a processing unit 302, configured to perform word segmentation on the first character string and the second character string, respectively, to obtain words included in the first character string and words included in the second character string; determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words; determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;
a matching unit 303, configured to determine that the first character string is matched with the second character string if it is determined that the matching degree is greater than a preset threshold.
In a possible implementation manner, the processing unit 302 is specifically configured to:
performing word segmentation on each sample character string to obtain each word contained in each sample character string; determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word; determining the repetition rate of each field according to the repetition rate of the word corresponding to each field; and determining the weight value of each field according to the repetition rate of each field.
In a possible implementation manner, before determining the repetition rate of each field according to the plurality of words corresponding to each field, the processing unit 302 is further configured to:
and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.
In a possible implementation manner, the specific unit 302 is specifically configured to:
determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field; determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field; and determining the weight value of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.
The embodiment of the present application further provides an apparatus, which has a function of implementing the above-described character string matching method. This function may be implemented by hardware executing corresponding software, and in one possible design, the apparatus includes: a processor, a transceiver, a memory; the memory is used for storing computer execution instructions, the transceiver is used for realizing the communication between the device and other communication entities, the processor and the memory are connected through the bus, and when the device runs, the processor executes the computer execution instructions stored in the memory so as to enable the device to execute the character string matching method described above.
An embodiment of the present invention further provides a computer storage medium, where a software program is stored, and when the software program is read and executed by one or more processors, the method for matching a character string described in the foregoing various possible implementation manners is implemented.
Embodiments of the present invention also provide a computer program product containing instructions, which when run on a computer, cause the computer to execute the string matching method described in the above-mentioned various possible implementation manners.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method of string matching, the method comprising:
acquiring a first character string and a second character string;
dividing words of the first character string and the second character string respectively to obtain words contained in the first character string and words contained in the second character string;
determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words;
determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;
and if the matching degree is larger than a preset threshold value, determining that the first character string is matched with the second character string.
2. The method of claim 1, wherein the weight value of each field is determined by:
performing word segmentation on each sample character string to obtain each word contained in each sample character string;
determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word;
determining the repetition rate of each field according to the repetition rate of the word corresponding to each field;
and determining the weight value of each field according to the repetition rate of each field.
3. The method of claim 2, wherein prior to determining the repetition rate for each field based on the plurality of words for each field, the method further comprises:
and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.
4. The method of claim 2, wherein determining the weight value of each field according to the repetition rate of each field comprises:
determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field;
determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field;
and determining the weight value of the field according to the discrimination of each field and the total discrimination corresponding to all the fields.
5. An apparatus for matching character strings, the apparatus comprising:
an acquisition unit configured to acquire a first character string and a second character string;
the processing unit is used for segmenting the first character string and the second character string respectively to obtain each word contained in the first character string and each word contained in the second character string; determining fields corresponding to the words contained in the first character string and fields corresponding to the words contained in the second character string according to a preset corresponding relation between the fields and the words; determining the matching degree of the first character string and the second character string according to each word and each corresponding field contained in the first character string, each word and each corresponding field contained in the second character string and the weight value of each field; the weight value of each field is determined according to a plurality of sample character strings;
and the matching unit is used for determining that the first character string is matched with the second character string if the matching degree is greater than a preset threshold value.
6. The apparatus according to claim 5, wherein the processing unit is specifically configured to:
performing word segmentation on each sample character string to obtain each word contained in each sample character string; determining each field corresponding to each word contained in each sample character string according to the corresponding relation between the field and the word; determining the repetition rate of each field according to the repetition rate of the word corresponding to each field; and determining the weight value of each field according to the repetition rate of each field.
7. The apparatus of claim 6, wherein the processing unit, prior to determining the repetition rate for each field based on the plurality of words for each field, is further configured to:
and determining the repetition rate of any word corresponding to each field in the words corresponding to the field according to the words corresponding to each field.
8. The apparatus according to claim 6, wherein the specific unit is specifically configured to:
determining the discrimination between a plurality of words corresponding to each field according to the repetition rate of each field; determining the total discrimination corresponding to all the fields according to the discrimination between the plurality of words corresponding to each field; and determining the weight value of the fields according to the discrimination of each field and the total discrimination corresponding to all the fields.
9. A computer-readable storage medium, characterized in that the storage medium stores instructions that, when executed on a computer, cause the computer to carry out performing the method of any one of claims 1 to 4.
10. A computer device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory to execute the method of any of claims 1 to 4 in accordance with the obtained program.
CN201810936946.7A 2018-08-16 2018-08-16 A kind of character string matching method and device Pending CN109165326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810936946.7A CN109165326A (en) 2018-08-16 2018-08-16 A kind of character string matching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810936946.7A CN109165326A (en) 2018-08-16 2018-08-16 A kind of character string matching method and device

Publications (1)

Publication Number Publication Date
CN109165326A true CN109165326A (en) 2019-01-08

Family

ID=64896089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810936946.7A Pending CN109165326A (en) 2018-08-16 2018-08-16 A kind of character string matching method and device

Country Status (1)

Country Link
CN (1) CN109165326A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427991A (en) * 2019-07-22 2019-11-08 联动优势科技有限公司 A kind of character string matching method and device
CN110750509A (en) * 2019-10-24 2020-02-04 赛诺贝斯(北京)营销技术股份有限公司 Enterprise name duplicate checking method and device, equipment and medium
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN112954387A (en) * 2021-01-26 2021-06-11 广州欢网科技有限责任公司 Method, system and readable storage medium for updating and optimizing television program list
CN113343076A (en) * 2021-04-23 2021-09-03 山东师范大学 Innovative technology recommendation method and system based on feature matching degree
CN113553360A (en) * 2021-07-30 2021-10-26 北京金堤征信服务有限公司 Multi-enterprise relationship analysis method, device, electronic equipment, storage medium and computer program
CN114297461A (en) * 2021-12-10 2022-04-08 北京羽乐创新科技有限公司 Company information matching method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN103761341A (en) * 2014-02-21 2014-04-30 北京嘉和美康信息技术有限公司 Information matching method and device
CN104268137A (en) * 2013-07-31 2015-01-07 深圳市华傲数据技术有限公司 Method and device for matching pharmaceutical name data
CN104778171A (en) * 2014-01-10 2015-07-15 携程计算机技术(上海)有限公司 Character string matching system and method
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN106650803A (en) * 2016-12-09 2017-05-10 北京锐安科技有限公司 Method and device for calculating similarity between strings
CN106951415A (en) * 2017-04-01 2017-07-14 银联智策顾问(上海)有限公司 A kind of name of firm searching method and device
CN108363729A (en) * 2018-01-12 2018-08-03 中国平安人寿保险股份有限公司 A kind of string comparison method, device, terminal device and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309857A (en) * 2012-03-06 2013-09-18 腾讯科技(深圳)有限公司 Method and equipment for determining classified linguistic data
CN104268137A (en) * 2013-07-31 2015-01-07 深圳市华傲数据技术有限公司 Method and device for matching pharmaceutical name data
CN104778171A (en) * 2014-01-10 2015-07-15 携程计算机技术(上海)有限公司 Character string matching system and method
CN103761341A (en) * 2014-02-21 2014-04-30 北京嘉和美康信息技术有限公司 Information matching method and device
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN106650803A (en) * 2016-12-09 2017-05-10 北京锐安科技有限公司 Method and device for calculating similarity between strings
CN106951415A (en) * 2017-04-01 2017-07-14 银联智策顾问(上海)有限公司 A kind of name of firm searching method and device
CN108363729A (en) * 2018-01-12 2018-08-03 中国平安人寿保险股份有限公司 A kind of string comparison method, device, terminal device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427991A (en) * 2019-07-22 2019-11-08 联动优势科技有限公司 A kind of character string matching method and device
CN110750509A (en) * 2019-10-24 2020-02-04 赛诺贝斯(北京)营销技术股份有限公司 Enterprise name duplicate checking method and device, equipment and medium
CN111104795A (en) * 2019-11-19 2020-05-05 平安金融管理学院(中国·深圳) Company name matching method and device, computer equipment and storage medium
CN112954387A (en) * 2021-01-26 2021-06-11 广州欢网科技有限责任公司 Method, system and readable storage medium for updating and optimizing television program list
CN113343076A (en) * 2021-04-23 2021-09-03 山东师范大学 Innovative technology recommendation method and system based on feature matching degree
CN113553360A (en) * 2021-07-30 2021-10-26 北京金堤征信服务有限公司 Multi-enterprise relationship analysis method, device, electronic equipment, storage medium and computer program
CN114297461A (en) * 2021-12-10 2022-04-08 北京羽乐创新科技有限公司 Company information matching method

Similar Documents

Publication Publication Date Title
CN109165326A (en) A kind of character string matching method and device
US20200065710A1 (en) Normalizing text attributes for machine learning models
CN105808988A (en) Method and device for identifying exceptional account
CN108734405A (en) A kind of data value Evaluation Platform and method
CN110990529B (en) Industry detail dividing method and system for enterprises
CN110019163A (en) Method, system, equipment and the storage medium of prediction, the recommendation of characteristics of objects
CN112948429B (en) Data reporting method, device and equipment
CN111353689B (en) Risk assessment method and device
CN113283675A (en) Index data analysis method, device, equipment and storage medium
CN116975284B (en) Entity relation extraction method and device based on priori knowledge and storage medium
CN106610932A (en) Corpus processing method and device and corpus analyzing method and device
CN113807096A (en) Text data processing method and device, computer equipment and storage medium
CN112418304A (en) OCR (optical character recognition) model training method, system and device
CN108415971B (en) Method and device for recommending supply and demand information by using knowledge graph
CN111221873A (en) Inter-enterprise homonym identification method and system based on associated network
WO2022012380A1 (en) Improved entity resolution of master data using qualified relationship score
CN113705164A (en) Text processing method and device, computer equipment and readable storage medium
CN116743474A (en) Decision tree generation method and device, electronic equipment and storage medium
CN115952156A (en) Data cleaning method and device, computer equipment and readable medium
CN106778048B (en) Data processing method and device
CN115237859A (en) Method, device and equipment for detecting quality of required document and storage medium
CN108629506A (en) Modeling method, device, computer equipment and the storage medium of air control model
CN114580354A (en) Synonym-based information encoding method, device, equipment and storage medium
CN114443493A (en) Test case generation method and device, electronic equipment and storage medium
CN113918709A (en) Industry classification model training method, classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190108