CN104778171A

CN104778171A - Character string matching system and method

Info

Publication number: CN104778171A
Application number: CN201410011078.3A
Authority: CN
Inventors: 叶亚明; 王威振
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Shanghai Ctrip Business Co Ltd
Priority date: 2014-01-10
Filing date: 2014-01-10
Publication date: 2015-07-15

Abstract

The invention provides a character string matching system and method. The character string matching system stores a plurality of key dimensionalities and a plurality of non-key dimensionalities, and each key dimensionality and each non-key dimensionality are both correspondingly provided with weighted values. The character string matching system comprises an input module, a word segmentation module, a labeling module, a comparison module, a computation module and an output module, wherein the input module is used for receiving the input of two character strings; the word segmentation module is used for carrying out word segmentation on the two character strings to obtain word groups; the labeling module is used for labeling the key dimensionality and the non-key dimensionality corresponding to each word group; the comparison module is used for comparing the word groups in the two character strings; and if the two word groups on any one key dimensionality are different, calling the output module to output one piece of character string mismatching information, and otherwise, calling the computation module to calculate a matching rate between the two character strings through a formula, and calling the output module to output the matching rate. The character string matching system can quickly, flexibly and accurately calculate the matching rate between the character strings.

Description

String matching system and method

Technical field

The present invention relates to a kind of string matching system and character string matching method.

Background technology

Due to the difference of the flexible and changeable characteristic of natural language and name style, having different describing modes for same things, at computing machine, is exactly two different character strings.Whether what how to judge two character strings descriptions fast is same things, also just becomes the technical matters that has realistic meaning.

Contact between existing string association degree computing method or more mechanical calculating character string, or be absorbed in numerous and diverse calculating of semantic analysis, cannot fast and flexible, calculate similarity between character string accurately.

Summary of the invention

The technical problem to be solved in the present invention be in order to overcome in prior art cannot fast and flexible, calculate the defect of the similarity between character string accurately, provide a kind of can fast and flexible, calculate the string matching system and method for the similarity between character string accurately.

The present invention solves above-mentioned technical matters by following technical proposals:

The invention provides a kind of string matching system, its feature is, it stores some key dimensions and some non-key dimensions, each key dimension and the equal correspondence of non-key dimension have weighted value, and this string matching system comprises a load module, a word-dividing mode, a labeling module, a comparison module, a computing module and an output module;

This load module is for receiving the input of two character strings;

It is phrase that this word-dividing mode is used for these two character string participles;

This labeling module is for marking key dimension corresponding to each phrase or non-key dimension;

This comparison module is for comparing the phrase in these two character strings, if two phrases in arbitrary key dimension are not identical, call this output module and export a character string not match information, otherwise (specifically referring to that two phrases in all identical or all key dimensions matched of two phrases in arbitrary key dimension are identical but a certain character string lacks the phrase in a certain or some key dimension) calls this computing module, wherein, it is equivalent in meaning that " two phrases are identical " refers to expressed by two phrases, and be not limited to all character strict conformances that two phrases comprise, similarly, " two phrases are not identical " meaning of referring to expressed by two phrases is not identical,

This computing module is used for passing through formula calculate the matching degree between these two character strings, and call this output module and export this matching degree; Wherein P represents the matching degree between these two character strings, and n represents the number that in these two character strings, phrase is identical, a _ifor the twice of weighted value corresponding to i-th identical phrase in these two character strings, B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding.

Preferably, this string matching system also comprises a processing module, and this processing module is for removing stop-word in these two character strings, correcting the wrongly written or mispronounced characters in these two character strings and the phonetic in these two character strings is replaced by Chinese character.

Preferably, this string matching system storage one comprises the dictionary of multiple word, and this word-dividing mode comprises stroke sub-module and a matching module;

This division module is used for dividing these two character strings;

This matching module is used for the word marked off to mate with all words in this dictionary, if the match is successful, the word this marked off is as this phrase.

Preferably, those key dimensions and non-key dimension are according to the self-defined setting in field.

The present invention also provides a kind of character string matching method, and its feature is, it stores some key dimensions and some non-key dimensions, and each key dimension and the equal correspondence of non-key dimension have weighted value, and this character string matching method comprises the following steps:

S ₁, receive the input of two character strings;

S ₂, be phrase by these two character string participles;

S ₃, mark key dimension corresponding to each phrase or non-key dimension;

S ₄, the phrase compared in these two character strings, if two phrases in arbitrary key dimension are not identical, enter step S ₅, otherwise enter step S ₆;

S ₅, export a character string not match information, process ends;

S ₆, pass through formula calculate the matching degree between these two character strings, and export this matching degree, process ends; Wherein n represents the number that in these two character strings, phrase is identical, a _ifor the twice of weighted value corresponding to i-th identical phrase in these two character strings, B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding.

Preferably, step S ₁with step S ₂between comprise the following steps:

Remove the stop-word in these two character strings, correct the wrongly written or mispronounced characters in these two character strings and the phonetic in these two character strings is replaced by Chinese character.

Preferably, this character string matching method stores the dictionary that comprises multiple word, step S ₂comprise the following steps:

S ₂₁, these two character strings are divided;

S ₂₂, the word marked off is mated with all words in this dictionary, if the match is successful, the word this marked off is as this phrase.

On the basis meeting this area general knowledge, above-mentioned each optimum condition, can combination in any, obtains the preferred embodiments of the invention.

Positive progressive effect of the present invention is:

The invention provides a kind of string matching system and method, by marking each phrase marked off, by comparing the phrase in key dimension, and output string not match information when adopting " different negative " mode of priority two phrases in arbitrary key dimension not identical, otherwise specifically calculate the matching degree between two character strings.The present invention can fast and flexible, calculate matching degree between two character strings accurately.

Accompanying drawing explanation

Fig. 1 is the structured flowchart of the string matching system of present pre-ferred embodiments.

Fig. 2 is the process flow diagram of the character string matching method of present pre-ferred embodiments.

Embodiment

Mode below by embodiment further illustrates the present invention, but does not therefore limit the present invention among described scope of embodiments.

As shown in Figure 1, the present embodiment provides a kind of string matching system, it stores some key dimensions and some non-key dimensions, those key dimensions and non-key dimension can according to the self-defined settings in field, each key dimension and the equal correspondence of non-key dimension have weighted value, and this string matching system comprises load module 1, processing module 2, word-dividing mode 3, labeling module 4, comparison module 5, computing module 6 and an output module 7.

Above describe the parts that this string matching system comprises, lower mask body introduces the function that each parts realizes:

This load module 1 is for receiving the input of two character strings;

This processing module 2 is for removing stop-word in these two character strings, correcting the wrongly written or mispronounced characters in these two character strings and the phonetic in these two character strings is replaced by Chinese character;

This word-dividing mode 3 is for being phrase by these two character string participles;

This labeling module 4 is for marking key dimension corresponding to each phrase or non-key dimension;

This comparison module 5 is for comparing the phrase in these two character strings, if two phrases in arbitrary key dimension are not identical, call this output module 7 and exports a character string not match information, otherwise call this computing module 6;

This computing module 6 is for passing through formula calculate the matching degree between these two character strings, and call this output module 7 and export this matching degree; Wherein P represents the matching degree between two character strings, and n represents the number that in these two character strings, phrase is identical, a _ifor the twice of weighted value corresponding to i-th identical phrase in these two character strings, B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding.

Wherein, this word-dividing mode 3 comprises stroke sub-module 31 and a matching module 32 further, this string matching system storage one comprises the dictionary of multiple word, this division module 31 is for dividing these two character strings, this matching module 32 is for mating the word marked off with all words in this dictionary, if the match is successful, the word this marked off is as this phrase.

As shown in Figure 2, the present embodiment additionally provides a kind of character string matching method, and it stores some key dimensions and some non-key dimensions, and each key dimension and the equal correspondence of non-key dimension have weighted value, and this character string matching method comprises the following steps:

The input of step 101, reception two character strings;

Step 102, the stop-word removed in these two character strings, correct the wrongly written or mispronounced characters in these two character strings and the phonetic in these two character strings is replaced by Chinese character;

Step 103, be phrase by these two character string participles, further, this step comprises two steps below: divide these two character strings; Mated with all words in this dictionary by the word marked off, if the match is successful, the word this marked off is as this phrase;

Step 104, mark key dimension corresponding to each phrase or non-key dimension;

Step 105, the phrase compared in these two character strings, if two phrases in arbitrary key dimension are not identical, enter step 106, otherwise enter step 107;

Step 106, export a character string not match information, process ends;

Step 107, pass through formula calculate the matching degree between these two character strings, and export this matching degree, process ends; Wherein n represents the number that in these two character strings, phrase is identical, a _ifor the twice of weighted value corresponding to i-th identical phrase in these two character strings, B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding.

Namely string matching system and method is described with the matching degree between the Liang Ge hotel title of input below for a concrete example, the present invention is understood better to make those skilled in the art, but the present invention is not limited to the matching degree that can only be applied to and calculate between hotel's title, and the present invention can be applicable to calculate the matching degree in each field between two character strings.

Different fields, dimension set by different application scenarioss are different, and the key dimension wherein extracted is also different.In this example, for field, hotel, its presumable dimension has " city ", " hotel's brand ", " sub-brand name ", " hotel's title descriptor ", " region " and " meaningless word " etc., key dimension is wherein " city ", " hotel's brand ", " sub-brand name " and " region ", and non-key dimension is " hotel's title descriptor " and " meaningless word ".In key dimension, the weighted value of " city " correspondence is 5, and the weighted value of " region " correspondence is 5, and the weighted value of " hotel's brand " correspondence is 10, and the weighted value of " sub-brand name " correspondence is 8.In non-key dimension, the weighted value that " hotel's title descriptor " is corresponding is 1, and the weighted value that " meaningless word " is corresponding is 0.

Dictionary comprises general dictionary and special dictionary, and general dictionary is the most extensive, the prevailing dictionary not distinguishing industry, for industry is general, comprises as administrative region dictionary, natural language dictionary etc.; But special dictionary is a series of less more professional dictionary according to specific industry tissue, and its data volume can not show a candle to general dictionary, but has higher authority than general dictionary on specific area, and it is higher that it adopts probability.In the field, hotel of this example, what it adopted is special dictionary, by the retrieval to special dictionary, according to the segmentation methods of standard, can obtain a series of set with the word composition of semantic label.

This load module 1 receives the input of two character strings, and first character string is " the quick hotel of ru family of Xujiahui, Shanghai ", and second character string is " shop, IBIs Xujiahui China ".This processing module 2 carries out conventional process, remove in first character string " ", the phonetic " ru " in first character string is replaced by Chinese character " as ".

This division module 31 divides these two character strings, be divided in " Shanghai " by first character string, " Xujiahui ", " as family " and " quick hotel ", second character string is divided into " IBIs ", " Xujiahui " and " China ", the word " Shanghai " that this matching module 32 will mark off, " Xujiahui ", " as family ", " quick hotel " " IBIs " and " China " mates with all words in above-mentioned special dictionary, the word " Shanghai " then this marked off after the match is successful, " Xujiahui ", " as family ", " quick hotel " " IBIs " and " China " are as phrase.

This labeling module 4 marks key dimension corresponding to each phrase or non-key dimension, namely key dimension corresponding to phrase in first character string or non-key dimension " Shanghai (city) ", " Xujiahui (region) ", " as family (hotel's brand) " and " quick hotel (hotel's title descriptor) " is marked, the key dimension that the phrase in second character string is corresponding or non-key dimension " IBIs (hotel's brand) ", " Xujiahui (region) " and " Chinese (meaningless word) ".

This comparison module 5 compares the phrase in these two character strings, phrase " Xujiahui " in first character string in key dimension " region " is identical with the phrase " Xujiahui " in second character string, phrase " as family " in first character string in key dimension " hotel's brand " is identical with the phrase " IBIs " in second character string, and (it is identical that " identical " here refers to commercial brand in the brand in field, hotel, namely commercial brand " as family " and " IBIs " are same commercial brand), the phrase in key dimension " city " is there is and the phrase lacked in second character string in key dimension " city " in first character string, then do not compare the phrase in key dimension " city ", by above-mentioned comparison procedure, that two phrases in all key dimensions matched are identical or be the phrase that the second character string lacks in key dimension " city ", and then computing module 6 calculates the matching degree between these two character strings.

Computing module 6 passes through formula the detailed process calculating the matching degree between these two character strings is:

The number that in these two character strings, phrase is identical is 2, weighted value 10 sum 20 that the weighted value 10 of a1 phrase " as the family " correspondence that to be weighted value 5 sum 10, a2 that weighted value 5 that phrase " Xujiahui " in first character string is corresponding is corresponding with the phrase " Xujiahui " in second character string be in first character string is corresponding with the phrase " IBIs " in second character string; B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding, and the weighted value 5 of phrase " Shanghai " correspondence namely in first character string adds that the weighted value 5 of phrase " Xujiahui " correspondence that the weighted value 10 of phrase " IBIs " correspondence that the weighted value 1 of phrase " quick hotel " correspondence that the weighted value 10 of phrase " as the family " correspondence that the weighted value 5 of phrase " Xujiahui " correspondence in first character string adds in first character string adds in first character string adds in second character string adds in second character string adds the weighted value 0 of phrase " China " correspondence in second character string.

Matching degree P=(10+20 then between these two character strings)/(5+5+10+1+10+5+0)=83.33%, and call this output module 7 and export this matching degree 83.33%.

The each matching result of this string matching system all goes on record and carries out manual examination and verification, whether the matching result of this string matching system of manual examination and verification is correct, and auditing result is fed back to this string matching system, the auditing result of this string matching system to feedback carries out matching error number and type statistics, and by statistical result showed out.Be in most cases the phrase owing to not having some special in dictionary, the phrase that participle is gone out is incorrect, and then causes matching result also incorrect.So, auditor can carry out supplementary and perfect in artificially to dictionary, the accuracy of the matching result of this string matching system of further increase, if the amount of error of same type is accumulated to certain threshold value, or think that the matching degree exported is unreasonable, then auditor can regulate weight allocation in artificially, such as a certain key dimension or non-key dimension is carried out to the adjustment of weight.

For the coupling of hotel's title of the present embodiment, by the artificial checking to a large amount of actual case, the accuracy rate nearly 92% of the string matching result of this string matching system under initial situation can be seen, after manual examination and verification after a while with adjustment, the accuracy rate of the string matching result of this string matching system brings up to about 97%, and utilize the accuracy rate about 75% that common comparison algorithm (if the shortest editing distance algorithm is the text string comparison algorithm of core) obtains, find out from above-mentioned, the accuracy rate of matching result of the present invention is far away higher than the accuracy rate of common comparison algorithm.

Equally, this string matching system is applied to the coupling of house type title, although the string length of house type title is shorter, difficulty of matching is larger, but by the artificial checking to a large amount of actual case, the accuracy rate nearly 88.3% of the string matching result of this string matching system under initial situation can be seen, after manual examination and verification after a while with adjustment, the accuracy rate of the string matching result of this string matching system brings up to about 94.4%, and utilize the accuracy rate about 70% that common comparison algorithm (if the shortest editing distance algorithm is the text string comparison algorithm of core) obtains, can find out equally from above-mentioned, the accuracy rate of matching result of the present invention is far away higher than the accuracy rate of common comparison algorithm.

The present embodiment is by marking each phrase marked off, by comparing the phrase in key dimension, and output string not match information when adopting " different negative " mode of priority two phrases in arbitrary key dimension not identical, otherwise specifically calculate the matching degree between two character strings.The present invention can fast and flexible, calculate matching degree between two character strings accurately.

Each functional module in the present invention all can be realized in conjunction with existing software programming means under existing hardware condition, therefore does not all repeat its concrete methods of realizing at this.

Although the foregoing describe the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is defined by the appended claims.Those skilled in the art, under the prerequisite not deviating from principle of the present invention and essence, can make various changes or modifications to these embodiments, but these change and amendment all falls into protection scope of the present invention.

Claims

1. a string matching system, it is characterized in that, it stores some key dimensions and some non-key dimensions, each key dimension and the equal correspondence of non-key dimension have weighted value, and this string matching system comprises a load module, a word-dividing mode, a labeling module, a comparison module, a computing module and an output module;

This load module is for receiving the input of two character strings;

This comparison module is for comparing the phrase in these two character strings, if two phrases in arbitrary key dimension are not identical, call this output module and exports a character string not match information, otherwise call this computing module;

This computing module is used for passing through formula calculate the matching degree between these two character strings, and call this output module and export this matching degree; Wherein n represents the number that in these two character strings, phrase is identical, a _ifor the twice of weighted value corresponding to i-th identical phrase in these two character strings, B is the cumulative sum of the weighted value that in these two character strings, each phrase is corresponding.

2. string matching system as claimed in claim 1, it is characterized in that, this string matching system also comprises a processing module, and this processing module is for removing stop-word in these two character strings, correcting the wrongly written or mispronounced characters in these two character strings and the phonetic in these two character strings is replaced by Chinese character.

3. string matching system as claimed in claim 1, it is characterized in that, this string matching system storage one comprises the dictionary of multiple word, and this word-dividing mode comprises stroke sub-module and a matching module;

This division module is used for dividing these two character strings;

4. as the string matching system in claim 1-3 as described in any one, it is characterized in that, those key dimensions and non-key dimension are according to the self-defined setting in field.

5. a character string matching method, is characterized in that, it stores some key dimensions and some non-key dimensions, and each key dimension and the equal correspondence of non-key dimension have weighted value, and this character string matching method comprises the following steps:

S ₁, receive the input of two character strings;

S ₂, be phrase by these two character string participles;

S ₃, mark key dimension corresponding to each phrase or non-key dimension;

S ₅, export a character string not match information, process ends;

6. character string matching method as claimed in claim 5, is characterized in that, step S ₁with step S ₂between comprise the following steps:

7. character string matching method as claimed in claim 5, is characterized in that, this character string matching method stores the dictionary that comprises multiple word, step S ₂comprise the following steps:

S ₂₁, these two character strings are divided;

8. as the character string matching method in claim 5-7 as described in any one, it is characterized in that, those key dimensions and non-key dimension are according to the self-defined setting in field.