CN102184169A - Method, device and equipment used for determining similarity information among character string information - Google Patents

Method, device and equipment used for determining similarity information among character string information Download PDF

Info

Publication number
CN102184169A
CN102184169A CN 201110099437 CN201110099437A CN102184169A CN 102184169 A CN102184169 A CN 102184169A CN 201110099437 CN201110099437 CN 201110099437 CN 201110099437 A CN201110099437 A CN 201110099437A CN 102184169 A CN102184169 A CN 102184169A
Authority
CN
China
Prior art keywords
information
character string
similarity
substring
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201110099437
Other languages
Chinese (zh)
Other versions
CN102184169B (en
Inventor
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN 201110099437 priority Critical patent/CN102184169B/en
Publication of CN102184169A publication Critical patent/CN102184169A/en
Application granted granted Critical
Publication of CN102184169B publication Critical patent/CN102184169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method, a device and an equipment used for determining similarity information among character string information and based on a plurality of types. In the scheme, through obtaining two character string information to be processed and obtaining the similarity information of at least two types based on the two character string information, the final similarity information of the two character string information is determined. Compared with the prior art, the method, device and equipment used for determining similarity information among character string information have the following advantages: the final similarity information of the two character string information can be determined comprehensively according to the similarity information of plurality of types so as to show the similarity of the two character string information on character pronunciation, character pattern and/or character meaning more roundly, and the judgment result of the similarity is more accurate.

Description

Be used for determining method, device and the equipment of similarity information between character string information
Technical field
The present invention relates to field of computer technology, relate in particular to the method, device and the equipment that are used for determining similarity information between character string information.
Background technology
Determine that the similarity between character string information is a part important in the natural language processing always.In the prior art, often only determine similarity between character string information based on an aspect, for example, only determine two font similarities between character string information based on editing distance, again for example, often only judge two semantic similarities between character string information etc. based on synonymicon, these determine that the method for similarity between character string information often is difficult to reflect all sidedly two similarities between character string.
Summary of the invention
The purpose of this invention is to provide a kind of method, device and equipment that is used for determining similarity information between character string information.
According to an aspect of the present invention, provide a kind of computer implemented method that is used for determining based on polytype similarity information between character string information, wherein, this method may further comprise the steps:
A obtains two pending character string informations;
B determines the final similarity information between described two character string informations according at least two types the similarity information that obtains based on described two character string informations.
According to another aspect of the present invention, also provide a kind of similarity that is used for similarity information between definite character string information to determine device, wherein, this similarity determines that device comprises:
First deriving means, be used to obtain two pending character string informations;
First determine device, be used at least two types similarity information obtaining according to based on described two character string informations, determine the final similarity information between described two character string informations.
According to a further aspect of the invention, also provide a kind of computer equipment, wherein, this computer equipment comprises that aforementioned similarity determines device.
Compared with prior art, the present invention has the following advantages: 1) can comprehensively determine final similarity information between two character string informations according to polytype similarity information, thereby reflect the similarity of two character string informations at aspects such as word sound, font and/or the meanings of word more all sidedly, the similarity judged result of gained is more accurate; 2), make the final similarity information that is obtained more meet the demand of application scenario by weight information in conjunction with each type correspondence; Further, can adjust the weight information of each type correspondence automatically according to applied environment information; Nearlyer one goes on foot ground, can select the type of required processing according to applied environment, so that the scheme of present embodiment can be applicable to multiple occasion adaptively; 3) can improve the speed of obtaining described final similarity information by dividing substring information, reduce system resources consumption; Further, can be according to the coupling array mode between substring information, improve the speed of obtaining described final similarity information, reduce system resources consumption, perhaps, can further improve the speed of obtaining described final similarity information according to the right similarity information of historical substring combination, reduce system resources consumption; 4) can improve the accuracy of obtaining the final similarity information between two pending character string informations by taking all factors into consideration overall similarity information and dividing string similarity information to obtain final similarity information; 5) can be applicable to that various needs carry out the occasion that similarity is judged; For example, be used for similarity judgement between keyword that list entries and text candidates item from the user comprised in the occasion in search; Again for example, the similarity that is used in the error correction occasion between keyword that list entries and error correction dictionary from the user comprised is judged; Again for example, in the synonym mining process, be used for two similarity judgements between character string to be judged etc.
Description of drawings
By reading the detailed description of doing with reference to the following drawings that non-limiting example is done, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 is used to obtain the method flow diagram of similarity information between character string information for one aspect of the invention;
Fig. 2 is the method flow diagram that is used to obtain similarity information between character string information of a preferred embodiment of the invention;
Fig. 3 is the method flow diagram that is used to obtain similarity information between character string information of another preferred embodiment of the present invention;
The similarity that Fig. 4 is used to obtain similarity information between character string information for one aspect of the invention is determined the structural representation of device;
Fig. 5 determines the structural representation of device for the similarity that is used to obtain similarity information between character string information of a preferred embodiment of the invention;
Fig. 6 determines the structural representation of device for the similarity that is used to obtain similarity information between character string information of another preferred embodiment of the present invention;
Same or analogous Reference numeral is represented same or analogous parts in the accompanying drawing.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in further detail.
Fig. 1 is used to obtain the method flow diagram of similarity information between character string information for one aspect of the invention.Wherein, the method according to this invention can be finished by operating system in the computer equipment or processing controller, for simplicity's sake, below described operating system or processing controller are referred to as similarity determine device.Wherein, this computer equipment includes but not limited to: 1) subscriber equipment; 2) network equipment.Wherein, described subscriber equipment includes but not limited to: PC, smart mobile phone, PDA etc.; The described network equipment includes but not limited to: the group of server that single network server, a plurality of webserver are formed or based on the cloud that is made of a large amount of computing machines or the webserver of cloud computing (Cloud Computing), wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine of being made up of the loosely-coupled computing machine collection of a group.
In step S1, similarity determines that device obtains two pending character string informations.
Wherein, similarity determines that the mode that device obtains described pending two character string informations includes but not limited to:
1) needs that obtain pre-stored carry out two character string informations that similarity is judged;
2) obtain similarity and determine that affiliated computer equipment of device or the current needs of other computer equipments carry out two character string informations of similarity judgment processing.
For example, a character string information in two character string informations is from the list entries of the current input of user, another character string information from computer equipment according to an aforementioned text message that character string information is retrieved, similarity determines that computer equipment under the device or other computer equipments are current needs to judge similarity between these two character string informations to determine whether that resource under the text information is offered the user, and then similarity determines that device obtains two character string informations that computer equipment under it or other computer equipments provide.
Again for example, a list entries that character string information is imported in application program from the user in two character string informations, another character string information is from the error correction dictionary of computer equipment under this application program, similarity under this application program between these two character string informations of the current needs judgement of computer equipment is to determine whether that aforementioned another character string information is offered the user as error correcting prompt information, and then similarity determines that device obtains two character string informations that the affiliated computer equipment of this application program provides.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention those skilled in the art should understand that, any implementation of obtaining two pending character string informations all should be within the scope of the present invention.
Then, in step S2, similarity determines that device according at least two types the similarity information that obtains based on described two character string informations, determines the final similarity information between described two character string informations.Wherein, described similarity information comprises at least one item in following two: 1) the similarity information between described two character string informations; 2) described two intercharacter similarity information of the part that character string information comprised.Described similarity information includes but not limited to: 1) similarity grade; 2) similarity value.Wherein, described at least two types similarity information comprises at least two kinds of similarity information that obtain based at least two kinds of similarity of character string processing modes.
Particularly, similarity determines that device obtains described at least two types similarity information according to various types of similarity processing modes respectively, and directly obtains described final similarity information according to the similarity information of each type that is obtained; For example, with the mean value of each similarity information, product, quadratic sum, inverse and, logarithm and etc. as described final similarity information; Perhaps, similarity determines that device carries out normalized to each similarity information earlier, and the every value according to the normalization gained obtains described final similarity information again; Perhaps, similarity is determined device by selection portion branch similarity information in each similarity information that is obtained, and obtains described final similarity information etc. according to selected this part similarity information.
For example, similarity determines that device is by carrying out the Metaphone phonetic notation to character string information A and character string information B, the phonetic notation information that obtains both is respectively " KRM " and " KRL ", then similarity is determined the ratio of device by total number of characters of obtaining two characters of same character position are identical in two phonetic notation information number of times and two phonetic notation information and comprising, determine pronunciation similarity information=2/ (3+4)=2/7 between character string information A and B, and similarity determines that device obtains both for synonym by inquiring about predetermined synonymicon; When then similarity determines that device is synonym according to predetermined two character string informations when pending, both pronunciation similarity information be multiply by the rule of the value of 2 gained as described final similarity information, with 2*2/7=4/7 as the final similarity information between character string information A and B.
Again for example, character string information A comprises substring information A 1 and A2, similarity determines that device is 2 by the editing distance that substring information A 1 is converted to character string information B and obtains between substring information A 1 and character string information B, and is 1 by substring information A 2 being converted to the editing distance that character string information B obtains between substring information A 2 and character string information B; And, similarity is determined the short text spread vector of substring information A 1, substring information A 2 and character string information B that device obtains and stores, and the vector distance between the short text spread vector of acquisition substring information A 1 and character string information B is 1.755, vector distance between the short text spread vector of substring information A 2 and character string information B is 1.025, then similarity determine device with inverse=1/ (2+1+1.755+1.025)=0.173 of every value sum of gained as the final similarity information between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, at least two types the similarity information that any basis obtains based on described two character string informations, determine the implementation of the final similarity information between described two character string informations, all should be within the scope of the present invention.
Preferably, described at least two types of at least two kinds of comprising in the following:
1) editing distance type;
Particularly, the similarity information of editing distance type obtains by obtaining two editing distances between character string information.
More preferably, the similarity information of this editing distance type obtains according to the relevant character change information of performed editing operation in the conversion process that a character string information in described two character string informations is converted to another character string information.
For example, two character string informations are respectively " bai " and " bei ", " bai " is converted to editing operation performed in the process of " bei " is respectively " character reproduction b ", " character a is replaced with character e " and " character reproduction i ", the character change information that is then obtained is " a → e ", and the variation cost that the predetermined cost storehouse of inquiry obtains " a → e " is 0.2, determines that then the similarity of the editing distance type between " bai " and " bei " is 1-0.2=0.8.
2) pronunciation type;
Particularly, phonetic notation or the phonetic of the similarity information of pronunciation type by obtaining two character string informations, and obtain by the similarity of determining described phonetic notation or phonetic.For example, the phonetic that obtains two pending character string informations is respectively " baidu " and " paidu ", and then the ratio that accounts for total consonant and vowel quantity according to consonant identical under the same sorting position and vowel quantity determines that the similarity information of the pronunciation type of two character string informations that this is pending is 0.75.
3) synonym match-type;
Particularly, the similarity information of synonym match-type perhaps, judges that the synonym possibility of two character string informations obtains by judging whether two character string informations are synonym.
4) short text expansion type;
Particularly, the similarity information of short text expansion type obtains by the similarity between the short text extend information of obtaining two character string informations.
5) character string proper vector type;
Particularly, the similarity information of character string proper vector type obtains according to two character string proper vectors based on the result for retrieval gained of described two character string informations respectively.
For example, retrieve based on character string information A and to obtain a plurality of webpages, and after the text message in these a plurality of webpages cut speech, removes keyword quantity that invalid keyword and statistics repeat etc. and handle, obtain the character string proper vector of character string information A; Then, for character string information B, repeat above-mentioned processing to obtain the character string proper vector of character string information B; Then, obtain the similarity information of the character string proper vector type between character string information A and B by the vector distance between the character string proper vector of obtaining character string information A and B.
6) theme distribution pattern;
Particularly, the similarity information of theme distribution pattern obtains according to the theme of relevant with described two character string informations respectively a plurality of resource informations.
For example, retrieve based on character string information A and to obtain three webpages, and the predetermined theme that obtains these three webpages is respectively " news ", " amusement ", " news ", the theme of then determining character string information A distribute and comprise " news: 2/3, amusement: 1/3 "; For character string information B, repeat aforesaid operations and obtain its theme distribute and comprise " news: 1/2, amusement: 1/4, recreation: 1/4 "; Then, with the similarity information of the mean value sum of character string information A and B same subject=(2/3+1/2)/2+ (1/3+1/4)/2=7/8 as the theme distribution pattern between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any at least two kinds of similarity information that obtain based at least two kinds of similarity of character string processing modes all should be included in the similarity range of information of the present invention two types at least.
The method according to this invention can comprehensively be determined final similarity information between two character string informations according to polytype similarity information, thereby reflect the similarity of two character string informations at aspects such as word sound, font and/or the meanings of word more all sidedly, the similarity judged result of gained is more accurate.
Fig. 2 is the method flow diagram that is used to obtain similarity information between character string information of a preferred embodiment of the invention.Method according to present embodiment comprises step S1, step S3 and step S2 '.
Step S1 is described in detail with reference to the embodiment shown in FIG. 1, and is contained in this by reference, repeats no more.
In step S3, similarity determines that device obtains every type of pairing weight information in described at least two types.Wherein, this weight information includes but not limited to: 1) weight grade; 2) weighted value.
Particularly, similarity determines that the mode that device obtains described weight information includes but not limited to:
1) similarity determines that device according to predetermined weight information and the corresponding relation between type, obtains the weight information of every type of correspondence in described at least two types;
2) similarity determines that device obtains the applied environment information of described final similarity information, and according to described applied environment information, determines the weight information of described every type of correspondence.
Wherein, similarity determines that the mode that device obtains the applied environment information of described final similarity information includes but not limited to:
A) similarity is determined the applied environment information that device obtains to be provided by other equipment or other devices; For example, another device request similarity in the described computer equipment determines that device determines the final similarity information between two character string informations, and determine that to similarity device provides its API (Application Programming Interface), then similarity determines that API that device provides this another device is as described applied environment information;
B) similarity determines that device detects and two the pending application programs that character string information is relevant that obtained, to obtain described applied environment information; For example, one in two the pending character string informations that obtain when detecting by obtaining among the word, determines that then described applied environment information comprises the identification information of word program etc.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention those skilled in the art should understand that, any implementation of obtaining the applied environment information of described final similarity information all should be within the scope of the present invention.
Wherein, similarity determines that device according to described applied environment information, determines that the mode of described every type of pairing weight information includes but not limited to:
A) when described at least two types comprise similarity determine device can adopt all types the time, similarity is determined device according to described applied environment information, determines the weight information that all types is respectively corresponding.
Particularly, similarity is determined device according to the corresponding relation between the weight information of predetermined applied environment information and each type, determines the weight information that all types is corresponding respectively.
For example, when described all types comprises editing distance type and theme distribution pattern, and when described applied environment information comprises the identification information of word program, similarity determines that device is definite when applied environment information comprises the identification information of word program according to described corresponding relation, the weight grade of editing distance type correspondence is the first estate, and the weight grade of theme distribution pattern correspondence is the tertiary gradient etc.
B) similarity is determined device according to described applied environment information, by described at least two types of selection in all types, and according to described applied environment information, obtains every type of pairing weight information in selected described at least two types.
Particularly, similarity is determined the type that device adopts according to needs under each predetermined applied environment, selects described at least two types in the cause all types; Then, similarity determines that device according to described applied environment information, obtains the weight information of every type of correspondence in selected described at least two types.
For example, similarity determine device all types that can adopt comprise the pronunciation type, the synonym match-type, the short text expansion type, character string proper vector type and theme distribution pattern, and similarity determines that device is a search environment according to the current applied environment of API judgement that comprises in the applied environment information that is obtained, then similarity determines that device adopts the short text expansion type according to needs under the predetermined search environment, the rule of character string proper vector type and theme distribution pattern is by the pronunciation type, the synonym match-type, the short text expansion type, select the short text expansion type in character string proper vector type and the theme distribution pattern, character string proper vector type and theme distribution pattern; Then, similarity is determined the weight information that device is corresponding respectively according to each type under the predetermined search environment, obtains selected short text expansion type, character string proper vector type and theme distribution pattern pairing weight information respectively.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described applied environment information, determine the implementation of described every type of pairing weight information, all should be within the scope of the present invention.
Need to prove, when need not to obtain weight information, there is no sequencing between step S1 and the step S3 according to two pending character string informations.
Then, in step S2 ', similarity is determined device according to described at least two types the similarity information that obtains based on two character string informations, and in conjunction with described every type of pairing weight information, determines the final similarity information between described two character string informations.Wherein, similarity determines that the mode of the similarity information that the device acquisition is described at least two types is described in detail in abovementioned steps S2, and is contained in this by reference, repeats no more.
Particularly, similarity determines that device comes the described at least two types similarity information that is obtained is weighted according to described weight information, to obtain final similarity information according to the similarity information after the weighting.
For example, similarity determines that the pronunciation similarity information that device obtains between character string information A and B is 0.45, the similarity information of synonym match-type is 0.26, and similarity determines that device obtains pronunciation similarity type correspondence in abovementioned steps S3 weighted value is 0.4, and the weighted value of synonym match-type correspondence is 0.5; Then similarity determines that device multiply by the similarity information of the type with the weighted value of each type correspondence, and with each product addition that is obtained, to obtain described final similarity information=0.4*0.45+0.5*0.26=0.31.
Again for example, character string information A comprises substring information A 1 and A2, similarity determines that the editing distance that device obtains between substring information A 1 and character string information B is 2, editing distance between substring information A 2 and character string information B is 1, vector distance between the short text spread vector of substring information A 1 and character string information B is 1.755, vector distance between the short text spread vector of substring information A 2 and character string information B is 1.025, and, similarity determines that device obtains editing distance type correspondence in step S3 weighted value is 0.8, and the weighted value of short text spread vector type correspondence is 0.5; Then similarity determine device with inverse=1/ (0.8*2+0.8*1+0.5*1.755+0.5*1.025)=0.2639 of the weighted sum of every similarity information as the final similarity information between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described at least two types the similarity information that obtains based on two character string informations, and in conjunction with described every type of pairing weight information, determine the implementation of the final similarity information between described two character string informations, for example, with the logarithm behind every similarity information weighting and, quadratic sum, products etc. are as final similarity information, again for example, earlier every similarity information is carried out normalized, again the value of normalized gained is asked for weighted sum to obtain final similarity information etc., all should be within the scope of the present invention.
According to the method for present embodiment,, make the final similarity information that is obtained more meet the demand of application scenario by weight information in conjunction with each type correspondence; For example, when the method according to present embodiment is applied to error correction system, give editing distance type and the higher weight information of pronunciation type, when the method according to present embodiment is applied to search system, give the higher weight information of short text expansion type, character string proper vector type and body portion category type etc.Further, can also adjust the weight information of each type correspondence according to applied environment information automatically according to the method for present embodiment, and can select the type of required processing according to applied environment, so that the method for present embodiment can be applicable to multiple occasion adaptively.
Fig. 3 is the method flow diagram that is used to obtain similarity information between character string information of another preferred embodiment of the present invention.Method according to present embodiment comprises step S1 and step S2, and wherein, step S2 further comprises step S21 and step S22.
Step S1 is described in detail with reference to the embodiment shown in FIG. 1, and is contained in this by reference, repeats no more.
In step S21, similarity determines that device divides at least one character string information in described two character string informations, to obtain a plurality of substring information that this at least one character string information comprises.
Particularly, similarity determines that the device basis is such as one or more factors such as keyword that comprise in language and/or the dictionary under syllable, character code type, the character, come in described two character string informations at least one divided, to obtain a plurality of substring information that this at least one character string information comprises.
For example, for character string information " secondary え ろ り ん く ", similarity determines that device is different with the character code type of character string " え ろ り ん く " according to character string " secondary ", and be respectively two vocabulary in the dictionary according to " え ろ " and " り ん く ", " secondary え ろ り ん く " is divided into substring information " secondary ", " え ろ " and " り ん く " with character string information.For for purpose of brevity, will adopt identifier to represent substring information in the following content, for example, for character string information " secondary え ろ り ん く ", A represents with character string information; For substring information " secondary ", represent with substring information A 1; For substring information " え ろ ", represent with substring information A 2; For substring information " り ん く ", represent etc. with substring information A 3.Need to prove that aforementioned only is illustration for example, but not the concrete character string of representatives such as tag mark " A ", " A1 ", " A2 ", " A3 " is limited.
Then, in step S22, similarity determines that device according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, determines the final similarity information between described two character string informations.Wherein, similarity determines that similarity determines that the mode of the similarity information between two character string informations of device acquisition is same or similar among mode that device obtains the similarity information between two substring information and the abovementioned steps S2, does not repeat them here.
Particularly, similarity determines that device handles each similarity information that is obtained, to determine the final similarity information between described two character string informations.
For example, in step S21, similarity determines that device obtains character string information A and comprises substring information A 1 and A2, and character string information B comprises substring information B1 and B2; In this step, similarity determines that the similarity information of the pronunciation type between device acquisition substring information A 1 and B1 is 0.6, the similarity information of the pronunciation type between substring information A 1 and B2 is 0.1, the similarity information of the editing distance type between substring information A 2 and B1 is 0.2, and the similarity information of the editing distance type between substring information A 2 and B1 is 0.8; Then similarity determine device with the mean value of each similarity information as described final similarity information, to obtain described final similarity information=(0.6+0.1+0.2+0.8)/4=0.425.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any basis is contained in the one or more substring information in one of them character string information and is contained between one or more substring information in another character string information at least two types similarity information, determine the implementation of the final similarity information between described two character string informations, for example, be positioned at preceding 2 similarity information after selecting to sort from high to low and obtain final similarity information etc., all should be within the scope of the present invention.
According to the method for present embodiment, can improve the speed of obtaining described final similarity information by dividing substring information, reduce system resources consumption.
As one of preferred version of present embodiment, described step S22 further comprises step S2211 (figure does not show), step S2212 (figure does not show) and step S2213 (figure does not show).
In step S2211, similarity is determined that device obtains and describedly is contained in all the substring information in the character string information and is contained in all coupling array modes between all substring information in another character string information.
For example, in step S21, similarity determines that device obtains similarity and determines that device obtains character string information A and comprises substring information A 1, A2 and A3, character string information B comprises substring information B 1 and B2, and then similarity determines that all coupling array modes that device obtains between all substring information that all substring information that character string information A comprises and character string information B comprise are as follows:
Then, in step S2212, similarity determines that device according to described all coupling array modes, obtains at least two types similarity information between described two character string informations.
Particularly, similarity determines that device obtains substring information that each coupling is complementary in array mode or at least two types similarity information between the substring combination, to obtain at least two types similarity information between these two character string informations under this coupling array mode.Wherein, similarity determines that device obtains between two substring information, between two substrings combinations or between substring information and substring combination among at least two types mode and the abovementioned steps S2 of similarity information similarity determine between two character string informations of device acquisition that at least two types the mode of similarity information is same or similar, do not repeat them here.
For example, with array mode of coupling shown in the abovementioned steps S2211 one and coupling array mode two is example, similarity determines that device obtains the substring information A 1 that is complementary in the coupling array mode one and the similarity information of the editing distance type between B1 is 0.8, the similarity information of pronunciation type is 0.3, the similarity information of the editing distance type between substring combination A2A3 and substring information B2 is 0.05, the similarity information of pronunciation type is 0.88, the substring combination A1A2 that is complementary in the coupling array mode two and the similarity information of the editing distance type between B1 are 0.3, the similarity information of pronunciation type is 0.2, the similarity information of the editing distance type between substring information A 3 and substring information B2 is 0.07, and the similarity information of pronunciation type is 0.25; Then similarity is determined the mean value of the similarity information of device by asking for each type, obtain the similarity information of the editing distance type between character string information A and B=(0.8+0.05+0.3+0.07)/4=0.305, the similarity information of pronunciation type=(0.3+0.88+0.2+0.25)/4=0.4075.
Again for example, shown in the editing distance type in coupling array mode one and the coupling array mode two between each substring information or substring combination and the similarity information of pronunciation type are given an example as described above, then similarity determines that device is higher than predetermined editing distance threshold value 0.7 according to the similarity information 0.8 of the editing distance between substring information A 1 and B1, the similarity information 0.88 of the pronunciation type between substring combination A2A3 and substring information B2 is higher than predetermined pronunciation threshold value 0.75, determine to obtain the editing distance type between character string information A and B and the similarity information of pronunciation type according to coupling array mode one, then similarity determines that device obtains the similarity information of the editing distance type between character string information A and B=(0.8+0.05)/2=0.425, the similarity information of pronunciation type=(0.3+0.88)/2=0.59.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described all coupling array modes, obtain the implementation of at least two types similarity information between described two character string informations, all should be within the scope of the present invention.
Then, in step S2213, similarity determines that device according at least two types similarity information between described two character string informations, determines the final similarity information between described two character string informations.Wherein, similarity determines that device determines the mode of the final similarity information between described two character string informations at least according to two types similarity information between described two character string informations, and similarity determines that device determines that according at least two types the similarity information that obtains based on two character string informations the mode of the final similarity information between described two character string informations is same or similar among the abovementioned steps S2, does not repeat them here.
According to the method for this preferred version, can further improve the speed of obtaining described final similarity information according to the coupling array mode between substring information, reduce system resources consumption.
As one of preferred version of present embodiment, described step S22 further comprises step S2221 (figure does not show), step S2222 (figure does not show), step S2223 (figure does not show) and step S2224 (figure does not show).
In step S221, similarity determines that device is by obtaining current substring combination in described two character string informations to information.Wherein, described current substring combination is to comprising substring information and/or the substring combination that belongs to two character string informations respectively in the information.
Particularly, similarity determines that device is according to the sorting position in the character string information of two substring information that character string information comprised under separately, and, obtain described current substring combination to information in conjunction with executed operation note of obtaining described current substring combination to information.
Wherein, described operation note includes but not limited to following at least one:
1) the executed number of operations that obtains;
2) preceding current substring combination of once obtaining is to information;
3) preceding current substring combination of once obtaining is to substring information content that belongs to a character string information that comprises in the information and the substring information content that belongs to another character string information.
For example, in step S21, similarity determines that device obtains similarity and determines that device obtains character string information A and comprises substring information A 1, A2 and A3, character string information B comprises substring information B 1 and B2, and makes up A1A2 and substring information B1 by the current substring combination of once obtaining before obtaining in the described operation note to comprising substring in the information.Then similarity determines that device selects substring combination A1A2A3 and substring information B1 at random, perhaps, substring combination A1A2 and substring combination B1B2 as current substring combination to information.
Need to prove that similarity determines that device can obtain the combination of current substring to information and be contained in this current substring combination to two in the information pending substring information according to multiple order; For example, when pending character string information A comprises substring information A 1, A2 and A3, wherein, substring information A 1, A2 and A3 arrange in character string information A from left to right, character string information B comprises substring information B1 and B2, wherein, substring information B1 and B2 arrange in character string information B from left to right, and then similarity determines that device obtains current substring with following arbitrary order and makes up information:
1)A1_B1、A1A2_B1、A1A2A3_B1、A1_B1B2、A1A2_B1B2、A1A2A3_B1B2;
2)A1_B1、A1_B1B2、A1A2_B1、A1A2_B1B2、A1A2A3_B1、A1A2A3_B1B2;
3)A3_B2、A2A3_B2、A1A2A3_B2、A3_B1B2、A2A3_B1B2、A1A2A3_B1B2;
4)A3_B2、A3_B1B2、A2A3_B2、A2A3_B1B2、A1A2A3_B2、A1A2A3_B1B2;
5)A1_B1、A1A2_B1、A1_B1B2、A1A2A3_B1、A1A2_B1B2、A1A2A3_B1B2;
6)A3_B2、A2A3_B2、A3_B1B2、A1A2A3_B2、A2A3_B1B2、A1A2A3_B1B2;
What need further specify is, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any by obtaining the combination of current substring in described two character string informations to information and be contained in the implementation of this current substring combination to two pending substring information in the information, all should be within the scope of the present invention.
Then, in step S2222, similarity is determined that device obtains and is contained in described current substring combination at least two types right similarity information of each pending substring information in the information.Wherein, each pending substring information is to comprising substring information and/or the substring combination that belongs to two character string informations respectively, and the similarity information that this pending substring information is right is the similarity information between its substring information that comprises and/or substring combination.Wherein, similarity determines to obtain similarity among the mode of at least two types right similarity information of pending substring information and the step S2 by device, and to determine that device obtains between two pending character string informations at least two types similarity information mode same or similar, do not repeat them here.
For example, character string information A comprises substring information A 1, A2 and the A3 that arranges from left to right, and character string information B comprises substring information substring information B1 and the B2 that arranges from left to right; In step S2221, similarity determines that the current substring combination that device obtains is " A1A2A3B_1B2 " to information; Then similarity determines that device is according to the coupling array mode of current substring combination to all possible substring information between two the substring combinations " A1A2A3 " and " B1B2 " in the information, obtain two pending substring information to " A2A3_B2 " and " A3_B2 ", and obtaining the editing distance type of " A2A3_B2 " and the similarity information of pronunciation type is respectively 0.45 and 0.576, the editing distance type of " A3_B2 " and the similarity information of pronunciation type are respectively 0.61 and 0.5.
Then, in step S2223, similarity is determined at least two types similarity information and the historical similarity information that device is right according to described each pending substring information, determines that described current substring combination is to the similarity information between information.
For example, right at least two types the similarity information of described each pending substring information as described above among the step S2222 for example shown in, and similarity determines that the acquired historical similarity record of device is as follows:
The substring combination is to information similarity information
Figure BSA00000477888800171
Similarity determines that device determines pending substring information to the similarity information of " A2A3_B2 "=(0.45+0.576)/2=0.513, and pending substring information is to the similarity information of " A3_B2 "=(0.61+0.5)/2=0.555; Then similarity determines that device is 0.6 according to the similarity information that substring makes up information " A1_B1 ", determine at the similarity information=0.6*0.513=0.3078 of the down current substring combination of coupling array mode " A1 mates B1; A2 and A3 coupling B2 " information " A1A2A3_B1B2 ", at the similarity information=0.3*0.555=0.1665 of the down current substring combination of coupling array mode " A1 and A2 coupling B1, A3 mates B2 " to information " A1A2A3_B1B2 "; Then similarity determines that the bigger value 0.3078 of device selection is as the similarity information of current substring combination to information " A1A2A3_B1B2 ".
Then, in step S2224, similarity determines that device makes up to the similarity information between information described current substring as one of historical similarity information, repeating step S2221 to step S2223 and aforementioned with described current substring combination to the step of the similarity information between information as one of historical similarity information, until described current substring combination information is comprised described two character string informations, and described current substring is made up the similarity information between information as the final similarity information between described two character string informations.
According to the method for this preferred version, can further improve the speed of obtaining described final similarity information according to the right similarity information of historical substring combination, reduce system resources consumption.
As one of preferred version of present embodiment, also comprise step S4 according to the method for present embodiment, abovementioned steps S2 comprises step S22 '.
In step S4, similarity determines that device obtains the overall similarity information of at least a type between described two character string informations.Wherein, described overall similarity information is the similarity information that directly obtains according to two unallocated character string informations.Similarity determines to obtain similarity among one type the mode of overall similarity information and the step S2 by device, and to determine that device obtains the mode of one type similarity information between two pending character string informations same or similar, do not repeat them here.
In step S22 ', similarity determines that device is according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, and, determine the final similarity information between described two character string informations in conjunction with the overall similarity information of described at least a type.Wherein, similarity determines that device obtains the described mode that is contained in the one or more substring information in one of them character string information and is contained between one or more substring information in another character string information at least two types similarity information and described in detail in abovementioned steps S22, and be contained in this by reference, repeat no more.
Particularly, similarity determines that device according to described overall similarity information with describedly be contained in the one or more substring information in one of them character string information and be contained between one or more substring information in another character string information at least two types similarity information, determines that the mode of described final similarity information includes but not limited to:
1) according to the described one or more substring information and the one or more substring information that are contained in another character string information that are contained in one of them character string information, obtain the branch string similarity information between two pending character string informations, and select similarity value in this branch string similarity information and the described overall similarity information or similarity grade higher one as described final similarity information; Wherein, similarity determines that device obtains the mode of described branch string similarity information, and to determine that device obtains the mode of described final similarity information same or similar with similarity among the abovementioned steps S22, do not repeat them here.
For example, in abovementioned steps S4, similarity determines that it is 0.6 that device obtains between character string information A and B one type overall similarity information, and similarity determines that device adopts the mode described in the abovementioned steps S22, the branch string similarity information that obtains between character string information A and B is 0.83, and then similarity determines that the higher branch string similarity information of device selection similarity value is as described final similarity information.
2) described branch string similarity information and described at least a type overall similarity information are handled, to obtain described final similarity information.
For example, in abovementioned steps S4, similarity determines that the overall similarity information of character string proper vector type between device acquisition character string information A and B is 0.6, the overall similarity information of theme distribution pattern is 0.4, and similarity determines that device adopts the mode described in the abovementioned steps S22, the branch string similarity information that obtains between character string information A and B is 0.83, then similarity determines that device is 0.45 according to the weighted value of the overall similarity information of predetermined character string proper vector type, the weighted value of the overall similarity information of theme distribution pattern is 0.47, dividing the weighted value of string similarity information is 0.86, come described two types overall similarity information and branch string similarity information are asked for weighted sum, to obtain final similarity information=0.45*0.6+0.47*0.4+0.86*0.83=1.1718.
What need further specify is, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described overall similarity information with describedly be contained in the one or more substring information in one of them character string information and be contained between one or more substring information in another character string information the implementation that at least two types similarity information is determined described final similarity information, all should be within the scope of the present invention.
According to the method for this preferred version, can improve the accuracy of obtaining the final similarity information between two pending character string informations by taking all factors into consideration overall similarity information and dividing string similarity information to obtain final similarity information.
The similarity that Fig. 4 is used to obtain similarity information between character string information for one aspect of the invention is determined the structural representation of device.Similarity according to the present invention determines that device comprises first deriving means 1 and first definite device 2.
First deriving means 1 obtains two pending character string informations.
Wherein, first deriving means 1 mode of obtaining described pending two character string informations includes but not limited to:
1) needs that obtain pre-stored carry out two character string informations that similarity is judged;
2) obtain similarity and determine that affiliated computer equipment of device or the current needs of other computer equipments carry out two character string informations of similarity judgment processing.
For example, a character string information in two character string informations is from the list entries of the current input of user, another character string information from computer equipment according to an aforementioned text message that character string information is retrieved, computer equipment or other computer equipments are current under first deriving means 1 needs to judge similarity between these two character string informations to determine whether that resource under the text information is offered the user, and then first deriving means 1 obtains two character string informations that computer equipment under it or other computer equipments provide.
Again for example, a list entries that character string information is imported in application program from the user in two character string informations, another character string information is from the error correction dictionary of computer equipment under this application program, similarity under this application program between these two character string informations of the current needs judgement of computer equipment is to determine whether that aforementioned another character string information is offered the user as error correcting prompt information, and then first deriving means 1 obtains two character string informations that the affiliated computer equipment of this application program provides.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention those skilled in the art should understand that, any implementation of obtaining two pending character string informations all should be within the scope of the present invention.
Then, first definite device 2 is determined the final similarity information between described two character string informations according at least two types the similarity information that obtains based on described two character string informations.Wherein, described similarity information comprises at least one item in following two: 1) the similarity information between described two character string informations; 2) described two intercharacter similarity information of the part that character string information comprised.Described similarity information includes but not limited to: 1) similarity grade; 2) similarity value.Wherein, described at least two types similarity information comprises at least two kinds of similarity information that obtain based at least two kinds of similarity of character string processing modes.
Particularly, first determines that device 2 obtains described at least two types similarity information according to various types of similarity processing modes respectively, and directly obtains described final similarity information according to the similarity information of each type that is obtained; For example, with the mean value of each similarity information, product, quadratic sum, inverse and, logarithm and etc. as described final similarity information; Perhaps, first determines that device 2 carries out normalized to each similarity information earlier, and the every value according to the normalization gained obtains described final similarity information again; Perhaps, first determines device 2 by selection portion branch similarity information in each similarity information that is obtained, and obtains described final similarity information etc. according to selected this part similarity information.
For example, first determines that device 2 is by carrying out the Metaphone phonetic notation to character string information A and character string information B, the phonetic notation information that obtains both is respectively " KRM " and " KRL ", first determines the ratio of device 2 by total number of characters of obtaining two characters of same character position are identical in two phonetic notation information number of times and two phonetic notation information and comprising, determine pronunciation similarity information=2/ (3+4)=2/7 between character string information A and B, and first determines that device 2 obtains both and is synonym by inquiring about predetermined synonymicon; When then first definite device 2 is synonym according to predetermined two character string informations when pending, both pronunciation similarity information be multiply by the rule of the value of 2 gained as described final similarity information, with 2*2/7=4/7 as the final similarity information between character string information A and B.
Again for example, character string information A comprises substring information A 1 and A2, first determines that device 2 is 2 by the editing distance that substring information A 1 is converted to character string information B and obtains between substring information A 1 and character string information B, and is 1 by substring information A 2 being converted to the editing distance that character string information B obtains between substring information A 2 and character string information B; And, first determines the short text spread vector of substring information A 1, substring information A 2 and character string information B that device 2 obtains and stores, and the vector distance between the short text spread vector of acquisition substring information A 1 and character string information B is 1.755, vector distance between the short text spread vector of substring information A 2 and character string information B is 1.025, then first determine device 2 with inverse=1/ (2+1+1.755+1.025)=0.173 of every value sum of gained as the final similarity information between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, at least two types the similarity information that any basis obtains based on described two character string informations, determine the implementation of the final similarity information between described two character string informations, all should be within the scope of the present invention.
Preferably, described at least two types of at least two kinds of comprising in the following:
1) editing distance type;
Particularly, the similarity information of editing distance type obtains by obtaining two editing distances between character string information.
More preferably, be contained in the similarity information that character change information that similarity determines that first kind similarity in the device determines that device (figure does not show) is correlated with according to performed editing operation in the conversion process that a character string information in described two character string informations is converted to another character string information obtains this editing distance type.
For example, two character string informations are respectively " bai " and " bei ", first kind similarity determines that device is converted to editing operation performed in the process of " bei " with " bai " and is respectively " character reproduction b ", " character a is replaced with character e " and " character reproduction i ", then first kind similarity determines that the character change information that device obtains is " a → e ", and the variation cost that the predetermined cost storehouse of inquiry obtains " a → e " is 0.2, determines that then the similarity of the editing distance type between " bai " and " bei " is 1-0.2=0.8.
2) pronunciation type;
Particularly, phonetic notation or the phonetic of the similarity information of pronunciation type by obtaining two character string informations, and obtain by the similarity of determining described phonetic notation or phonetic.For example, the phonetic that obtains two pending character string informations is respectively " baidu " and " paidu ", and then the ratio that accounts for total consonant and vowel quantity according to consonant identical under the same sorting position and vowel quantity determines that the similarity information of the pronunciation type of two character string informations that this is pending is 0.75.
3) synonym match-type;
Particularly, the similarity information of synonym match-type perhaps, judges that the synonym possibility of two character string informations obtains by judging whether two character string informations are synonym.
4) short text expansion type;
Particularly, be contained in similarity and determine that the similarity information of the short text expansion type in the device obtains by the similarity between the short text extend information of obtaining two character string informations.
5) character string proper vector type;
Particularly, be contained in similarity and determine that the second type similarity in the device determines that device (figure does not show) is according to the similarity information that obtains character string proper vector type respectively based on two character string proper vectors of the result for retrieval gained of described two character string informations.
For example, the second type similarity is determined that device is retrieved based on character string information A and is obtained a plurality of webpages, and after the text message in these a plurality of webpages cut speech, removes keyword quantity that invalid keyword and statistics repeat etc. and handle, obtain the character string proper vector of character string information A; Then, for character string information B, the second type similarity determines that device repeats above-mentioned processing to obtain the character string proper vector of character string information B; Then, the second type similarity determines that device obtains the similarity information of the character string proper vector type between character string information A and B by the vector distance between the character string proper vector of obtaining character string information A and B.
6) theme distribution pattern;
Particularly, be contained in similarity and determine that the 3rd type similarity in the device determines that device (figure does not show) obtains the similarity information of theme distribution pattern according to the theme of relevant with described two character string informations respectively a plurality of resource informations.
For example, the 3rd type similarity is determined that device is retrieved based on character string information A and is obtained three webpages, and the predetermined theme that obtains these three webpages is respectively " news ", " amusement ", " news ", then the 3rd type similarity determines that device determines the theme of character string information A distribute and to comprise " news: 2/3, amusement: 1/3 "; For character string information B, the 3rd type similarity determines that device repeats aforesaid operations and obtains its theme distribute and comprise " news: 1/2, amusement: 1/4, recreation: 1/4 "; Then, the 3rd type similarity determines that device is with the similarity information of the mean value sum of character string information A and B same subject=(2/3+1/2)/2+ (1/3+1/4)/2=7/8 as the theme distribution pattern between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any at least two kinds of similarity information that obtain based at least two kinds of similarity of character string processing modes all should be included in the similarity range of information of the present invention two types at least.
Similarity according to the present invention determines that device can comprehensively determine final similarity information between two character string informations according to polytype similarity information, thereby reflect the similarity of two character string informations at aspects such as word sound, font and/or the meanings of word more all sidedly, the similarity judged result of gained is more accurate.
Fig. 5 determines the structural representation of device for the similarity that is used to obtain similarity information between character string information of a preferred embodiment of the invention.Determine that according to the similarity of present embodiment device comprises first deriving means 1, weight deriving means 3 and is contained in the first definite device of determining in the device 23 of first son.
First deriving means 1 is described in detail with reference to the embodiment shown in FIG. 4, and is contained in this by reference, repeats no more.
Weight deriving means 3 obtains every type of pairing weight information in described at least two types.Wherein, this weight information includes but not limited to: 1) weight grade; 2) weighted value.
Particularly, weight deriving means 3 mode of obtaining described weight information includes but not limited to:
1) weight deriving means 3 obtains the weight information of every type of correspondence in described at least two types according to predetermined weight information and the corresponding relation between type;
2) weight deriving means 3 determines that by first sub-deriving means (figure does not show) and weight that it comprised device (figure does not show) obtains the weight information of every type of correspondence in described at least two types.Wherein, the first sub-deriving means obtains the applied environment information of described final similarity information; Weight determines that device according to the described applied environment information that the first sub-deriving means is obtained, determines the weight information of described every type of correspondence.
Wherein, the first sub-deriving means mode of obtaining the applied environment information of described final similarity information includes but not limited to:
A) the first sub-deriving means is provided by the applied environment information that is provided by other equipment or other devices; For example, another device request similarity in the described computer equipment determines that device determines the final similarity information between two character string informations, and determine that to similarity device provides its API (Application Programming Interface), then the first sub-deriving means obtains this another device and offers the API that similarity is determined device, and with this API as described applied environment information;
B) the first sub-deriving means detects and two the pending application programs that character string information is relevant that obtained, to obtain described applied environment information; For example, one in two the pending character string informations that obtain when detecting by obtaining among the word, and then the first sub-deriving means 3 determines that described applied environment information comprise the identification information of word program etc.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention those skilled in the art should understand that, any implementation of obtaining the applied environment information of described final similarity information all should be within the scope of the present invention.
Wherein, weight determines that device according to the described applied environment information that the first sub-deriving means is obtained, determines that the mode of described every type of pairing weight information includes but not limited to;
A) when described at least two types comprise similarity determine device can adopt all types the time, weight is determined device according to described applied environment information, determines the weight information that all types is respectively corresponding.
Particularly, weight is determined device according to the corresponding relation between the weight information of predetermined applied environment information and each type, determines the weight information that all types is corresponding respectively.
For example, when described all types comprises editing distance type and theme distribution pattern, and when described applied environment information comprises the identification information of word program, weight determines that device is definite when applied environment information comprises the identification information of word program according to described corresponding relation, the weight grade of editing distance type correspondence is the first estate, and the weight grade of theme distribution pattern correspondence is the tertiary gradient etc.
B) weight determines that device determines that according to selecting arrangement that it comprised (figure does not show) and sub-weight device (scheming not show) obtains every type of pairing weight information in described at least two types.Wherein, selecting arrangement is according to described applied environment information, by selecting described at least two types in all types; Sub-weight is determined device according to described applied environment information, obtains every type of pairing weight information in selected described at least two types of the selecting arrangement.
Particularly, the type that selecting arrangement adopts according to needs under each predetermined applied environment is selected described at least two types in the cause all types; Then, sub-weight is determined device according to described applied environment information, obtains the weight information of every type of correspondence in selected described at least two types of the selecting arrangement.
For example, similarity determine device all types that can adopt comprise the pronunciation type, the synonym match-type, the short text expansion type, character string proper vector type and theme distribution pattern, sub-weight determines that device is a search environment according to the current applied environment of API judgement that comprises in the applied environment information that selecting arrangement obtained, then sub-weight determines that device adopts the short text expansion type according to needs under the predetermined search environment, the rule of character string proper vector type and theme distribution pattern is by the pronunciation type, the synonym match-type, the short text expansion type, select the short text expansion type in character string proper vector type and the theme distribution pattern, character string proper vector type and theme distribution pattern; Then, sub-weight is determined the weight information that device is corresponding respectively according to each type under the predetermined search environment, obtains selected short text expansion type, character string proper vector type and theme distribution pattern pairing weight information respectively.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described applied environment information, determine the implementation of described every type of pairing weight information, all should be within the scope of the present invention.
Need to prove that when need not to obtain weight information according to two pending character string informations, the 3 performed operations of first deriving means 1 and weight deriving means there is no sequencing.
First word is determined device 23 according to described at least two types the similarity information that obtains based on two character string informations, and in conjunction with described every type of pairing weight information, determines the final similarity information between described two character string informations.Wherein, the definite device 23 of first son obtains the mode and first of described at least two types similarity information and determines that the mode of the similarity information that the device acquisition is described at least two types is same or similar, and is contained in this by reference, repeats no more.
Particularly, first word determines that device 23 comes the described at least two types similarity information that is obtained is weighted according to described weight information, to obtain final similarity information according to the similarity information after the weighting.
For example, first word determines that the pronunciation similarity information that device 23 obtains between character string information A and B is 0.45, the similarity information of synonym match-type is 0.26, and the weighted value that weight deriving means 3 obtains pronunciation similarity type correspondence is 0.4, and the weighted value of synonym match-type correspondence is 0.5; Then first word determines that device 23 multiply by the similarity information of the type with the weighted value of each type correspondence, and with each product addition that is obtained, to obtain described final similarity information=0.4*0.45+0.5*0.26=0.31.
Again for example, character string information A comprises substring information A 1 and A2, first word determines that the editing distance that device 23 obtains between substring information A 1 and character string information B is 2, editing distance between substring information A 2 and character string information B is 1, vector distance between the short text spread vector of substring information A 1 and character string information B is 1.755, vector distance between the short text spread vector of substring information A 2 and character string information B is 1.025, and, the weighted value that weight deriving means 3 obtains editing distance type correspondence is 0.8, and the weighted value of short text spread vector type correspondence is 0.5; Then first word determine device 23 with inverse=1/ (0.8*2+0.8*1+0.5*1.755+0.5*1.025)=0.2639 of the weighted sum of every similarity information as the final similarity information between character string information A and B.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described at least two types the similarity information that obtains based on two character string informations, and in conjunction with described every type of pairing weight information, determine the implementation of the final similarity information between described two character string informations, for example, with the logarithm behind every similarity information weighting and, quadratic sum, products etc. are as final similarity information, again for example, earlier every similarity information is carried out normalized, again the value of normalized gained is asked for weighted sum to obtain final similarity information etc., all should be within the scope of the present invention.
Similarity according to present embodiment is determined device, by the weight information in conjunction with each type correspondence, makes the final similarity information that is obtained more meet the demand of application scenario; For example, when determining that according to the similarity of present embodiment device is applied to error correction system, give editing distance type and the higher weight information of pronunciation type, when determining that according to the similarity of present embodiment device is applied to search system, give the higher weight information of short text expansion type, character string proper vector type and body portion category type etc.Further, determine that according to the similarity of present embodiment device can also adjust the weight information of each type correspondence automatically according to applied environment information, and can select the type of required processing according to applied environment, so that determining device, the similarity of present embodiment can be applicable to multiple occasion adaptively.
Fig. 6 determines the structural representation of device for the similarity that is used to obtain similarity information between character string information of another preferred embodiment of the present invention.Determine that according to the similarity of present embodiment device comprises first deriving means 1, is contained in first classification apparatus 21 and the definite device of determining in the device 22 of second son.
First deriving means 1 is described in detail with reference to the embodiment shown in FIG. 4, and is contained in this by reference, repeats no more.
At least one character string information in 21 pairs of described two character string informations of classification apparatus is divided, to obtain a plurality of substring information that this at least one character string information comprises.
Particularly, classification apparatus 21 bases are such as one or more factors such as keyword that comprise in language and/or the dictionary under syllable, character code type, the character, come in described two character string informations at least one divided, to obtain a plurality of substring information that this at least one character string information comprises.
For example, for character string information " secondary え ろ り ん く ", classification apparatus 21 is different with the character code type of character string " え ろ り ん く " according to character string " secondary ", and be respectively two vocabulary in the dictionary according to " え ろ " and " り ん く ", " secondary え ろ り ん く " is divided into substring information " secondary ", " え ろ " and " り ん く " with character string information.For for purpose of brevity, will adopt identifier to represent substring information in the following content, for example, for character string information " secondary え ろ り ん く ", A represents with character string information; For substring information " secondary ", represent with substring information A 1; For substring information " え ろ ", represent with substring information A 2; For substring information " り ん く ", represent etc. with substring information A 3.Need to prove that aforementioned only is illustration for example, but not the concrete character string of representatives such as tag mark " A ", " A1 ", " A2 ", " A3 " is limited.
Second son determines that device 22 according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, determines the final similarity information between described two character string informations.Wherein, the definite device 22 of second son obtains the mode and first of the similarity information between two substring information and determines that the mode of the similarity information between two character string informations of device acquisition is same or similar, does not repeat them here.
Particularly, each similarity information that definite 22 pairs in the device of second son is obtained is handled, to determine the final similarity information between described two character string informations.
For example, classification apparatus 21 obtains character string information A and comprises substring information A 1 and A2, and character string information B comprises substring information B1 and B2; Second son determines that the similarity information that device 22 obtains the pronunciation type between substring information A 1 and B1 is 0.6, the similarity information of the pronunciation type between substring information A 1 and B2 is 0.1, the similarity information of the editing distance type between substring information A 2 and B1 is 0.2, and the similarity information of the editing distance type between substring information A 2 and B1 is 0.8; Then second son determine device 22 with the mean value of each similarity information as described final similarity information, to obtain described final similarity information=(0.6+0.1+0.2+0.8)/4=0.425.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any basis is contained in the one or more substring information in one of them character string information and is contained between one or more substring information in another character string information at least two types similarity information, determine the implementation of the final similarity information between described two character string informations, for example, be positioned at preceding 2 similarity information after selecting to sort from high to low and obtain final similarity information etc., all should be within the scope of the present invention.
Similarity according to present embodiment is determined device, can improve the speed of obtaining described final similarity information by dividing substring information, reduces system resources consumption.
As one of preferred version of present embodiment, second son determines that device 22 comprises that further coupling deriving means (figure does not show), second deriving means (figure does not show) and the 3rd son determine device (figure does not show).
The coupling deriving means obtains and describedly is contained in all the substring information in the character string information and is contained in all coupling array modes between all substring information in another character string information.
For example, classification apparatus 21 obtains similarity and determines that device obtains character string information A and comprises substring information A 1, A2 and A3, character string information B comprises substring information B1 and B2, and it is as follows then to mate all coupling array modes that deriving means obtains between all substring information that all substring information that character string information A comprises and character string information B comprise:
Figure BSA00000477888800291
Then, second deriving means obtains at least two types similarity information between described two character string informations according to described all coupling array modes.
Particularly, second deriving means obtains substring information that each coupling is complementary in array mode or at least two types similarity information between the substring combination, to obtain at least two types similarity information between these two character string informations under this coupling array mode.Wherein, second deriving means obtains between two substring information, at least two types the mode and aforementioned first of similarity information determines between two character string informations of device acquisition that at least two types the mode of similarity information is same or similar between two substrings combinations or between substring information and substring combination, do not repeat them here.
For example, the coupling array mode one and the coupling array mode two that are obtained with aforementioned coupling deriving means are example, the substring information A 1 that is complementary in second deriving means acquisition coupling array mode one and the similarity information of the editing distance type between B1 are 0.8, the similarity information of pronunciation type is 0.3, the similarity information of the editing distance type between substring combination A2A3 and substring information B2 is 0.05, the similarity information of pronunciation type is 0.88, the substring combination A1A2 that is complementary in the coupling array mode two and the similarity information of the editing distance type between B1 are 0.3, the similarity information of pronunciation type is 0.2, the similarity information of the editing distance type between substring information A 3 and substring information B2 is 0.07, and the similarity information of pronunciation type is 0.25; The mean value of the similarity information of second deriving means by asking for each type then, obtain the similarity information of the editing distance type between character string information A and B=(0.8+0.05+0.3+0.07)/4=0.305, the similarity information of pronunciation type=(0.3+0.88+0.2+0.25)/4=0.4075.
Again for example, shown in the editing distance type in coupling array mode one and the coupling array mode two between each substring information or substring combination and the similarity information of pronunciation type are given an example as described above, then second deriving means is higher than predetermined editing distance threshold value 0.7 according to the similarity information 0.8 of the editing distance between substring information A 1 and B1, the similarity information 0.88 of the pronunciation type between substring combination A2A3 and substring information B2 is higher than predetermined pronunciation threshold value 0.75, come to determine to obtain the editing distance type between character string information A and B and the similarity information of pronunciation type according to coupling array mode one, then second deriving means obtains the similarity information of the editing distance type between character string information A and B=(0.8+0.05)/2=0.425, the similarity information of pronunciation type=(0.3+0.88)/2=0.59.
Need to prove, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described all coupling array modes, obtain the implementation of at least two types similarity information between described two character string informations, all should be within the scope of the present invention.
Then, the definite device of the 3rd son is determined the final similarity information between described two character string informations according at least two types similarity information between described two character string informations.Wherein, the definite device of the 3rd son is determined the mode of the final similarity information between described two character string informations at least according to two types similarity information between described two character string informations, and aforementioned first definite device determines that according at least two types the similarity information that obtains based on two character string informations the mode of the final similarity information between described two character string informations is same or similar, does not repeat them here.
Similarity according to this preferred version is determined device, can further improve the speed of obtaining described final similarity information according to the coupling array mode between substring information, reduces system resources consumption.
As one of preferred version of present embodiment, described the 3rd son determines that device 22 comprises that further substring deriving means (figure does not show), the 3rd deriving means (figure does not show), the 4th son determine device (figure does not show) and iteration means (figure does not show).
The substring deriving means is by obtaining current substring combination in described two character string informations to information.Wherein, described current substring combination is to comprising substring information and/or the substring combination that belongs to two character string informations respectively in the information.
Particularly, the substring deriving means is according to the sorting position in the character string information of two substring information that character string information comprised under separately, and, obtain described current substring combination to information in conjunction with executed operation note of obtaining described current substring combination to information.
Wherein, described operation note includes but not limited to following at least one:
1) the executed number of operations that obtains;
2) preceding current substring combination of once obtaining is to information;
3) preceding current substring combination of once obtaining is to substring information content that belongs to a character string information that comprises in the information and the substring information content that belongs to another character string information.
For example, classification apparatus 21 obtains similarity and determines that device obtains character string information A and comprises substring information A 1, A2 and A3, character string information B comprises substring information B 1 and B2, and the substring deriving means makes up A1A2 and substring information B1 by the current substring combination of once obtaining before obtaining in the described operation note to comprising substring in the information.Then the substring deriving means is selected substring combination A1A2A3 and substring information B1 at random, and perhaps, substring combination A1A2 and substring combination B1B2 make up information as current substring.
Need to prove that the substring deriving means can obtain the combination of current substring to information and be contained in this current substring combination to two in the information pending substring information according to multiple order; For example, when pending character string information A comprises substring information A 1, A2 and A3, wherein, substring information A 1, A2 and A3 arrange in character string information A from left to right, character string information B comprises substring information B1 and B2, wherein, substring information B1 and B2 arrange in character string information B from left to right, and then the substring deriving means obtains current substring combination to information with following arbitrary order:
1)A1_B1、A1A2_B1、A1A2A3_B1、A1_B1B2、A1A2_B1B2、A1A2A3_B1B2;
2)A1_B1、A1_B1B2、A1A2_B1、A1A2_B1B2、A1A2A3_B1、A1A2A3_B1B2;
3)A3_B2、A2A3_B2、A1A2A3_B2、A3_B1B2、A2A3_B1B2、A1A2A3_B1B2;
4)A3_B2、A3_B1B2、A2A3_B2、A2A3_B1B2、A1A2A3_B2、A1A2A3_B1B2;
5)A1_B1、A1A2_B1、A1_B1B2、A1A2A3_B1、A1A2_B1B2、A1A2A3_B1B2;
6)A3_B2、A2A3_B2、A3_B1B2、A1A2A3_B2、A2A3_B1B2、A1A2A3_B1B2;
What need further specify is, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any by obtaining the combination of current substring in described two character string informations to information and be contained in the implementation of this current substring combination to two pending substring information in the information, all should be within the scope of the present invention.
Then, the 3rd deriving means obtains and is contained in described current substring combination at least two types right similarity information of each pending substring information in the information.Wherein, each pending substring information is to comprising substring information and/or the substring combination that belongs to two character string informations respectively, and the similarity information that this pending substring information is right is the similarity information between its substring information that comprises and/or substring combination.Wherein, it is same or similar that the mode and first that the 3rd deriving means obtains at least two types right similarity information of pending substring information determines that device obtains between two pending character string informations at least two types similarity information mode, do not repeat them here.
For example, character string information A comprises substring information A 1, A2 and the A3 that arranges from left to right, and character string information B comprises substring information substring information B1 and the B2 that arranges from left to right; The current substring combination that the substring deriving means obtained is " A1A2A3_B1B2 " to information; Then the 3rd deriving means is according to the coupling array mode of current substring combination to all possible substring information between two the substring combinations " A1A2A3 " and " B1B2 " in the information, obtain two pending substring information to " A2A3_B2 " and " A3_B2 ", and obtaining the editing distance type of " A2A3_B2 " and the similarity information of pronunciation type is respectively 0.45 and 0.576, the editing distance type of " A3_B2 " and the similarity information of pronunciation type are respectively 0.61 and 0.5.
Then, the 4th son is determined device at least two type similarity information and the historical similarity information right according to described each pending substring information, determines that described current substring combination is to the similarity information between information.
For example, right at least two types the similarity information of described each pending substring information as described above the 3rd deriving means correspondence for example shown in, and similarity determines that the acquired historical similarity record of device is as follows:
The substring combination is to information similarity information
Figure BSA00000477888800331
The 4th son determines that device determines pending substring information to the similarity information of " A2A3_B2 "=(0.45+0.576)/2=0.513, and pending substring information is to the similarity information of " A3_B2 "=(0.61+0.5)/2=0.555; Then the 4th son determines that device is 0.6 according to the similarity information that substring makes up information " A1_B1 ", determine at the similarity information=0.6*0.513=0.3078 of the down current substring combination of coupling array mode " A1 mates B1; A2 and A3 coupling B2 " information " A1A2A3_B1B2 ", at the similarity information=0.3*0.555=0.1665 of the down current substring combination of coupling array mode " A1 and A2 coupling B1, A3 mates B2 " to information " A1A2A3_B1B2 "; Then the 4th son determines that the bigger value 0.3078 of device selection is as the similarity information of current substring combination to information " A1A2A3_B1B2 ".
Then, iteration means makes up to the similarity information between information described current substring as one of historical similarity information, so that described substring deriving means, described the 3rd deriving means and described the 4th son determine that device repeats corresponding operating and until described current substring combination information comprised described two character string informations, and described current substring is made up the similarity information between information as the final similarity information between described two character string informations.
Similarity according to this preferred version is determined device, can further improve the speed of obtaining described final similarity information according to the right similarity information of historical substring combination, reduces system resources consumption.
As one of preferred version of present embodiment, determine that according to the similarity of present embodiment device also comprises the 4th deriving means (figure does not show), second son determines that device 22 comprises that also the 5th son determines device (figure does not show).
The 4th deriving means obtains the overall similarity information of at least a type between described two character string informations.Wherein, described overall similarity information is the similarity information that directly obtains according to two unallocated character string informations.It is same or similar that the mode and first that the 4th deriving means obtains one type overall similarity information determines that device obtains the mode of one type similarity information between two pending character string informations, do not repeat them here.
The 5th son determines that device is according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, and, determine the final similarity information between described two character string informations in conjunction with the overall similarity information of described at least a type.Wherein, the 5th son determine device obtain described be contained in the one or more substring information in one of them character string information with the one or more substring information that are contained in another character string information between at least two types the mode of similarity information described in detail when the definite device 22 of explanation second son, and be contained in this by reference, repeat no more.
Particularly, the 5th son determines that device according to described overall similarity information with describedly be contained in the one or more substring information in one of them character string information and be contained between one or more substring information in another character string information at least two types similarity information, determines that the mode of described final similarity information includes but not limited to:
1) according to the described one or more substring information and the one or more substring information that are contained in another character string information that are contained in one of them character string information, obtain the branch string similarity information between two pending character string informations, and select similarity value in this branch string similarity information and the described overall similarity information or similarity grade higher one as described final similarity information; Wherein, the 5th son determines that device obtains the mode of described branch string similarity information, determines that with second son mode of the described final similarity information of device 22 acquisitions is same or similar, does not repeat them here.
For example, one type overall similarity information is 0.6 between the 4th deriving means acquisition character string information A and B, and the 5th son determines that the branch string similarity information that device obtains between character string information A and B is 0.83, and then the definite device of the 5th son selects the higher branch string similarity information of similarity value as described final similarity information.
2) described branch string similarity information and described at least a type overall similarity information are handled, to obtain described final similarity information.
For example, the overall similarity information that the 4th deriving means obtains character string proper vector type between character string information A and B is 0.6, the overall similarity information of theme distribution pattern is 0.4, and the 5th son determines that the branch string similarity information that device obtains between character string information A and B is 0.83, then the 5th son determines that device is 0.45 according to the weighted value of the overall similarity information of predetermined character string proper vector type, the weighted value of the overall similarity information of theme distribution pattern is 0.47, dividing the weighted value of string similarity information is 0.86, come described two types overall similarity information and branch string similarity information are asked for weighted sum, to obtain final similarity information=0.45*0.6+0.47*0.4+0.86*0.83=1.1718.
What need further specify is, above-mentioned for example only for technical scheme of the present invention is described better, but not limitation of the present invention, those skilled in the art should understand that, any according to described overall similarity information with describedly be contained in the one or more substring information in one of them character string information and be contained between one or more substring information in another character string information the implementation that at least two types similarity information is determined described final similarity information, all should be within the scope of the present invention.
Similarity according to this preferred version is determined device, can improve the accuracy of obtaining the final similarity information between two pending character string informations by taking all factors into consideration overall similarity information and dividing string similarity information to obtain final similarity information.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and under the situation that does not deviate from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, therefore is intended to be included in the present invention dropping on the implication that is equal to important document of claim and all changes in the scope.Any Reference numeral in the claim should be considered as limit related claim.In addition, obviously other unit or step do not got rid of in " comprising " speech, and odd number is not got rid of plural number.A plurality of unit of stating in system's claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims (25)

1. computer implemented method that is used for determining similarity information between character string information based on polytype, wherein, this method may further comprise the steps:
A obtains two pending character string informations;
B determines the final similarity information between described two character string informations according at least two types the similarity information that obtains based on described two character string informations.
2. method according to claim 1, wherein, this method is further comprising the steps of:
I obtains every type of pairing weight information in described at least two types;
Wherein, described step b also comprises:
-according to described at least two types the similarity information that obtains based on described two character string informations, and, determine the final similarity information between described two character string informations in conjunction with described every type of pairing weight information.
3. method according to claim 2, wherein, described step I is further comprising the steps of:
I1 obtains the applied environment information of described final similarity information;
I2 determines described every type of pairing weight information according to described applied environment information.
4. method according to claim 3, wherein, described step I 2 is further comprising the steps of:
-according to described applied environment information, by selecting described at least two types in all types;
-according to described applied environment information, obtain every type of pairing weight information in selected described at least two types.
5. according to each described method in the claim 1 to 4, wherein, described step b is further comprising the steps of:
B1 divides at least one character string information in described two character string informations, to obtain a plurality of substring information that this at least one character string information comprises;
B2 determines the final similarity information between described two character string informations according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information.
6. method according to claim 5, wherein, described step b2 is further comprising the steps of:
-obtain and describedly be contained in all the substring information in the character string information and be contained in all coupling array modes between all substring information in another character string information;
-according to described all coupling array modes, obtain at least two types similarity information between described two character string informations;
-according at least two types similarity information between described two character string informations, determine the final similarity information between described two character string informations.
7. method according to claim 5, wherein, described step b2 is further comprising the steps of:
B21 is by obtaining current substring combination in described two character string informations to information;
B22 obtains and is contained in described current substring combination at least two types right similarity information of each pending substring information in the information;
At least two types similarity information and historical similarity information that b23 is right according to described pending substring information determine that described current substring combination is to the similarity information between information;
B24 makes up to the similarity information between information described current substring as one of historical similarity information, repeating step b21 is to step b23 and aforementioned described current substring combination is comprised described two character string informations until described current substring combination to information as the step of one of historical similarity information to the similarity information between information, and described current substring is made up the similarity information between information as the final similarity information between described two character string informations.
8. according to each described method in the claim 5 to 7, wherein, this method is further comprising the steps of:
-obtain the overall similarity information of at least a type between described two character string informations;
Wherein, described step b2 is further comprising the steps of:
-according to being contained in the one or more substring information in one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, and, determine the final similarity information between described two character string informations in conjunction with the overall similarity information of described at least a type.
9. according to each described method in the claim 1 to 8, wherein, described at least two types of at least two kinds arbitrarily of comprising in the following:
-editing distance type;
-pronunciation type;
-synonym match-type;
-short text expansion type;
-character string proper vector type;
-theme distribution pattern.
10. method according to claim 9, wherein, described at least two types comprise the editing distance type, wherein, this method is further comprising the steps of:
-according to the character change information that performed editing operation in the conversion process that a character string information in described two character string informations is converted to another character string information is correlated with, determine the similarity information of editing distance type between described two character string informations.
11. according to claim 9 or 10 described methods, wherein, described at least two types comprise character string proper vector type, wherein, this method is further comprising the steps of:
-according to two character string proper vectors that obtain based on the result for retrieval of described two character string informations respectively, determine the similarity information of the character string proper vector type between described two character string informations.
12. according to each described method in the claim 9 to 11, wherein, described at least two types comprise the theme distribution pattern, this method is further comprising the steps of:
-according to the theme of relevant with described two character string informations respectively a plurality of resource informations, determine the similarity information of the theme distribution pattern between described two character string informations.
13. a similarity that is used for similarity information between definite character string information is determined device, wherein, this similarity determines that device comprises:
First deriving means, be used to obtain two pending character string informations;
First determine device, be used at least two types similarity information obtaining according to based on described two character string informations, determine the final similarity information between described two character string informations.
14. similarity according to claim 13 is determined device, wherein, this similarity determines that device also comprises:
The weight deriving means, be used for obtaining described at least two types every type pairing weight information;
Wherein, described first determines that device also comprises:
First son determines device, be used at least two types similarity information obtaining based on described two character string informations according to described, and, determine the final similarity information between described two character string informations in conjunction with described every type of pairing weight information.
15. similarity according to claim 14 is determined device, wherein, described weight deriving means also comprises:
The first sub-deriving means, be used to obtain the applied environment information of described final similarity information;
Weight is determined device, is used for according to described applied environment information, determines described every type of pairing weight information.
16. similarity according to claim 15 is determined device, wherein, described weight determines that device also comprises:
Selecting arrangement, be used for, by selecting described at least two types in all types according to described applied environment information;
Sub-weight is determined device, is used for according to described applied environment information, obtains every type of pairing weight information in selected described at least two types.
17. determine device according to each described similarity in the claim 13 to 16, wherein, described first determines that device also comprises:
Classification apparatus, be used at least one character string information of described two character string informations is divided, to obtain a plurality of substring information that this at least one character string information comprises;
Second son determines device, be used for determining the final similarity information between described two character string informations according to being contained in one or more substring information of one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information.
18. similarity according to claim 17 is determined device, wherein, described second son determines that device also comprises:
The coupling deriving means, be used for obtaining and describedly be contained in all substring information of a character string information and be contained in all coupling array modes between all substring information in another character string information;
Second deriving means, be used for according to described all coupling array modes, obtain at least two types similarity information between described two character string informations;
The 3rd son is determined device, is used for determining the final similarity information between described two character string informations according at least two types similarity information between described two character string informations.
19. similarity according to claim 17 is determined device, wherein, described second son determines that device also comprises:
The substring deriving means, be used for obtaining the combination of current substring to information by described two character string informations;
The 3rd deriving means, be used for obtaining described current substring combination at least two type the similarity information right that is contained in to each pending substring information of information;
The 4th son is determined device, is used at least two type similarity information and the historical similarity information right according to described pending substring information, determines that described current substring combination is to the similarity information between information;
Iteration means, be used for described current substring combination the similarity information between information as one of historical similarity information, so that described substring deriving means, described the 3rd deriving means and described the 4th son determine that device repeats corresponding operating and until described current substring combination information comprised described two character string informations, and described current substring is made up the similarity information between information as the final similarity information between described two character string informations.
20. determine device according to each described similarity in the claim 17 to 19, wherein, this similarity determines that device also comprises:
The 4th deriving means, be used to obtain the overall similarity information of at least a type between described two character string informations;
Wherein, described second son determines that device also comprises:
The 5th son determines device, be used for according to being contained in one or more substring information of one of them character string information and being contained between one or more substring information in another character string information at least two types similarity information, and, determine the final similarity information between described two character string informations in conjunction with the overall similarity information of described at least a type.
21. determine device according to each described similarity in the claim 13 to 20, wherein, described at least two types of at least two kinds arbitrarily of comprising in the following:
-editing distance type;
-pronunciation type;
-synonym match-type;
-short text expansion type;
-character string proper vector type;
-theme distribution pattern.
22. similarity according to claim 21 is determined device, wherein, described at least two types comprise the editing distance type, and wherein, this similarity determines that device also comprises:
First kind similarity determines device, be used for being converted to the relevant character change information of performed editing operation in the conversion process of another character string information according to a character string information with described two character string informations, determines the similarity information of editing distance type between described two character string informations.
23. determine device according to claim 21 or 22 described similarities, wherein, described at least two types comprise character string proper vector type, wherein, this similarity determines that device also comprises:
The second type similarity determines device, be used for two character string proper vectors obtaining based on the result for retrieval of described two character string informations according to respectively, determines the similarity information of the character string proper vector type between described two character string informations.
24. determine device according to each described similarity in the claim 21 to 23, wherein, described at least two types comprise the theme distribution pattern, this similarity determines that device also comprises:
The 3rd type similarity is determined device, is used for the theme according to relevant with described two character string informations respectively a plurality of resource informations, determines the similarity information of the theme distribution pattern between described two character string informations.
25. a computer equipment, wherein, this computer equipment comprises as at least one described similarity in the claim 13 to 24 determines device.
CN 201110099437 2011-04-20 2011-04-20 Method, device and equipment used for determining similarity information among character string information Active CN102184169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110099437 CN102184169B (en) 2011-04-20 2011-04-20 Method, device and equipment used for determining similarity information among character string information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110099437 CN102184169B (en) 2011-04-20 2011-04-20 Method, device and equipment used for determining similarity information among character string information

Publications (2)

Publication Number Publication Date
CN102184169A true CN102184169A (en) 2011-09-14
CN102184169B CN102184169B (en) 2013-06-19

Family

ID=44570346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110099437 Active CN102184169B (en) 2011-04-20 2011-04-20 Method, device and equipment used for determining similarity information among character string information

Country Status (1)

Country Link
CN (1) CN102184169B (en)

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103678272A (en) * 2012-09-17 2014-03-26 北京信息科技大学 Method for processing unknown words in Chinese-language dependency tree banks
CN103678655A (en) * 2013-12-23 2014-03-26 国家电网公司 Method and device for verifying information
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104462060A (en) * 2014-12-03 2015-03-25 百度在线网络技术(北京)有限公司 Method and device for calculating text similarity and realizing search processing through computer
CN104866985A (en) * 2015-05-04 2015-08-26 小米科技有限责任公司 Express bill number identification method, device and system
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN106446717A (en) * 2016-10-14 2017-02-22 深圳天珑无线科技有限公司 Information processing method, device and terminal
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106598986A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Similarity calculation method and apparatus
CN106649749A (en) * 2016-12-26 2017-05-10 浙江传媒学院 Chinese voice bit characteristic-based text duplication checking method
CN107564528A (en) * 2017-09-20 2018-01-09 深圳市空谷幽兰人工智能科技有限公司 A kind of speech recognition text and the method and apparatus of order word text matches
CN107688563A (en) * 2016-08-05 2018-02-13 中国移动通信有限公司研究院 A kind of recognition methods of synonym and identification device
CN107895251A (en) * 2016-12-24 2018-04-10 上海壹账通金融科技有限公司 Data error-correcting method and device
CN108255836A (en) * 2016-12-28 2018-07-06 普天信息技术有限公司 A kind of character string matching method and device
CN108399192A (en) * 2018-01-25 2018-08-14 链家网(北京)科技有限公司 A kind of cell information matching process and device
CN108664957A (en) * 2017-03-31 2018-10-16 杭州海康威视数字技术股份有限公司 Number-plate number matching process and device, character information matching process and device
CN109189809A (en) * 2018-10-17 2019-01-11 北京金堤科技有限公司 A kind of matched method and apparatus of shareholder's names associate
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
CN110390015A (en) * 2019-07-23 2019-10-29 中国工商银行股份有限公司 A kind of data information processing method, apparatus and system
CN111324784A (en) * 2015-03-09 2020-06-23 阿里巴巴集团控股有限公司 Character string processing method and device
CN111444450A (en) * 2019-01-16 2020-07-24 北大方正集团有限公司 Method and device for determining reprinted data
CN111460215A (en) * 2020-03-30 2020-07-28 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN111459789A (en) * 2019-08-28 2020-07-28 南京意博软件科技有限公司 Detection method and device for application programming interface
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN111539197A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Text matching method and device, computer system and readable storage medium
CN113094559A (en) * 2021-04-25 2021-07-09 百度在线网络技术(北京)有限公司 Information matching method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106650803B (en) * 2016-12-09 2019-06-18 北京锐安科技有限公司 The method and device of similarity between a kind of calculating character string

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434400A (en) * 2002-01-22 2003-08-06 住友电气工业株式会社 Method, device, program, and recording medium for chararacter similarity calculation
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
CN101702171A (en) * 2009-11-19 2010-05-05 新蛋信息技术(西安)有限公司 Approximating matching method for numerous character strings
CN102122298A (en) * 2011-03-07 2011-07-13 清华大学 Method for matching Chinese similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434400A (en) * 2002-01-22 2003-08-06 住友电气工业株式会社 Method, device, program, and recording medium for chararacter similarity calculation
CN101561813A (en) * 2009-05-27 2009-10-21 东北大学 Method for analyzing similarity of character string under Web environment
CN101702171A (en) * 2009-11-19 2010-05-05 新蛋信息技术(西安)有限公司 Approximating matching method for numerous character strings
CN102122298A (en) * 2011-03-07 2011-07-13 清华大学 Method for matching Chinese similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《情报学报》 20051231 章成志 基于多层特征的字符串相似度计算模型 第24卷, 第06期 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622338A (en) * 2012-02-24 2012-08-01 北京工业大学 Computer-assisted computing method of semantic distance between short texts
CN102955774A (en) * 2012-05-30 2013-03-06 华东师范大学 Control method and device for calculating Chinese word semantic similarity
CN103678272A (en) * 2012-09-17 2014-03-26 北京信息科技大学 Method for processing unknown words in Chinese-language dependency tree banks
CN103678272B (en) * 2012-09-17 2016-04-06 北京信息科技大学 The disposal route of unregistered word in the interdependent treebank of Chinese
CN104424279A (en) * 2013-08-30 2015-03-18 腾讯科技(深圳)有限公司 Text relevance calculating method and device
CN104424279B (en) * 2013-08-30 2018-11-20 腾讯科技(深圳)有限公司 A kind of correlation calculations method and apparatus of text
CN103678655B (en) * 2013-12-23 2017-02-08 国网浙江省电力公司 Method and device for verifying information
CN103678655A (en) * 2013-12-23 2014-03-26 国家电网公司 Method and device for verifying information
CN105095203B (en) * 2014-04-17 2018-10-23 阿里巴巴集团控股有限公司 Determination, searching method and the server of synonym
CN105095203A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Methods for determining and searching synonym, and server
CN104462060A (en) * 2014-12-03 2015-03-25 百度在线网络技术(北京)有限公司 Method and device for calculating text similarity and realizing search processing through computer
CN104462060B (en) * 2014-12-03 2017-08-01 百度在线网络技术(北京)有限公司 Pass through computer implemented calculating text similarity and search processing method and device
CN111324784B (en) * 2015-03-09 2023-05-16 创新先进技术有限公司 Character string processing method and device
CN111324784A (en) * 2015-03-09 2020-06-23 阿里巴巴集团控股有限公司 Character string processing method and device
CN104866985A (en) * 2015-05-04 2015-08-26 小米科技有限责任公司 Express bill number identification method, device and system
CN106598986A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Similarity calculation method and apparatus
CN106598986B (en) * 2015-10-16 2020-11-27 北京国双科技有限公司 Similarity calculation method and device
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
CN107688563B (en) * 2016-08-05 2021-03-19 中国移动通信有限公司研究院 Synonym recognition method and recognition device
CN107688563A (en) * 2016-08-05 2018-02-13 中国移动通信有限公司研究院 A kind of recognition methods of synonym and identification device
CN106484678A (en) * 2016-10-13 2017-03-08 北京智能管家科技有限公司 A kind of short text similarity calculating method and device
CN106446717A (en) * 2016-10-14 2017-02-22 深圳天珑无线科技有限公司 Information processing method, device and terminal
CN107895251A (en) * 2016-12-24 2018-04-10 上海壹账通金融科技有限公司 Data error-correcting method and device
CN106649749A (en) * 2016-12-26 2017-05-10 浙江传媒学院 Chinese voice bit characteristic-based text duplication checking method
CN106649749B (en) * 2016-12-26 2019-07-16 浙江传媒学院 A kind of text duplicate checking method based on Chinese phoneme features
CN108255836B (en) * 2016-12-28 2020-12-25 普天信息技术有限公司 Character string matching method and device
CN108255836A (en) * 2016-12-28 2018-07-06 普天信息技术有限公司 A kind of character string matching method and device
CN108664957A (en) * 2017-03-31 2018-10-16 杭州海康威视数字技术股份有限公司 Number-plate number matching process and device, character information matching process and device
CN107564528B (en) * 2017-09-20 2020-12-15 广东惠禾科技发展有限公司 Method and equipment for matching voice recognition text with command word text
CN107564528A (en) * 2017-09-20 2018-01-09 深圳市空谷幽兰人工智能科技有限公司 A kind of speech recognition text and the method and apparatus of order word text matches
CN108399192B (en) * 2018-01-25 2020-07-24 贝壳找房(北京)科技有限公司 Cell information matching method and device
CN108399192A (en) * 2018-01-25 2018-08-14 链家网(北京)科技有限公司 A kind of cell information matching process and device
CN109189809A (en) * 2018-10-17 2019-01-11 北京金堤科技有限公司 A kind of matched method and apparatus of shareholder's names associate
CN111444450A (en) * 2019-01-16 2020-07-24 北大方正集团有限公司 Method and device for determining reprinted data
CN111488497B (en) * 2019-01-25 2023-05-12 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN111488497A (en) * 2019-01-25 2020-08-04 北京沃东天骏信息技术有限公司 Similarity determination method and device for character string set, terminal and readable medium
CN110348539A (en) * 2019-07-19 2019-10-18 知者信息技术服务成都有限公司 Short text correlation method of discrimination
CN110390015A (en) * 2019-07-23 2019-10-29 中国工商银行股份有限公司 A kind of data information processing method, apparatus and system
CN111459789A (en) * 2019-08-28 2020-07-28 南京意博软件科技有限公司 Detection method and device for application programming interface
CN111459789B (en) * 2019-08-28 2023-11-03 南京意博软件科技有限公司 Detection method and device for application programming interface
CN111460215A (en) * 2020-03-30 2020-07-28 腾讯科技(深圳)有限公司 Audio data processing method and device, computer equipment and storage medium
CN111539197A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Text matching method and device, computer system and readable storage medium
CN111539197B (en) * 2020-04-15 2023-08-15 北京百度网讯科技有限公司 Text matching method and device, computer system and readable storage medium
CN113094559A (en) * 2021-04-25 2021-07-09 百度在线网络技术(北京)有限公司 Information matching method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN102184169B (en) 2013-06-19

Similar Documents

Publication Publication Date Title
CN102184169B (en) Method, device and equipment used for determining similarity information among character string information
KR102363369B1 (en) Generating vector representations of documents
WO2017114019A1 (en) Keyword recommendation method and system based on latent dirichlet allocation model
CN103491205B (en) The method for pushing of a kind of correlated resources address based on video search and device
CN103914533B (en) That promotes search result shows method and apparatus
EP3051432A1 (en) Semantic information acquisition method, keyword expansion method thereof, and search method and system
CN109299383B (en) Method and device for generating recommended word, electronic equipment and storage medium
CN102081602B (en) Method and equipment for determining category of unlisted word
CN103092943B (en) A kind of method of advertisement scheduling and advertisement scheduling server
CN110427483B (en) Text abstract evaluation method, device, system and evaluation server
CN103699625A (en) Method and device for retrieving based on keyword
CN102110126A (en) Information retrieval method and device
CN108766451B (en) Audio file processing method and device and storage medium
CN104778283B (en) A kind of user's occupational classification method and system based on microblogging
CN108108347B (en) Dialogue mode analysis system and method
CN102163228A (en) Method, apparatus and device for determining sorting result of resource candidates
CN103970748A (en) Related keyword recommending method and device
CN103927177B (en) Characteristic-interface digraph establishment method based on LDA model and PageRank algorithm
CN103534696A (en) Exploiting query click logs for domain detection in spoken language understanding
CN103744887A (en) Method and device for people search and computer equipment
CN103914569B (en) Input creation method, the device of reminding method, device and dictionary tree-model
CN108681564A (en) The determination method, apparatus and computer readable storage medium of keyword and answer
CN111611452A (en) Method, system, device and storage medium for ambiguity recognition of search text
CN102184195B (en) Method, device and device for acquiring similarity between character strings
CN111428011A (en) Word recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant