CN101030216A - Method for matching text string based on parameter characteristics - Google Patents

Method for matching text string based on parameter characteristics Download PDF

Info

Publication number
CN101030216A
CN101030216A CNA2007100487900A CN200710048790A CN101030216A CN 101030216 A CN101030216 A CN 101030216A CN A2007100487900 A CNA2007100487900 A CN A2007100487900A CN 200710048790 A CN200710048790 A CN 200710048790A CN 101030216 A CN101030216 A CN 101030216A
Authority
CN
China
Prior art keywords
character
array
text
term
influence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007100487900A
Other languages
Chinese (zh)
Inventor
丁光耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA2007100487900A priority Critical patent/CN101030216A/en
Publication of CN101030216A publication Critical patent/CN101030216A/en
Priority to PCT/CN2008/070603 priority patent/WO2008119297A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for matching character string based on characteristic parameter includes calculating the matching relation of text to character in indexing word, calculating characteristic parameter according to calculated matching relation, calculating characteristic matching degree according to calculated parameter and text as well as length of indexing word and returning calculated characteristic matching degree back.

Description

Character string matching method based on characterisitic parameter
Technical field
The present invention relates to a kind of character string matching method, specifically, relate to a kind of character string matching method based on characterisitic parameter.
Background technology
The dictionary retrieval is the most basic application of string matching technology.The retrieval technique of existing dictionary retrieval product is divided into two classes: based on the retrieval technique of accurate coupling, and based on the retrieval technique of non-accurate coupling.Accurately the coupling retrieval technique can not be fault-tolerant; But not the retrieval technique of accurately mating allows a spot of mistake of appearance in user's input, and therefore higher using value is arranged.
Over 40 years, both at home and abroad distance calculation based on wrong factor is adopted in the method research of non-accurate string matching always, the most frequently used is Levenshtein distance and ED (Edit Distance) distance, and the wrong factor that influences distance results mainly comprises inserts mistake, deletion error, replacement mistake, exchange mistake etc.This based on wrong factor distance calculating method, have some intrinsic problems, caused the dictionary result for retrieval too general and fault-tolerant ability is limited, problem be mainly reflected in following some:
1), thinking is studied in existing non-accurate string matching based on wrong factor distance calculation, is a kind of research thinking based on the problem phenomenon, as insertion, deletion, replacement, exchange, backward error etc.These wrong phenomenons be not complete independent, can polarize, present diversified feature, be typical problem phenomenon.For example, can represent that is qualitatively replaced a mistake with an insertion mistake and a deletion error in essence, represent qualitatively that with an insertion mistake and a deletion error exchanges a mistake.Therefore, some wrong factor is not independent notion, and string matching does not form the taxonomic hierarchies of science so far, and this is one of major reason wherein.
2), existingly the polymorphism of character string matching problem is described based on wrong factor, directly have influence on the ordering of string matching result and result for retrieval.Table 1 reflects the polymorphism that the wrong phenomenon of particular problem is described based on wrong factor.
Text Term Describe based on wrong factor
ABCDEFGH ABCDEFGH Accurately mate (being the substring coupling): do not allow mistakes such as deletion, insertion, exchange
AB CDEFGH CDEF Accurately mate (being the substring coupling): allow the deletion of front, back
AB CDE FGH CDF Non-accurate coupling: 1, have a deletion (E); Or 2, have the deletion of some fronts, back, and deletion (E) in the middle of having; Or 3, have an insertion (F); Or 4, exist a replacement (E, F); Deng
AB CDEFGH CEDF Non-accurate coupling: 1, exchange of existence (DE, ED); Or 2, exist two replacements (D, E), (E, D); Or 3, have an insertion (D) and a deletion (D); Deng
AB CDEFGH CEFD Non-accurate coupling: 1, have a deletion (D) and an insertion (D); Or 2, have two insertions (C), (D); Or 3, have two insertions (E), (F); Deng
AB CD EFGH ACEFXD Non-accurate coupling: 1, have two deletions (B), (D) with two insertions (X), (D); Or 2, have two deletions (B), (D) with two replacements (G, X), (H, D); Deng
Table 1
In the table 1, ignore the quantization influence of distance calculation,, have multiple qualitative representation method, reflect the polymorphism of describing same problem, be not easy to classification and handle based on wrong factor based on the matching problem of wrong factor to particular text and term.
3), based on the non-accurate character string matching method of wrong factor distance calculation, because by distance calculation various wrong factors being carried out unified quantization handles, as ED (Edit Distance) distance calculation, the character obfuscation of different wrong factors in the feasible coupling, the matching result that distance reflects is too general.For example distance is 2, can represent to contain two and insert mistake, also can represent to contain 2 deletion errors, perhaps replaces mistake, the perhaps mistake of 2 mixing for two.And the nonindependence of wrong factor notion, and the uncertainty of the wrong properties in the coupling, make again and can not the further refinement of matching state be represented according to wrong factor.Therefore, in the dictionary retrieval, when calculating matching degree, lack the more careful parameter foundation of accurate match situation, be unfavorable for detecting result's reasonable ordering.
4), existing dictionary retrieval, seldom from psychology, cognitive science, linguistics, ethological angle, the influence to the dictionary retrieval is discussed.In fact, each character is according to factors such as the position in word, pronunciation, visions, have cognitive difference in various degree, some character remembers that easily some character then is difficult for remembeing, perhaps As time goes on and gradually desalination, this also is the main cause that causes non-accurate input.Therefore in the retrieval, should consider of the influence of the cognitive difference of each character in word to the dictionary result for retrieval.
Above problem presents unsound, the imperfection of string matching foundational system, directly has influence on the reasonable ordering of dictionary result for retrieval, and fault-tolerant ability is limited, the diversity of matching process, but be difficult to launch integrated application, need to be resolved hurrily.
Summary of the invention
The objective of the invention is to overcome above-mentioned deficiency of the prior art, the character string matching method based on characterisitic parameter that a kind of result for retrieval ordering is more reasonable, have very strong fault-tolerant ability is provided.
For achieving the above object, a kind of character string matching method of the present invention based on characterisitic parameter, a given text that is stored in the memory device, and the term of input equipment input, it is characterized in that, messaging device carries out string matching based on characterisitic parameter to given text and term, and step is:
The A step), calculate the matching relationship of character in text and the term;
B step), according to the matching relationship estimated performance parameter of character in text and the term, characterisitic parameter comprises that each character of reflection term appears at the dispersion number of the discrete number of characters in the text, each character of reflection term appears at the crossing number of the number of characters of the intersection in the text, and each character of reflection term does not appear at the non-perfect number of the number of characters in the text;
C step), according to the length computation characteristic matching degree of characterisitic parameter, text and term;
The D step), output characteristics matching degree.
Compared with prior art, the invention has the beneficial effects as follows:
One, by table 2, existing based on wrong factor and the present invention is based on three specific characters the description difference table example relatively, clearly illustrated.
Text Term Existing retouching based on wrong factor The present invention is based on three specific characters
State Description
BCDEFGH ABCDEFGH Accurately mate (being the substring coupling): do not allow mistakes such as deletion, insertion, exchange Accurate coupling (being the substring coupling): do not allow to disperse, intersection, non-fully
ABCDEFGH CDEF Accurately mate (being the substring coupling): only allow the deletion of front, back Accurate coupling (being the substring coupling): do not allow to disperse, intersection, non-fully
ABCDEFGH CDF Non-accurate coupling: 1, have a deletion (E); Or 2, have the deletion of some fronts, back and have one in the middle of deletion; Or 3, have an insertion (F); Or 4, exist a replacement (E, F); Deng Discrete coupling: only have one discrete (E)
ABCDEFGH CEDF Non-accurate coupling: 1, existence exchange (DE, ED); Or 2, exist two replacements (D, E), (E, D); 3, there are insertion (D) and deletion (D); Deng Cross-matched: only have an intersection (D)
ABCDEFGH CEFD Non-accurate coupling: 1, have deletion (D) and insertion (D); Or 2, have two insertions (C), (D); Or Cross-matched: only have an intersection (D)
3, there are two insertions (E), (F) etc.
ABCDEFGH ACEFXD Non-accurate coupling: 1, have two deletions (B), (D) with two insertions (X), (D); Or 2, in two deletions (B), (D) and two replace (G, X), (H, D); Deng The non-coupling fully of discrete intersection: have discrete (B), an intersection (D), one non-(X) fully
Table 2
As seen, have now, have multiple describing method, reflect the polymorphism of describing same problem, be not easy to careful classification and handle based on wrong factor based on the string matching problem of wrong factor to particular text and term.And adopt three specific characters of the present invention that only there is a kind of describing method in same problem, reflected that exactly text is corresponding with the character of term.
About two substring matching problems, existing have two kinds of different describing methods based on wrong factor in the table 2; Only there is a kind of describing method in the present invention to two substring matching problems, more meets the definition of substring.
Contrast by last table, can also be well understood to the difference of discrete feature and deletion error, the difference that cross characteristic is wrong with exchange, the incomplete difference wrong with insertion.
The three specific character parameters that the present invention adopts, i.e. discreteness, intercrossing, incomplete have diverse notion and character between mutually, are characteristics independently each other.The mistake factor is the external manifestation of three specific characters, based on the string matching research thinking of three specific character parameters, has more scientifically disclosed the inherent law of string matching problem.
Two, the nonindependence of existing wrong factor notion, and the uncertainty of the wrong properties in the coupling make and can not the further refinement of matching state be represented according to wrong factor.Therefore, in the dictionary retrieval, when calculating matching degree, lack the more careful parameter foundation of accurate match situation, be unfavorable for detecting result's reasonable ordering.
And the three specific character parameters that the present invention adopts: discreteness, intercrossing, incomplete, have diverse notion and character between mutually, be characteristic independently each other.By three specific character calculation of parameter characteristic matching degrees, can consider of the influence of three specific characters respectively to the characteristic matching degree, make the characteristic matching degree that calculates reflect the similarity degree of text and term more accurately.Therefore, the characteristic matching degree that obtains according to the present invention, ordering output to all matching results of dictionary word is more reasonable, and fault-tolerant ability improves greatly, has overcome too general based on wrong factor distance calculation result, as to be unfavorable for the COMPREHENSIVE CALCULATING of matching degree defective.
Embodiment
Below in conjunction with embodiment, the character string matching method that the present invention is based on characterisitic parameter is described in further detail.
The electronics English dictionary is the electronic dictionary storehouse that is made of English word.Electronics English dictionary retrieval is meant: according to the English word of input, i.e. term P is to each word in the electronics English dictionary storehouse, be text T, carry out the string matching computing, and according to matching result, to the word ordering output that satisfies condition, user friendly selection.
The core technology of electronics English dictionary is exactly string matching, and its matching result directly has influence on all final sorting positions that detect word, also is the important indicator of weighing electronics English dictionary retrieval effectiveness.
In the present embodiment, a given text that is stored in the memory device is T=" t 1t 2T n", and the term of input equipment input is P=" p 1p 2P m", t wherein i, p j(1≤i≤n, 1≤j≤m) are character, and m, n are all greater than zero, and the concrete steps of the matching relationship of character are in the calculating text in described A step and the term:
The a step), stable sort term P
To term P=" p 1p 2P m" in all characters; carry out stable ascending sort; and be stored in the internal memory among the array PT, stored each character original position in term among the array PT simultaneously, be called the position subnumber group PTp that stores among the character subnumber group PTc that stores among the array PT and the array PT;
The b step), stable sort text T
To text T=" t 1t 2T n" in all characters; carry out stable ascending sort; and be stored in the internal memory among the array WT, stored each character original position in text among the array WT simultaneously, be called the position subnumber group WTp that stores among the character subnumber group WTc that stores among the array WT and the array WT;
The c step), parameter initialization
Array POS is used for the store character correspondence position in the internal memory, all is initialized as-1, non-perfect number=0, the position W=1 of array WT, the position P=1 of array PT, maximum position=0, minimum position=n;
The d step), whether circulation finishes
If array WT relatively finishes or array PT relatively finishes, then change the f step;
The e step), relatively
According to the comparable situation of the character of P storage in position among the character of position W storage among the WTc and the PTc, carry out following situation processing respectively:
If the character of position P storage among the character<PTc of position W storage among the WTc, then position W increases by 1, changes the d step;
If the character of position P storage among the character>PTp of position W storage among the WTp, then position P increases by 1, and non-perfect number increases by 1, changes the d step;
If the character of position P storage among the character=PTc of position W storage among the WTc, then the value storage that position W among the WTp is stored is in array POS, and its memory location is the numerical value of P storage in position among the PTp; If the numerical value>maximum position of position W storage among the WTp, then the numerical value with W storage in position among the WTp deposits in the maximum position; If the numerical value<minimum position of position W storage among the WTp, then the numerical value with W storage in position among the WTp deposits in the minimum position; Position W increases by 1, and position P increases by 1, changes the d step;
The f step), finish
Obtain representing array POS, maximum position, minimum position, position P and the non-perfect number of the matching relationship of character in text and the term.
This ordering text and term, the method for calculating character matching relationship can improve computing velocity.
Time complexity is: and O (k * log2k), k=Max (m, n);
In another embodiment, on the basis of last embodiment, the step according to the matching relationship estimated performance parameter of character in text and the term in described B step is:
The a step), non-perfect number=(non-perfect number+m-position P+1);
The b step), dispersion number=(the non-perfect number of maximum position-minimum position+1-m+);
C step), crossing number=carry out crossing number according to array POS result to calculate.
The step that aforementioned crossing number calculates can for:
(1), asks the maximum length of ascending sequence at interval;
(2), the non-perfect number-maximum of the crossing number=m-length of ascending sequence at interval.
Aforementioned ask maximum at interval the length of ascending sequence step can for:
(1), initialization
To deposit in the interim array in the internal memory greater than whole numerical value of zero among the array POS successively, interim array is last adds an end mark; If the element number in the interim array equals 0, directly return the maximum length of the ascending sequence result that equals 0 at interval; If the element number in the interim array equals 1, directly return the maximum length of the ascending sequence result that equals 1 at interval; Otherwise, deposit maximum ascending sequence at interval with array LPOS in the internal memory, and the numerical value of first position of array LPOS is initialized as first numerical value in the interim array; LP is used to indicate the position of current number group LPOS processing, and is initialized as 1; Get in the interim array second numerical value in comparing data;
(2), judge whether to finish
If comparing data is an end mark, change (4) step;
(3), handle according to relatively carrying out two kinds of situations
If comparing data is greater than the data of LP position among the array LPOS, then LP increases by 1, stores comparing data among array LPOS LP position, gets that next numerical value changes (2) step in the interim array in comparing data;
If comparing data is less than the data of LP position among the array LPOS, then first position is carried out binary search backward from array LPOS, searches for first data greater than comparing data, and rewrites this data with comparing data; Get in the interim array next numerical value in comparing data, change (2) step;
(4), the maximum length=LP of ascending sequence at interval.
Array POS is used for depositing the term appearance position of characters matched in text in matching process, data characteristic is among the array POS: beyond the divider value-1, other data are the integer greater than 0, and unequal mutually.Because numerical value-1 represented not characters matched, so numerical value-1 is not counted in maximum at interval in the ascending sequence.
The maximum of the array POS length of ascending sequence at interval is meant: by the size of data among the array POS, seek out maximum ascending sequence at interval at array POS, the data number of this sequence is the maximum length of ascending sequence at interval.
The strict difinition of maximum ascending sequence at interval and length is as follows:
Definition: establish arbitrary sequence a 1a 2A n(a i≠ a j), each element can compare, if there is maximum subsequence a K1a K2A KmSatisfy
1,1≤k1<k2<...<km≤n and
2、a k1<a k2<……<a km
Then claim subsequence a K1a K2A KmBe sequence a 1a 2A nMaximum ascending sequence at interval, unit several count m be its length.
For example 7,8,9,1,2,6,3,4,12
Maximum is ascending sequence at interval: 1,2,3,4,12;
Maximum is ascending sequence length at interval: 5
By the maximum length of ascending sequence at interval, the crossing number in the time of can obtaining text and term and mate.The effect of crossing number is to cooperate dispersion number, non-perfect number estimated performance matching degree, conveniently detects the ordering of text, satisfies user's retrieval requirement.
The time complexity of this algorithm is: O (mlog 2(m)).
In another embodiment, on the basis of aforementioned A step, B step embodiment, the step according to the length computation characteristic matching degree of characterisitic parameter, text and term in described C step is:
The a step), calculate the relevant character number of the actual match character of term and text
Relevant character number=2 * (the non-perfect number of m-);
The b step), the estimated performance parameter is to the factor of influence 1 of characteristic matching degree
Factor of influence 1=k 1* crossing number;
The c step), the estimated performance parameter is to the factor of influence 2 of characteristic matching degree
Factor of influence 2=q 1* non-perfect number+q 2* crossing number+q 3* dispersion number;
The d step), estimated performance matching degree
Characteristic matching degree=(relevant character number-factor of influence 1) ÷ (m+n+ factor of influence 2);
Wherein, k 1, q 1, q 2, q 3Be the weight coefficient of each characterisitic parameter in the characteristic matching degree, k 1For more than or equal to zero and smaller or equal to 2 real number, q 1, q 2, q 3For more than or equal to zero real number, weight coefficient k 1, q 1, q 2, q 3, can select different numerical value according to different product, different application occasion, thereby influence the characteristic matching degree of the text that retrieves, and influence the ordering of the text that retrieves.In a kind of concrete application, weight coefficient k 1, q 1, q 2, q 3Value be k 1=2/3, q 1=1, q 2=2/3, q 3=1/3.
The introducing of factor of influence, purpose are according to different product, and the different application environment is taken all factors into consideration the weighing factor of different qualities parameter to the characteristic matching degree, thereby make the ordering of result for retrieval more meet the customer requirements that particular surroundings is used.
In the present embodiment, this characteristic matching degree is for satisfying more than or equal to zero and smaller or equal to 1 real number.
Example one
Be result for retrieval and characteristics below according to the electronics English dictionary library searching example of the method specific design of above-mentioned A, B, C step embodiment.
6,000 English words commonly used have been selected in the electronics English dictionary storehouse of this example; Weight coefficient k 1, q 1, q 2, q 3Be chosen as: k 1=2/3, q 1=1, q 2=2/3, q 3=1/3; Result for retrieval is only exported the first five word by the characteristic matching degree descending sort that calculates.
1, discrete retrieval
Can omit the character in the word during input English word arbitrarily.
For example: target word is: " wonderful "
Input character is: " wdfl "
Result for retrieval is: 1wonderful 2handful 3unfold 4wind 5windy
Characteristic matching degree: 0.546 0.487 0.444 0.375 0.343
2, cross-searching
Input can intersect the character in the word during English word arbitrarily.
For example: target word is: " what "
Input character is: " whta "
Result for retrieval is: 1what 2wheat 3watch 4hat 5white
Characteristic matching degree: 0.846 0.733 0.625 0.615 0.581
3, allow non-complete character
During the input English word, allow to occur error character.
For example: target word is: " error "
Input character is: " irror "
Result for retrieval is: 1mirror 2error 3terror 4terrorist 5territory
Characteristic matching degree: 0.909 0.727 0.667 0.636 0.622
4, comprehensive example
Input is during English word, can disperse, intersection, non-ly mix fully.
For example: target word is: " marvelous "
Input character is: " mvrilus "
Result for retrieval is: 1marvelous 2various 3survival 4minus 5visual
Characteristic matching degree: 0.607 0.600 0.536 0.522 0.520
When wherein importing, omitted a, e, o, an error character i arranged, have an intersection (v, r).
5, particular example
For example: target word is: " marvelous "
Input character is: " mrxxxxxxlus "
Result for retrieval is: 1marvelous 2muscular 3marxist 4marxism 5luxurious
Characteristic matching degree: 0.366 0.317 0.312 0.312 0.302
In another embodiment, a kind of character string matching method of the present invention, described text T=" t based on characterisitic parameter 1t 2T n" in each character, corresponding stored a cognitive weight w, formed the cognitive weights series W=" w of text 1w 2W n", and satisfy w 1+ w 2+ ... + w n=1, cognitive weight w i(1≤i≤n) represented character at text " t 1t 2T n" middle by cognitive probability;
Aforementioned on the basis of A step, B step embodiment, the step according to the length computation characteristic matching degree of characterisitic parameter, text and term in described described C step is:
The a step), calculate the relevant character number of the actual match character of term and text
Relevant character number=2 * (the non-perfect number of m-);
The b step), the estimated performance parameter is to the factor of influence 1 of characteristic matching degree
Factor of influence 1=k 1* crossing number;
The c step), the estimated performance parameter is to the factor of influence 2 of characteristic matching degree
Factor of influence 2=q 1* non-perfect number+q 2* crossing number+q 3* dispersion number;
The d step), calculate the cognitive weights sum of having mated character in the text
According to the position of matched text character among the array POS, obtain the cognitive weights sum that all have mated character;
The e step), characteristic matching degree=[(relevant character number-factor of influence 1) ÷ (m+n+ factor of influence 2)] * cognitive weights sum.
Wherein, k 1, q 1, q 2, q 3Be the weight coefficient of each characterisitic parameter in the characteristic matching degree, k 1For more than or equal to zero and smaller or equal to 2 real number, q 1, q 2, q 3For more than or equal to zero real number, weight coefficient k 1, q 1, q 2, q 3, can select different numerical value according to different product, different application occasion, thereby influence the characteristic matching degree of the text that retrieves, and influence the ordering of the text that retrieves.
The introducing of factor of influence, purpose are according to different product, and the different application environment is taken all factors into consideration the weighing factor of different qualities parameter to the characteristic matching degree, thereby make the ordering of result for retrieval more meet the customer requirements that particular surroundings is used.
The cognitive weights of above-mentioned increase are to the method that influences of characteristic matching degree, it is a kind of improvement that the characteristic matching degree is calculated, meet the actual cognitive difference of people, combine multi-disciplinary cognitive thoughts such as psychology, behaviouristics, linguistics, statistics, especially be adapted to the dictionary retrieval special symbol.Strengthen the weight of the character of easy cognition, desalinated the weight of the character of makeing mistakes easily, the characteristic matching degree that calculates thus, its ordering that detects the result more meets user's requirement.
Determine the principal element of cognitive weights to have:
1) character first position of word whether;
2) whether character is the initial character of each syllable of pronunciation;
3) whether character standard whether in the pronunciation of syllable perhaps acts on obviously;
4) whether character vision in word is obvious;
5) whether character is character in the word etc.
Example two
According to the aforementioned embodiment based on the character string matching method of characterisitic parameter that cognitive weights are arranged, this example increases to the electronics English dictionary retrieval of cognitive weights.The electronics English dictionary is the electronic dictionary storehouse that is made of English word and cognitive weights.
The method of method of this example two and example one is basic identical, each character that different only is among the text T, corresponding stored cognitive weights, increased of the influence of cognitive weights to the estimated performance matching degree.
Be a kind of method of calculating cognitive weights below:
1, character first position of word whether, score value 0.4;
2, whether character is the initial character of each syllable of pronunciation, score value 0.3;
3, character standard or whether act on obviously score value 0.1 whether in the pronunciation of syllable;
4, whether character vision in word is obvious, score value 0.2;
5, be character in the word, score value 1.
For example: consider English word " what ".
Character w satisfies 1,2,3,4,5, character w score value=2;
Character h satisfies 4,5, character h score value=1.2;
Character a satisfies 3,5, character a score value=1.1;
Character t satisfies 2,3,4,5, character w score value=1.6.
Total score value of English word " what " is 5.9, and the cognitive weights of each character are:
The total score value of cognitive weights=w score value/" what " of w=2/5.9;
Total score value=the 1.2/5.9 of cognitive weights=h score value/" what " of h;
Total score value=the 1.1/5.9 of cognitive weights=a score value/" what " of a;
Total score value=the 1.6/5.9 of cognitive weights=t score value/" what " of t;
Obtain the cognitive weights sequence of English word " what " at last: 2/5.9,1.2/5.9,1.1/5.9,1.6/5.9.
By above-mentioned embodiment, we as can be seen, compare based on the distance calculation of wrong factor with existing, the notion of each characterisitic parameter of estimated performance matching degree of the present invention is independent, the characterisitic parameter that calculates has reflected text and the term difference on each characterisitic parameter more meticulously.Therefore the characteristic matching degree that calculates according to three characterisitic parameters can more reasonably reflect matching state, meets user's actual demand in the ordering of dictionary result for retrieval more.
Simultaneously, we it can also be seen that, the character string matching method based on characterisitic parameter of the present invention has extremely strong fault-tolerant retrieval capability, are adapted to the dictionary retrieval.
Although above the illustrative embodiment of the present invention is described; but should be understood that; the invention is not restricted to the scope of embodiment; to those skilled in the art; as long as various variations appended claim limit and the spirit and scope of the present invention determined in; these variations are conspicuous, and all utilize innovation and creation that the present invention conceives all at the row of protection.

Claims (8)

1, a kind of character string matching method based on characterisitic parameter, a given text that is stored in the memory device, and the term of input equipment input is characterized in that, messaging device carries out string matching based on characterisitic parameter to given text and term, and step is:
The A step), calculate the matching relationship of character in text and the term;
B step), according to the matching relationship estimated performance parameter of character in text and the term, characterisitic parameter comprises that each character of reflection term appears at the dispersion number of the discrete number of characters in the text, each character of reflection term appears at the crossing number of the number of characters of the intersection in the text, and each character of reflection term does not appear at the non-perfect number of the number of characters in the text;
C step), according to the length computation characteristic matching degree of characterisitic parameter, text and term;
The D step), output characteristics matching degree.
2, a kind of character string matching method based on characterisitic parameter according to claim 1, a given text that is stored in the memory device is T=" t 1t 2T n", and the term of input equipment input is P=" p 1p 2P m", t wherein i, p j(1≤i≤n, 1≤j≤m) are character, and m, n greater than zero, is characterized in that all the concrete steps of the matching relationship of character are in the calculating text in described A step and the term:
The a step), stable sort term P
To term P=" p 1p 2P m" in all characters; carry out stable ascending sort; and be stored in the internal memory among the array PT, stored each character original position in term among the array PT simultaneously, be called the position subnumber group PTp that stores among the character subnumber group PTc that stores among the array PT and the array PT;
The b step), stable sort text T
To text T=" t 1t 2T n" in all characters; carry out stable ascending sort; and be stored in the internal memory among the array WT, stored each character original position in text among the array WT simultaneously, be called the position subnumber group WTp that stores among the character subnumber group WTc that stores among the array WT and the array WT;
The c step), parameter initialization
Array POS is used for the store character correspondence position in the internal memory, all is initialized as-1, non-perfect number=0, the position W=1 of array WT, the position P=1 of array PT, maximum position=0, minimum position=n;
The d step), whether circulation finishes
If array WT relatively finishes or array PT relatively finishes, then change the f step;
The e step), relatively
According to the comparable situation of the character of P storage in position among the character of position W storage among the WTc and the PTc, carry out following situation processing respectively:
If the character of position P storage among the character<PTc of position W storage among the WTc, then position W increases by 1, changes the d step;
If the character of position P storage among the character>PTp of position W storage among the WTp, then position P increases by 1, and non-perfect number increases by 1, changes the d step;
If the character of position P storage among the character=PTc of position W storage among the WTc, then the value storage that position W among the WTp is stored is in array POS, and its memory location is the numerical value of P storage in position among the PTp; If the numerical value>maximum position of position W storage among the WTp, then the numerical value with W storage in position among the WTp deposits in the maximum position; If the numerical value<minimum position of position W storage among the WTp, then the numerical value with W storage in position among the WTp deposits in the minimum position; Position W increases by 1, and position P increases by 1, changes the d step;
The f step), finish
Obtain representing array POS, maximum position, minimum position, position P and the non-perfect number of the matching relationship of character in text and the term.
3, a kind of character string matching method based on characterisitic parameter according to claim 2 is characterized in that, the step according to the matching relationship estimated performance parameter of character in text and the term in described B step is:
The a step), non-perfect number=(non-perfect number+m-position P+1);
The b step), dispersion number=(the non-perfect number of maximum position-minimum position+1-m+);
C step), crossing number=carry out crossing number according to array POS result to calculate.
4, a kind of character string matching method based on characterisitic parameter according to claim 3 is characterized in that, the step according to the length computation characteristic matching degree of characterisitic parameter, text and term in described C step is:
The a step), calculate the relevant character number of the actual match character of term and text
Relevant character number=2 * (the non-perfect number of m-);
The b step), the estimated performance parameter is to the factor of influence 1 of characteristic matching degree
Factor of influence 1=k 1* crossing number;
The c step), the estimated performance parameter is to the factor of influence 2 of characteristic matching degree
Factor of influence 2=q 1* non-perfect number+q 2* crossing number+q 3* dispersion number;
The d step), estimated performance matching degree
Characteristic matching degree=(relevant character number-factor of influence 1) ÷ (m+n+ factor of influence 2);
Wherein, k 1, q 1, q 2, q 3Be the weight coefficient of each characterisitic parameter in the characteristic matching degree, k 1For more than or equal to zero and smaller or equal to 2 real number, q 1, q 2, q 3For more than or equal to zero real number, weight coefficient k 1, q 1, q 2, q 3, can select different numerical value according to different product, different application occasion, thereby influence the characteristic matching degree of the text that retrieves, and influence the ordering of the text that retrieves.
5, a kind of character string matching method based on characterisitic parameter according to claim 4 is characterized in that, described weight coefficient k 1, q 1, q 2, q 3Value be k 1=2/3, q 1=1, q 2=2/3, q 3=1/3.
6, a kind of character string matching method based on characterisitic parameter according to claim 3 is characterized in that, the step that crossing number calculates is:
(1), asks the maximum length of ascending sequence at interval;
(2), the non-perfect number-maximum of the crossing number=m-length of ascending sequence at interval.
7, a kind of character string matching method based on characterisitic parameter according to claim 6 is characterized in that, ask maximum at interval the step of the length of ascending sequence be:
(1), initialization
To deposit in the interim array in the internal memory greater than whole numerical value of zero among the array POS successively, interim array is last adds an end mark; If the element number in the interim array equals 0, directly return the maximum length of the ascending sequence result that equals 0 at interval; If the element number in the interim array equals 1, directly return the maximum length of the ascending sequence result that equals 1 at interval; Otherwise, deposit maximum ascending sequence at interval with array LPOS in the internal memory, and the numerical value of first position of array LPOS is initialized as first numerical value in the interim array; LP is used to indicate the position of current number group LPOS processing, and is initialized as 1; Get in the interim array second numerical value in comparing data;
(2), judge whether to finish
If comparing data is an end mark, change (4) step;
(3), handle according to relatively carrying out two kinds of situations
If comparing data is greater than the data of LP position among the array LPOS, then LP increases by 1, stores comparing data among array LPOS LP position, gets that next numerical value changes (2) step in the interim array in comparing data;
If comparing data is less than the data of LP position among the array LPOS, then first position is carried out binary search backward from array LPOS, searches for first data greater than comparing data, and rewrites this data with comparing data; Get in the interim array next numerical value in comparing data, change (2) step;
(4), the maximum length=LP of ascending sequence at interval.
8, a kind of character string matching method based on characterisitic parameter according to claim 3 is characterized in that, described text T=" t 1t 2T n" in each character, corresponding stored a cognitive weight w, formed the cognitive weights series W=" w of text 1w 2W n", and satisfy w 1+ w 2+ ... + w n=1, cognitive weight w i(1≤i≤n) represented character at text " t 1t 2T n" middle by cognitive probability;
The step according to the length computation characteristic matching degree of characterisitic parameter, text and term in described C step is:
The a step), calculate the relevant character number of the actual match character of term and text
Relevant character number=2 * (the non-perfect number of m-);
The b step), the estimated performance parameter is to the factor of influence 1 of characteristic matching degree
Factor of influence 1=k 1* crossing number;
The c step), the estimated performance parameter is to the factor of influence 2 of characteristic matching degree
Factor of influence 2=q 1* non-perfect number+q 2* crossing number+q 3* dispersion number;
The d step), calculate the cognitive weights sum of having mated character in the text
According to the position of matched text character among the array POS, obtain the cognitive weights sum that all have mated character;
The e step), characteristic matching degree=[(relevant character number-factor of influence 1) ÷ (m+n+ factor of influence 2)] * cognitive weights sum.
Wherein, k 1, q 1, q 2, q 3Be the weight coefficient of each characterisitic parameter in the characteristic matching degree, k 1For more than or equal to zero and smaller or equal to 2 real number, q 1, q 2, q 3For more than or equal to zero real number, weight coefficient k 1, q 1, q 2, q 3, can select different numerical value according to different product, different application occasion, thereby influence the characteristic matching degree of the text that retrieves, and influence the ordering of the text that retrieves.
CNA2007100487900A 2007-04-02 2007-04-02 Method for matching text string based on parameter characteristics Pending CN101030216A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNA2007100487900A CN101030216A (en) 2007-04-02 2007-04-02 Method for matching text string based on parameter characteristics
PCT/CN2008/070603 WO2008119297A1 (en) 2007-04-02 2008-03-27 Method for matching character string based on characteristic parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007100487900A CN101030216A (en) 2007-04-02 2007-04-02 Method for matching text string based on parameter characteristics

Publications (1)

Publication Number Publication Date
CN101030216A true CN101030216A (en) 2007-09-05

Family

ID=38715562

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007100487900A Pending CN101030216A (en) 2007-04-02 2007-04-02 Method for matching text string based on parameter characteristics

Country Status (2)

Country Link
CN (1) CN101030216A (en)
WO (1) WO2008119297A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008119297A1 (en) * 2007-04-02 2008-10-09 Guangyao Ding Method for matching character string based on characteristic parameters
CN102184195A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and device for acquiring similarity between character strings
CN102682033A (en) * 2011-03-17 2012-09-19 环达电脑(上海)有限公司 Method for querying words by matching binary characteristic values
CN101345957B (en) * 2008-08-20 2013-01-09 宇龙计算机通信科技(深圳)有限公司 Recognition method, system and mobile terminal for login cipher
CN107239500A (en) * 2017-05-03 2017-10-10 成都国腾实业集团有限公司 A kind of character string matching method and system
CN112182313A (en) * 2020-09-30 2021-01-05 国网青海省电力公司 Relay protection setting value name matching method and system
CN112508845A (en) * 2020-10-15 2021-03-16 福州大学 Depth learning-based automatic osd menu language detection method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
JP3924899B2 (en) * 1998-02-20 2007-06-06 富士ゼロックス株式会社 Text search apparatus and text search method
CN1392497A (en) * 2002-07-24 2003-01-22 彭泉 Matching method for large character string
CN1916896A (en) * 2006-09-08 2007-02-21 丁光耀 Matching method based on discrete, intersectional, not complete character string mode
CN101030216A (en) * 2007-04-02 2007-09-05 丁光耀 Method for matching text string based on parameter characteristics

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008119297A1 (en) * 2007-04-02 2008-10-09 Guangyao Ding Method for matching character string based on characteristic parameters
CN101345957B (en) * 2008-08-20 2013-01-09 宇龙计算机通信科技(深圳)有限公司 Recognition method, system and mobile terminal for login cipher
CN102682033A (en) * 2011-03-17 2012-09-19 环达电脑(上海)有限公司 Method for querying words by matching binary characteristic values
CN102184195A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and device for acquiring similarity between character strings
CN102184195B (en) * 2011-04-20 2014-01-08 北京百度网讯科技有限公司 Method, device and device for acquiring similarity between character strings
CN107239500A (en) * 2017-05-03 2017-10-10 成都国腾实业集团有限公司 A kind of character string matching method and system
CN112182313A (en) * 2020-09-30 2021-01-05 国网青海省电力公司 Relay protection setting value name matching method and system
CN112508845A (en) * 2020-10-15 2021-03-16 福州大学 Depth learning-based automatic osd menu language detection method and system

Also Published As

Publication number Publication date
WO2008119297A1 (en) 2008-10-09

Similar Documents

Publication Publication Date Title
CN101030216A (en) Method for matching text string based on parameter characteristics
CN1174332C (en) Method and device for converting expressing mode
CN1171162C (en) Apparatus and method for retrieving charater string based on classification of character
CN1489089A (en) Document search system and question answer system
CN1096038C (en) Method and equipment for file retrieval based on Bayesian network
CN1158627C (en) Method and apparatus for character recognition
CN100337407C (en) Method and system for compressing structured descriptions of documents
CN1113305C (en) Language processing apparatus and method
CN1228762C (en) Method, module, device and server for voice recognition
CN1251128C (en) Pattern ranked matching device and method
CN1281191A (en) Information retrieval method and information retrieval device
CN1193779A (en) Method for dividing sentences in Chinese language into words and its use in error checking system for texts in Chinese language
CN1781093A (en) System and method for storing and accessing data in an interlocking trees datastore
CN1490744A (en) Method and system for searching confirmatory sentence
CN1215457C (en) Sentense recognition device, sentense recognition method, program and medium
CN1177407A (en) Method and system for velocity-based head writing recognition
CN1942877A (en) Information extraction system
CN1904896A (en) Structured document processing apparatus, search apparatus, structured document system and method
CN1409842A (en) Pattern matching method and apparatus
CN1761958A (en) Method and arrangement for searching for strings
CN1379882A (en) Method for converting two-dimensional data canonical representation
CN1848162A (en) Method, system and program for evaluating reliability on component
CN1696933A (en) Method for automatic picking up conceptual relationship of text based on dynamic programming
CN1627294A (en) Method and apparatus for document filtering capable of efficiently extracting document matching to searcher's intention using learning data
CN1902647A (en) Inference machine

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication