CN101930458B - Short message matching method based on characteristic value - Google Patents

Short message matching method based on characteristic value Download PDF

Info

Publication number
CN101930458B
CN101930458B CN2010102566063A CN201010256606A CN101930458B CN 101930458 B CN101930458 B CN 101930458B CN 2010102566063 A CN2010102566063 A CN 2010102566063A CN 201010256606 A CN201010256606 A CN 201010256606A CN 101930458 B CN101930458 B CN 101930458B
Authority
CN
China
Prior art keywords
note
seed
eigenwert
short message
characteristic value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010102566063A
Other languages
Chinese (zh)
Other versions
CN101930458A (en
Inventor
廖建新
王晶
王纯
李炜
张少杰
彭刚
钱苏林
朱晓民
张磊
徐童
张乐剑
沈奇威
樊利民
程莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dongxin Beiyou Information Technology Co Ltd
Original Assignee
Hangzhou Dongxin Beiyou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dongxin Beiyou Information Technology Co Ltd filed Critical Hangzhou Dongxin Beiyou Information Technology Co Ltd
Priority to CN2010102566063A priority Critical patent/CN101930458B/en
Publication of CN101930458A publication Critical patent/CN101930458A/en
Application granted granted Critical
Publication of CN101930458B publication Critical patent/CN101930458B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a short message matching method based on a characteristic value, comprising the following steps: 1. initializing a seed short message library; 2. calculating the characteristic value collection of a user short message, searching the seed short message coincident with the characteristic value from the seed short message library according to each characteristic value in the characteristic value collection, and gradually matching the found seed short message with the user short message. The step 1 also comprises the step A: calculating the characteristic value collection of each seed short message in the seed short message library, and storing the short message content and the characteristic value of the seed short message. The invention belongs to the technical fieldof mobile communication and can effectively improve the short message matching efficiency under high telephone traffic. The characteristic value has large volume and can effectively reduce short message matching times so as to improve the short message matching efficiency under high telephone traffic; the characteristic value collection can effectively improve matching accuracy and high efficiency and lower omission rate. The invention can support plaintext short messages and MD5 encryption short messages.

Description

A kind of short message matching method based on eigenwert
Technical field
The present invention relates to a kind of short message matching method, belong to the mobile communication technology field based on eigenwert.
Background technology
Along with the development of mobile communication technology, the note matching technique has obtained widespread use in increasing Bulk Call short message service.For example the note sales service can provide the business of order/sending short message to the user; After user's order/program request, become kind of a child user; The note marketing platform issues the seed note to the user, and is transmitted to other users by kind of child user, further propagates to other users by transmitting the user again; Thereby form multistage forwarding chain; There is the note flow of 500,000 seed notes, 40,000/second in a common short message sales service application case, and is overstocked in order not exist, and must in one second, accomplish the comparison of 500,000 * 40,000=20,000,000,000 time short message contents so.For another example to the monitoring and the interception of refuse messages; Along with the game of monitor supervision platform with the harassing and wrecking platform; The kind of illegal key word and mutation are also more and more, and monitor supervision platform needs under the note flow of high traffic, from the key word of magnanimity, to analyze illegal note fast.
Though the method to note (character string) coupling is a lot of at present; But all lay particular emphasis on the coupling between two definite character strings; Because the potentiality that its character string matching method self efficient improves are very little, can't under the data of magnanimity like this, realize note coupling efficiently.How to improve the note matching efficiency under the Bulk Call? Some relevant solutions have also been proposed at present:
Patented claim CN 200810107117 (application title: the large scale rapid matching method of sentence surface; Application time: 2008-07-17; The applicant: relate to the large scale rapid matching method of sentence surface Anhui Kedaxunfei Science and Technology information Co., Ltd), this method comprises index foundation, fuzzy matching and accurately matees three phases.The index establishment stage carries out standardization, the code conversion of sentence content; The purpose in fuzzy matching stage be from the magnanimity sentence, pick out maybe with new sentence matched candidate sentence, its quantity is controlled in the feasible scope; Accurately matching stage has adopted the similarity measure algorithm based on editing distance, and according to the similarity of accurately mating candidate's sentence being sorted obtains the sentence of final matching again.The cryptographic hash that this scheme is calculated has only 256 codomains, and like this along with the increase of candidate sentence quantum count, the coupling workload is linear growth, and efficient also can linear thereupon decline; Simultaneously the matching degree when two notes very little, that is to say when content has nothing to do basically that this technical scheme also can continue to calculate the accurate matching degree of these two notes, therefore will waste a big chunk efficient.This technical scheme improves very limited to the note matching efficiency under the Bulk Call.
Therefore, how to improve the note matching efficiency under the Bulk Call? Just become Bulk Call short message services such as influencing note sales service, message monitoring platform by a key technical problem of large-scale application and popularization.
Summary of the invention
In view of this, the purpose of this invention is to provide a kind of short message matching method, can improve the note matching efficiency under the Bulk Call based on eigenwert.
In order to achieve the above object, the invention provides a kind of short message matching method, include based on eigenwert:
Step 1, seed note library initialization;
The characteristic value collection of step 2, calculating user note; And according to each eigenwert in the said characteristic value collection; From seed note storehouse, search the seed note consistent with said eigenwert; And the seed note that is found mated with user's note one by one, said seed note is the short message content that need be mated
Said step 1 also further includes:
Steps A, calculate the characteristic value collection of every seed note in the seed note storehouse, and with the short message content and eigenwert preservation of said seed note;
The smallest match length of step B, calculating seed note: be about to seed note length and multiply by the seed matching degree;
Step C, the smallest match length of all seed notes are relatively picked out minimum value wherein, and the smallest match length of the seed note of minimum is designated as Lmin,
Said step 2 further includes:
Step 21, length filtration for the first time: judge that said user's note length is less than Lmin? If, then there is not the seed note that is complementary with user's note, this flow process finishes; If, then do not continue next step 22;
The characteristic value collection of step 22, calculating user note;
Step 23, from the characteristic value collection of user's note, extract each eigenwert one by one; Search the seed note consistent with said eigenwert; And user's note and said seed note mated, the seed note that at last match is required is returned as matching result
The computing method of the note characteristic value collection in said steps A and the step 2 are following:
Steps A 1, a note is become several note bursts by Segmentation of Punctuation, said several note bursts are arranged in order into the formation of a note burst by its priority location order in short message content;
Steps A 2, the note burst in the formation of said note burst is sorted by length order from big to small, and order is extracted the note burst of N maximum length from the formation of note burst, said N is the eigenwert capacity, and its value is provided with according to the main frame model;
Steps A 3, by the byte-extraction position of setting; Extract byte as the eigenwert seed from the corresponding position of said N note burst respectively; N eigenwert seed corresponding to same byte-extraction position constitutes an eigenwert seed group, belongs to different character value seed group from the eigenwert seed of different byte-extraction fetched;
Steps A 4, from each eigenwert seed group optional M eigenwert seed; And said M eigenwert seed arranged by positive sequence, the corresponding ASCII character of eigenwert seed after arranging is combined as a shaping number, said shaping number is an eigenwert; All eigenwert constitutive characteristic value sets that comprise by all eigenwert seed group; Wherein M is the characteristic value collection capacity, and its value is provided with according to the main frame model
Said step 23 further includes:
Step 231, from the characteristic value collection of user's note, extract an eigenwert;
Do you step 232, judge that the seed note consistent with said eigenwert arranged in the seed note storehouse? If then continue next step 233; If, then do not go to step 239;
Step 233, length filtration for the second time: whether the length of judges note is not less than the smallest match length of seed note; And be not more than the maximum match length of seed note; Wherein, The smallest match length of seed note=seed note length * seed matching degree, the maximum match length of seed note=seed note length+seed note length * (1-seed matching degree)? If then continue next step 234; If, then do not turn to step 238;
Step 234, the improved bit vector method of employing are calculated the maximum match degree of seed note and user's note;
The matching degree that the maximum match degree that the bit vector value that step 235, judgement finally calculate possibly reach requires less than said seed note? If then turn to step 238; If, then do not continue next step 236;
Step 236, the improved editing distance method of use are calculated the editing distance of seed note and user's note: through calculating character string A promptly can be changed into character string B through the operation that N goes on foot; Calculate the editing distance of A and B; The accurate matching degree of A and B then is N/max (length (A), length (B)) so;
Step 237, as satisfactory matching result, preserve with said seed note with according to the accurate matching degree that editing distance calculates;
Step 238, judge in the seed note storehouse whether to also have the seed note consistent with said eigenwert? If have, then turn to step 233; If, then do not turn to next step 239;
Also has undrawn eigenwert in the characteristic value collection of step 239, judges note? If, then continue to extract next eigenwert, turn to step 232; If, then satisfactory seed note and accurate matching degree are not returned as matching result.
Compared with prior art; The invention has the beneficial effects as follows: the present invention adopts the account form of characteristic value collection; Choosing of eigenwert have uniqueness, be prone to the property calculated, high efficiency, discreteness and characteristics such as non-linear, because the eigenwert capacity is big, every user's note only needs and several seed notes compare operation; While is along with the increase of seed note; Need the number of times of coupling to be into the increase of logarithm rank, so the present invention can effectively reduce the note matching times, thereby improve the note matching efficiency under the Bulk Call greatly; Every note can utilize this characteristic value collection can effectively improve the accuracy and the high efficiency of coupling to the characteristic value collection of 70*2 eigenwert should be arranged, and reduces the missing rate of note coupling; Can calculate the matching degree of two notes accurately through the editing distance method; But because its efficiency of algorithm is lower; Therefore the present invention carries out once the coupling of bit vector method efficiently earlier before adopting the editing distance method; When the highest matching degree possibly carried out the calculating of editing distance when requiring required numerical value again, thereby further increase work efficiency; Owing in the actual service application scene, possibly have the requirement of incomplete coupling, so the present invention considered the setting of matching degree, also promptly reaches certain matching degree and gets final product; The present invention can provide support to plaintext note and md5 encryption note.
Description of drawings
Fig. 1 is a kind of short message matching method process flow diagram based on eigenwert of the present invention.
Fig. 2 is the concrete operations process flow diagram of seed note library initialization.
Fig. 3 is the computing method process flow diagram of note characteristic value collection.
Fig. 4 is that the data of seed note eigenwert and short message content are preserved synoptic diagram.
Fig. 5 is the concrete operations process flow diagram of Fig. 1 step 2.
Fig. 6 is the concrete operations process flow diagram of Fig. 5 step 23.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, the present invention is made further detailed description below in conjunction with accompanying drawing and embodiment.
Below the used noun of the present invention is explained as follows respectively:
1, seed note: the short message content that need be mated.
2, user's note: the actual short message content that the mobile subscriber sends.
3, note burst: every note is cut apart according to punctuation mark, each short message content after cutting apart be called a note burst.
4, eigenwert: every corresponding unique eigenwert of note.Every definite short message content has a definite eigenwert, but a definite eigenwert maybe be to there being a plurality of short message contents.
5, md5 encryption note: at some application scenarios; In order to guarantee the confidentiality of user's short message; Need the note burst be carried out md5 encryption; Fixing 32 bytes of ciphertext after the md5 encryption, and have non repudiation, any one that also promptly revises the preceding seed content of encryption all can cause the md5 encryption ciphertext inconsistent.The md5 encryption ciphertext can not be deciphered.
As shown in Figure 1, be a kind of short message matching method process flow diagram of the present invention based on eigenwert, its concrete steps are:
Step 1, seed note library initialization;
The characteristic value collection of step 2, calculating user note, and, from seed note storehouse, search the seed note consistent with said eigenwert according to each eigenwert in the said characteristic value collection, and the seed note that is found is mated with user's note one by one.
The present invention adopts characteristic value collection to filter out most incoherent seed notes, thereby minimizing needs the number of times of note coupling.Said eigenwert; DNA just as a people; The people who confirms has definite DNA, but the people's of same family DNA similar (being that matching degree is very high), and also possibly there is very high similarity in the people's of different families DNA on fragment; But whole section DNA sheet difference will be very big, and the eigenwert of note and characteristic value collection also have similar characteristic.
As shown in Figure 2, the step 1 of Fig. 1 further includes following steps:
Steps A, calculate the characteristic value collection of every seed note in the seed note storehouse, and with the short message content and eigenwert preservation of said seed note;
The smallest match length of step B, calculating seed note: be about to seed note length and multiply by the seed matching degree.The smallest match length of said seed note is under the situation that reaches the requirement of seed matching degree, the minimum length of user's note.Some seed note coupling of will demanding perfection, promptly the seed matching degree is 100%, the smallest match length of seed note is exactly seed note length so; Some seed note requires to reach certain matching degree and gets final product, and the smallest match length of seed note is exactly that seed note length multiply by the seed matching degree so;
Step C, the smallest match length of all seed notes are relatively picked out minimum value wherein, and the smallest match length of the seed note of said minimum is designated as Lmin.
As shown in Figure 3, in the steps A of Fig. 2, the computing method of note characteristic value collection are following:
Steps A 1, a note is become several note bursts by Segmentation of Punctuation, said several note bursts are arranged in order into the formation of a note burst by its priority location order in short message content.
Steps A 2, the note burst in the formation of said note burst is sorted by length order from big to small; And order is extracted the note burst of N maximum length from the formation of note burst; Said N is the eigenwert capacity; 32 desirable 4,64 main frame N desirable 8 of main frame N can be set according to the main frame model.For example short message content is A12345, B123456, C1234567; D12345678, E123456789 is after the note burst sorted by length; Count from the note burst of maximum length, 4 note bursts that extracted are followed successively by E123456789, D12345678, C1234567, B123456.The present invention can provide support to plaintext note and md5 encryption note; Be the md5 encryption note of 32 bytes for minute leaf length; The present invention can choose N note burst of front from the formation of note burst; Its coupling step is consistent with coupling step to the plaintext note, below just repeats no more.
Steps A 3, by the byte-extraction position of setting; Extract byte as the eigenwert seed from the corresponding position of said N note burst respectively; N eigenwert seed corresponding to same byte-extraction position constitutes an eigenwert seed group, belongs to different character value seed group from the eigenwert seed of different byte-extraction fetched.Can set byte-extraction position first byte for each note burst; Because in the note process of transmitting; The user edits short message content probably; Such as increasing relatively short statement such as new line and respect language; Perhaps individual other segmentation is made amendment; Therefore also set first byte and last byte that the byte-extraction position be each note burst according to actual conditions, first byte of extracting the note burst simultaneously and last byte are as the eigenwert seed, and promptly first byte and the pairing eigenwert seed of last byte constitute two eigenwert seed group respectively.
Steps A 4, from each eigenwert seed group optional M eigenwert seed; And said M eigenwert seed arranged by positive sequence; The corresponding ASSIC code character of eigenwert seed after arranging is combined into a shaping number (for the left zero padding of 4 of less thaies); Said shaping number is an eigenwert, all eigenwert constitutive characteristic value sets that comprised by all eigenwert seed group.Wherein M is the characteristic value collection capacity, also can be provided with according to the main frame model, and for example 32 main frame M get 3,64 main frame M and get 4.
For example above step is described: use 32 main frames, N=4, M=3; When receiving note: A12345, B123456, C1234567; D12345678; During E123456789, at first the note burst after dividing is sorted by length, 4 note bursts of extraction are followed successively by E123456789, D12345678, C1234567, B123456; First byte of extracting the note burst more simultaneously is the eigenwert seed with last byte, and promptly two eigenwert seed group are respectively E, D, C, B and 9,8,7,6; At last, from each eigenwert seed group, choose 3 eigenwert seeds wantonly, and rearrange an eigenwert by positive sequence, said characteristic value collection is designated as { EDC, EDB, ECB, DCB, 987,986,976,876}.For first eigenwert EDC, its positive sequence is arranged as C, D, E, and its corresponding ASSIC sign indicating number is respectively 0x43,0x44,0x45, and then the eigenwert of this note is 0x434445 on 32 main frames, for the left zero padding of 4 of less thaies.
When using 64 main frames, can also remove more a plurality of eigenwert seeds, as removing 4 seeds at most, also promptly from 8 eigenwert seeds, choose 4 eigenwert seed group wantonly and synthesize characteristic value collection.Therefore note can improve accuracy corresponding to the characteristic value collection of 70*2 eigenwert through utilizing characteristic value collection to carry out the note coupling greatly, reduces the missing rate that note is mated.Eigenwert among the present invention has following characteristics:
1, uniqueness: can find out for the note of confirming that through choosing of eigenwert its characteristic value collection is unique to be confirmed.
2, the property be prone to calculated: eigenwert is chosen and need not to carry out complex operations, according to specifying figure place to choose, and carry out bit manipulation and get final product, and bit manipulation efficient is very high.
3, high efficiency: each eigenwert in the characteristic value collection is a shaping number; Computing machine is in operating; Compare consistent to 8 types with the efficient of operating with 64; Also promptly to the comparison of 64 long with consistent to the relative efficiency of a character, thereby be higher than the relative efficiency of character string far away.
4, discreteness: under 64 main frames; Eigenwert can reach 8; According to the mode of from 8 eigenwert seeds, getting 4 eigenwert seed constitutive characteristic value sets; Its capacity is that 2 32 powers also promptly reach 4,200,000,000, even consider the characteristic value collection of every note 70 eigenwerts is arranged, and the capacity of 3,000 ten thousand seed note is so also arranged.
5, non-linear: the codomain of eigenwert reaches 256 4 powers on the gross data; Promptly 4,200,000,000; Every note has 70*2=140 eigenwert, also promptly on average can hold 3,000 ten thousand seed note, that is to say; Matching times is 1 under the situation in the seed note storehouse below 3,000 ten thousand in theory, and is irrelevant with seed note storage capacity in other words.
As shown in Figure 4; In the steps A of Fig. 2; Every note can the eigenwert and the short message content of seed note be kept in the ordering container of B+ tree, owing at most corresponding to 70*2 eigenwert, can use a seed note node queue to preserve the details of seed note; And each seed note is corresponding to a node in the seed note node queue, and eigenwert B+ tree preserves the pointer address of eigenwert and its corresponding seed note.
As shown in Figure 5, the step 2 of Fig. 1 further includes:
Step 21, length filtration for the first time: judge that said user's note length is less than Lmin? If, then there is not the seed note that is complementary with user's note, this flow process finishes; If, then do not continue next step 22.
The characteristic value collection of step 22, calculating user note.Said method flow is referring to Fig. 3.
Step 23, from the characteristic value collection of user's note, extract each eigenwert one by one; Search the seed note consistent with said eigenwert; And user's note and said seed note mated, the seed note that at last match is required is returned as matching result.As shown in Figure 6, step 23 further includes following steps:
Step 231, from the characteristic value collection of user's note, extract an eigenwert.
Do you step 232, judge that the seed note consistent with said eigenwert arranged in the seed note storehouse? If then continue next step 233; If, then do not go to step 239.
Step 233, length filtration for the second time: whether the length of judges note is not less than the smallest match length of seed note; And be not more than the maximum match length of seed note; Wherein, The smallest match length of seed note=seed note length * seed matching degree, the maximum match length of seed note=seed note length+seed note length * (1-seed matching degree)? If then continue next step 234; If, then do not turn to step 238.
Step 234, the improved bit vector method of employing are calculated the maximum match degree of seed note and user's note.Said step 234 further includes:
Whether has difference value surpassed the maximum different value that matching degree requires in the process of step 2341, judgement vector calculation on the throne? If, then no longer carry out follow-up bit vector and calculate, turn to step 238; If, then not continuing follow-up bit vector calculates.
The matching degree that the maximum match degree that the bit vector value that step 235, judgement finally calculate possibly reach requires less than said seed note? If then turn to step 238; If, then do not continue next step 236.
Step 236, the improved editing distance method of use are calculated the editing distance of seed note and user's note: through calculating character string A promptly can be changed into character string B through the operation that N goes on foot; Calculate the editing distance of A and B; The matching degree of A and B then is N/max (length (A), length (B)) so.Said step 236 further includes:
Step 2361, judgement are in the computation process of editing distance, and whether has distance value surpassed the maximum range value that matching degree requires? If, then no longer carry out follow-up editing distance and calculate, turn to step 238; If, then not continuing follow-up editing distance calculates.
Step 237, as satisfactory matching result, preserve with said seed note with according to the accurate matching degree that editing distance calculates.
Step 238, judge in the seed note storehouse whether to also have the seed note consistent with said eigenwert? If have, then turn to step 233; If, then do not turn to next step 239.
Also has undrawn eigenwert in the characteristic value collection of step 239, judges note? If, then continue to extract next eigenwert, turn to step 232; If, then satisfactory seed note and accurate matching degree are not returned as matching result.
The present invention tests in practical business, and through the analysis to test figure, validity of the present invention and accuracy have obtained effective checking, evidence, and the present invention can effectively solve its technical matters, and reaches the technique effect of expection:
1, efficiency analysis: under the situation of Bulk Call, be the comparison of two short message contents to the maximum composition of effectiveness affects, and every note will compare with a lot of bar seed notes.For example when seed note storehouse had 500,000 seed notes, every note compared with regard to needs and 500,000 seed notes.Eigenwert capacity of the present invention is very big; So every user's note only needs and several seed notes compare operation, can improve up to ten thousand times efficient, simultaneously; Increase along with the seed note; The number of times that needs coupling is not linearity increase on year-on-year basis, but becomes the increase of logarithm rank, can effectively reduce the increase ratio of note matching times.Test figure shows: 100,000 seed notes have 1,480,000 eigenwerts, and average every user's note is consistent with the eigenwert of 32 seed notes; Article 200,000, the seed note has 2,350,000 eigenwerts, and average every user's note is consistent with the eigenwert of 41 seed notes; Article 500,000, the seed note has 4,570,000 eigenwerts, and average every user's note is consistent with the eigenwert of 53 seed notes.Can find out from test figure; User's note quantity that eigenwert is consistent becomes the logarithm growth pattern basically with seed note quantity; And non-linear, efficient has improved at least 1 ten thousand times, and the overwhelming majority is under the situation of repetition in the seed note of above-mentioned same characteristic features value if consider to have; It is more that efficient improves, and is roughly about 100,000 times from the test case of reality.
2, analysis of the accuracy:
Actual note sales service mainly includes following several kinds of situation:
(1), for user's note of not changing, seed note and user's note are in full accord, its eigenwert is also in full accord, so its correctness is 100%;
(2), for user's note of changing, the seed note is generally the blessing note of many bursts, like " you of tired 1 year should have a rest; be Father's Day today, and I have grown into an adult as your child; can not let you so worry about for me again, wish your happy holiday, happy forever "; " I know you to my love as the sun, just you express to me with the mode of the moon; I know you to my love as the sea, just you express to me with the mode in brook; Be Father's Day today, is willing to that father's happiness is safe and comfortable." or the like, be more than 4 bursts, especially the blessing note of parallelism sentence formula.User's actual note can be subdivided into following several kinds of situation:
1., in actual process of transmitting, it all is to add the address head that the overwhelming majority is changed, perhaps name inscription is like " XX you good: ".This has just increased the burst number, does not change the content of original burst, therefore also can not change its characteristic of correspondence value, also is that accuracy is 100%;
If 2. in short message content, revised short message content, because the eigenwert seed extracts from the note slice header, therefore can the effect characteristics value not calculate, also be that accuracy is 100%;
3., for not punctuating, and revised the note that surpasses 5 (containing) bursts, and note has all end to end under the situation of change, 100% accuracy rate just can appear being lower than.But revised 10 characters this moment at least, and a note is also with regard to 70 Chinese characters.That is to say that its highest matching degree also has only 85%, can nonrecognition be the seed note.

Claims (6)

1. the short message matching method based on eigenwert is characterized in that, said method comprises following steps:
Step 1, seed note library initialization;
The characteristic value collection of step 2, calculating user note; And according to each eigenwert in the said characteristic value collection; From seed note storehouse, search the seed note consistent with said eigenwert; And the seed note that is found mated with user's note one by one, said seed note is the short message content that need be mated
Said step 1 also further includes:
Steps A, calculate the characteristic value collection of every seed note in the seed note storehouse, and with the short message content and eigenwert preservation of said seed note;
The smallest match length of step B, calculating seed note: be about to seed note length and multiply by the seed matching degree;
Step C, the smallest match length of all seed notes are relatively picked out minimum value wherein, and the smallest match length of the seed note of minimum is designated as Lmin,
Said step 2 further includes:
Step 21, length filtration for the first time: judge that said user's note length is less than Lmin? If, then there is not the seed note that is complementary with user's note, this flow process finishes; If, then do not continue next step 22;
The characteristic value collection of step 22, calculating user note;
Step 23, from the characteristic value collection of user's note, extract each eigenwert one by one; Search the seed note consistent with said eigenwert; And user's note and said seed note mated, the seed note that at last match is required is returned as matching result
The computing method of the note characteristic value collection in said steps A and the step 2 are following:
Steps A 1, a note is become several note bursts by Segmentation of Punctuation, said several note bursts are arranged in order into the formation of a note burst by its priority location order in short message content;
Steps A 2, the note burst in the formation of said note burst is sorted by length order from big to small, and order is extracted the note burst of N maximum length from the formation of note burst, said N is the eigenwert capacity, and its value is provided with according to the main frame model;
Steps A 3, by the byte-extraction position of setting; Extract byte as the eigenwert seed from the corresponding position of said N note burst respectively; N eigenwert seed corresponding to same byte-extraction position constitutes an eigenwert seed group, belongs to different character value seed group from the eigenwert seed of different byte-extraction fetched;
Steps A 4, from each eigenwert seed group optional M eigenwert seed; And said M eigenwert seed arranged by positive sequence, the corresponding ASCII character of eigenwert seed after arranging is combined as a shaping number, said shaping number is an eigenwert; All eigenwert constitutive characteristic value sets that comprise by all eigenwert seed group; Wherein M is the characteristic value collection capacity, and its value is provided with according to the main frame model
Said step 23 further includes:
Step 231, from the characteristic value collection of user's note, extract an eigenwert;
Do you step 232, judge that the seed note consistent with said eigenwert arranged in the seed note storehouse? If then continue next step 233; If, then do not go to step 239;
Step 233, length filtration for the second time: whether the length of judges note is not less than the smallest match length of seed note; And be not more than the maximum match length of seed note; Wherein, The smallest match length of seed note=seed note length * seed matching degree, the maximum match length of seed note=seed note length+seed note length * (1-seed matching degree)? If then continue next step 234; If, then do not turn to step 238;
Step 234, the improved bit vector method of employing are calculated the maximum match degree of seed note and user's note;
The matching degree that the maximum match degree that the bit vector value that step 235, judgement finally calculate possibly reach requires less than said seed note? If then turn to step 238; If, then do not continue next step 236;
Step 236, the improved editing distance method of use are calculated the editing distance of seed note and user's note: through calculating character string A promptly can be changed into character string B through the operation that N goes on foot; Calculate the editing distance of A and B; The accurate matching degree of A and B then is N/max (length (A), length (B)) so;
Step 237, as satisfactory matching result, preserve with said seed note with according to the accurate matching degree that editing distance calculates;
Step 238, judge in the seed note storehouse whether to also have the seed note consistent with said eigenwert? If have, then turn to step 233; If, then do not turn to next step 239;
Also has undrawn eigenwert in the characteristic value collection of step 239, judges note? If, then continue to extract next eigenwert, turn to step 232; If, then satisfactory seed note and accurate matching degree are not returned as matching result.
2. a kind of short message matching method as claimed in claim 1 based on eigenwert; It is characterized in that; Byte-extraction position in the said steps A 3 can be set at first byte of each note burst, also can be set at first byte and last byte of each note burst; When main frame was 32 main frames, N got 4, and M gets 3, and when main frame was 64 main frames, N got 8, and M gets 4.
3. a kind of short message matching method as claimed in claim 1 based on eigenwert; It is characterized in that; In the steps A; Seed note node queue is used to preserve the details of seed note, and each seed note is corresponding to a node in the seed note node queue, and eigenwert B+ tree is used to preserve the pointer address of eigenwert and its corresponding seed note.
4. a kind of short message matching method based on eigenwert as claimed in claim 1 is characterized in that step 234 further includes:
Do you judge in the process of vector calculation on the throne whether difference value has surpassed the maximum different value that matching degree requires? If, then no longer carry out follow-up bit vector and calculate, turn to step 238; If, then not continuing follow-up bit vector calculates.
5. a kind of short message matching method based on eigenwert as claimed in claim 1 is characterized in that step 236 further includes:
Judgement is in the computation process of editing distance, and whether has distance value surpassed the maximum range value that matching degree requires? If, then no longer carry out follow-up editing distance and calculate, turn to step 238; If, then not continuing follow-up editing distance calculates.
6. a kind of short message matching method based on eigenwert as claimed in claim 1 is characterized in that said method all provides support to plaintext note and md5 encryption note.
CN2010102566063A 2010-08-18 2010-08-18 Short message matching method based on characteristic value Expired - Fee Related CN101930458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102566063A CN101930458B (en) 2010-08-18 2010-08-18 Short message matching method based on characteristic value

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102566063A CN101930458B (en) 2010-08-18 2010-08-18 Short message matching method based on characteristic value

Publications (2)

Publication Number Publication Date
CN101930458A CN101930458A (en) 2010-12-29
CN101930458B true CN101930458B (en) 2012-02-01

Family

ID=43369635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102566063A Expired - Fee Related CN101930458B (en) 2010-08-18 2010-08-18 Short message matching method based on characteristic value

Country Status (1)

Country Link
CN (1) CN101930458B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662943B (en) * 2012-01-18 2014-06-18 苏州酷动多媒体科技有限公司 Method of short message matching and duplication deletion
CN106997335B (en) * 2016-01-26 2020-05-19 阿里巴巴集团控股有限公司 Identical character string determination method and device
CN112269904B (en) * 2020-09-28 2023-07-25 华控清交信息科技(北京)有限公司 Data processing method and device
CN112261600B (en) * 2020-12-22 2021-08-13 江苏音信通信息技术有限公司 Short message content fast matching method and short message intercepting method based on content

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1691581B (en) * 2004-04-26 2010-04-28 彭诗力 Multi-pattern matching algorithm based on characteristic value
US7797269B2 (en) * 2007-03-29 2010-09-14 Nokia Corporation Method and apparatus using a context sensitive dictionary
CN101257671B (en) * 2007-07-06 2010-12-08 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101398837B (en) * 2008-10-23 2011-05-11 深圳市奇迹通讯有限公司 Method for rapidly matching sms text
CN101600178B (en) * 2009-06-26 2012-04-04 成都市华为赛门铁克科技有限公司 Method for confirming junk information as well as device and terminal therefor

Also Published As

Publication number Publication date
CN101930458A (en) 2010-12-29

Similar Documents

Publication Publication Date Title
CN109445834B (en) Program code similarity rapid comparison method based on abstract syntax tree
CN103634473B (en) Based on mobile phone method for filtering spam short messages and the system of Naive Bayes Classification
CN103268350B (en) Internet public opinion information monitoring system and monitoring method
CN105843878B (en) A kind of IT system event criteria implementation method
CN103823838B (en) A kind of method of multi-format document typing and comparison
CN111352907A (en) Method and device for analyzing pipeline file, computer equipment and storage medium
CN101930458B (en) Short message matching method based on characteristic value
CN105787156B (en) A kind of submodel generation method extracted based on IFC solid datas
CN104166651A (en) Data searching method and device based on integration of data objects in same classes
CN105740337A (en) Rapid event matching method in content-based publishing subscription system
CN102419975A (en) Data mining method and system based on voice recognition
CN109992766A (en) The method and apparatus for extracting target word
CN103123650A (en) Extensible markup language (XML) data bank full-text indexing method based on integer mapping
CN105550253B (en) Method and device for acquiring type relationship
CN107239512A (en) The microblogging comment spam recognition methods of relational network figure is commented in a kind of combination
CN101247434B (en) Traffic analyzing method and system
CN107958154A (en) A kind of malware detection device and method
CN102045268B (en) A kind of e-mail data restoration methods and device
Shi et al. An approach to text steganography based on search in internet
CN106874240A (en) Digital publishing method and system
CN111353838A (en) Method and device for automatically checking commodity category
CN111666575A (en) Text carrier-free information hiding method based on word element coding
CN106411704A (en) Distributed junk short message recognition method
CN109800337A (en) A kind of multi-mode canonical matching algorithm suitable for big alphabet
CN105938469B (en) Coding and storing method, text storing data structure and Text compression storage and statistics output method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120201

Termination date: 20160818

CF01 Termination of patent right due to non-payment of annual fee