CN104036047A - Method and system for automatically correcting character strings - Google Patents

Method and system for automatically correcting character strings Download PDF

Info

Publication number
CN104036047A
CN104036047A CN201410312846.9A CN201410312846A CN104036047A CN 104036047 A CN104036047 A CN 104036047A CN 201410312846 A CN201410312846 A CN 201410312846A CN 104036047 A CN104036047 A CN 104036047A
Authority
CN
China
Prior art keywords
phrase
word
character string
string
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410312846.9A
Other languages
Chinese (zh)
Other versions
CN104036047B (en
Inventor
刘利
黄晓君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ctrip Business Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN201410312846.9A priority Critical patent/CN104036047B/en
Publication of CN104036047A publication Critical patent/CN104036047A/en
Application granted granted Critical
Publication of CN104036047B publication Critical patent/CN104036047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Abstract

The invention discloses a method and system for automatically correcting character strings. The method for automatically correcting the character strings includes the following steps that a keyword database in which first kind words, second kind words, preset word groups and a word ranking sequence are recorded is generated; a word group ranking statistical table is generated according to the keyword database; the input character string is read; first kind words are selected from the input character string, and the character string is divided into keyword groups; effective word groups, words to be combined and ineffective words are selected from all the keyword groups; effective word groups are formed based on the words to be combined and according to the word group ranking statistical table; the output character string is generated; accuracy is calculated according to the word group ranking statistical table and is output. According to the method and system for automatically correcting the character strings, the concept partially based on word bank matching and partially based on a statistical probability is adopted, accuracy judgment can be conducted on input character string information, clerical errors generated in the user input process can be well recognized and automatically corrected, and therefore running efficiency of electronic commerce is improved.

Description

Character string automatic correcting method and system
Technical field
The present invention relates to a kind of character string automatic correcting method and system.
Background technology
Along with ecommerce role in people's daily life is increasing, authenticity, accuracy problem for user's input information in ecommerce also become the emphasis that numerous e-commerce companies pay close attention to.In ecommerce, often can relate to filling in of some information with conventional form, such as information such as ship-to, in these information common Dou Hui businessman and user's mutual and communication, play important effect.Yet, in the information of the user input of magnanimity, there will be unavoidably some harassing and wrecking property information, be deceptive information, some clerical mistakes that not carefully cause while also occurring unavoidably some users due to input message on the other hand.The reason of this two aspect, just makes the authenticity of part input message, accuracy have query, and has hindered businessman and user's further communication or the carrying out of transaction etc.
In fact, the minor error causing for the clerical mistake by user's input information etc., owing to correcting automatically, has affected the running efficiency of ecommerce in this case greatly, for user, uses speech also to need it to re-enter information and inadequate border.And for the deceptive information of harassing and wrecking property, owing to being difficult to provide according to input message automatically and efficiently the foundation of judgement comparatively accurately or identification, not only can be dragged by these deceptive information the running efficiency of low ecommerce, also can improve the cost that anti-risk of fraud is controlled.The problems referred to above, are perplexing service provider, businessman and the consumer of vast ecommerce for a long time.
Summary of the invention
The technical problem to be solved in the present invention is the character string information of inputting for user in prior art in order to overcome, cannot be automatically, efficiently its authenticity or accuracy are made to judgement comparatively accurately, clerical mistake when being also difficult to identify preferably user and inputting and and then the minor error in character string is carried out to automatic straightening, thereby efficiency and the anti-higher defect of wind control cost of swindling that can reduce the running of ecommerce in this case, propose a kind of character string automatic correcting method and system.
The present invention solves above-mentioned technical matters by following technical proposals:
The invention provides a kind of character string automatic correcting method, its feature is, in a string data storehouse, store a plurality of character strings and a plurality of default first kind word examined, each character string of having examined includes some first kind words, and this character string automatic correcting method comprises the following steps:
S 1, from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
S 2, generate a phrase and arrange statistical form, this phrase is arranged and in statistical form, is recorded each default phrase and appear at the arrangement probability of the plurality of character string beginning and the arrangement probability that closely occurs each default phrase in the plurality of character string after each default phrase;
S 3, read an input of character string;
S 4, from this input of character string, choose first kind word as level keyword, and according to level keyword present position in this input of character string, this input of character string is divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
S 5, from each crucial phrase, choose default phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
S 6, from invalid part, choose Equations of The Second Kind word as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
S 7, with order from front to back in this input of character string, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
S 8, generate an output string, in this output string, be arranged with each effective phrase, the order of arrangement is determined according to this row's word order;
S 9, inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy;
S 10, export this accuracy.
Wherein, a plurality of character strings of having examined should be understood to, and meet the character string of a certain specific call format, and this specific call format requires must comprise some first kind words in character string.For representing that the character string of address is example, it must include word for representing address level such as " road ", " district " etc.In ecommerce, one of ordinary skill in the art will readily recognize that the data of storing in this string data storehouse, can be as a rule to derive from and step S 3in read input of character string identical source, the data of just storing in this string data storehouse are all examined, thereby it is truly and accurately.
This row's word order for be each first kind word, and Equations of The Second Kind word to can be understood as be in fact to utilize default first kind word to split for character string to obtain, default phrase is also similarly with Equations of The Second Kind word, can think the part of the character string that obtains after fractionation.At step S 1this process in, can think that raw data is only character string and first kind word.Step S 2in fact, utilize S 1the plurality of character string add up the statistics drawing.Wherein, arrange the probability that probability means the arrangement of default phrase, and in the present invention, only refer to the probability that two default phrases are arranged in front and back next-door neighbour's mode.Thereby, arrange probability and just refer to the probability that and then occurs another default phrase after some default phrases, unique special case is for the arrangement probability that is positioned at the default phrase of character string beginning, arranges probability and just refer to that a certain default phrase appears at the probability of character string beginning in this special case.Here all arrangement probability are all added up and are calculated based on the plurality of character string (based on truly and accurately checking data).
Step S 3~S 6in fact to obtain input of character string, then based on this keyword database, this input of character string is divided, according to the first kind word in input of character string and Equations of The Second Kind word obtain effective phrase, treat portmanteau word, invalid part, invalid word part etc. so that carry out follow-up processing.Effective phrase is wherein directly as the ingredient of final this output string generating, to export.And treat that in fact portmanteau word is exactly the Equations of The Second Kind word only meeting in this keyword database, and lacked thereafter the word of first kind word.
For treating portmanteau word, disappearance may be first kind word thereafter, step S 7be exactly according to the arrangement probability in this phrase arrangement statistical form in conjunction with word in this input of character string in tandem, choose corresponding to treating effective phrase portmanteau word, that arrange maximum probability.Then, generate this output string, this output string only comprises effective phrase, thereby this output string should be by the determined character string that includes accurate information in string data storehouse according to this input of character string.
Above-mentioned steps can be regarded the generative process of this output string as, and this generative process part design based on dictionary coupling, part is the design of the terminology match based on statistical method.At step S 9in also based on this phrase, arrange the accuracy that statistical form obtains this output string.Should be understood that, although the accuracy in the present invention cannot directly determine that whether this output string is accurate, but this accuracy be similar to this output string and the character string examined are had on the whole the tolerance of similarity of feature, if the character string of having examined has higher representativeness, this accuracy is also comparatively accurately accordingly.Pass through above-mentioned steps, comparatively efficiently and accurately carries out certain identification rectification to input of character string, can also calculate automatically and efficiently the basis for estimation that accuracy is usingd as the accuracy of this output string, so just can reduce online anti-swindle wind control cost.
Preferably, in this string data storehouse, also store the weighted value of each first kind word, S 9by S 9asubstitute S 9afor:
Inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtain is as accuracy, in the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string or adjacent effective phrase after effective phrase in the weighted value of first kind word.
The first kind word being actually based on default for the division of character string in the present invention is divided, and the status of each first kind word in character string may be different.For instance, if what each first kind vocabulary showed is the hierarchical relationship declining step by step, the corresponding Equations of The Second Kind word of first kind word in higher level sum may be less so, can be larger significantly thereby make to relate to the relevant arrangement probability of effective phrase of these first kind words.Therefore, can be weighted according to different first kind words, thereby avoid in character string a certain the probability of arranging excessive and accuracy result is played to mastery effect, and eliminate the impact of other arrangement probability.
Preferably, S 2also comprise: all arrangement probability that are greater than a default probability threshold value in this phrase arrangement statistical form are rewritten as and equal this probability threshold value.So just avoid causing because certain default phrase quantity in the plurality of character string is very few certain to arrange probability excessive, thereby made accuracy result cannot reflect the similarity degree of each effective phrase W-response in the plurality of character string and character string that examined completely.
Preferably, S 10further comprising the steps of afterwards:
S 11a, this output string is added in this string data storehouse and is stored.
Preferably, at S 6carry out afterwards S 61, S 61for: from invalid word part, choose comprise first kind word phrase as unknown phrase, and carry out S 7;
S 10further comprising the steps of afterwards:
S 11, judge that whether this accuracy is greater than a default accuracy threshold value, carries out S in the situation that the determination result is NO 12, carry out S be in the situation that judgment result is that 13;
S 12, this output string is added in this string data storehouse and is stored, and process ends;
S 13, according to the first kind word in unknown phrase, add this output string to generate a return string unknown phrase, the sequencer of first kind word in this return string is should row's word order and is carried out S 14;
S 14, this return string is added in this string data storehouse and is stored.
Wherein, unknown phrase has comprised first kind word, therefore has certain possible unknown phrase just because the character string quantity of having examined enough causes failing to identify.In the case, if according to assert that this output string accuracy is higher after accuracy judgement, think that the appearance of unknown phrase may not be that mistake while inputting due to user produces, thereby be included in this return string and be added in this string data storehouse.
Preferably, S 10also comprise: the quantity of portmanteau word and/or the number of characters that invalid word part comprises are treated in output;
S 11by S 11bsubstitute S 11bfor: judge that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, carries out S in the situation that the determination result is NO 12, carry out S be in the situation that judgment result is that 13.
Preferably, first kind word comprises city, district, Village, road.
The present invention also provides a kind of character string self-correcting system, and its feature is, comprising:
String data library module, for storing a plurality of character strings and a plurality of default first kind word of having examined, each character string of having examined includes some first kind words;
Keyword data library module, be used for from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
Statistical module arranged in phrase, for calculate and record each default phrase according to the record of this string data library module and this keyword database, appears at the arrangement probability of the plurality of character string beginning and in the plurality of character string, after each default phrase, closely occur the arrangement probability of each default phrase;
Character string read module, for reading input of character string;
Character string is divided module, for choosing first kind word as level keyword from this input of character string, and according to level keyword present position in this input of character string, this input of character string being divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
Effectively module chosen in phrase, for choosing default phrase from each crucial phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
Treat that portmanteau word chooses module, for choosing Equations of The Second Kind word from invalid part as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
Phrase builds module, be used for this input of character string order from front to back, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
Output module, for generating an output string, is arranged with each effective phrase in this output string, the order of arrangement is determined according to this row's word order;
The first computing module, arrange the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy for inquiring about this phrase;
Accuracy module, for exporting this accuracy.
Preferably, this string data library module also stores the weighted value of each first kind word, and the first computing module is substituted by the second computing module;
The second computing module, for inquiring about this phrase, arrange the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtain is as accuracy, in the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string or adjacent effective phrase after effective phrase in the weighted value of first kind word.
Preferably, this phrase is arranged statistical module also for all arrangement probability that are greater than a default probability threshold value are rewritten as and equal this probability threshold value.
Preferably, this character string self-correcting system also comprises that an output string returns to module, and this output string returns to module and stores for this output string being added to this string data library module.
Preferably, this treat portmanteau word choose module also for from invalid word part, choose comprise first kind word phrase as unknown phrase, this character string self-correcting system also comprises:
The first judge module for judging whether this accuracy is greater than a default accuracy threshold value, and is enabled first and is returned to module in the situation that the determination result is NO, judgment result is that enabling second be in the situation that returns to module;
First returns to module, for this output string being added to this string data library module, stores;
Second returns to module, for adding this output string to generate the sequencer of first kind word in a return string, this return string should arrange word order unknown phrase according to the first kind word of unknown phrase, then this return string is added in this string data library module and is stored.
Preferably, accuracy module is also treated the quantity of portmanteau word and/or the number of characters that invalid word part comprises for exporting;
The first judge module is substituted by the second judge module, the second judge module for judging that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, and in the situation that the determination result is NO, enable and first return to module, judgment result is that enabling second be in the situation that returns to module.
Preferably, first kind word comprises city, district, Village, road.
Meeting on the basis of this area general knowledge, above-mentioned each optimum condition, can combination in any, obtains the preferred embodiments of the invention.
Positive progressive effect of the present invention is:
Character string automatic correcting method of the present invention and system, design, part based on the statistical probability design with deduction character string correctness of part based on dictionary coupling, can carry out automatically the character string information of user input, the judgement of efficient authenticity or accuracy, clerical mistake in the time of simultaneously can also identifying preferably user and input and and then the minor error in character string is carried out to automatic straightening, thereby improved the efficiency of ecommerce running in this case.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the character string automatic correcting method of the embodiment of the present invention 1.
Fig. 2 is the schematic diagram of the character string self-correcting system of the embodiment of the present invention 2.
Embodiment
Below in conjunction with accompanying drawing, provide preferred embodiment of the present invention, to describe technical scheme of the present invention in detail, but therefore do not limit the present invention among described scope of embodiments.
Embodiment 1
In the character string automatic correcting method of the present embodiment, store a plurality of character strings and a plurality of default first kind word examined in a string data storehouse, each character string of having examined includes some first kind words.Shown in figure 1, this character string automatic correcting method comprises the following steps:
S 1, from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
S 2, generate a phrase and arrange statistical form, this phrase is arranged and in statistical form, is recorded each default phrase and appear at the arrangement probability of the plurality of character string beginning and the arrangement probability that closely occurs each default phrase in the plurality of character string after each default phrase;
S 3, read an input of character string;
S 4, from this input of character string, choose first kind word as level keyword, and according to level keyword present position in this input of character string, this input of character string is divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
S 5, from each crucial phrase, choose default phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
S 6, from invalid part, choose Equations of The Second Kind word as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
S 7, with order from front to back in this input of character string, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
S 8, generate an output string, in this output string, be arranged with each effective phrase, the order of arrangement is determined according to this row's word order;
S 9, inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy;
S 10, export this accuracy;
S 11a, this output string is added in this string data storehouse and is stored.
Wherein, S 2also comprise: all arrangement probability that are greater than a default probability threshold value in this phrase arrangement statistical form are rewritten as and equal this probability threshold value.So just avoid causing because certain default phrase quantity in the plurality of character string is very few certain to arrange probability excessive, thereby made accuracy result cannot reflect the similarity degree of each effective phrase W-response in the plurality of character string and character string that examined completely.
And a plurality of character strings of having examined should be understood to, and meet the character string of a certain specific call format, this specific call format requires must comprise some first kind words in character string.This row's word order for be each first kind word, and Equations of The Second Kind word to can be understood as be in fact to utilize default first kind word to split for character string to obtain, default phrase is also similarly with Equations of The Second Kind word, can think the part of the character string that obtains after fractionation.
Step S 2in fact, utilize S 1the plurality of character string add up the statistics drawing.Wherein, arrange probability and refer to the probability that default phrase is arranged in front and back next-door neighbour's mode.For treating portmanteau word, disappearance may be first kind word thereafter, step S 7be exactly according to the arrangement probability in this phrase arrangement statistical form in conjunction with word in this input of character string in tandem, choose corresponding to treating effective phrase portmanteau word, that arrange maximum probability.Then, generate this output string, this output string only comprises effective phrase, thereby this output string should be by the determined character string that includes accurate information in string data storehouse according to this input of character string.
Above-mentioned steps can be regarded the generative process of this output string as, and this generative process part design based on dictionary coupling, part is the design of the terminology match based on statistical method.To this method of the present embodiment, the application in the automatic correction of address character string is illustrated below.
In this application example, first kind word has comprised city, district, Village, road, and this row's word order is city, district, Village, road.What in the plurality of character string, include similar " Nanjing Road, Shanghai City ", " XX road, Huangpu District, Shanghai City ", " Village, east, Huangpu District " and so on examines character string.By these character strings, can add up the arrangement probability that draws " Shanghai City ", " Village, Shanghai ", " upper sea route ", " Nanjing Road ", " Nanjing " etc. default phrase.
In the case, for instance, the input of character string reading is " east, Shanghai, Huangpu District, Nanjing ", although be difficult to Direct Recognition by this input of character string, go out legal address, but according to phrase, arrange examining character string and can determine in statistical form, in input of character string, lack first kind word phrase " Nanjing ", " Shanghai " and " east " may with first kind word in some or a plurality of compositions preset phrase.Now, first perform step S 4first to mark off a crucial phrase " Huangpu District ".
Then, for input of character string " east, Shanghai, Huangpu District, Nanjing ", remainder is all designated as invalid part, by the character string of having examined, is determined, " Nanjing " wherein, " Shanghai ", " east " all belong to Equations of The Second Kind word, is therefore elected to be and treats portmanteau word.Then, consider these Equations of The Second Kind words and all possible combination of first kind word, and the arrangement probability of arranging in statistical form according to phrase is found out the mode of arranging maximum probability in the character string forming after potential first kind word that adds.For example, " Nanjing Road ", " Shanghai City " and " Nanjing ", " upper sea route " both of these case are all possible, but after sorting in conjunction with crucial phrase " Huangpu District " and according to first kind word, the arrangement probability of " Huangpu District, Shanghai City " is larger compared with the arrangement probability of " Huangpu District, Nanjing ", therefore adopts front a kind of.
Thus, the final output string generating is " DongFang Residential Quater, Nanjing Road, Huangpu District, Shanghai City ".Should be understood that, although said method does not guarantee that the final output string generating must be correct or clear is discernible, from the angle of statistical probability, greatly improved the correctness of output string.This point is longer in character string, it is particularly evident when the quantity of the first kind word comprising (can be regarded as level keyword) is more, thereby this is can utilize the first kind word character string that more effectively utilization has been examined that sorts character string to be confirmed is revised due to this method, can play the effect of mutual confirmation when the arrangement probability relating to is more.
At step S 9in also based on this phrase, arrange the accuracy that statistical form obtains this output string.Should be understood that, although the accuracy in the present invention cannot directly determine that whether this output string is accurate, but this accuracy be similar to this output string and the character string examined are had on the whole the tolerance of similarity of feature, if the character string of having examined has higher representativeness, this accuracy is also comparatively accurately accordingly.
The method of the present embodiment, comparatively efficiently and accurately carries out certain identification rectification to input of character string, can also calculate automatically and efficiently the basis for estimation that accuracy is usingd as the accuracy of this output string, so just can reduce online anti-swindle wind control cost.
Embodiment 2
The character string automatic correcting method of the present embodiment is compared with embodiment 1, and difference is only:
In this string data storehouse, also store the weighted value of each first kind word, S 9by S 9asubstitute S 9afor:
Inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtain is as accuracy, in the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string or adjacent effective phrase after effective phrase in the weighted value of first kind word.
And, at S 6carry out afterwards S 61, S 61for: from invalid word part, choose comprise first kind word phrase as unknown phrase, and carry out S 7;
S 10also comprise: the quantity of portmanteau word and/or the number of characters that invalid word part comprises are treated in output;
S 10further comprising the steps of afterwards:
S 11b, judge that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, carries out S in the situation that the determination result is NO 12, carry out S be in the situation that judgment result is that 13;
S 12, this output string is added in this string data storehouse and is stored, and process ends;
S 13, according to the first kind word in unknown phrase, add this output string to generate a return string unknown phrase, the sequencer of first kind word in this return string is should row's word order and is carried out S 14;
S 14, this return string is added in this string data storehouse and is stored.
The method of the present embodiment is on the basis of embodiment 1, having taken into full account the status of each first kind word in character string may be different, and then be weighted according to different first kind words, avoid in character string a certain the probability of arranging excessive and accuracy result is played to mastery effect, and eliminated the impact of other arrangement probability.
Embodiment 3
Shown in figure 2, the character string self-correcting system of the present embodiment comprises:
String data library module 1, for storing a plurality of character strings and a plurality of default first kind word of having examined, each character string of having examined includes some first kind words;
Keyword data library module 2, be used for from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
Statistical module 3 arranged in phrase, for calculate and record each default phrase according to the record of this string data library module and this keyword database, appears at the arrangement probability of the plurality of character string beginning and in the plurality of character string, after each default phrase, closely occur the arrangement probability of each default phrase;
Character string read module 4, for reading input of character string;
Character string is divided module 5, for choosing first kind word as level keyword from this input of character string, and according to level keyword present position in this input of character string, this input of character string being divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
Effectively module 6 chosen in phrase, for choosing default phrase from each crucial phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
Treat that portmanteau word chooses module 7, for choosing Equations of The Second Kind word from invalid part as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
Phrase builds module 8, be used for this input of character string order from front to back, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
Output module 9, for generating an output string, is arranged with each effective phrase in this output string, the order of arrangement is determined according to this row's word order;
The first computing module 10, arrange the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy for inquiring about this phrase;
Accuracy module 11, for exporting this accuracy.
Wherein, this phrase is arranged statistical module also for all arrangement probability that are greater than a default probability threshold value are rewritten as and equal this probability threshold value.This character string self-correcting system also comprises that an output string returns to module 12, and this output string returns to module and stores for this output string being added to this string data library module.
Embodiment 4
The character string self-correcting system of the present embodiment is compared with embodiment 3, and difference is only:
This string data library module also stores the weighted value of each first kind word, the first computing module is substituted by the second computing module, wherein the second computing module arranges for inquiring about this phrase the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtaining is as accuracy, the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string, or in adjacent effective phrase after effective phrase in the weighted value of first kind word.
And this treat portmanteau word choose module also for from invalid word part, choose comprise first kind word phrase as unknown phrase, this character string self-correcting system also comprises:
The first judge module for judging whether this accuracy is greater than a default accuracy threshold value, and is enabled first and is returned to module in the situation that the determination result is NO, judgment result is that enabling second be in the situation that returns to module;
First returns to module, for this output string being added to this string data library module, stores;
Second returns to module, for adding this output string to generate the sequencer of first kind word in a return string, this return string should arrange word order unknown phrase according to the first kind word of unknown phrase, then this return string is added in this string data library module and is stored.
Meanwhile, accuracy module is also treated the quantity of portmanteau word and/or the number of characters that invalid word part comprises for exporting;
The first judge module is substituted by the second judge module, the second judge module for judging that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, and in the situation that the determination result is NO, enable and first return to module, judgment result is that enabling second be in the situation that returns to module.
Although more than described the specific embodiment of the present invention, it will be understood by those of skill in the art that these only illustrate, protection scope of the present invention is limited by appended claims.Those skilled in the art is not deviating under the prerequisite of principle of the present invention and essence, can make various changes or modifications to these embodiments, but these changes and modification all fall into protection scope of the present invention.

Claims (14)

1. a character string automatic correcting method, it is characterized in that, in a string data storehouse, store a plurality of character strings and a plurality of default first kind word examined, each character string of having examined includes some first kind words, and this character string automatic correcting method comprises the following steps:
S 1, from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
S 2, generate a phrase and arrange statistical form, this phrase is arranged and in statistical form, is recorded each default phrase and appear at the arrangement probability of the plurality of character string beginning and the arrangement probability that closely occurs each default phrase in the plurality of character string after each default phrase;
S 3, read an input of character string;
S 4, from this input of character string, choose first kind word as level keyword, and according to level keyword present position in this input of character string, this input of character string is divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
S 5, from each crucial phrase, choose default phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
S 6, from invalid part, choose Equations of The Second Kind word as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
S 7, with order from front to back in this input of character string, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
S 8, generate an output string, in this output string, be arranged with each effective phrase, the order of arrangement is determined according to this row's word order;
S 9, inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy;
S 10, export this accuracy.
2. character string automatic correcting method as claimed in claim 1, is characterized in that, also stores the weighted value of each first kind word, S in this string data storehouse 9by S 9asubstitute S 9afor:
Inquire about this phrase and arrange the arrangement probability that statistical form obtains effective phrase of starting in this output string and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtain is as accuracy, in the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string or adjacent effective phrase after effective phrase in the weighted value of first kind word.
3. character string automatic correcting method as claimed in claim 1, is characterized in that S 2also comprise: all arrangement probability that are greater than a default probability threshold value in this phrase arrangement statistical form are rewritten as and equal this probability threshold value.
4. the character string automatic correcting method as described in any one in claim 1-3, is characterized in that S 10further comprising the steps of afterwards:
S 11a, this output string is added in this string data storehouse and is stored.
5. the character string automatic correcting method as described in any one in claim 1-3, is characterized in that, at S 6carry out afterwards S 61, S 61for: from invalid word part, choose comprise first kind word phrase as unknown phrase, and carry out S 7;
S 10further comprising the steps of afterwards:
S 11, judge that whether this accuracy is greater than a default accuracy threshold value, carries out S in the situation that the determination result is NO 12, carry out S be in the situation that judgment result is that 13;
S 12, this output string is added in this string data storehouse and is stored, and process ends;
S 13, according to the first kind word in unknown phrase, add this output string to generate a return string unknown phrase, the sequencer of first kind word in this return string is should row's word order and is carried out S 14;
S 14, this return string is added in this string data storehouse and is stored.
6. character string automatic correcting method as claimed in claim 5, is characterized in that S 10also comprise: the quantity of portmanteau word and/or the number of characters that invalid word part comprises are treated in output;
S 11by S 11bsubstitute S 11bfor: judge that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, carries out S in the situation that the determination result is NO 12, carry out S be in the situation that judgment result is that 13.
7. character string automatic correcting method as claimed in claim 6, is characterized in that, first kind word comprises city, district, Village, road.
8. a character string self-correcting system, is characterized in that, comprising:
String data library module, for storing a plurality of character strings and a plurality of default first kind word of having examined, each character string of having examined includes some first kind words;
Keyword data library module, be used for from the plurality of character string, other words that extraction is separated by first kind word are as Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards next-door neighbour's the common phrase forming of first kind word as default phrase, then generate a keyword database, in this keyword database, record quantity and be a plurality of first kind words, Equations of The Second Kind word, default phrase and row's word order, default the putting in order that this row's word order is each first kind word;
Statistical module arranged in phrase, for calculate and record each default phrase according to the record of this string data library module and this keyword database, appears at the arrangement probability of the plurality of character string beginning and in the plurality of character string, after each default phrase, closely occur the arrangement probability of each default phrase;
Character string read module, for reading input of character string;
Character string is divided module, for choosing first kind word as level keyword from this input of character string, and according to level keyword present position in this input of character string, this input of character string being divided into crucial phrase, level keyword is positioned at ending place of crucial phrase;
Effectively module chosen in phrase, for choosing default phrase from each crucial phrase as effective phrase, and the part except effective phrase in each crucial phrase is designated as to invalid part;
Treat that portmanteau word chooses module, for choosing Equations of The Second Kind word from invalid part as treating portmanteau word, and by invalid part except treating that all words portmanteau word are designated as invalid word part;
Phrase builds module, be used for this input of character string order from front to back, according to each, treat that the effective phrase of next-door neighbour before portmanteau word and this phrase arrange statistical form and generate successively effective phrase that each treats that portmanteau word is corresponding successively, effective phrase of generation is respectively in corresponding each phrase obtaining after portmanteau word and each first kind word combination, at this phrase, arrange the phrase of the arrangement maximum probability in statistical form;
Output module, for generating an output string, is arranged with each effective phrase in this output string, the order of arrangement is determined according to this row's word order;
The first computing module, arrange the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the summation of calculating the arrangement probability obtaining is as accuracy for inquiring about this phrase;
Accuracy module, for exporting this accuracy.
9. character string self-correcting system as claimed in claim 8, is characterized in that, this string data library module also stores the weighted value of each first kind word, and the first computing module is substituted by the second computing module;
The second computing module, for inquiring about this phrase, arrange the arrangement probability that statistical form module is obtained effective phrase that this output string starts and adjacent effective phrase, and the weighted mean value that calculates the arrangement probability obtain is as accuracy, in the weighted value of the first kind word in effective phrase that wherein each weight of arranging probability equals to start in this output string or adjacent effective phrase after effective phrase in the weighted value of first kind word.
10. character string self-correcting system as claimed in claim 8, is characterized in that, this phrase is arranged statistical module also for all arrangement probability that are greater than a default probability threshold value are rewritten as and equal this probability threshold value.
11. character string self-correcting systems as described in any one in claim 8-10, it is characterized in that, this character string self-correcting system also comprises that an output string returns to module, and this output string returns to module and stores for this output string being added to this string data library module.
12. character string self-correcting systems as described in any one in claim 8-10, it is characterized in that, this treat portmanteau word choose module also for from invalid word part, choose comprise first kind word phrase as unknown phrase, this character string self-correcting system also comprises:
The first judge module for judging whether this accuracy is greater than a default accuracy threshold value, and is enabled first and is returned to module in the situation that the determination result is NO, judgment result is that enabling second be in the situation that returns to module;
First returns to module, for this output string being added to this string data library module, stores;
Second returns to module, for adding this output string to generate the sequencer of first kind word in a return string, this return string should arrange word order unknown phrase according to the first kind word of unknown phrase, then this return string is added in this string data library module and is stored.
13. character string self-correcting systems as claimed in claim 12, is characterized in that, accuracy module is also treated the quantity of portmanteau word and/or the number of characters that invalid word part comprises for exporting;
The first judge module is substituted by the second judge module, the second judge module for judging that this accuracy is greater than this accuracy threshold value, the quantity for the treatment of portmanteau word is less than default one and treats that number of characters that portmanteau word amount threshold and/or invalid word part comprise is less than a default idle character and counts threshold value and whether set up simultaneously, and in the situation that the determination result is NO, enable and first return to module, judgment result is that enabling second be in the situation that returns to module.
14. character string self-correcting systems as claimed in claim 13, is characterized in that, first kind word comprises city, district, Village, road.
CN201410312846.9A 2014-07-02 2014-07-02 Method and system for automatically correcting character strings Active CN104036047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410312846.9A CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410312846.9A CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Publications (2)

Publication Number Publication Date
CN104036047A true CN104036047A (en) 2014-09-10
CN104036047B CN104036047B (en) 2017-05-17

Family

ID=51466817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410312846.9A Active CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Country Status (1)

Country Link
CN (1) CN104036047B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291698A (en) * 2017-06-30 2017-10-24 广东欧珀移动通信有限公司 Information revision method, device, storage medium and electronic equipment
CN110135414A (en) * 2019-05-16 2019-08-16 京北方信息技术股份有限公司 Corpus update method, device, storage medium and terminal
CN116502614A (en) * 2023-06-26 2023-07-28 北京每日信动科技有限公司 Data checking method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477565A (en) * 2009-01-22 2009-07-08 北京搜狗科技发展有限公司 Method and apparatus for confirming correctness of input character string in search engine
CN101639830A (en) * 2009-09-08 2010-02-03 西安交通大学 Chinese term automatic correction method in input process
EP2395438A1 (en) * 2009-02-03 2011-12-14 Huawei Technologies Co., Ltd. Character string processing method and system and matcher
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477565A (en) * 2009-01-22 2009-07-08 北京搜狗科技发展有限公司 Method and apparatus for confirming correctness of input character string in search engine
EP2395438A1 (en) * 2009-02-03 2011-12-14 Huawei Technologies Co., Ltd. Character string processing method and system and matcher
CN101639830A (en) * 2009-09-08 2010-02-03 西安交通大学 Chinese term automatic correction method in input process
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵华: "多模型下的近似字符串匹配算法研究", 《中国博士学位论文全文数据库·信息科技辑》 *
马恒等: "多属性字符串辨识码的高效率查询算法", 《PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ENGINEERING AND BUSINESS MANAGEMENT(EBM2010) 》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291698A (en) * 2017-06-30 2017-10-24 广东欧珀移动通信有限公司 Information revision method, device, storage medium and electronic equipment
CN110135414A (en) * 2019-05-16 2019-08-16 京北方信息技术股份有限公司 Corpus update method, device, storage medium and terminal
CN110135414B (en) * 2019-05-16 2021-07-09 京北方信息技术股份有限公司 Corpus updating method, apparatus, storage medium and terminal
CN116502614A (en) * 2023-06-26 2023-07-28 北京每日信动科技有限公司 Data checking method, system and storage medium
CN116502614B (en) * 2023-06-26 2023-09-01 北京每日信动科技有限公司 Data checking method, system and storage medium

Also Published As

Publication number Publication date
CN104036047B (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN103678613B (en) Method and device for calculating influence data
CN105608218A (en) Intelligent question answering knowledge base establishment method, establishment device and establishment system
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN103514238B (en) Sensitive word identifying processing method based on classification searching
CN108009914A (en) A kind of assessing credit risks method, system, equipment and computer-readable storage medium
CN105095238A (en) Decision tree generation method used for detecting fraudulent trade
CN104142915A (en) Punctuation adding method and system
CN107315731A (en) Text similarity computing method
CN103678271B (en) A kind of text correction method and subscriber equipment
CN103678277A (en) Theme-vocabulary distribution establishing method and system based on document segmenting
CN103324745A (en) Text garbage identifying method and system based on Bayesian model
CN104199965A (en) Semantic information retrieval method
CN104092601A (en) Method and device for recognizing social-media account
CN103970802A (en) Song recommending method and device
CN104036047A (en) Method and system for automatically correcting character strings
CN106528821A (en) Method for importing change column data into database
CN103810171B (en) Method and system for generating random test data within limited range
CN110059177A (en) A kind of activity recommendation method and device based on user's portrait
CN109344387A (en) The generation method of nearly word form dictionary, device and nearly word form error correction method, device
CN107122391A (en) A kind of code person approving based on historical record recommends method
CN101477565B (en) Method and apparatus for confirming correctness of input character string in search engine
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
CN104809141A (en) Matching system and method of hotel data
CN103309851B (en) The rubbish recognition methods of short text and system
CN107608981B (en) Character matching method and system based on regular expression

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160205

Address after: 200335 Shanghai city Changning District Admiralty Road No. 968 Building No. 16 10 floor

Applicant after: SHANGHAI XIECHENG BUSINESS CO., LTD.

Address before: 200335 Shanghai City, Changning District Fuquan Road No. 99, Ctrip network technology building

Applicant before: Ctrip computer technology (Shanghai) Co., Ltd.

GR01 Patent grant
GR01 Patent grant