CN104036047B - Method and system for automatically correcting character strings - Google Patents

Method and system for automatically correcting character strings Download PDF

Info

Publication number
CN104036047B
CN104036047B CN201410312846.9A CN201410312846A CN104036047B CN 104036047 B CN104036047 B CN 104036047B CN 201410312846 A CN201410312846 A CN 201410312846A CN 104036047 B CN104036047 B CN 104036047B
Authority
CN
China
Prior art keywords
word
phrase
string
character string
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410312846.9A
Other languages
Chinese (zh)
Other versions
CN104036047A (en
Inventor
刘利
黄晓君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ctrip Business Co Ltd
Original Assignee
Shanghai Ctrip Business Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Ctrip Business Co Ltd filed Critical Shanghai Ctrip Business Co Ltd
Priority to CN201410312846.9A priority Critical patent/CN104036047B/en
Publication of CN104036047A publication Critical patent/CN104036047A/en
Application granted granted Critical
Publication of CN104036047B publication Critical patent/CN104036047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and system for automatically correcting character strings. The method for automatically correcting the character strings includes the following steps that a keyword database in which first kind words, second kind words, preset word groups and a word ranking sequence are recorded is generated; a word group ranking statistical table is generated according to the keyword database; the input character string is read; first kind words are selected from the input character string, and the character string is divided into keyword groups; effective word groups, words to be combined and ineffective words are selected from all the keyword groups; effective word groups are formed based on the words to be combined and according to the word group ranking statistical table; the output character string is generated; accuracy is calculated according to the word group ranking statistical table and is output. According to the method and system for automatically correcting the character strings, the concept partially based on word bank matching and partially based on a statistical probability is adopted, accuracy judgment can be conducted on input character string information, clerical errors generated in the user input process can be well recognized and automatically corrected, and therefore running efficiency of electronic commerce is improved.

Description

Character string automatic correcting method and system
Technical field
The present invention relates to a kind of character string automatic correcting method and system.
Background technology
As ecommerce acts on increasing played in people's daily life, for user input is believed in ecommerce The verity of breath, accuracy problem also become the emphasis of numerous e-commerce company's concerns.In ecommerce, often relate to And some have the information such as filling in for the information of general format, such as ship-to, these information generally all can be in businessman and use Play an important role in the interaction and communication at family.However, in the information of the user input of magnanimity, occurring that some are disturbed unavoidably , on the other hand also there are some users unavoidably due to not enough carefully causing during input information in the information of immunity, i.e. deceptive information Some clerical mistakes.The reason for these two aspects, allow for the verity of part input information, accuracy and there is query, and hamper Carrying out further linked up or conclude the business of businessman and user etc..
In fact, the minor error caused by clerical mistake by user input information etc., due to automatically being corrected, Leverage the running efficiency of ecommerce in this case, be also required to for user's use its re-enter information and It is not convenient enough.And for the deceptive information of harassing and wrecking property, due to being difficult to automatically and efficiently provide accurate according to input information The foundation for judging or recognizing, not only can be dragged the running efficiency of low ecommerce by these deceptive information, can also improve anti-fraud wind The cost of danger control.The problems referred to above, long-standing problem service provider, businessman and the consumer of vast ecommerce.
The content of the invention
The technical problem to be solved in the present invention is to overcome in prior art for the character string information of user input, nothing Method makes accurately judgement automatically, efficiently to its verity or accuracy, it is also difficult to when preferably identifying user is input into Clerical mistake and and then automatic straightening is carried out to the minor error in character string, so as to can reduce ecommerce in this case operating Efficiency and the relatively costly defect of anti-fraud air control, propose a kind of character string automatic correcting method and system.
The present invention is solving above-mentioned technical problem by following technical proposals:
The invention provides a kind of character string automatic correcting method, its feature is to store in a string data storehouse There are the multiple character strings and multiple default first kind words examined, the character string that each has been examined includes some first kind Word, the character string automatic correcting method are comprised the following steps:
S1, from the plurality of character string, extract other words for being separated by first kind word as Equations of The Second Kind word, and by each the Two class words and afterwards close to the phrase that collectively forms of first kind word as default phrase, then generate a keyword database, In the keyword database, record has quantity to be multiple first kind words, Equations of The Second Kind word, default phrase and row's word order, Row's word order is that the default of each first kind word puts in order;
S2, generate phrase arrangement statistical table, in phrase arrangement statistical table, record has each default phrase to occur in this Arrangement probability and closely occur each after each default phrase in the plurality of character string that multiple character strings start The arrangement probability of default phrase;
S3, read one be input into character string;
S4, from the input character string choose first kind word it is as level key word and defeated at this according to level key word In entering character string, the input character string is divided into crucial phrase by present position, and level key word is located at the ending of crucial phrase Place;
S5, choose default phrase as effective phrase from each crucial phrase, and effective word will be removed in input character string Part outside group is designated as inactive portion;
S6, from inactive portion choose Equations of The Second Kind word as word to be combined, and by inactive portion in addition to word to be combined All words are designated as invalid word part;
S7, with the order in the input character string from front to back, successively according to each word to be combined close to effective phrase And phrase arrangement statistical table sequentially generates the corresponding effective phrase of each word to be combined, effective phrase of generation is respectively right Arrangement in each word to be combined answered and the phrase obtained after each first kind word combination, in the phrase arrangement statistical table is general The maximum phrase of rate;
S8, generate an output string, be arranged with each effective phrase in the output string, the order of arrangement is according to this Row's word order determines;
S9, inquire about effective phrase for starting during phrase arrangement statistical table obtains the output string and adjacent effective The arrangement probability of phrase, and the summation for arranging probability of acquisition is calculated as accuracy;
S10, export the accuracy.
Wherein, multiple character strings for having examined it should be appreciated that meet the character string of a certain specific call format, this One specific call format must include some first kind words in requiring character string.As a example by character string for expression address, Which necessarily includes for representing word such as " road ", " area " etc. of address level.In ecommerce, those skilled in the art Data easy to understand, storing in the string data storehouse, as a rule can be derived from and step S3It is middle to read input word The identical source of symbol string, be the data stored in the string data storehouse be verified, thus which is true and accurate True.
Row's word order is directed to each first kind word, and Equations of The Second Kind word is substantially can be understood as using default First kind word carries out splitting what is obtained for character string, and it is also similar with Equations of The Second Kind word to preset phrase, it is believed that be to split The part of the character string for obtaining afterwards.In step S1This during, it is believed that initial data is only character string and the One class word.Step S2S is utilized substantially1The plurality of character string carry out counting the statistical result for drawing.Wherein, arrange probability The probability of the arrangement of default phrase is meant, and in the present invention, only refers to that two default phrases are general by what is arranged close in the way of in front and back Rate.Thus, arrangement probability just refers to the probability for and then occurring another default phrase after some default phrase, uniquely Special case is the arrangement probability for the default phrase positioned at character string beginning, and in this special case, arrangement probability just refers to a certain pre- If phrase occurs in the probability of character string beginning.Here all of arrangement probability is all based on the plurality of character string (i.e. based on true Real and accurate checking data) count and calculated.
Step S3~S6Input character string is substantially obtained, and the keyword database is then based on to the input character string Divided, effective phrase, word to be combined, invalid portion are obtained according to the first kind word and Equations of The Second Kind word in input character string Point, invalid word part etc., in order to carry out follow-up process.Effective phrase therein can be should directly as what is ultimately generated What the ingredient of output string was exported.And word to be combined actually only meets in the keyword database Two class words, and lacked thereafter the word of first kind word.
For word to be combined, the possibly first kind word for lacking thereafter, step S7It is exactly that statistical table is arranged according to the phrase In arrangement probability combine the input character string in word in tandem, choose corresponding to word to be combined, arrangement maximum probability Effective phrase.Then, the output string is generated, the output string comprises only effect phrase, thus the output string It should be the character string for including accurate information according to the input character string determined by string data storehouse.
Above-mentioned steps can regard the generating process of the output string as, and this generating process is based partially on dictionary The design matched somebody with somebody, is partly the design of the terminology match based on statistical method.In step S9In statistical table is arranged based on the phrase also Obtain the accuracy of the output string.Although it should be appreciated that the accuracy in the present invention cannot directly determine the output Whether accurately, but this accuracy is similar to the output string and the character string examined has on the whole for character string Feature similarity tolerance, if the character string examined have higher representativeness, the accuracy be also accordingly compared with For accurate.By above-mentioned steps, more certain identification correction can either be carried out to being input into character string by efficiently and accurately, moreover it is possible to It is enough automatically and efficiently to calculate accuracy using the basis for estimation of the accuracy as the output string, this makes it possible to reduce Line is counter to cheat air control cost.
It is preferred that also storing the weighted value of each first kind word, S in the string data storehouse9By S9aSubstitute, S9aFor:
Inquire about the effective phrase and adjacent effective word started during the phrase arrangement statistical table obtains the output string The arrangement probability of group, and the weighted mean for arranging probability of acquisition is calculated as accuracy, the wherein power of each arrangement probability In the weighted value or adjacent effective phrase of the first kind word being equal in the output string in effective phrase for starting again The weighted value of the first kind word in effective phrase afterwards.
Divide for the division of character string is actually based on default first kind word in the present invention, each first kind Status of the word in character string may be different.For example, if what each first kind word represented is the level for declining step by step Relation, then the corresponding Equations of The Second Kind word sum of first kind word in higher level may be less, so that being related to these first kind The arrangement probability that effective phrase of word is related can be significantly bigger.Therefore, it can be weighted according to different first kind words, from And avoid a certain item arrangement probability in character string excessive and play dominant trait's effect to accuracy result, and eliminate other arrangements generally The impact of rate.
It is preferred that S2Also include:Will be all arrangements more than a default probability threshold value in the phrase arrangement statistical table general Rate is rewritten as equal to the probability threshold value.So avoid due in the plurality of character string certain default phrase quantity is very few causes Certain arrangement probability is excessive, so that each effective phrase is whole during accuracy result cannot reflect the plurality of character string completely The similarity degree of character string that is precursor reactant and examining.
It is preferred that S10It is further comprising the steps of afterwards:
S11a, the output string added into into the string data storehouse storage.
It is preferred that in S6S is performed afterwards61, S61For:The phrase for including first kind word is chosen from invalid word part as not Know phrase, and perform S7
S10It is further comprising the steps of afterwards:
S11, judge the accuracy whether more than default accuracy threshold value, in judged result to perform in the case of no S12, in the case of being to be in judged result, perform S13
S12, the output string added into into the string data storehouse storage, and terminate flow process;
S13, unknown phrase is added the output string to generate a return character according to the first kind word in unknown phrase String, in the return string, the order of first kind word meets row's word order, and performs S14
S14, the return string added into into the string data storehouse storage.
Wherein, unknown phrase includes first kind word, therefore has certain possibly unknown phrase only because having examined Character string quantity causes not can recognize that enough.In the case, if assert after being judged according to accuracy, the output string is accurate Exactness is higher, then it is assumed that the mistake when appearance of unknown phrase may be not due to user input and produce, thus by its Add in being included in the return string into the string data storehouse.
It is preferred that S10Also include:The number of characters included by the quantity and/or invalid word part for exporting word to be combined;
S11By S11bSubstitute, S11bFor:Judge that the accuracy is less than more than the quantity of the accuracy threshold value, word to be combined to preset A word amount threshold to be combined and/or the number of characters that included of invalid word part less than default idle character number threshold value be It is no while set up, in judged result to perform S in the case of no12, in the case of being to be in judged result, perform S13
It is preferred that first kind word includes city, area, Village, road.
Present invention also offers a kind of character string self-correcting system, its feature is, including:
String data library module, for storing the multiple character strings and multiple default first kind words examined, each The character string examined includes some first kind words;
Keyword data library module, for, from the plurality of character string, extracting other word conducts separated by first kind word Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards close to the phrase that collectively forms of first kind word as default phrase, then A keyword database is generated, record has quantity to be multiple first kind words, Equations of The Second Kind word, preset in the keyword database Phrase and row's word order, row's word order are that the default of each first kind word puts in order;
Phrase arranges statistical module, for the recording gauge according to the string data library module and the keyword database Calculate and record each default phrase and occur in the arrangement probability of the plurality of character string beginning and in the plurality of character string each Closely occurs the arrangement probability of each default phrase after individual default phrase;
Character string read module, for reading input character string;
Character string division module, for first kind word is chosen from the input character string as level key word, and according to The input character string is divided into crucial phrase by level key word present position in the input character string, and level key word is located at At the ending of crucial phrase;
Effectively phrase chooses module, for default phrase is chosen from each crucial phrase as effective phrase, and will be defeated Enter the part in character string in addition to effective phrase and be designated as inactive portion;
Selected ci poem delivery block to be combined, for Equations of The Second Kind word is chosen from inactive portion as word to be combined, and by invalid portion All words in point in addition to word to be combined are designated as invalid word part;
Phrase builds module, for the order in the input character string from front to back, successively according to each word to be combined Close to effective phrase and phrase arrangement statistical table sequentially generate the corresponding effective phrase of each word to be combined, generation has In the phrase that effect phrase is obtained after being respectively corresponding each word to be combined and each first kind word combination, in the phrase arrangement system The phrase of the arrangement maximum probability in meter table;
Output module, for generating an output string, is arranged with each effective phrase in the output string, arrangement Order is determined according to row's word order;
First computing module, obtains start in the output string effective for inquiring about phrase arrangement statistical table module The arrangement probability of phrase and adjacent effective phrase, and the summation for arranging probability of acquisition is calculated as accuracy;
Accuracy module, for exporting the accuracy.
The weighted value of each first kind word it is preferred that the string data library module is also stored with, the first computing module by Second computing module is substituted;
Second computing module, obtains start in the output string effective for inquiring about phrase arrangement statistical table module The arrangement probability of phrase and adjacent effective phrase, and the weighted mean for arranging probability of acquisition is calculated as accuracy, Weight of wherein each arrangement probability be equal to the weighted value of the first kind word in the output string in effective phrase for starting or The weighted value of the first kind word in the adjacent effective phrase of person in posterior effective phrase.
It is preferred that the phrase arrangement statistical module is additionally operable to change all arrangement probability more than a default probability threshold value It is written as equal to the probability threshold value.
It is preferred that the character string self-correcting system also includes that an output string returns module, the output string is returned Module is returned for the output string is added into the string data library module storage.
It is preferred that the selected ci poem delivery block to be combined is additionally operable to the phrase work for including first kind word is chosen from invalid word part For unknown phrase, the character string self-correcting system also includes:
First judge module, for whether judging the accuracy more than default accuracy threshold value, and in judged result To enable the first return module in the case of no, in judged result to enable the second return module in the case of being;
First returns module, for the output string is added into the string data library module storage;
Second return module, for according to the first kind word in unknown phrase by unknown phrase add the output string with The order for generating first kind word in a return string, the return string meets row's word order, then by the return character Storage is serially added into the string data library module.
It is preferred that accuracy module is additionally operable to the character included by the quantity for exporting word to be combined and/or invalid word part Number;
First judge module is substituted by the second judge module, and the second judge module is used to judge that the accuracy is accurate more than this Degree threshold value, the quantity of word to be combined are less than the character included by default one word amount threshold to be combined and/or invalid word part Whether number is simultaneously set up less than default idle character number threshold value, and in judged result to enable the first return in the case of no Module, in judged result to enable the second return module in the case of being.
It is preferred that first kind word includes city, area, Village, road.
On the basis of common sense in the field is met, above-mentioned each optimum condition, can combination in any, obtain final product each preferable reality of the present invention Example.
The present invention positive effect be:
The character string automatic correcting method and system of the present invention, is based partially on the design of dictionary matching, is based partially on statistics Probability can carry out automatic, efficient verity to infer the design of character string correctness to the character string information of user input Or the judgement of accuracy, while can also preferably identifying user be input into when clerical mistake and and then the minor error in character string is entered Row automatic straightening, so as to improve the efficiency of the operating of ecommerce in this case.
Description of the drawings
Flow charts of the Fig. 1 for the character string automatic correcting method of the embodiment of the present invention 1.
Schematic diagrams of the Fig. 2 for the character string self-correcting system of the embodiment of the present invention 2.
Specific embodiment
Provide present pre-ferred embodiments below in conjunction with the accompanying drawings, to describe technical scheme in detail, but not because Among this limits the present invention to described scope of embodiments.
Embodiment 1
In the character string automatic correcting method of the present embodiment, be stored with a string data storehouse the multiple words examined Symbol string and multiple default first kind words, the character string that each has been examined include some first kind words.With reference to shown in Fig. 1, should Character string automatic correcting method is comprised the following steps:
S1, from the plurality of character string, extract other words for being separated by first kind word as Equations of The Second Kind word, and by each the Two class words and afterwards close to the phrase that collectively forms of first kind word as default phrase, then generate a keyword database, In the keyword database, record has quantity to be multiple first kind words, Equations of The Second Kind word, default phrase and row's word order, Row's word order is that the default of each first kind word puts in order;
S2, generate phrase arrangement statistical table, in phrase arrangement statistical table, record has each default phrase to occur in this Arrangement probability and closely occur each after each default phrase in the plurality of character string that multiple character strings start The arrangement probability of default phrase;
S3, read one be input into character string;
S4, from the input character string choose first kind word it is as level key word and defeated at this according to level key word In entering character string, the input character string is divided into crucial phrase by present position, and level key word is located at the ending of crucial phrase Place;
S5, choose default phrase as effective phrase from each crucial phrase, and effective word will be removed in input character string Part outside group is designated as inactive portion;
S6, from inactive portion choose Equations of The Second Kind word as word to be combined, and by inactive portion in addition to word to be combined All words are designated as invalid word part;
S7, with the order in the input character string from front to back, successively according to each word to be combined close to effective phrase And phrase arrangement statistical table sequentially generates the corresponding effective phrase of each word to be combined, effective phrase of generation is respectively right Arrangement in each word to be combined answered and the phrase obtained after each first kind word combination, in the phrase arrangement statistical table is general The maximum phrase of rate;
S8, generate an output string, be arranged with each effective phrase in the output string, the order of arrangement is according to this Row's word order determines;
S9, inquire about effective phrase for starting during phrase arrangement statistical table obtains the output string and adjacent effective The arrangement probability of phrase, and the summation for arranging probability of acquisition is calculated as accuracy;
S10, export the accuracy;
S11a, the output string added into into the string data storehouse storage.
Wherein, S2Also include:By all arrangement probability more than a default probability threshold value in the phrase arrangement statistical table It is rewritten as equal to the probability threshold value.So avoid due in the plurality of character string certain default phrase quantity is very few causes certain Individual arrangement probability is excessive, so that accuracy result cannot reflect each effective phrase entirety in the plurality of character string completely The similarity degree of character string that is reaction and examining.
Also, multiple character strings for having examined are it should be appreciated that meet the character string of a certain specific call format, this One specific call format must include some first kind words in requiring character string.Row's word order is directed to each first kind Word, and Equations of The Second Kind word is substantially can be understood as using default first kind word for character string carries out splitting what is obtained, in advance If phrase is also similar with Equations of The Second Kind word, it is believed that be split after the part of character string that obtains.
Step S2S is utilized substantially1The plurality of character string carry out counting the statistical result for drawing.Wherein, arrange probability Refer to default probability of the phrase to arrange close in the way of in front and back.For word to be combined, what is lacked thereafter is probably the first kind Word, step S7It is exactly that word is combined in the input character string in tandem according to the arrangement probability in phrase arrangement statistical table, Choose corresponding to word to be combined, arrangement maximum probability effective phrase.Then, the output string is generated, the output character String comprises only effect phrase, thus the output string should be determined by string data place according to the input character string The character string for including accurate information.
Above-mentioned steps can regard the generating process of the output string as, and this generating process is based partially on dictionary The design matched somebody with somebody, is partly the design of the terminology match based on statistical method.Below to this method of the present embodiment in address word Application in the automatic amendment of symbol string is illustrated.
In this application example, first kind word includes city, area, Village, road, and row's word order is city, area, new Village, road.Include in the plurality of character string similar " Shanghai City Nanjing Road ", " Shanghai City Huangpu District XX roads ", " Huangpu District east is new Village " etc examines character string.By these character strings, can count draw " Shanghai City ", " Shanghai Village ", " upper sea route ", The arrangement probability of the default phrase in " Nanjing Road ", " Nanjing " etc..
In the case, for example, the input character string of reading is " Nanjing Huangpu District Shanghai east ", although by this Input character string is difficult to Direct Recognition and goes out legal address, but the character string of examining arranged in statistical table according to phrase can be true It is fixed, be input in character string lack first kind word phrase " Nanjing ", " Shanghai " and " east " may with first kind word in it is a certain Individual or multiple compositions preset phrase.Now, first carry out step S4To mark off a crucial phrase " Huangpu District " first.
Then, for input character string " Nanjing Huangpu District Shanghai east ", remainder is designated as inactive portion, Determined by the character string examined, " Nanjing " therein, " Shanghai ", " east " belong to Equations of The Second Kind word, therefore is elected to be to be combined Word.Then, it is considered to these Equations of The Second Kind words and all possible combination of first kind word, and according to phrase arrange the row in statistical table Row probability come find out add potential first kind word after the mode of maximum probability is arranged in the character string that formed.For example, " Nanjing Road ", " Shanghai City " and " Nanjing ", " upper sea route " both of which are possible, but with reference to crucial phrase " Huangpu District " And after being sorted according to first kind word, the arrangement probability of " Shanghai City Huangpu District " is bigger compared with the arrangement probability of " Nanjing Huangpu District ", Therefore adopt former.
Thus, the output string for ultimately generating is " Shanghai City Huangpu District Nanjing Road DongFang Residential Quater ".It should be appreciated that Although said method does not guarantee the output string that ultimately generates, and necessarily correct or clear is discernible, from For the angle of statistical probability, the correctness of output string has been substantially increased.This point is longer in character string, that is, include First kind word (can be regarded as level key word) quantity it is more when it is particularly evident, this is can be using first due to this method Class word is ranked up so as to be more effectively modified to character string to be confirmed using the character string examined, in the row being related to Row probability can play a part of mutually confirmation when more.
In step S9In the accuracy of the output string is also obtained based on phrase arrangement statistical table.It should be understood that It is that, although whether accurately the accuracy in the present invention cannot directly determine the output string, it is right that this accuracy is similar to The tolerance of the similarity of the feature that the output string and the character string examined have on the whole, if the character string examined With higher representativeness, then the accuracy is also accurate accordingly.
The method of the present embodiment, more can not only carry out certain identification correction to being input into character string by efficiently and accurately, Accuracy can also automatically and efficiently be calculated using the basis for estimation of the accuracy as the output string, this makes it possible to drop Low online anti-fraud air control cost.
Embodiment 2
The character string automatic correcting method of the present embodiment is compared with embodiment 1, and difference is only that:
The weighted value of each first kind word, S are stored also in the string data storehouse9By S9aSubstitute, S9aFor:
Inquire about the effective phrase and adjacent effective word started during the phrase arrangement statistical table obtains the output string The arrangement probability of group, and the weighted mean for arranging probability of acquisition is calculated as accuracy, the wherein power of each arrangement probability In the weighted value or adjacent effective phrase of the first kind word being equal in the output string in effective phrase for starting again The weighted value of the first kind word in effective phrase afterwards.
Also, in S6S is performed afterwards61, S61For:The phrase for including first kind word is chosen from invalid word part as unknown Phrase, and perform S7
S10Also include:The number of characters included by the quantity and/or invalid word part for exporting word to be combined;
S10It is further comprising the steps of afterwards:
S11b, judge the accuracy more than the accuracy threshold value, word to be combined quantity be less than default one word number to be combined Amount threshold value and/or the number of characters that included of invalid word part less than the whether simultaneously establishment of default idle character number threshold value, Judged result is to perform S in the case of no12, in the case of being to be in judged result, perform S13
S12, the output string added into into the string data storehouse storage, and terminate flow process;
S13, unknown phrase is added the output string to generate a return character according to the first kind word in unknown phrase String, in the return string, the order of first kind word meets row's word order, and performs S14
S14, the return string added into into the string data storehouse storage.
The method of the present embodiment has taken into full account each ground of first kind word in character string on the basis of embodiment 1 Position may be different, and then is weighted according to different first kind words, it is to avoid in character string, a certain item arranges probability mistake Play greatly dominant trait's effect to accuracy result, and eliminate the impact of other arrangement probability.
Embodiment 3
With reference to shown in Fig. 2, the character string self-correcting system of the present embodiment includes:
String data library module 1, for storing the multiple character strings and multiple default first kind words examined, each The character string examined includes some first kind words;
Keyword data library module 2, for, from the plurality of character string, extracting other word conducts separated by first kind word Equations of The Second Kind word, and using each Equations of The Second Kind word and afterwards close to the phrase that collectively forms of first kind word as default phrase, then A keyword database is generated, record has quantity to be multiple first kind words, Equations of The Second Kind word, preset in the keyword database Phrase and row's word order, row's word order are that the default of each first kind word puts in order;
Phrase arranges statistical module 3, for the record according to the string data library module and the keyword database Calculate and record each default phrase occur in the beginning of the plurality of character string arrangement probability and in the plurality of character string Closely occurs the arrangement probability of each default phrase after each default phrase;
Character string read module 4, for reading input character string;
Character string division module 5, for first kind word is chosen from the input character string as level key word, and according to The input character string is divided into crucial phrase by level key word present position in the input character string, and level key word is located at At the ending of crucial phrase;
Effectively phrase chooses module 6, for default phrase is chosen from each crucial phrase as effective phrase, and will be defeated Enter the part in character string in addition to effective phrase and be designated as inactive portion;
Selected ci poem delivery block 7 to be combined, for Equations of The Second Kind word is chosen from inactive portion as word to be combined, and by invalid portion All words in point in addition to word to be combined are designated as invalid word part;
Phrase builds module 8, for the order in the input character string from front to back, successively according to each word to be combined Close to effective phrase and phrase arrangement statistical table sequentially generate the corresponding effective phrase of each word to be combined, generation has In the phrase that effect phrase is obtained after being respectively corresponding each word to be combined and each first kind word combination, in the phrase arrangement system The phrase of the arrangement maximum probability in meter table;
Output module 9, for generating an output string, is arranged with each effective phrase in the output string, arrangement Order determined according to row's word order;
First computing module 10, obtains having of starting in the output string for inquiring about phrase arrangement statistical table module The arrangement probability of effect phrase and adjacent effective phrase, and the summation for arranging probability of acquisition is calculated as accuracy;
Accuracy module 11, for exporting the accuracy.
Wherein, the phrase arrangement statistical module is additionally operable to rewrite all arrangement probability more than a default probability threshold value It is equal to the probability threshold value.The character string self-correcting system also includes that an output string returns module 12, the output character String returns module for the output string is added into the string data library module storage.
Embodiment 4
The character string self-correcting system of the present embodiment is compared with embodiment 3, and difference is only that:
The string data library module is also stored with the weighted value of each first kind word, and the first computing module is calculated by second Module is substituted, wherein the second computing module is used to inquire about what is started during the phrase arrangement statistical table module obtains the output string The arrangement probability of effective phrase and adjacent effective phrase, and the weighted mean for arranging probability of acquisition is calculated as accurate Degree, the weight of the first kind word in effective phrase that the weight of wherein each arrangement probability starts in being equal to the output string The weighted value of the first kind word in value or adjacent effective phrase in posterior effective phrase.
And the selected ci poem delivery block to be combined is additionally operable to choose the phrase conduct for including first kind word from invalid word part Unknown phrase, the character string self-correcting system also include:
First judge module, for whether judging the accuracy more than default accuracy threshold value, and in judged result To enable the first return module in the case of no, in judged result to enable the second return module in the case of being;
First returns module, for the output string is added into the string data library module storage;
Second return module, for according to the first kind word in unknown phrase by unknown phrase add the output string with The order for generating first kind word in a return string, the return string meets row's word order, then by the return character Storage is serially added into the string data library module.
Meanwhile, accuracy module is additionally operable to the number of characters included by the quantity for exporting word to be combined and/or invalid word part;
First judge module is substituted by the second judge module, and the second judge module is used to judge that the accuracy is accurate more than this Degree threshold value, the quantity of word to be combined are less than the character included by default one word amount threshold to be combined and/or invalid word part Whether number is simultaneously set up less than default idle character number threshold value, and in judged result to enable the first return in the case of no Module, in judged result to enable the second return module in the case of being.
Although the foregoing describing the specific embodiment of the present invention, it will be appreciated by those of skill in the art that these It is merely illustrative of, protection scope of the present invention is defined by the appended claims.Those skilled in the art is not carrying on the back On the premise of the principle and essence of the present invention, various changes or modifications, but these changes can be made to these embodiments Protection scope of the present invention is each fallen within modification.

Claims (14)

1. a kind of character string automatic correcting method, it is characterised in that be stored with a string data storehouse examined it is multiple Character string and multiple default first kind words, the character string that each has been examined include some first kind words, and the character string is automatic Modification method is comprised the following steps:
S1, from the plurality of character string, extract other words for being separated by first kind word as Equations of The Second Kind word, and by each Equations of The Second Kind Word and afterwards close to the phrase that collectively forms of first kind word as default phrase, then generate a keyword database, the pass In keyword data base, record has quantity to be multiple first kind words, Equations of The Second Kind word, default phrase and row's word order, the row Word order is that the default of each first kind word puts in order;
S2, generate phrase arrangement statistical table, in phrase arrangement statistical table, record has each default phrase to occur in the plurality of word The arrangement probability of symbol string beginning and closely occurs each default word in the plurality of character string after each default phrase The arrangement probability of group;
S3, read one be input into character string;
S4, choose first kind word as level key word from the input character string, and according to level key word in the input character In string, the input character string is divided into crucial phrase by present position, and level key word is located at the ending of crucial phrase;
S5, choose default phrase as effective phrase from each crucial phrase, and will be in input character string in addition to effective phrase Part is designated as inactive portion;
S6, choose Equations of The Second Kind word as word to be combined from inactive portion, and by all words in inactive portion in addition to word to be combined It is designated as invalid word part;
S7, with the order in the input character string from front to back, successively according to each word to be combined close to effective phrase and should Phrase arrangement statistical table sequentially generates the corresponding effective phrase of each word to be combined, and effective phrase of generation is respectively corresponding each Arrangement maximum probability in the phrase obtained after individual word to be combined and each first kind word combination, in the phrase arrangement statistical table Phrase;
S8, generate an output string, each effective phrase is arranged with the output string, the order of arrangement is according to row's word Order determines;
S9, inquire about the effective phrase and adjacent effective phrase started during phrase arrangement statistical table obtains the output string Arrangement probability, and calculate acquisition arrangement probability summation as accuracy;
S10, export the accuracy.
2. character string automatic correcting method as claimed in claim 1, it is characterised in that also store respectively in the string data storehouse The weighted value of individual first kind word, S9By S9aSubstitute, S9aFor:
Inquire about effective phrase for starting during phrase arrangement statistical table obtains the output string and adjacent effective phrase Arrangement probability, and the weighted mean for arranging probability of acquisition is calculated as accuracy, wherein weight of each arrangement probability etc. It is posterior in the weighted value of the first kind word in the effective phrase started in the output string or adjacent effective phrase The weighted value of the first kind word effectively in phrase.
3. character string automatic correcting method as claimed in claim 1, it is characterised in that S2Also include:By the phrase arrangement statistics In table, all arrangement probability more than a default probability threshold value are rewritten as equal to the probability threshold value.
4. the character string automatic correcting method as described in any one in claim 1-3, it is characterised in that S10Also include afterwards with Lower step:
S11a, the output string added into into the string data storehouse storage.
5. the character string automatic correcting method as described in any one in claim 1-3, it is characterised in that in S6Perform afterwards S61, S61For:The phrase for including first kind word is chosen from invalid word part as unknown phrase, and performs S7
S10It is further comprising the steps of afterwards:
S11, judge the accuracy whether more than default accuracy threshold value, in judged result to perform S in the case of no12, Judged result is to perform S in the case of being13
S12, the output string added into into the string data storehouse storage, and terminate flow process;
S13, unknown phrase added to generate a return string by the output string according to the first kind word in unknown phrase, In the return string, the order of first kind word meets row's word order, and performs S14
S14, the return string added into into the string data storehouse storage.
6. character string automatic correcting method as claimed in claim 5, it is characterised in that S10Also include:Export word to be combined The number of characters included by quantity and/or invalid word part;
S11By S11bSubstitute, S11bFor:Judge that the accuracy is treated less than default one more than accuracy threshold value, the quantity of word to be combined The number of characters included by portmanteau word amount threshold and/or invalid word part less than default idle character number threshold value whether while Set up, in the case of being no in judged result, perform S12, in the case of being to be in judged result, perform S13
7. character string automatic correcting method as claimed in claim 6, it is characterised in that first kind word include city, area, Village, Road.
8. a kind of character string self-correcting system, it is characterised in that include:
String data library module, for storing multiple character strings for having examined and multiple default first kind words, each core Real character string includes some first kind words;
Keyword data library module, for from the plurality of character string, extracting other words separated by first kind word as second Class word, and using each Equations of The Second Kind word and afterwards close to the phrase that collectively forms of first kind word as default phrase, then generate One keyword database, in the keyword database, record has quantity to be multiple first kind words, Equations of The Second Kind word, default phrase And row's word order, row's word order is that the default of each first kind word puts in order;
Phrase arranges statistical module, for being calculated simultaneously according to the record of the string data library module and the keyword database Record each default phrase occur in the beginning of the plurality of character string arrangement probability and in the plurality of character string it is pre- at each If closely occurring the arrangement probability of each default phrase after phrase;
Character string read module, for reading input character string;
Character string division module, for first kind word is chosen from the input character string as level key word, and according to level The input character string is divided into crucial phrase by key word present position in the input character string, and level key word is located at key At the ending of phrase;
Effectively phrase chooses module, for default phrase is chosen from each crucial phrase as effective phrase, and will be input into word Part in symbol string in addition to effective phrase is designated as inactive portion;
Selected ci poem delivery block to be combined, for Equations of The Second Kind word is chosen from inactive portion as word to be combined, and by inactive portion All words in addition to word to be combined are designated as invalid word part;
Phrase builds module, for the order in the input character string from front to back, successively according to each word to be combined close to Effective phrase and phrase arrangement statistical table sequentially generate the corresponding effective phrase of each word to be combined, effective word of generation Statistical table is arranged in the phrase that group is obtained after being respectively corresponding each word to be combined and each first kind word combination, in the phrase In arrangement maximum probability phrase;
Output module, for generating an output string, is arranged with each effective phrase, the order of arrangement in the output string Determined according to row's word order;
First computing module, obtains the effective phrase started in the output string for inquiring about the phrase arrangement statistical table module And the arrangement probability of adjacent effective phrase, and the summation for arranging probability of acquisition is calculated as accuracy;
Accuracy module, for exporting the accuracy.
9. character string self-correcting system as claimed in claim 8, it is characterised in that the string data library module is also stored There is the weighted value of each first kind word, the first computing module is substituted by the second computing module;
Second computing module, obtains the effective phrase started in the output string for inquiring about the phrase arrangement statistical table module And the arrangement probability of adjacent effective phrase, and the weighted mean for arranging probability of acquisition is calculated as accuracy, wherein The weighted value or phase of the first kind word in effective phrase that the weight of each arrangement probability starts in being equal to the output string The weighted value of the first kind word in adjacent effective phrase in posterior effective phrase.
10. character string self-correcting system as claimed in claim 8, it is characterised in that phrase arrangement statistical module is also used In all arrangement probability more than a default probability threshold value are rewritten as equal to the probability threshold value.
The 11. character string self-correcting systems as described in any one in claim 8-10, it is characterised in that the character string is certainly Dynamic update the system also includes that an output string returns module, and the output string returns module for the output character is serially added Enter into the string data library module storage.
The 12. character string self-correcting systems as described in any one in claim 8-10, it is characterised in that the word to be combined Choose module and be additionally operable to choose the phrase for including first kind word from invalid word part as unknown phrase, the character string is repaiied automatically Positive system also includes:
First judge module, for whether judging the accuracy more than default accuracy threshold value, and is no in judged result In the case of enable the first return module, in judged result to enable the second return module in the case of being;
First returns module, for the output string is added into the string data library module storage;
Second returns module, for adding the output string to generate by unknown phrase according to the first kind word in unknown phrase In one return string, the return string, the order of first kind word meets row's word order, then serially adds the return character Enter into the string data library module storage.
13. character string self-correcting systems as claimed in claim 12, it is characterised in that accuracy module is additionally operable to output and treats The number of characters included by the quantity of portmanteau word and/or invalid word part;
First judge module is substituted by the second judge module, and the second judge module is used to judge that the accuracy is more than accuracy threshold The number of characters included less than default one word amount threshold to be combined and/or invalid word part by value, the quantity of word to be combined is little Whether simultaneously set up in default idle character number threshold value, and in judged result to enable the first return mould in the case of no Block, in judged result to enable the second return module in the case of being.
14. character string self-correcting systems as claimed in claim 13, it is characterised in that first kind word includes city, area, new Village, road.
CN201410312846.9A 2014-07-02 2014-07-02 Method and system for automatically correcting character strings Active CN104036047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410312846.9A CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410312846.9A CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Publications (2)

Publication Number Publication Date
CN104036047A CN104036047A (en) 2014-09-10
CN104036047B true CN104036047B (en) 2017-05-17

Family

ID=51466817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410312846.9A Active CN104036047B (en) 2014-07-02 2014-07-02 Method and system for automatically correcting character strings

Country Status (1)

Country Link
CN (1) CN104036047B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291698B (en) * 2017-06-30 2020-08-04 Oppo广东移动通信有限公司 Information correction method, information correction device, storage medium and electronic equipment
CN110135414B (en) * 2019-05-16 2021-07-09 京北方信息技术股份有限公司 Corpus updating method, apparatus, storage medium and terminal
CN116502614B (en) * 2023-06-26 2023-09-01 北京每日信动科技有限公司 Data checking method, system and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477565A (en) * 2009-01-22 2009-07-08 北京搜狗科技发展有限公司 Method and apparatus for confirming correctness of input character string in search engine
CN101639830A (en) * 2009-09-08 2010-02-03 西安交通大学 Chinese term automatic correction method in input process
EP2395438A1 (en) * 2009-02-03 2011-12-14 Huawei Technologies Co., Ltd. Character string processing method and system and matcher
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477565A (en) * 2009-01-22 2009-07-08 北京搜狗科技发展有限公司 Method and apparatus for confirming correctness of input character string in search engine
EP2395438A1 (en) * 2009-02-03 2011-12-14 Huawei Technologies Co., Ltd. Character string processing method and system and matcher
CN101639830A (en) * 2009-09-08 2010-02-03 西安交通大学 Chinese term automatic correction method in input process
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
多属性字符串辨识码的高效率查询算法;马恒等;《Proceedings of International Conference on Engineering and Business Management(EBM2010) 》;20100325;第1826-1829页 *
多模型下的近似字符串匹配算法研究;赵华;《中国博士学位论文全文数据库·信息科技辑》;20140215(第2期);第5-19页 *

Also Published As

Publication number Publication date
CN104036047A (en) 2014-09-10

Similar Documents

Publication Publication Date Title
WO2020182019A1 (en) Image search method, apparatus, device, and computer-readable storage medium
CN104750795B (en) A kind of intelligent semantic searching system and method
CN103336766B (en) Short text garbage identification and modeling method and device
CN105608218A (en) Intelligent question answering knowledge base establishment method, establishment device and establishment system
CN107704512A (en) Financial product based on social data recommends method, electronic installation and medium
CN103514238B (en) Sensitive word identifying processing method based on classification searching
CN104951468A (en) Data searching and processing method and system
CN103345496B (en) multimedia information retrieval method and system
CN105095238A (en) Decision tree generation method used for detecting fraudulent trade
JP2013506189A5 (en)
CN107315731A (en) Text similarity computing method
CN103577989A (en) Method and system for information classification based on product identification
CN106095778A (en) The Chinese search word automatic error correction method of search engine
CN104036047B (en) Method and system for automatically correcting character strings
CN104199965A (en) Semantic information retrieval method
CN103218436A (en) Similar problem retrieving method fusing user category labels and device thereof
CN104281565B (en) Semantic dictionary construction method and device
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN108897789A (en) A kind of cross-platform social network user personal identification method
CN104778186A (en) Method and system for hanging commodity object to standard product unit (SPU)
CN103810171B (en) Method and system for generating random test data within limited range
CN103778122A (en) Searching method and system
CN104809141A (en) Matching system and method of hotel data
CN111190873B (en) Log mode extraction method and system for log training of cloud native system
WO2021042526A1 (en) Search method and apparatus based on similarity value, and computer device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160205

Address after: 200335 Shanghai city Changning District Admiralty Road No. 968 Building No. 16 10 floor

Applicant after: SHANGHAI XIECHENG BUSINESS CO., LTD.

Address before: 200335 Shanghai City, Changning District Fuquan Road No. 99, Ctrip network technology building

Applicant before: Ctrip computer technology (Shanghai) Co., Ltd.

GR01 Patent grant
GR01 Patent grant