CN106959977A - Candidate collection computational methods and device, word error correction method and device in word input - Google Patents

Candidate collection computational methods and device, word error correction method and device in word input Download PDF

Info

Publication number
CN106959977A
CN106959977A CN201610020331.0A CN201610020331A CN106959977A CN 106959977 A CN106959977 A CN 106959977A CN 201610020331 A CN201610020331 A CN 201610020331A CN 106959977 A CN106959977 A CN 106959977A
Authority
CN
China
Prior art keywords
error correction
word
probability
input
path
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610020331.0A
Other languages
Chinese (zh)
Inventor
吴岳
谢玄亮
陈凯成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Dongjing Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Dongjing Computer Technology Co Ltd filed Critical Guangzhou Dongjing Computer Technology Co Ltd
Priority to CN201610020331.0A priority Critical patent/CN106959977A/en
Publication of CN106959977A publication Critical patent/CN106959977A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses the candidate collection computational methods in a kind of input of word, comprise the following steps:Extraction step, for extracting error correction inquiry pair from user journal, and inquired about for each error correction to setting up error correction character string pair, the error correction inquiry is to the corresponding relation between the word content for mistake input and the word content correctly entered, and the error correction character string is to inquiring about centering mistake input character string for the error correction and correctly entering the corresponding relation between character string;Candidate collection calculation procedure, for as the word t of inputiIn string matching error correction character string pair when, according to error correction character string to the word generate word variant set V={ v1,v2,…,vnIt is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}.A kind of candidate collection computing device, input error correction method and its device are also disclosed simultaneously.The error correction degree of accuracy can be improved using the present invention, most Error Correcting Problem can be covered again, good adaptability is also demonstrated by neologisms error correction.

Description

Candidate collection computational methods and device, word error correction method and device in word input
Technical field
The present invention relates to the technical field of natural language processing, in particular it relates to a kind of text Candidate collection computational methods and device, word error correction method and device in word input.
Background technology
Error correcting technique is an important step in search.According to Document system, in search engine inquiry, greatly The inquiry that there are about 10%-15% is mistake input.Particularly in some groups being accustomed to language-specific In body, such as in Indian English or Indian music search project, the query of mistake is even more to have accounted for 30%. Conventional search error correction method includes noisy channel model and HMM.Noisy channel model It is that Candidate Set is obtained by editing distance, then the transition probability of maximum is tried to achieve based on statistics, so as to tries to achieve Best candidate error correction;HMM is then to regard inquiry as one group of observation state, corresponding time Selected works regard one group of hidden state as, and each hidden state of observation state correspondence has corresponding output probability, Also there is corresponding transition probability between hidden state, so as to calculate optimal hidden state sequence.It is above-mentioned Two methods, are generally all that Candidate Set and its error probability are calculated by editing distance, have ignored language Speech rule itself, is difficult the precision and coverage of balance Candidate Set in practice.
For example, inventor is had found in Project, and Indian inquires about what is inputed by mistake in search Problem will become apparent than general English, Chinese language users.One very main reasons is that by their language Say what characteristic was determined.Influenceed by historical factor, Indian's main language used on network is print Spend English hinglish (https://en.wikipedia.org/wiki/Hinglish), one kind has merged English With the mixed raw language of India's native language (hindi, Punjabi etc.).They can by native language (hindi, Punjabi Latin alphabet spelling) is converted into, unified hard and fast rule is had no in this course, simply root According to rule on voice, a hindi word is caused often to have a variety of Latin alphabet spell modes, such as film " aashiqui 2 " can also be spelt into " ashiqui 2 " to name.Therefore, India native country is multilingual mixes The characteristics of bring a large amount of search input errors.
Existing Hidden Markov searches for error correction, and the reasonable estimation to Candidate Set is an Important Problems. Common method has two kinds, 1) calculate word between editing distance, further obtain transition probability, This mode only simply considers character difference, and the degree of accuracy is poor.2) it is based on Web log mining error correction list Relation between word pair, further obtains transition probability.Such mode is used dependent on very comprehensive Family daily record, the error correction often covered is limited in scope, and can not tackle neologisms.Inventor is had found in practice Middle above two method is in the input being accustomed to language-specific, such as in Indian English hinglish Error correction is all not ideal enough.
The content of the invention
It is an object of the present invention to provide it is a kind of be suitable to language-specific be accustomed to and characteristic it is defeated Enter to provide the new solution that candidate combines and carries out error correction.
There is provided the candidate collection calculating side in a kind of input of word according to the first aspect of the invention Method, comprises the following steps:
Extraction step, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction Enter character string and correctly enter the corresponding relation between character string;
Candidate collection calculation procedure, for as the word t of inputiIn string matching erroneous character correction During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn} It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection calculation procedure to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1 Between constant.
Preferably, the extraction step also include choosing error correction character string centering mistake input character string and Correctly enter the step that character string is respectively less than the error correction character string pair of predetermined editing distance.
Preferably, the step of extraction step is also included to error correction character string to calculating occurrence number, And error correction character string of the number of times more than predetermined threshold is will appear to being established as final error correction character string pair.
Preferably, methods described also includes:
Probability calculation step is rewritten, is rewritten for according to user journal Result, calculating all kinds of characters Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection calculation procedure, is additionally operable to obtain all and word tiBetween be less than predetermined editor Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, the output probability of the set U is calculated in the candidate collection calculation procedure to be included:
According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character manipulation Corresponding character rewrites Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating set of letters U output probability bag in the candidate collection calculation procedure Include:
According to formula pj=r(l-k)(1-r)k*∏M=1 k phmCalculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word Symbol rewrites probability;R is the probability that single character is correctly entered.
Error correction method is inputted there is provided one kind according to the second aspect of the invention, is comprised the following steps,
Transition probability calculation procedure, the state transition probability P ' for calculating sentence in corpus;
Input step, for inputting sentence;
Segmentation step, for sentence to be divided into word ti
Candidate collection calculation procedure, for calculating each of segmentation according to such as aforementioned candidates set computational methods The word tiCandidate collection C and its output probability P;
Error correction path computing step, for being calculated most according to the output probability P and transition probability P ' Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge step, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input Sentence.
Preferably, the transition probability calculation procedure includes:In units of sentence, calculate in corpus The transition probability P ' (ti | tj) of whole words between any two.
Preferably, the transition probability calculation procedure includes:
According to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate in corpus whole words between any two Transition probability;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、 tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
According to the third aspect of the invention we there is provided the candidate collection computing device in a kind of input of word, Including:
Abstraction module, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction Enter character string and correctly enter the corresponding relation between character string;
Candidate collection computing module, for as the word t of inputiIn string matching erroneous character correction During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn} It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection computing module to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1 Between constant.
Preferably, the abstraction module be additionally operable to choose error correction character string centering mistake input character string and Correctly enter the error correction character string pair that character string is respectively less than predetermined editing distance.
It is highly preferred that the abstraction module is additionally operable to error correction character string to calculating occurrence number, and Error correction character string of the number of times more than predetermined threshold be will appear to being established as final error correction character string pair.
Preferably, described device also includes:Probability evaluation entity is rewritten, for being dug according to for daily record Result is dug, the probability P that all kinds of characters are rewritten is calculatedh, the mistake that the character is rewritten as single character writes, Fail to write, write more;And
The candidate collection computing module, is additionally operable to obtain all and word tiBetween be less than specific editor Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating the output probability bag of the set U in the candidate collection computing module Include:
According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character manipulation Corresponding character rewrites Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating set of letters U output probability bag in the candidate collection computing module Include:
According to formula pj=r(l-k)(1-r)k*∏M=1 k phmCalculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word Symbol rewrites probability;R is the probability that single character is correctly entered.
According to the fourth aspect of the invention there is provided a kind of programmable device, including memory and processor, Wherein, the memory is used for store instruction, and the instruction is used to control the processor to be operated To perform foregoing candidate collection computational methods.
Error correction device is inputted there is provided one kind according to the fifth aspect of the invention, including:
Transition probability computing module, the state transition probability P ' for calculating sentence in corpus;
Input module, for inputting sentence;
Split module, for sentence to be divided into word ti
Foregoing candidate collection computing device, each word t for calculating segmentationiCandidate collection C and its output probability P;
Error correction path calculation module, for being calculated most according to the output probability P and transition probability P ' Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge module, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that module judges that the optimal error correction path is equal to former input path, former input is returned Sentence;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described Difference between the Probability p l in optimal error correction path and the Probability p 0 of original path, if difference is more than in advance Determine difference threshold, then return to the corresponding error correction result in optimal error correction path, otherwise, return to former input Sentence.
Preferably, the transition probability computing module, in units of sentence, calculating in corpus The transition probability P ' (ti | tj) of whole words between any two.
Preferably, the transition probability computing module includes:
According to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate in corpus whole words between any two Transition probability;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、 tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
According to the sixth aspect of the invention there is provided a kind of programmable device, including memory and processor, Wherein, the memory is used for store instruction, and the instruction is used to control the processor to be operated To perform foregoing input error correction method.
It was found by the inventors of the present invention that in the prior art, not proposing also a kind of for specific language Say that the situation of input habit and characteristic proposes candidate collection computational methods and corresponding error correction method.Therefore, The technical assignment to be realized of the present invention or technical problem to be solved be those skilled in the art from It is not expecting or it is not expected that, therefore the present invention is a kind of new technical scheme.
By referring to the drawings to the detailed description of the exemplary embodiment of the present invention, of the invention its Its feature and its advantage will be made apparent from.
Brief description of the drawings
The accompanying drawing for being combined in the description and constituting a part for specification shows the reality of the present invention Example is applied, and together with the principle that its explanation is used to explain the present invention.
Fig. 1 shows the hardware configuration for the computer system 1000 that can realize embodiments of the invention Block diagram.
Fig. 2 shows the stream of the candidate collection computational methods in word input according to embodiments of the present invention Cheng Tu;
Fig. 3 shows the block diagram of candidate collection computing device according to embodiments of the present invention;
Fig. 4 shows the flow chart of transition probability computational methods according to embodiments of the present invention;
The flow chart of the input error correction methods of Fig. 5 according to embodiments of the present invention;
Fig. 6 shows the block diagram of input error correction device according to embodiments of the present invention.
Embodiment
The various exemplary embodiments of the present invention are described in detail now with reference to accompanying drawing.It should be noted that: Unless specifically stated otherwise, the part that otherwise illustrates in these embodiments and step it is positioned opposite, Numerical expression and numerical value are not limited the scope of the invention.
The description only actually at least one exemplary embodiment is illustrative below, is never made For to the present invention and its application or any limitation used.
It may not make to beg in detail for technology, method and apparatus known to person of ordinary skill in the relevant By, but in the appropriate case, the technology, method and apparatus should be considered as a part for specification.
In shown here and discussion all examples, any occurrence should be construed as merely example Property, not as limitation.Therefore, other examples of exemplary embodiment can have different Value.
It should be noted that:Similar label and letter represents similar terms, therefore, one in following accompanying drawing It is defined, then it need not be carried out further in subsequent accompanying drawing in a certain Xiang Yi accompanying drawing of denier Discuss.
<Hardware configuration>
Fig. 1 is to show that the hardware configuration of the computer system 1000 of embodiments of the invention can be realized Block diagram.
As shown in figure 1, computer system 1000 includes computer 1110.Computer 1110 includes warp The processing unit 1120 that is connected by system bus 1121, system storage 1130, fixation are non-volatile Memory interface 1140, mobile non-volatile memory interface 1150, user input interface 1160, Network interface 1170, video interface 1190 and peripheral interface 1195.
System storage 1130 includes ROM (read-only storage) and RAM (random access memories Device).BIOS (basic input output system) is resided in ROM.Operating system, application program, Other program modules and some routine datas are resided in RAM.
The fixed non-volatile memory of such as hard disk is connected to fixed non-volatile memory interface 1140.Fixed non-volatile memory for example can be with storage program area, application program, other programs Module and some routine datas.
The mobile nonvolatile memory of such as floppy disk and CD-ROM drive is connected to shifting Dynamic non-volatile memory interface 1150.For example, floppy disk can be inserted into floppy disk, with And CD (CD) can be inserted into CD-ROM drive.
The input equipment of such as mouse and keyboard is connected to user input interface 1160.
Computer 1110 can be connected to remote computer 1180 by network interface 1170.For example, Network interface 1170 can be connected to remote computer by LAN.Or, network interface 1170 Modem (modulator-demodulator) is may be coupled to, and modem is via wide area network It is connected to remote computer 1180.
Remote computer 1180 can include the memory of such as hard disk, and it can store remote application Program.
Video interface 1190 is connected to monitor.
Peripheral interface 1195 is connected to printer and loudspeaker.
Computer system shown in Fig. 1 be merely illustrative and be in no way intended to the present invention, its Using or any limitation for using.
<First embodiment>
According to the first embodiment of the present invention, there is provided the time in a kind of input of word as shown in Figure 2 Selected works close computational methods, comprise the following steps:
First in step S2100 digging user daily records, user journal can be with specific input habit Specific user colony and select, can for example select as the user journal of rare foreign languages spoken and written languages user, Especially, the characteristics of be multilingual mix for Indian English, user would generally be by other India native country The vocabulary of language is converted into Latin alphabet input, now often has multiple spell modes, and it is expressed The meaning be all phonetics rule that is consistent, being spelt based on this kind of word with pronunciation, this can be directed to The especially selection of class situation is excavated for the user journal of Indian English, can be to India so as to obtain The candidate collection that English input habit is matched.Also, user journal can dynamically update online, because And it is corresponding, can be according to predetermined measurement period, the specific input that has for periodically excavating selection is practised The user journal of used specific user colony, for example, periodically excavating the user for Indian English Daily record, so as to obtain the renewal of the candidate collection matched to Indian English input habit.
In step S2200, extraction step for extracting error correction inquiry pair from user journal, and is Each error correction inquiry is to setting up error correction character string pair, and the error correction inquiry is in the word for mistake input Hold the corresponding relation between the word content that correctly enters, the error correction character string is to for the error correction Inquiry centering mistake inputs character string and correctly enters the corresponding relation between character string.It is described to extract step Suddenly also include choosing error correction character string centering mistake input character string and correctly enter character string be respectively less than it is pre- Determine the step of the error correction character string pair of editing distance.Wherein, editing distance refers between two character strings, As the minimum edit operation number of times needed for one changes into another.The predetermined editing distance can according to should Choose desired value with scene, such as when within the editor's number of times changed between expecting character string is 2 time, It is 2 that predetermined editing distance, which can be chosen,.Then, error correction character string will be gone out to calculating occurrence number Occurrence number is more than the error correction character string of predetermined threshold to being established as final error correction character string pair.It is described predetermined Threshold value is the integer value more than 0, can choose desired value according to application scenarios and application experience.
It is illustrated with the characteristic for Indian English, the voice spelt according to Indian English word Rule is learned, we can get some obvious error correction character strings pair, for example, such as table 1 below It is shown,
Error correction inquiry pair Error correction short character strings pair
ashiqui 2->aashiqui 2 a->aa
tere nam-->tere naam a->aa
Khoobsurat-->khubsurat oo->u
zaruri tha-->zaroori tha u->oo
Table 1
Error correction inquiry pair can be obtained in several ways first, for example, can be by first to user day Will is filtered, and then finds some error correction inquiry to Candidate Set by methods such as editing distances, then lead to Cross artificial examine to be confirmed, so as to obtain error correction inquiry pair.So-called editing distance refers to two word strings Between, as the minimum edit operation number of times needed for one changes into another.Then extract and obtain similar aa->Error correction character string pair as a, the extraction step that can be taken, is exemplified below:
First, error correction inquiry pair is obtained from user journal.Assuming that p, q are one group of error correction inquiry pair, P inputs for mistake, and q correctly enters to be corresponding.P, q are made up of one group of character, p=a1a2...an, Q=b1b2...bm
If x=1, if ax==bx, then x+1.Loop iteration, until ax!=bx
If y=0, p string length are n, q string length is m.If an-y==bm-y, Then y+1, loop iteration, until an-y!=bm-y
It is b (preferably b to choose the editing distance upper limit<=2), if now meeting 0<n-y-x<B, 0<m-y-x<B, obtains one group of candidate's error correction character string to [ax-1,ax,ax+1,...,an-y+1]-> [bx-1,bx,bx+1,...,bm-y+1], it is added into candidate result collection R.
The whole error correction inquiries of scanning are to rear, for all candidate's error correction character strings pair in R, if it goes out Occurrence number is more than predetermined threshold, then as final error correction character string pair.Wherein, predetermined threshold be more than 0 integer value, can choose fit value according to application scenarios and application experience.
Preferably, methods described can also include step S2300 rewriting probability calculation steps, for root According to user journal Result, the probability P that all kinds of characters are rewritten is calculatedh, the character is rewritten as single The mistake of character is write, failed to write, writing more.
Error correction for non-single character in step S2200 proposes extraction error correction character string To mode, and for the error correction of single character, specifically, such as single character occur mistake write, Fail to write or situation about writing more is, it is necessary to which extra computation rewrites probability.
Editing distance can be based in S2300, various characters rewrite the probability occurred in counting user daily record Ph, in this step, the edit operation of license only includes a character being substituted for another character, inserted Enter a character and delete a character.PhIt can be calculated by equation below:
Ph=count (errori)/∑count(errori) (formula 1)
count(errori) it is that certain specific character rewrites the number of times that mistake occurs in user journal. ∑count(errori) it is that various characters rewrite the number of times sum that mistake occurs in user journal.
With this, the supplement of the error correction character string pair to step S2200 is used as.Schematically as follows table 2.
Delete letter o Occur 10 times Accounting is 0.001
Increase letter j Occur 15 times Accounting is 0.0015
Alphabetical a mistakes are write as b Occur 20 times Accounting is 0.002
Alphabetical m mistakes are write as n Occur 20 times Accounting is 0.002
... ... ...
Table 2
Then, in step S2400 candidate collection calculation procedures, the word t of input is being receivedi Afterwards, to tiCheck whether it matches error correction character string pair, if it does, then according to error correction character string pair The variant set V={ v of word are generated to the word1,v2,…,vn}。
In one embodiment, can be only to word tiDo error correction string matching and using set V as Candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
For example, can be according to formula:
pj=r(l-θ)(1-r)θ(formula 2)
Calculate word vjOutput probability;Wherein
L is input word tiString length;
R is the probability that single character is correctly entered;θ is the constant between 0~1.
R can be counted from user journal obtains r=1- (∑ count (errori)/total number of characters).θ is ginseng Number, can be chosen by historical experience or experiment.
In another embodiment, can be simultaneously to word tiDo error correction string matching and calculate and rewrite general Rate does the candidate collection calculation procedure, also obtains all and word tiBetween be less than specific editing distance (such as editing distance<=set of letters U={ u 2)1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, can be according to tiWith the word u in set of letters UjBetween editor's conversion pathway on All kinds of character manipulations corresponding to character rewrite Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}.Editor's conversion pathway refers to that a word is converted by editing distance For the edit operation corresponding to another word, for example, being converted into happy from laappy, then edit Conversion pathway is:A is deleted, l replaces with h.
It is highly preferred that can be according to formula
pj=r(l-k)(1-r)k*∏M=1 k phm(formula 3)
Calculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word Symbol rewrites probability;R is the probability that single character is correctly entered.
Fig. 3 shows the block diagram of candidate collection computing device 3000 according to embodiments of the present invention.Wait Selected works conjunction computing device 3000 can be used to realize the method shown in Fig. 2, therefore repeating part is no longer detailed Description.
Candidate collection computing device 3000, including:Abstraction module 3010 and candidate collection computing module 3030, preferably also include rewriting probability evaluation entity 3020.
Abstraction module 3010, for extracting error correction inquiry pair from user journal, and is looked into for each error correction Ask to setting up error correction character string pair, the error correction inquiry to the word content for mistake input with it is correctly defeated Corresponding relation between the word content entered, the error correction character string is wrong to inquiring about centering for the error correction Erroneous input character string and correctly enter the corresponding relation between character string;
Candidate collection computing module 3030, for as the word t of inputiIn string matching entangle During error character string pair, according to error correction character string to the variant set to word generation word V={ v1,v2,…,vnIt is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection computing module to be included:
According to formula
pj=r(l-θ)(1-r)θ(formula 2)
Calculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1 Between constant.
Preferably, the abstraction module 3010 is additionally operable to choose error correction character string centering mistake input word Symbol is gone here and there and correctly enters the error correction character string pair that character string is respectively less than predetermined editing distance, and to erroneous character correction Symbol string will appear from error correction character string of the number of times more than predetermined threshold to being established as most to calculating occurrence number Whole error correction character string pair.Wherein, predetermined threshold is integer value more than 0, can according to application scenarios with And application experience chooses fit value.
Especially, also including rewriting probability evaluation entity 3020, it is used for Web log mining result for basis, Calculate the probability P that all kinds of characters are rewrittenh, the mistake that the character is rewritten as single character writes, fails to write, many Write;And
In the case of comprising probability evaluation entity 3020 is rewritten, the candidate collection computing module 3030, it is additionally operable to obtain all and word tiBetween be less than specific editing distance set of letters U={ u1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set The V and set U, so as to obtain word tiCandidate collection C={ c1,c2,..,cn,cn+1,..,cn+mAnd phase Output probability P={ the p answered1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, the output probability of the set U is calculated in the candidate collection computing module 3030 Including:According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character behaviour Make corresponding character and rewrite Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
It is highly preferred that the output that set of letters U is calculated in the candidate collection computing module 3030 is general Rate includes:According to formula
pj=r(l-k)(1-r)k*∏M=1 k phm(formula 3)
Calculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word Symbol rewrites probability;R is the probability that single character is correctly entered.
According to one more embodiment of the present invention, a kind of programmable device, including memory and place are also provided Device is managed, wherein, the memory is used for store instruction, and the instruction is used to control the processor to enter Row operates to perform the method described in Fig. 2.
The first embodiment of the present invention has been described in conjunction with the accompanying above, according to the present embodiment, there is pin The user journal being accustomed to language-specific is excavated to property, and accordingly generates error correction character string pair and counts Calculate character and rewrite probability, in input matching and error correction procedure is carried out, error correction character string pair will be met Variant set and character, which are rewritten, to be included candidate collection and calculates corresponding probability, so as to improve pair The error correcting capability of the natural language input of special group with certain word input habit.Particularly exist Good result is achieved on Indian English hinglish, the error correction degree of accuracy has both been ensure that, can be covered again Most Error Correcting Problem, good adaptability is also demonstrated by neologisms error correction.
<Second embodiment>
According to the second embodiment of the present invention, it is real based on first there is provided one kind as shown in Figure 4,5 Apply the input error correction method of the method described in example.Therefore repeating part is not described in detail.
Walked as shown in figure 4, including transition probability according to the input error correction method of the present embodiment and calculating Suddenly.In traditional pattern recognition theory, user input can be counted as one group of status switch.Calculate shape Transition probability between state, that is, find that two words constitute the general of adjacent context from corpus Rate.For example, for example existing English corpus is as follows:
it is over
How Sweet It Is
it is time to say goodbye
Transition probability P (is | the it)=3/3=1 obtained from it to is can be then calculated, the transfer from is to over is general Rate is P (over | it)=1/3.
The calculating of transition probability can be as follows:
S4100, builds corpus Y={ s1,s2,...,sn, wherein s represents a short sentence, and n represents corpus Data volume.si={ t1,t2,...,tm, t represents a word.And generate global dictionary D={ t1,t2,...tc}。
S4200, is that the beginning and end of each short sentence is marked.For example, can be short in each s The beginning and end of sentence is put on<s></s>, for identifying beginning of the sentence sentence tail, in favor of automatic identification.
S4300, calculates transition probability the P ' (t of whole words between any twoi|tj)。
Advantageously according to formula:
P’(ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) (formula 4)
Calculate the transition probability of whole words between any two in corpus;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、 tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
For example, in following corpus "
hello world
world peace
say hello world in python”
Because world is occurred in that 3 times altogether in corpus, therefore c (world)=3.
Hello world are occurred in that 2 times in corpus, therefore c (hello, world)=2
And hello word, world peace, say hello, world in, in python are had in corpus Five kinds of adjacent words pair, therefore v=5.
Thus according to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate word ti,tjBetween transition probability P’(ti|tj)。
Carried out transition probability calculation procedure shown in Fig. 4 and the error correction character string shown in Fig. 2 extract, Character is rewritten after probability calculation step, as shown in figure 5, real-time text can be provided the user on line Input error correction.Methods described includes:
S5100 input steps, for inputting sentence;
S5200 segmentation steps, for sentence to be divided into word ti
S5300 candidate collection calculation procedures, segmentation is calculated for the method according to embodiment one Each word tiCandidate collection C and its output probability P;
S5400 error correction path computing steps, for according to the output probability P and according to Fig. 4 institutes The transition probability P ' that the method shown is obtained calculates optimal error correction path and its corresponding Probability p l, Yi Jiyuan The Probability p 0 in beginning input path, the optimal error correction path refers to the candidate obtained from candidate collection C The nearest error correction path chosen in error correction path by probability calculation.
S5500 judges step, for judging whether the optimal error correction path is equal to former input path, Wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input Sentence.Wherein, predetermined difference value threshold value is the constant more than or equal to 0.This can be directed to applied field Scape chooses fit value according to implementation experience or usual optimization method.
In wherein described step S5400, it is preferable that can be according to traditional HMM (HMM) Viterbi (Viterbi) method in calculates the optimal error correction path l and its corresponding Probability p l.Viterbi method is well known in the prior art dynamic programming method, will not be repeated here.
In addition, as shown in fig. 6, also provide a kind of input error correction device 6000, including:
Transition probability computing module 6060, the state transition probability P ' for calculating sentence in corpus;
Input module 6040, for inputting sentence;
Split module 6050, for sentence to be divided into word;
Candidate collection computing device 3000 as shown in Figure 3, each word t for calculating segmentationi Candidate collection C and its output probability P;
Error correction path calculation module 6070, for according to the output probability P and transition probability P ' Optimal error correction path and its corresponding Probability p l are calculated, and is originally inputted the Probability p 0 in path;
Judge module 6080, for judging whether the error correction path is equal to former input path, wherein
If it is determined that module judges that the error correction path is equal to former input path, the sentence of former input is returned Son;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described Difference between the Probability p l in error correction path and the Probability p 0 of original path, if difference is poor more than predetermined It is worth threshold value, then returns to the corresponding error correction result in optimal error correction path, otherwise, returns to the sentence of former input. Wherein, predetermined difference value threshold value is the constant more than or equal to 0.Can be for application scenarios according to implementation Experience or usual optimization method choose fit value.
According to one more embodiment of the present invention, a kind of programmable device, including memory and place are also provided Device is managed, wherein, the memory is used for store instruction, and the instruction is used to control the processor to enter Row operates to perform the method described in Fig. 5.
The second embodiment of the present invention has been described in conjunction with the accompanying above, according to the present embodiment, this reality Example is applied there is provided complete word error correction method and device, online lower structure corpus simultaneously calculates sentence State transition probability, it is online under pointedly excavate the user journal being accustomed to language-specific, and according to This generation error correction character string pair and calculating character rewriting probability.The inquiry input of user is received on line Afterwards, the variant set and character that meet error correction character string pair are rewritten and includes candidate collection and calculate Corresponding probability, then according to the state transition probability calculated under line, candidate collection and corresponding probability To calculate optimal error correction path.This programme is improved to the special group with certain word input habit Natural language input error correcting capability.Good effect is particularly achieved on Indian English hinglish Really, the error correction degree of accuracy had both been ensure that, most Error Correcting Problem can be covered again, to neologisms error correction It is demonstrated by good adaptability.
It will be appreciated by those skilled in the art that, it can realize that candidate collection calculates dress by various modes Put and word error correction device.For example, can realize that candidate collection is calculated by instructing configuration processor Device and word error correction device.For example, instruction can be stored in ROM, and when startup is set When standby, will instruction from ROM read programming device in realize candidate collection computing device and text Word error correction device.For example, candidate collection computing device and word error correction device can be cured to special In device (such as ASIC).Candidate collection computing device and word error correction device can be divided into mutually Independent unit, or they can be merged to realization.Candidate collection computing device and word Error correction device can be realized by one kind in above-mentioned various implementations, or can be by above-mentioned The combinations of two or more modes in various implementations is realized.
The present invention can be system, method and/or computer program product.Computer program product can be with Including computer-readable recording medium, containing for making processor realize various aspects of the invention Computer-readable program instructions.
Computer-readable recording medium can keep and store to be used by instruction execution equipment The tangible device of instruction.Computer-readable recording medium, which for example can be ,-- but is not limited to-and-electricity deposits Store up equipment, magnetic storage apparatus, light storage device, electromagnetism storage device, semiconductor memory apparatus or Above-mentioned any appropriate combination.The more specifically example of computer-readable recording medium is (non exhaustive List) include:Portable computer diskette, hard disk, random access memory (RAM), read-only deposit Reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random Access memory (SRAM), Portable compressed disk read-only storage (CD-ROM), numeral many Functional disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example it is stored thereon with instruction Punch card or groove internal projection structure and above-mentioned any appropriate combination.Meter used herein above Calculation machine readable storage medium storing program for executing is not construed as instantaneous signal in itself, such as radio wave or other freedom The electromagnetic wave of propagation, the electromagnetic wave propagated by waveguide or other transmission mediums are (for example, pass through optical fiber The light pulse of cable) or the electric signal that is transmitted by electric wire.
Computer-readable program instructions as described herein can be downloaded from computer-readable recording medium To each calculating/processing equipment, or by network, such as internet, LAN, wide area network and/ Or wireless network downloads to outer computer or External memory equipment.Network can include copper transmission cable, Optical Fiber Transmission, be wirelessly transferred, router, fire wall, interchanger, gateway computer and/or edge clothes Business device.Adapter or network interface in each calculating/processing equipment receive computer from network Readable program instructions, and the computer-readable program instructions are forwarded, for being stored in each calculating/processing In computer-readable recording medium in equipment.
Can be assembly instruction, instruction set architecture for performing the computer program instructions that the present invention is operated (ISA) instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, state set number According to or with one or more programming languages any combination source code or object code write, institute Programming language of the programming language including object-oriented-Smalltalk, C++ etc. is stated, and it is conventional Procedural programming languages-such as " C " language or similar programming language.Computer-readable program instructions It fully can on the user computer perform, partly perform on the user computer, as one Independent software kit is performed, part on the user computer part perform on the remote computer or Performed completely on remote computer or server.In the situation of remote computer is related to, long-range meter Calculation machine can be by the network of any kind-include LAN (LAN) or wide area network (WAN)-connection To subscriber computer, or, it may be connected to outer computer (is for example provided using Internet service Business comes by Internet connection).In certain embodiments, by using computer-readable program instructions Status information come personalized customization electronic circuit, such as PLD, field programmable gate Array (FPGA) or programmable logic array (PLA), the electronic circuit can perform computer can Reader is instructed, so as to realize various aspects of the invention.
Referring herein to method according to embodiments of the present invention, device (system) and computer program product Flow chart and/or block diagram describe various aspects of the invention.It should be appreciated that flow chart and/or block diagram Each square frame and flow chart and/or block diagram in each square frame combination, can be by computer-readable journey Sequence instruction is realized.
These computer-readable program instructions can be supplied to all-purpose computer, special-purpose computer or other The processor of programmable data processing unit, so as to produce a kind of machine so that these instructions are logical When crossing the computing device of computer or other programmable data processing units, generate and realize flow chart And/or one or more of the block diagram device of function/action specified in square frame.These can also be counted Calculation machine readable program instructions store in a computer-readable storage medium, these instruct cause computer, Programmable data processing unit and/or other equipment work in a specific way, so that, be stored with instruction Computer-readable medium then includes manufacture, and it includes realizing one in flow chart and/or block diagram Or the instruction of the various aspects of function/action specified in multiple square frames.
Computer-readable program instructions can also be loaded into computer, other programmable datas processing dress Put or miscellaneous equipment on so that in computer, other programmable data processing units or miscellaneous equipment Upper execution series of operation steps, to produce computer implemented process so that computer, Flow chart and/or block diagram are realized in the instruction performed in other programmable data processing units or miscellaneous equipment One or more of function/action specified in square frame.
Flow chart and block diagram in accompanying drawing show the system of multiple embodiments according to the present invention, method With architectural framework in the cards, function and the operation of computer program product.At this point, flow Each square frame in figure or block diagram can represent a module, program segment or a part for instruction, described Module, program segment or a part for instruction are used to realize defined logic function comprising one or more Executable instruction.In some realizations as replacement, the function of being marked in square frame can also be with not The order for being same as being marked in accompanying drawing occurs.For example, two continuous square frames can essentially substantially simultaneously Perform capablely, they can also be performed in the opposite order sometimes, this is depending on involved function. It is also noted that in each square frame and block diagram and/or flow chart in block diagram and/or flow chart The combination of square frame, can be with function as defined in execution or the special hardware based system of action come real It is existing, or can be realized with the combination of specialized hardware and computer instruction.For people in the art For member it is well known that, realized by hardware mode, realized by software mode and by software and The mode of combination of hardware realizes all be of equal value.
It is described above various embodiments of the present invention, described above is exemplary, and exhaustive Property, and it is also not necessarily limited to disclosed each embodiment.In the model without departing from illustrated each embodiment Enclose and spirit in the case of, many modifications and changes for those skilled in the art It will be apparent from.The selection of term used herein, it is intended to best explain the original of each embodiment Reason, practical application or to the technological improvement in market, or make other ordinary skills of the art Personnel are understood that each embodiment disclosed herein.The scope of the present invention is defined by the appended claims.

Claims (10)

1. the candidate collection computational methods in a kind of word input, it is characterised in that including following step Suddenly:
Extraction step, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction Enter character string and correctly enter the corresponding relation between character string;
Candidate collection calculation procedure, for as the word t of inputiIn string matching erroneous character correction During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn} It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
2. according to the method described in claim 1, it is characterised in that the candidate collection calculates step The output probability of the set V is calculated in rapid to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1 Between constant.
3. according to the method described in claim 1, it is characterised in that methods described also includes:
Probability calculation step is rewritten, is rewritten for according to user journal Result, calculating all kinds of characters Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection calculation procedure, is additionally operable to obtain all and word tiBetween be less than predetermined editor Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
4. one kind input error correction method, it is characterised in that comprise the following steps,
Transition probability calculation procedure, the state transition probability P ' for calculating sentence in corpus;
Input step, for inputting sentence;
Segmentation step, for sentence to be divided into word ti
Candidate collection calculation procedure, for according to the method as described in any one in claim 1-3 Calculate each word t of segmentationiCandidate collection C and its output probability P;
Error correction path computing step, for being calculated most according to the output probability P and transition probability P ' Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge step, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input Sentence.
5. the candidate collection computing device in a kind of word input, including:
Abstraction module, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction Enter character string and correctly enter the corresponding relation between character string;
Candidate collection computing module, for as the word t of inputiIn string matching erroneous character correction During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn} It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
6. device according to claim 5, it is characterised in that the candidate collection calculates mould The output probability of the set V is calculated in block to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1 Between constant.
7. device according to claim 5, it is characterised in that also include:
Probability evaluation entity is rewritten, for according to Web log mining result is used for, calculating all kinds of characters and rewriting Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection computing module, is additionally operable to obtain all and word tiBetween be less than specific editor Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
8. a kind of programmable device, including memory and processor, wherein, the memory is used to deposit Storage instruction, it is described to instruct for controlling the processor to be operated to perform according to claim 1-3 Method described in middle any one.
9. one kind input error correction device, it is characterised in that including:
Transition probability computing module, the state transition probability P ' for calculating sentence in corpus;
Input module, for inputting sentence;
Split module, for sentence to be divided into word ti
Candidate collection computing device according to any one in claim 5-7, divides for calculating Each word t cutiCandidate collection C and its output probability P;
Error correction path calculation module, for being calculated most according to the output probability P and transition probability P ' Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge module, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that module judges that the optimal error correction path is equal to former input path, former input is returned Sentence;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described Difference between the Probability p l in optimal error correction path and the Probability p 0 of original path, if difference is more than in advance Determine difference threshold, then return to the corresponding error correction result in optimal error correction path, otherwise, return to former input Sentence.
10. a kind of programmable device, including memory and processor, wherein, the memory is used for Store instruction, the instruction is used to control the processor to be operated to perform according to claim 4 Described method.
CN201610020331.0A 2016-01-12 2016-01-12 Candidate collection computational methods and device, word error correction method and device in word input Pending CN106959977A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610020331.0A CN106959977A (en) 2016-01-12 2016-01-12 Candidate collection computational methods and device, word error correction method and device in word input

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610020331.0A CN106959977A (en) 2016-01-12 2016-01-12 Candidate collection computational methods and device, word error correction method and device in word input

Publications (1)

Publication Number Publication Date
CN106959977A true CN106959977A (en) 2017-07-18

Family

ID=59481421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610020331.0A Pending CN106959977A (en) 2016-01-12 2016-01-12 Candidate collection computational methods and device, word error correction method and device in word input

Country Status (1)

Country Link
CN (1) CN106959977A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108197317A (en) * 2018-02-01 2018-06-22 科大讯飞股份有限公司 Document key message extraction system test method and device
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108519973A (en) * 2018-03-29 2018-09-11 广州视源电子科技股份有限公司 Character spelling detection method, system, computer equipment and storage medium
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN109376362A (en) * 2018-11-30 2019-02-22 武汉斗鱼网络科技有限公司 A kind of the determination method and relevant device of corrected text
CN109426357A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109885180A (en) * 2019-02-21 2019-06-14 北京百度网讯科技有限公司 Error correction method and device, computer-readable medium
CN109977415A (en) * 2019-04-02 2019-07-05 北京奇艺世纪科技有限公司 A kind of text error correction method and device
CN110889028A (en) * 2018-08-15 2020-03-17 北京嘀嘀无限科技发展有限公司 Corpus processing and model training method and system
CN111339757A (en) * 2020-02-13 2020-06-26 上海凯岸信息科技有限公司 Error correction method for voice recognition result in collection scene
CN111353025A (en) * 2018-12-05 2020-06-30 阿里巴巴集团控股有限公司 Parallel corpus processing method and device, storage medium and computer equipment
CN111797614A (en) * 2019-04-03 2020-10-20 阿里巴巴集团控股有限公司 Text processing method and device
CN112445953A (en) * 2019-08-14 2021-03-05 阿里巴巴集团控股有限公司 Information search error correction method, computing device and storage medium
CN112528980A (en) * 2020-12-16 2021-03-19 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112861518A (en) * 2020-12-29 2021-05-28 科大讯飞股份有限公司 Text error correction method and device, storage medium and electronic device
CN114168808A (en) * 2021-11-22 2022-03-11 中核核电运行管理有限公司 Regular expression-based document character string coding identification method and device
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060048055A1 (en) * 2004-08-25 2006-03-02 Jun Wu Fault-tolerant romanized input method for non-roman characters
CN101350004A (en) * 2008-09-11 2009-01-21 北京搜狗科技发展有限公司 Method for forming personalized error correcting model and input method system of personalized error correcting
CN102156551A (en) * 2011-03-30 2011-08-17 北京搜狗科技发展有限公司 Method and system for correcting error of word input
US20140188460A1 (en) * 2012-10-16 2014-07-03 Google Inc. Feature-based autocorrection
CN101241514B (en) * 2008-03-21 2014-11-05 北京搜狗科技发展有限公司 Method for creating error-correcting database, automatic error correcting method and system
CN104298672A (en) * 2013-07-16 2015-01-21 北京搜狗科技发展有限公司 Error correction method and device for input
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
CN107102746A (en) * 2016-02-19 2017-08-29 北京搜狗科技发展有限公司 Candidate word generation method, device and the device generated for candidate word

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060048055A1 (en) * 2004-08-25 2006-03-02 Jun Wu Fault-tolerant romanized input method for non-roman characters
CN101133411A (en) * 2004-08-25 2008-02-27 Google公司 Fault-tolerant romanized input method for non-roman characters
CN101241514B (en) * 2008-03-21 2014-11-05 北京搜狗科技发展有限公司 Method for creating error-correcting database, automatic error correcting method and system
CN101350004A (en) * 2008-09-11 2009-01-21 北京搜狗科技发展有限公司 Method for forming personalized error correcting model and input method system of personalized error correcting
CN102156551A (en) * 2011-03-30 2011-08-17 北京搜狗科技发展有限公司 Method and system for correcting error of word input
US20140188460A1 (en) * 2012-10-16 2014-07-03 Google Inc. Feature-based autocorrection
CN104298672A (en) * 2013-07-16 2015-01-21 北京搜狗科技发展有限公司 Error correction method and device for input
US9037967B1 (en) * 2014-02-18 2015-05-19 King Fahd University Of Petroleum And Minerals Arabic spell checking technique
CN107102746A (en) * 2016-02-19 2017-08-29 北京搜狗科技发展有限公司 Candidate word generation method, device and the device generated for candidate word

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DANIEL JURAFSKY,JAMES H. MARTIN: "《Speech and Language Processing》", 《SPEECH AND LANGUAGE PROCESSING》 *
FANDYWANG: "《斯坦福大学自然语言处理第五课"拼写纠错(Spelling Correction)"》", 《斯坦福大学自然语言处理第五课"拼写纠错(SPELLING CORRECTION)"》 *
弗里德里希著: "《数字媒体中的隐写术原理算法和应用》", 30 April 2014 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109426357A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109426357B (en) * 2017-09-01 2023-05-12 百度在线网络技术(北京)有限公司 Information input method and device
US10839794B2 (en) 2017-09-29 2020-11-17 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for correcting input speech based on artificial intelligence, and storage medium
CN107678561A (en) * 2017-09-29 2018-02-09 百度在线网络技术(北京)有限公司 Phonetic entry error correction method and device based on artificial intelligence
CN108197317A (en) * 2018-02-01 2018-06-22 科大讯飞股份有限公司 Document key message extraction system test method and device
CN108519973A (en) * 2018-03-29 2018-09-11 广州视源电子科技股份有限公司 Character spelling detection method, system, computer equipment and storage medium
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN110889028A (en) * 2018-08-15 2020-03-17 北京嘀嘀无限科技发展有限公司 Corpus processing and model training method and system
CN109376362A (en) * 2018-11-30 2019-02-22 武汉斗鱼网络科技有限公司 A kind of the determination method and relevant device of corrected text
CN111353025B (en) * 2018-12-05 2024-02-27 阿里巴巴集团控股有限公司 Parallel corpus processing method and device, storage medium and computer equipment
CN111353025A (en) * 2018-12-05 2020-06-30 阿里巴巴集团控股有限公司 Parallel corpus processing method and device, storage medium and computer equipment
CN109885180A (en) * 2019-02-21 2019-06-14 北京百度网讯科技有限公司 Error correction method and device, computer-readable medium
CN109977415A (en) * 2019-04-02 2019-07-05 北京奇艺世纪科技有限公司 A kind of text error correction method and device
CN111797614A (en) * 2019-04-03 2020-10-20 阿里巴巴集团控股有限公司 Text processing method and device
CN111797614B (en) * 2019-04-03 2024-05-28 阿里巴巴集团控股有限公司 Text processing method and device
CN112445953A (en) * 2019-08-14 2021-03-05 阿里巴巴集团控股有限公司 Information search error correction method, computing device and storage medium
CN111339757A (en) * 2020-02-13 2020-06-26 上海凯岸信息科技有限公司 Error correction method for voice recognition result in collection scene
CN112528980A (en) * 2020-12-16 2021-03-19 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112528980B (en) * 2020-12-16 2022-02-15 北京华宇信息技术有限公司 OCR recognition result correction method and terminal and system thereof
CN112861518A (en) * 2020-12-29 2021-05-28 科大讯飞股份有限公司 Text error correction method and device, storage medium and electronic device
CN112861518B (en) * 2020-12-29 2023-12-01 科大讯飞股份有限公司 Text error correction method and device, storage medium and electronic device
CN114168808A (en) * 2021-11-22 2022-03-11 中核核电运行管理有限公司 Regular expression-based document character string coding identification method and device
CN115659958A (en) * 2022-12-27 2023-01-31 中南大学 Chinese spelling error checking method

Similar Documents

Publication Publication Date Title
CN106959977A (en) Candidate collection computational methods and device, word error correction method and device in word input
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN106534548B (en) Voice error correction method and device
JP6182272B2 (en) Natural expression processing method, processing and response method, apparatus, and system
CN107622054B (en) Text data error correction method and device
CN110033760B (en) Modeling method, device and equipment for speech recognition
CN100489841C (en) Method and integrated development tool for building a natural language understanding application
CN101458681A (en) Voice translation method and voice translation apparatus
CN106598939A (en) Method and device for text error correction, server and storage medium
CN110070855B (en) Voice recognition system and method based on migrating neural network acoustic model
CN108647191B (en) Sentiment dictionary construction method based on supervised sentiment text and word vector
CN107945792A (en) Method of speech processing and device
CN109213861A (en) In conjunction with the tourism evaluation sensibility classification method of At_GRU neural network and sentiment dictionary
CN110147544B (en) Instruction generation method and device based on natural language and related equipment
US11934781B2 (en) Systems and methods for controllable text summarization
CN109460558B (en) Effect judging method of voice translation system
US20180018960A1 (en) Systems and methods for automatic repair of speech recognition engine output
CN106648819A (en) Internationalized code conversion method based on editor
CN112528605B (en) Text style processing method, device, electronic equipment and storage medium
CN113779972A (en) Speech recognition error correction method, system, device and storage medium
CN108304424A (en) Text key word extracting method and text key word extraction element
CN103678271A (en) Text correction method and user equipment
CN113920999A (en) Voice recognition method, device, equipment and storage medium
US11145308B2 (en) Symbol sequence estimation in speech
CN112216284A (en) Training data updating method and system, voice recognition method and system, and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200528

Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square

Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170718