CN106959977A - Candidate collection computational methods and device, word error correction method and device in word input - Google Patents
Candidate collection computational methods and device, word error correction method and device in word input Download PDFInfo
- Publication number
- CN106959977A CN106959977A CN201610020331.0A CN201610020331A CN106959977A CN 106959977 A CN106959977 A CN 106959977A CN 201610020331 A CN201610020331 A CN 201610020331A CN 106959977 A CN106959977 A CN 106959977A
- Authority
- CN
- China
- Prior art keywords
- error correction
- word
- probability
- input
- path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses the candidate collection computational methods in a kind of input of word, comprise the following steps:Extraction step, for extracting error correction inquiry pair from user journal, and inquired about for each error correction to setting up error correction character string pair, the error correction inquiry is to the corresponding relation between the word content for mistake input and the word content correctly entered, and the error correction character string is to inquiring about centering mistake input character string for the error correction and correctly entering the corresponding relation between character string;Candidate collection calculation procedure, for as the word t of inputiIn string matching error correction character string pair when, according to error correction character string to the word generate word variant set V={ v1,v2,…,vnIt is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}.A kind of candidate collection computing device, input error correction method and its device are also disclosed simultaneously.The error correction degree of accuracy can be improved using the present invention, most Error Correcting Problem can be covered again, good adaptability is also demonstrated by neologisms error correction.
Description
Technical field
The present invention relates to the technical field of natural language processing, in particular it relates to a kind of text
Candidate collection computational methods and device, word error correction method and device in word input.
Background technology
Error correcting technique is an important step in search.According to Document system, in search engine inquiry, greatly
The inquiry that there are about 10%-15% is mistake input.Particularly in some groups being accustomed to language-specific
In body, such as in Indian English or Indian music search project, the query of mistake is even more to have accounted for 30%.
Conventional search error correction method includes noisy channel model and HMM.Noisy channel model
It is that Candidate Set is obtained by editing distance, then the transition probability of maximum is tried to achieve based on statistics, so as to tries to achieve
Best candidate error correction;HMM is then to regard inquiry as one group of observation state, corresponding time
Selected works regard one group of hidden state as, and each hidden state of observation state correspondence has corresponding output probability,
Also there is corresponding transition probability between hidden state, so as to calculate optimal hidden state sequence.It is above-mentioned
Two methods, are generally all that Candidate Set and its error probability are calculated by editing distance, have ignored language
Speech rule itself, is difficult the precision and coverage of balance Candidate Set in practice.
For example, inventor is had found in Project, and Indian inquires about what is inputed by mistake in search
Problem will become apparent than general English, Chinese language users.One very main reasons is that by their language
Say what characteristic was determined.Influenceed by historical factor, Indian's main language used on network is print
Spend English hinglish (https://en.wikipedia.org/wiki/Hinglish), one kind has merged English
With the mixed raw language of India's native language (hindi, Punjabi etc.).They can by native language (hindi,
Punjabi Latin alphabet spelling) is converted into, unified hard and fast rule is had no in this course, simply root
According to rule on voice, a hindi word is caused often to have a variety of Latin alphabet spell modes, such as film
" aashiqui 2 " can also be spelt into " ashiqui 2 " to name.Therefore, India native country is multilingual mixes
The characteristics of bring a large amount of search input errors.
Existing Hidden Markov searches for error correction, and the reasonable estimation to Candidate Set is an Important Problems.
Common method has two kinds, 1) calculate word between editing distance, further obtain transition probability,
This mode only simply considers character difference, and the degree of accuracy is poor.2) it is based on Web log mining error correction list
Relation between word pair, further obtains transition probability.Such mode is used dependent on very comprehensive
Family daily record, the error correction often covered is limited in scope, and can not tackle neologisms.Inventor is had found in practice
Middle above two method is in the input being accustomed to language-specific, such as in Indian English hinglish
Error correction is all not ideal enough.
The content of the invention
It is an object of the present invention to provide it is a kind of be suitable to language-specific be accustomed to and characteristic it is defeated
Enter to provide the new solution that candidate combines and carries out error correction.
There is provided the candidate collection calculating side in a kind of input of word according to the first aspect of the invention
Method, comprises the following steps:
Extraction step, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair
Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters
Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction
Enter character string and correctly enter the corresponding relation between character string;
Candidate collection calculation procedure, for as the word t of inputiIn string matching erroneous character correction
During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn}
It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection calculation procedure to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1
Between constant.
Preferably, the extraction step also include choosing error correction character string centering mistake input character string and
Correctly enter the step that character string is respectively less than the error correction character string pair of predetermined editing distance.
Preferably, the step of extraction step is also included to error correction character string to calculating occurrence number,
And error correction character string of the number of times more than predetermined threshold is will appear to being established as final error correction character string pair.
Preferably, methods described also includes:
Probability calculation step is rewritten, is rewritten for according to user journal Result, calculating all kinds of characters
Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection calculation procedure, is additionally operable to obtain all and word tiBetween be less than predetermined editor
Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability
P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime
Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability
P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, the output probability of the set U is calculated in the candidate collection calculation procedure to be included:
According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character manipulation
Corresponding character rewrites Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating set of letters U output probability bag in the candidate collection calculation procedure
Include:
According to formula pj=r(l-k)(1-r)k*∏M=1 k phmCalculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word
Symbol rewrites probability;R is the probability that single character is correctly entered.
Error correction method is inputted there is provided one kind according to the second aspect of the invention, is comprised the following steps,
Transition probability calculation procedure, the state transition probability P ' for calculating sentence in corpus;
Input step, for inputting sentence;
Segmentation step, for sentence to be divided into word ti;
Candidate collection calculation procedure, for calculating each of segmentation according to such as aforementioned candidates set computational methods
The word tiCandidate collection C and its output probability P;
Error correction path computing step, for being calculated most according to the output probability P and transition probability P '
Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge step, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned
Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated
The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than
Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input
Sentence.
Preferably, the transition probability calculation procedure includes:In units of sentence, calculate in corpus
The transition probability P ' (ti | tj) of whole words between any two.
Preferably, the transition probability calculation procedure includes:
According to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate in corpus whole words between any two
Transition probability;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、
tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
According to the third aspect of the invention we there is provided the candidate collection computing device in a kind of input of word,
Including:
Abstraction module, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair
Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters
Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction
Enter character string and correctly enter the corresponding relation between character string;
Candidate collection computing module, for as the word t of inputiIn string matching erroneous character correction
During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn}
It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection computing module to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1
Between constant.
Preferably, the abstraction module be additionally operable to choose error correction character string centering mistake input character string and
Correctly enter the error correction character string pair that character string is respectively less than predetermined editing distance.
It is highly preferred that the abstraction module is additionally operable to error correction character string to calculating occurrence number, and
Error correction character string of the number of times more than predetermined threshold be will appear to being established as final error correction character string pair.
Preferably, described device also includes:Probability evaluation entity is rewritten, for being dug according to for daily record
Result is dug, the probability P that all kinds of characters are rewritten is calculatedh, the mistake that the character is rewritten as single character writes,
Fail to write, write more;And
The candidate collection computing module, is additionally operable to obtain all and word tiBetween be less than specific editor
Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability
P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime
Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability
P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating the output probability bag of the set U in the candidate collection computing module
Include:
According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character manipulation
Corresponding character rewrites Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
Preferably, wherein calculating set of letters U output probability bag in the candidate collection computing module
Include:
According to formula pj=r(l-k)(1-r)k*∏M=1 k phmCalculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word
Symbol rewrites probability;R is the probability that single character is correctly entered.
According to the fourth aspect of the invention there is provided a kind of programmable device, including memory and processor,
Wherein, the memory is used for store instruction, and the instruction is used to control the processor to be operated
To perform foregoing candidate collection computational methods.
Error correction device is inputted there is provided one kind according to the fifth aspect of the invention, including:
Transition probability computing module, the state transition probability P ' for calculating sentence in corpus;
Input module, for inputting sentence;
Split module, for sentence to be divided into word ti;
Foregoing candidate collection computing device, each word t for calculating segmentationiCandidate collection
C and its output probability P;
Error correction path calculation module, for being calculated most according to the output probability P and transition probability P '
Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge module, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that module judges that the optimal error correction path is equal to former input path, former input is returned
Sentence;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described
Difference between the Probability p l in optimal error correction path and the Probability p 0 of original path, if difference is more than in advance
Determine difference threshold, then return to the corresponding error correction result in optimal error correction path, otherwise, return to former input
Sentence.
Preferably, the transition probability computing module, in units of sentence, calculating in corpus
The transition probability P ' (ti | tj) of whole words between any two.
Preferably, the transition probability computing module includes:
According to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate in corpus whole words between any two
Transition probability;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、
tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
According to the sixth aspect of the invention there is provided a kind of programmable device, including memory and processor,
Wherein, the memory is used for store instruction, and the instruction is used to control the processor to be operated
To perform foregoing input error correction method.
It was found by the inventors of the present invention that in the prior art, not proposing also a kind of for specific language
Say that the situation of input habit and characteristic proposes candidate collection computational methods and corresponding error correction method.Therefore,
The technical assignment to be realized of the present invention or technical problem to be solved be those skilled in the art from
It is not expecting or it is not expected that, therefore the present invention is a kind of new technical scheme.
By referring to the drawings to the detailed description of the exemplary embodiment of the present invention, of the invention its
Its feature and its advantage will be made apparent from.
Brief description of the drawings
The accompanying drawing for being combined in the description and constituting a part for specification shows the reality of the present invention
Example is applied, and together with the principle that its explanation is used to explain the present invention.
Fig. 1 shows the hardware configuration for the computer system 1000 that can realize embodiments of the invention
Block diagram.
Fig. 2 shows the stream of the candidate collection computational methods in word input according to embodiments of the present invention
Cheng Tu;
Fig. 3 shows the block diagram of candidate collection computing device according to embodiments of the present invention;
Fig. 4 shows the flow chart of transition probability computational methods according to embodiments of the present invention;
The flow chart of the input error correction methods of Fig. 5 according to embodiments of the present invention;
Fig. 6 shows the block diagram of input error correction device according to embodiments of the present invention.
Embodiment
The various exemplary embodiments of the present invention are described in detail now with reference to accompanying drawing.It should be noted that:
Unless specifically stated otherwise, the part that otherwise illustrates in these embodiments and step it is positioned opposite,
Numerical expression and numerical value are not limited the scope of the invention.
The description only actually at least one exemplary embodiment is illustrative below, is never made
For to the present invention and its application or any limitation used.
It may not make to beg in detail for technology, method and apparatus known to person of ordinary skill in the relevant
By, but in the appropriate case, the technology, method and apparatus should be considered as a part for specification.
In shown here and discussion all examples, any occurrence should be construed as merely example
Property, not as limitation.Therefore, other examples of exemplary embodiment can have different
Value.
It should be noted that:Similar label and letter represents similar terms, therefore, one in following accompanying drawing
It is defined, then it need not be carried out further in subsequent accompanying drawing in a certain Xiang Yi accompanying drawing of denier
Discuss.
<Hardware configuration>
Fig. 1 is to show that the hardware configuration of the computer system 1000 of embodiments of the invention can be realized
Block diagram.
As shown in figure 1, computer system 1000 includes computer 1110.Computer 1110 includes warp
The processing unit 1120 that is connected by system bus 1121, system storage 1130, fixation are non-volatile
Memory interface 1140, mobile non-volatile memory interface 1150, user input interface 1160,
Network interface 1170, video interface 1190 and peripheral interface 1195.
System storage 1130 includes ROM (read-only storage) and RAM (random access memories
Device).BIOS (basic input output system) is resided in ROM.Operating system, application program,
Other program modules and some routine datas are resided in RAM.
The fixed non-volatile memory of such as hard disk is connected to fixed non-volatile memory interface
1140.Fixed non-volatile memory for example can be with storage program area, application program, other programs
Module and some routine datas.
The mobile nonvolatile memory of such as floppy disk and CD-ROM drive is connected to shifting
Dynamic non-volatile memory interface 1150.For example, floppy disk can be inserted into floppy disk, with
And CD (CD) can be inserted into CD-ROM drive.
The input equipment of such as mouse and keyboard is connected to user input interface 1160.
Computer 1110 can be connected to remote computer 1180 by network interface 1170.For example,
Network interface 1170 can be connected to remote computer by LAN.Or, network interface 1170
Modem (modulator-demodulator) is may be coupled to, and modem is via wide area network
It is connected to remote computer 1180.
Remote computer 1180 can include the memory of such as hard disk, and it can store remote application
Program.
Video interface 1190 is connected to monitor.
Peripheral interface 1195 is connected to printer and loudspeaker.
Computer system shown in Fig. 1 be merely illustrative and be in no way intended to the present invention, its
Using or any limitation for using.
<First embodiment>
According to the first embodiment of the present invention, there is provided the time in a kind of input of word as shown in Figure 2
Selected works close computational methods, comprise the following steps:
First in step S2100 digging user daily records, user journal can be with specific input habit
Specific user colony and select, can for example select as the user journal of rare foreign languages spoken and written languages user,
Especially, the characteristics of be multilingual mix for Indian English, user would generally be by other India native country
The vocabulary of language is converted into Latin alphabet input, now often has multiple spell modes, and it is expressed
The meaning be all phonetics rule that is consistent, being spelt based on this kind of word with pronunciation, this can be directed to
The especially selection of class situation is excavated for the user journal of Indian English, can be to India so as to obtain
The candidate collection that English input habit is matched.Also, user journal can dynamically update online, because
And it is corresponding, can be according to predetermined measurement period, the specific input that has for periodically excavating selection is practised
The user journal of used specific user colony, for example, periodically excavating the user for Indian English
Daily record, so as to obtain the renewal of the candidate collection matched to Indian English input habit.
In step S2200, extraction step for extracting error correction inquiry pair from user journal, and is
Each error correction inquiry is to setting up error correction character string pair, and the error correction inquiry is in the word for mistake input
Hold the corresponding relation between the word content that correctly enters, the error correction character string is to for the error correction
Inquiry centering mistake inputs character string and correctly enters the corresponding relation between character string.It is described to extract step
Suddenly also include choosing error correction character string centering mistake input character string and correctly enter character string be respectively less than it is pre-
Determine the step of the error correction character string pair of editing distance.Wherein, editing distance refers between two character strings,
As the minimum edit operation number of times needed for one changes into another.The predetermined editing distance can according to should
Choose desired value with scene, such as when within the editor's number of times changed between expecting character string is 2 time,
It is 2 that predetermined editing distance, which can be chosen,.Then, error correction character string will be gone out to calculating occurrence number
Occurrence number is more than the error correction character string of predetermined threshold to being established as final error correction character string pair.It is described predetermined
Threshold value is the integer value more than 0, can choose desired value according to application scenarios and application experience.
It is illustrated with the characteristic for Indian English, the voice spelt according to Indian English word
Rule is learned, we can get some obvious error correction character strings pair, for example, such as table 1 below
It is shown,
Error correction inquiry pair | Error correction short character strings pair |
ashiqui 2->aashiqui 2 | a->aa |
tere nam-->tere naam | a->aa |
Khoobsurat-->khubsurat | oo->u |
zaruri tha-->zaroori tha | u->oo |
Table 1
Error correction inquiry pair can be obtained in several ways first, for example, can be by first to user day
Will is filtered, and then finds some error correction inquiry to Candidate Set by methods such as editing distances, then lead to
Cross artificial examine to be confirmed, so as to obtain error correction inquiry pair.So-called editing distance refers to two word strings
Between, as the minimum edit operation number of times needed for one changes into another.Then extract and obtain similar
aa->Error correction character string pair as a, the extraction step that can be taken, is exemplified below:
First, error correction inquiry pair is obtained from user journal.Assuming that p, q are one group of error correction inquiry pair,
P inputs for mistake, and q correctly enters to be corresponding.P, q are made up of one group of character, p=a1a2...an,
Q=b1b2...bm。
If x=1, if ax==bx, then x+1.Loop iteration, until ax!=bx。
If y=0, p string length are n, q string length is m.If an-y==bm-y,
Then y+1, loop iteration, until an-y!=bm-y。
It is b (preferably b to choose the editing distance upper limit<=2), if now meeting 0<n-y-x<B,
0<m-y-x<B, obtains one group of candidate's error correction character string to [ax-1,ax,ax+1,...,an-y+1]->
[bx-1,bx,bx+1,...,bm-y+1], it is added into candidate result collection R.
The whole error correction inquiries of scanning are to rear, for all candidate's error correction character strings pair in R, if it goes out
Occurrence number is more than predetermined threshold, then as final error correction character string pair.Wherein, predetermined threshold be more than
0 integer value, can choose fit value according to application scenarios and application experience.
Preferably, methods described can also include step S2300 rewriting probability calculation steps, for root
According to user journal Result, the probability P that all kinds of characters are rewritten is calculatedh, the character is rewritten as single
The mistake of character is write, failed to write, writing more.
Error correction for non-single character in step S2200 proposes extraction error correction character string
To mode, and for the error correction of single character, specifically, such as single character occur mistake write,
Fail to write or situation about writing more is, it is necessary to which extra computation rewrites probability.
Editing distance can be based in S2300, various characters rewrite the probability occurred in counting user daily record
Ph, in this step, the edit operation of license only includes a character being substituted for another character, inserted
Enter a character and delete a character.PhIt can be calculated by equation below:
Ph=count (errori)/∑count(errori) (formula 1)
count(errori) it is that certain specific character rewrites the number of times that mistake occurs in user journal.
∑count(errori) it is that various characters rewrite the number of times sum that mistake occurs in user journal.
With this, the supplement of the error correction character string pair to step S2200 is used as.Schematically as follows table 2.
Delete letter o | Occur 10 times | Accounting is 0.001 |
Increase letter j | Occur 15 times | Accounting is 0.0015 |
Alphabetical a mistakes are write as b | Occur 20 times | Accounting is 0.002 |
Alphabetical m mistakes are write as n | Occur 20 times | Accounting is 0.002 |
... | ... | ... |
Table 2
Then, in step S2400 candidate collection calculation procedures, the word t of input is being receivedi
Afterwards, to tiCheck whether it matches error correction character string pair, if it does, then according to error correction character string pair
The variant set V={ v of word are generated to the word1,v2,…,vn}。
In one embodiment, can be only to word tiDo error correction string matching and using set V as
Candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
For example, can be according to formula:
pj=r(l-θ)(1-r)θ(formula 2)
Calculate word vjOutput probability;Wherein
L is input word tiString length;
R is the probability that single character is correctly entered;θ is the constant between 0~1.
R can be counted from user journal obtains r=1- (∑ count (errori)/total number of characters).θ is ginseng
Number, can be chosen by historical experience or experiment.
In another embodiment, can be simultaneously to word tiDo error correction string matching and calculate and rewrite general
Rate does the candidate collection calculation procedure, also obtains all and word tiBetween be less than specific editing distance
(such as editing distance<=set of letters U={ u 2)1,u2,…,umAnd calculate corresponding output probability
P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime
Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability
P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, can be according to tiWith the word u in set of letters UjBetween editor's conversion pathway on
All kinds of character manipulations corresponding to character rewrite Probability phCalculate the output probability
P={ pn+1,pn+2,..,pn+m}.Editor's conversion pathway refers to that a word is converted by editing distance
For the edit operation corresponding to another word, for example, being converted into happy from laappy, then edit
Conversion pathway is:A is deleted, l replaces with h.
It is highly preferred that can be according to formula
pj=r(l-k)(1-r)k*∏M=1 k phm(formula 3)
Calculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word
Symbol rewrites probability;R is the probability that single character is correctly entered.
Fig. 3 shows the block diagram of candidate collection computing device 3000 according to embodiments of the present invention.Wait
Selected works conjunction computing device 3000 can be used to realize the method shown in Fig. 2, therefore repeating part is no longer detailed
Description.
Candidate collection computing device 3000, including:Abstraction module 3010 and candidate collection computing module
3030, preferably also include rewriting probability evaluation entity 3020.
Abstraction module 3010, for extracting error correction inquiry pair from user journal, and is looked into for each error correction
Ask to setting up error correction character string pair, the error correction inquiry to the word content for mistake input with it is correctly defeated
Corresponding relation between the word content entered, the error correction character string is wrong to inquiring about centering for the error correction
Erroneous input character string and correctly enter the corresponding relation between character string;
Candidate collection computing module 3030, for as the word t of inputiIn string matching entangle
During error character string pair, according to error correction character string to the variant set to word generation word
V={ v1,v2,…,vnIt is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability
P={ p1,p2,..,pn}。
Preferably, the output probability of the set V is calculated in the candidate collection computing module to be included:
According to formula
pj=r(l-θ)(1-r)θ(formula 2)
Calculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1
Between constant.
Preferably, the abstraction module 3010 is additionally operable to choose error correction character string centering mistake input word
Symbol is gone here and there and correctly enters the error correction character string pair that character string is respectively less than predetermined editing distance, and to erroneous character correction
Symbol string will appear from error correction character string of the number of times more than predetermined threshold to being established as most to calculating occurrence number
Whole error correction character string pair.Wherein, predetermined threshold is integer value more than 0, can according to application scenarios with
And application experience chooses fit value.
Especially, also including rewriting probability evaluation entity 3020, it is used for Web log mining result for basis,
Calculate the probability P that all kinds of characters are rewrittenh, the mistake that the character is rewritten as single character writes, fails to write, many
Write;And
In the case of comprising probability evaluation entity 3020 is rewritten, the candidate collection computing module
3030, it is additionally operable to obtain all and word tiBetween be less than specific editing distance set of letters
U={ u1,u2,…,umAnd calculate corresponding output probability P={ pn+1,pn+2,..,pn+m, merge the set
The V and set U, so as to obtain word tiCandidate collection C={ c1,c2,..,cn,cn+1,..,cn+mAnd phase
Output probability P={ the p answered1,p2,..,pn,pn+1,pn+2,..,pn+m}。
Preferably, the output probability of the set U is calculated in the candidate collection computing module 3030
Including:According to tiWith the word u in set of letters UjBetween editor's conversion pathway on each character behaviour
Make corresponding character and rewrite Probability phCalculate the output probability P={ pn+1,pn+2,..,pn+m}。
It is highly preferred that the output that set of letters U is calculated in the candidate collection computing module 3030 is general
Rate includes:According to formula
pj=r(l-k)(1-r)k*∏M=1 k phm(formula 3)
Calculate word ujOutput probability;Wherein
L is word tiString length;K is tiTo ujEditor conversion step-length;phFor corresponding word
Symbol rewrites probability;R is the probability that single character is correctly entered.
According to one more embodiment of the present invention, a kind of programmable device, including memory and place are also provided
Device is managed, wherein, the memory is used for store instruction, and the instruction is used to control the processor to enter
Row operates to perform the method described in Fig. 2.
The first embodiment of the present invention has been described in conjunction with the accompanying above, according to the present embodiment, there is pin
The user journal being accustomed to language-specific is excavated to property, and accordingly generates error correction character string pair and counts
Calculate character and rewrite probability, in input matching and error correction procedure is carried out, error correction character string pair will be met
Variant set and character, which are rewritten, to be included candidate collection and calculates corresponding probability, so as to improve pair
The error correcting capability of the natural language input of special group with certain word input habit.Particularly exist
Good result is achieved on Indian English hinglish, the error correction degree of accuracy has both been ensure that, can be covered again
Most Error Correcting Problem, good adaptability is also demonstrated by neologisms error correction.
<Second embodiment>
According to the second embodiment of the present invention, it is real based on first there is provided one kind as shown in Figure 4,5
Apply the input error correction method of the method described in example.Therefore repeating part is not described in detail.
Walked as shown in figure 4, including transition probability according to the input error correction method of the present embodiment and calculating
Suddenly.In traditional pattern recognition theory, user input can be counted as one group of status switch.Calculate shape
Transition probability between state, that is, find that two words constitute the general of adjacent context from corpus
Rate.For example, for example existing English corpus is as follows:
it is over
How Sweet It Is
it is time to say goodbye
Transition probability P (is | the it)=3/3=1 obtained from it to is can be then calculated, the transfer from is to over is general
Rate is P (over | it)=1/3.
The calculating of transition probability can be as follows:
S4100, builds corpus Y={ s1,s2,...,sn, wherein s represents a short sentence, and n represents corpus
Data volume.si={ t1,t2,...,tm, t represents a word.And generate global dictionary D={ t1,t2,...tc}。
S4200, is that the beginning and end of each short sentence is marked.For example, can be short in each s
The beginning and end of sentence is put on<s></s>, for identifying beginning of the sentence sentence tail, in favor of automatic identification.
S4300, calculates transition probability the P ' (t of whole words between any twoi|tj)。
Advantageously according to formula:
P’(ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) (formula 4)
Calculate the transition probability of whole words between any two in corpus;Wherein
θ is the constant between 0~1;c(tj) it is word tjThe occurrence number in corpus, c (ti,tj) it is ti、
tjThe number of times of adjacent appearance before and after two words, v is the number of adjacent words combinations whole in corpus.
For example, in following corpus "
hello world
world peace
say hello world in python”
Because world is occurred in that 3 times altogether in corpus, therefore c (world)=3.
Hello world are occurred in that 2 times in corpus, therefore c (hello, world)=2
And hello word, world peace, say hello, world in, in python are had in corpus
Five kinds of adjacent words pair, therefore v=5.
Thus according to formula P ' (ti|tj)=(c (ti,tj)+θ)/(c(tj)+v) calculate word ti,tjBetween transition probability
P’(ti|tj)。
Carried out transition probability calculation procedure shown in Fig. 4 and the error correction character string shown in Fig. 2 extract,
Character is rewritten after probability calculation step, as shown in figure 5, real-time text can be provided the user on line
Input error correction.Methods described includes:
S5100 input steps, for inputting sentence;
S5200 segmentation steps, for sentence to be divided into word ti;
S5300 candidate collection calculation procedures, segmentation is calculated for the method according to embodiment one
Each word tiCandidate collection C and its output probability P;
S5400 error correction path computing steps, for according to the output probability P and according to Fig. 4 institutes
The transition probability P ' that the method shown is obtained calculates optimal error correction path and its corresponding Probability p l, Yi Jiyuan
The Probability p 0 in beginning input path, the optimal error correction path refers to the candidate obtained from candidate collection C
The nearest error correction path chosen in error correction path by probability calculation.
S5500 judges step, for judging whether the optimal error correction path is equal to former input path,
Wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned
Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated
The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than
Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input
Sentence.Wherein, predetermined difference value threshold value is the constant more than or equal to 0.This can be directed to applied field
Scape chooses fit value according to implementation experience or usual optimization method.
In wherein described step S5400, it is preferable that can be according to traditional HMM
(HMM) Viterbi (Viterbi) method in calculates the optimal error correction path l and its corresponding
Probability p l.Viterbi method is well known in the prior art dynamic programming method, will not be repeated here.
In addition, as shown in fig. 6, also provide a kind of input error correction device 6000, including:
Transition probability computing module 6060, the state transition probability P ' for calculating sentence in corpus;
Input module 6040, for inputting sentence;
Split module 6050, for sentence to be divided into word;
Candidate collection computing device 3000 as shown in Figure 3, each word t for calculating segmentationi
Candidate collection C and its output probability P;
Error correction path calculation module 6070, for according to the output probability P and transition probability P '
Optimal error correction path and its corresponding Probability p l are calculated, and is originally inputted the Probability p 0 in path;
Judge module 6080, for judging whether the error correction path is equal to former input path, wherein
If it is determined that module judges that the error correction path is equal to former input path, the sentence of former input is returned
Son;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described
Difference between the Probability p l in error correction path and the Probability p 0 of original path, if difference is poor more than predetermined
It is worth threshold value, then returns to the corresponding error correction result in optimal error correction path, otherwise, returns to the sentence of former input.
Wherein, predetermined difference value threshold value is the constant more than or equal to 0.Can be for application scenarios according to implementation
Experience or usual optimization method choose fit value.
According to one more embodiment of the present invention, a kind of programmable device, including memory and place are also provided
Device is managed, wherein, the memory is used for store instruction, and the instruction is used to control the processor to enter
Row operates to perform the method described in Fig. 5.
The second embodiment of the present invention has been described in conjunction with the accompanying above, according to the present embodiment, this reality
Example is applied there is provided complete word error correction method and device, online lower structure corpus simultaneously calculates sentence
State transition probability, it is online under pointedly excavate the user journal being accustomed to language-specific, and according to
This generation error correction character string pair and calculating character rewriting probability.The inquiry input of user is received on line
Afterwards, the variant set and character that meet error correction character string pair are rewritten and includes candidate collection and calculate
Corresponding probability, then according to the state transition probability calculated under line, candidate collection and corresponding probability
To calculate optimal error correction path.This programme is improved to the special group with certain word input habit
Natural language input error correcting capability.Good effect is particularly achieved on Indian English hinglish
Really, the error correction degree of accuracy had both been ensure that, most Error Correcting Problem can be covered again, to neologisms error correction
It is demonstrated by good adaptability.
It will be appreciated by those skilled in the art that, it can realize that candidate collection calculates dress by various modes
Put and word error correction device.For example, can realize that candidate collection is calculated by instructing configuration processor
Device and word error correction device.For example, instruction can be stored in ROM, and when startup is set
When standby, will instruction from ROM read programming device in realize candidate collection computing device and text
Word error correction device.For example, candidate collection computing device and word error correction device can be cured to special
In device (such as ASIC).Candidate collection computing device and word error correction device can be divided into mutually
Independent unit, or they can be merged to realization.Candidate collection computing device and word
Error correction device can be realized by one kind in above-mentioned various implementations, or can be by above-mentioned
The combinations of two or more modes in various implementations is realized.
The present invention can be system, method and/or computer program product.Computer program product can be with
Including computer-readable recording medium, containing for making processor realize various aspects of the invention
Computer-readable program instructions.
Computer-readable recording medium can keep and store to be used by instruction execution equipment
The tangible device of instruction.Computer-readable recording medium, which for example can be ,-- but is not limited to-and-electricity deposits
Store up equipment, magnetic storage apparatus, light storage device, electromagnetism storage device, semiconductor memory apparatus or
Above-mentioned any appropriate combination.The more specifically example of computer-readable recording medium is (non exhaustive
List) include:Portable computer diskette, hard disk, random access memory (RAM), read-only deposit
Reservoir (ROM), erasable programmable read only memory (EPROM or flash memory), static random
Access memory (SRAM), Portable compressed disk read-only storage (CD-ROM), numeral many
Functional disc (DVD), memory stick, floppy disk, mechanical coding equipment, for example it is stored thereon with instruction
Punch card or groove internal projection structure and above-mentioned any appropriate combination.Meter used herein above
Calculation machine readable storage medium storing program for executing is not construed as instantaneous signal in itself, such as radio wave or other freedom
The electromagnetic wave of propagation, the electromagnetic wave propagated by waveguide or other transmission mediums are (for example, pass through optical fiber
The light pulse of cable) or the electric signal that is transmitted by electric wire.
Computer-readable program instructions as described herein can be downloaded from computer-readable recording medium
To each calculating/processing equipment, or by network, such as internet, LAN, wide area network and/
Or wireless network downloads to outer computer or External memory equipment.Network can include copper transmission cable,
Optical Fiber Transmission, be wirelessly transferred, router, fire wall, interchanger, gateway computer and/or edge clothes
Business device.Adapter or network interface in each calculating/processing equipment receive computer from network
Readable program instructions, and the computer-readable program instructions are forwarded, for being stored in each calculating/processing
In computer-readable recording medium in equipment.
Can be assembly instruction, instruction set architecture for performing the computer program instructions that the present invention is operated
(ISA) instruction, machine instruction, machine-dependent instructions, microcode, firmware instructions, state set number
According to or with one or more programming languages any combination source code or object code write, institute
Programming language of the programming language including object-oriented-Smalltalk, C++ etc. is stated, and it is conventional
Procedural programming languages-such as " C " language or similar programming language.Computer-readable program instructions
It fully can on the user computer perform, partly perform on the user computer, as one
Independent software kit is performed, part on the user computer part perform on the remote computer or
Performed completely on remote computer or server.In the situation of remote computer is related to, long-range meter
Calculation machine can be by the network of any kind-include LAN (LAN) or wide area network (WAN)-connection
To subscriber computer, or, it may be connected to outer computer (is for example provided using Internet service
Business comes by Internet connection).In certain embodiments, by using computer-readable program instructions
Status information come personalized customization electronic circuit, such as PLD, field programmable gate
Array (FPGA) or programmable logic array (PLA), the electronic circuit can perform computer can
Reader is instructed, so as to realize various aspects of the invention.
Referring herein to method according to embodiments of the present invention, device (system) and computer program product
Flow chart and/or block diagram describe various aspects of the invention.It should be appreciated that flow chart and/or block diagram
Each square frame and flow chart and/or block diagram in each square frame combination, can be by computer-readable journey
Sequence instruction is realized.
These computer-readable program instructions can be supplied to all-purpose computer, special-purpose computer or other
The processor of programmable data processing unit, so as to produce a kind of machine so that these instructions are logical
When crossing the computing device of computer or other programmable data processing units, generate and realize flow chart
And/or one or more of the block diagram device of function/action specified in square frame.These can also be counted
Calculation machine readable program instructions store in a computer-readable storage medium, these instruct cause computer,
Programmable data processing unit and/or other equipment work in a specific way, so that, be stored with instruction
Computer-readable medium then includes manufacture, and it includes realizing one in flow chart and/or block diagram
Or the instruction of the various aspects of function/action specified in multiple square frames.
Computer-readable program instructions can also be loaded into computer, other programmable datas processing dress
Put or miscellaneous equipment on so that in computer, other programmable data processing units or miscellaneous equipment
Upper execution series of operation steps, to produce computer implemented process so that computer,
Flow chart and/or block diagram are realized in the instruction performed in other programmable data processing units or miscellaneous equipment
One or more of function/action specified in square frame.
Flow chart and block diagram in accompanying drawing show the system of multiple embodiments according to the present invention, method
With architectural framework in the cards, function and the operation of computer program product.At this point, flow
Each square frame in figure or block diagram can represent a module, program segment or a part for instruction, described
Module, program segment or a part for instruction are used to realize defined logic function comprising one or more
Executable instruction.In some realizations as replacement, the function of being marked in square frame can also be with not
The order for being same as being marked in accompanying drawing occurs.For example, two continuous square frames can essentially substantially simultaneously
Perform capablely, they can also be performed in the opposite order sometimes, this is depending on involved function.
It is also noted that in each square frame and block diagram and/or flow chart in block diagram and/or flow chart
The combination of square frame, can be with function as defined in execution or the special hardware based system of action come real
It is existing, or can be realized with the combination of specialized hardware and computer instruction.For people in the art
For member it is well known that, realized by hardware mode, realized by software mode and by software and
The mode of combination of hardware realizes all be of equal value.
It is described above various embodiments of the present invention, described above is exemplary, and exhaustive
Property, and it is also not necessarily limited to disclosed each embodiment.In the model without departing from illustrated each embodiment
Enclose and spirit in the case of, many modifications and changes for those skilled in the art
It will be apparent from.The selection of term used herein, it is intended to best explain the original of each embodiment
Reason, practical application or to the technological improvement in market, or make other ordinary skills of the art
Personnel are understood that each embodiment disclosed herein.The scope of the present invention is defined by the appended claims.
Claims (10)
1. the candidate collection computational methods in a kind of word input, it is characterised in that including following step
Suddenly:
Extraction step, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair
Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters
Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction
Enter character string and correctly enter the corresponding relation between character string;
Candidate collection calculation procedure, for as the word t of inputiIn string matching erroneous character correction
During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn}
It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
2. according to the method described in claim 1, it is characterised in that the candidate collection calculates step
The output probability of the set V is calculated in rapid to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1
Between constant.
3. according to the method described in claim 1, it is characterised in that methods described also includes:
Probability calculation step is rewritten, is rewritten for according to user journal Result, calculating all kinds of characters
Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection calculation procedure, is additionally operable to obtain all and word tiBetween be less than predetermined editor
Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability
P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime
Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability
P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
4. one kind input error correction method, it is characterised in that comprise the following steps,
Transition probability calculation procedure, the state transition probability P ' for calculating sentence in corpus;
Input step, for inputting sentence;
Segmentation step, for sentence to be divided into word ti;
Candidate collection calculation procedure, for according to the method as described in any one in claim 1-3
Calculate each word t of segmentationiCandidate collection C and its output probability P;
Error correction path computing step, for being calculated most according to the output probability P and transition probability P '
Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge step, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that judging that the optimal error correction path is equal to former input path in step, former input is returned
Sentence;And wherein
If it is determined that judging that the optimal error correction path is not equal to former input path in step, then institute is calculated
The difference between the Probability p l in optimal error correction path and the Probability p 0 of original path is stated, if difference is more than
Predetermined difference value threshold value, then return to the corresponding error correction result in optimal error correction path, otherwise, returns to former input
Sentence.
5. the candidate collection computing device in a kind of word input, including:
Abstraction module, for extracting error correction inquiry pair from user journal, and is each error correction inquiry pair
Set up error correction character string pair, the error correction inquiry is to the word content for mistake input and correctly enters
Corresponding relation between word content, the error correction character string is defeated to inquiring about centering mistake for the error correction
Enter character string and correctly enter the corresponding relation between character string;
Candidate collection computing module, for as the word t of inputiIn string matching erroneous character correction
During symbol string pair, according to error correction character string to the variant set V={ v to word generation word1,v2,…,vn}
It is used as candidate collection C={ c1,c2,..,cnAnd calculate corresponding output probability P={ p1,p2,..,pn}。
6. device according to claim 5, it is characterised in that the candidate collection calculates mould
The output probability of the set V is calculated in block to be included:
According to formula pj=r(l-θ)(1-r)θCalculate word vjOutput probability;Wherein
L is input word tiString length;R is the probability that single character is correctly entered;θ is 0~1
Between constant.
7. device according to claim 5, it is characterised in that also include:
Probability evaluation entity is rewritten, for according to Web log mining result is used for, calculating all kinds of characters and rewriting
Probability Ph, the mistake that the character is rewritten as single character writes, fails to write, writing more;And
The candidate collection computing module, is additionally operable to obtain all and word tiBetween be less than specific editor
Set of letters U={ the u of distance1,u2,…,umAnd calculate corresponding output probability
P={ pn+1,pn+2,..,pn+m, merge the set V and set U, so as to obtain word tiTime
Selected works close C={ c1,c2,..,cn,cn+1,..,cn+mAnd corresponding output probability
P={ p1,p2,..,pn,pn+1,pn+2,..,pn+m}。
8. a kind of programmable device, including memory and processor, wherein, the memory is used to deposit
Storage instruction, it is described to instruct for controlling the processor to be operated to perform according to claim 1-3
Method described in middle any one.
9. one kind input error correction device, it is characterised in that including:
Transition probability computing module, the state transition probability P ' for calculating sentence in corpus;
Input module, for inputting sentence;
Split module, for sentence to be divided into word ti;
Candidate collection computing device according to any one in claim 5-7, divides for calculating
Each word t cutiCandidate collection C and its output probability P;
Error correction path calculation module, for being calculated most according to the output probability P and transition probability P '
Excellent error correction path and its corresponding Probability p l, and it is originally inputted the Probability p 0 in path;
Judge module, for judging whether the optimal error correction path is equal to former input path, wherein
If it is determined that module judges that the optimal error correction path is equal to former input path, former input is returned
Sentence;And wherein
If it is determined that module judges that the optimal error correction path is not equal to former input path, then calculate described
Difference between the Probability p l in optimal error correction path and the Probability p 0 of original path, if difference is more than in advance
Determine difference threshold, then return to the corresponding error correction result in optimal error correction path, otherwise, return to former input
Sentence.
10. a kind of programmable device, including memory and processor, wherein, the memory is used for
Store instruction, the instruction is used to control the processor to be operated to perform according to claim 4
Described method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610020331.0A CN106959977A (en) | 2016-01-12 | 2016-01-12 | Candidate collection computational methods and device, word error correction method and device in word input |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610020331.0A CN106959977A (en) | 2016-01-12 | 2016-01-12 | Candidate collection computational methods and device, word error correction method and device in word input |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106959977A true CN106959977A (en) | 2017-07-18 |
Family
ID=59481421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610020331.0A Pending CN106959977A (en) | 2016-01-12 | 2016-01-12 | Candidate collection computational methods and device, word error correction method and device in word input |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106959977A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107678561A (en) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Phonetic entry error correction method and device based on artificial intelligence |
CN108197317A (en) * | 2018-02-01 | 2018-06-22 | 科大讯飞股份有限公司 | Document key message extraction system test method and device |
CN108491392A (en) * | 2018-03-29 | 2018-09-04 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN108519973A (en) * | 2018-03-29 | 2018-09-11 | 广州视源电子科技股份有限公司 | Character spelling detection method, system, computer equipment and storage medium |
CN108563632A (en) * | 2018-03-29 | 2018-09-21 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN109376362A (en) * | 2018-11-30 | 2019-02-22 | 武汉斗鱼网络科技有限公司 | A kind of the determination method and relevant device of corrected text |
CN109426357A (en) * | 2017-09-01 | 2019-03-05 | 百度在线网络技术(北京)有限公司 | Data inputting method and device |
CN109885180A (en) * | 2019-02-21 | 2019-06-14 | 北京百度网讯科技有限公司 | Error correction method and device, computer-readable medium |
CN109977415A (en) * | 2019-04-02 | 2019-07-05 | 北京奇艺世纪科技有限公司 | A kind of text error correction method and device |
CN110889028A (en) * | 2018-08-15 | 2020-03-17 | 北京嘀嘀无限科技发展有限公司 | Corpus processing and model training method and system |
CN111339757A (en) * | 2020-02-13 | 2020-06-26 | 上海凯岸信息科技有限公司 | Error correction method for voice recognition result in collection scene |
CN111353025A (en) * | 2018-12-05 | 2020-06-30 | 阿里巴巴集团控股有限公司 | Parallel corpus processing method and device, storage medium and computer equipment |
CN111797614A (en) * | 2019-04-03 | 2020-10-20 | 阿里巴巴集团控股有限公司 | Text processing method and device |
CN112445953A (en) * | 2019-08-14 | 2021-03-05 | 阿里巴巴集团控股有限公司 | Information search error correction method, computing device and storage medium |
CN112528980A (en) * | 2020-12-16 | 2021-03-19 | 北京华宇信息技术有限公司 | OCR recognition result correction method and terminal and system thereof |
CN112861518A (en) * | 2020-12-29 | 2021-05-28 | 科大讯飞股份有限公司 | Text error correction method and device, storage medium and electronic device |
CN114168808A (en) * | 2021-11-22 | 2022-03-11 | 中核核电运行管理有限公司 | Regular expression-based document character string coding identification method and device |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060048055A1 (en) * | 2004-08-25 | 2006-03-02 | Jun Wu | Fault-tolerant romanized input method for non-roman characters |
CN101350004A (en) * | 2008-09-11 | 2009-01-21 | 北京搜狗科技发展有限公司 | Method for forming personalized error correcting model and input method system of personalized error correcting |
CN102156551A (en) * | 2011-03-30 | 2011-08-17 | 北京搜狗科技发展有限公司 | Method and system for correcting error of word input |
US20140188460A1 (en) * | 2012-10-16 | 2014-07-03 | Google Inc. | Feature-based autocorrection |
CN101241514B (en) * | 2008-03-21 | 2014-11-05 | 北京搜狗科技发展有限公司 | Method for creating error-correcting database, automatic error correcting method and system |
CN104298672A (en) * | 2013-07-16 | 2015-01-21 | 北京搜狗科技发展有限公司 | Error correction method and device for input |
US9037967B1 (en) * | 2014-02-18 | 2015-05-19 | King Fahd University Of Petroleum And Minerals | Arabic spell checking technique |
CN107102746A (en) * | 2016-02-19 | 2017-08-29 | 北京搜狗科技发展有限公司 | Candidate word generation method, device and the device generated for candidate word |
-
2016
- 2016-01-12 CN CN201610020331.0A patent/CN106959977A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060048055A1 (en) * | 2004-08-25 | 2006-03-02 | Jun Wu | Fault-tolerant romanized input method for non-roman characters |
CN101133411A (en) * | 2004-08-25 | 2008-02-27 | Google公司 | Fault-tolerant romanized input method for non-roman characters |
CN101241514B (en) * | 2008-03-21 | 2014-11-05 | 北京搜狗科技发展有限公司 | Method for creating error-correcting database, automatic error correcting method and system |
CN101350004A (en) * | 2008-09-11 | 2009-01-21 | 北京搜狗科技发展有限公司 | Method for forming personalized error correcting model and input method system of personalized error correcting |
CN102156551A (en) * | 2011-03-30 | 2011-08-17 | 北京搜狗科技发展有限公司 | Method and system for correcting error of word input |
US20140188460A1 (en) * | 2012-10-16 | 2014-07-03 | Google Inc. | Feature-based autocorrection |
CN104298672A (en) * | 2013-07-16 | 2015-01-21 | 北京搜狗科技发展有限公司 | Error correction method and device for input |
US9037967B1 (en) * | 2014-02-18 | 2015-05-19 | King Fahd University Of Petroleum And Minerals | Arabic spell checking technique |
CN107102746A (en) * | 2016-02-19 | 2017-08-29 | 北京搜狗科技发展有限公司 | Candidate word generation method, device and the device generated for candidate word |
Non-Patent Citations (3)
Title |
---|
DANIEL JURAFSKY,JAMES H. MARTIN: "《Speech and Language Processing》", 《SPEECH AND LANGUAGE PROCESSING》 * |
FANDYWANG: "《斯坦福大学自然语言处理第五课"拼写纠错(Spelling Correction)"》", 《斯坦福大学自然语言处理第五课"拼写纠错(SPELLING CORRECTION)"》 * |
弗里德里希著: "《数字媒体中的隐写术原理算法和应用》", 30 April 2014 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109426357A (en) * | 2017-09-01 | 2019-03-05 | 百度在线网络技术(北京)有限公司 | Data inputting method and device |
CN109426357B (en) * | 2017-09-01 | 2023-05-12 | 百度在线网络技术(北京)有限公司 | Information input method and device |
US10839794B2 (en) | 2017-09-29 | 2020-11-17 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for correcting input speech based on artificial intelligence, and storage medium |
CN107678561A (en) * | 2017-09-29 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Phonetic entry error correction method and device based on artificial intelligence |
CN108197317A (en) * | 2018-02-01 | 2018-06-22 | 科大讯飞股份有限公司 | Document key message extraction system test method and device |
CN108519973A (en) * | 2018-03-29 | 2018-09-11 | 广州视源电子科技股份有限公司 | Character spelling detection method, system, computer equipment and storage medium |
CN108563632A (en) * | 2018-03-29 | 2018-09-21 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN108491392A (en) * | 2018-03-29 | 2018-09-04 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN110889028A (en) * | 2018-08-15 | 2020-03-17 | 北京嘀嘀无限科技发展有限公司 | Corpus processing and model training method and system |
CN109376362A (en) * | 2018-11-30 | 2019-02-22 | 武汉斗鱼网络科技有限公司 | A kind of the determination method and relevant device of corrected text |
CN111353025B (en) * | 2018-12-05 | 2024-02-27 | 阿里巴巴集团控股有限公司 | Parallel corpus processing method and device, storage medium and computer equipment |
CN111353025A (en) * | 2018-12-05 | 2020-06-30 | 阿里巴巴集团控股有限公司 | Parallel corpus processing method and device, storage medium and computer equipment |
CN109885180A (en) * | 2019-02-21 | 2019-06-14 | 北京百度网讯科技有限公司 | Error correction method and device, computer-readable medium |
CN109977415A (en) * | 2019-04-02 | 2019-07-05 | 北京奇艺世纪科技有限公司 | A kind of text error correction method and device |
CN111797614A (en) * | 2019-04-03 | 2020-10-20 | 阿里巴巴集团控股有限公司 | Text processing method and device |
CN111797614B (en) * | 2019-04-03 | 2024-05-28 | 阿里巴巴集团控股有限公司 | Text processing method and device |
CN112445953A (en) * | 2019-08-14 | 2021-03-05 | 阿里巴巴集团控股有限公司 | Information search error correction method, computing device and storage medium |
CN111339757A (en) * | 2020-02-13 | 2020-06-26 | 上海凯岸信息科技有限公司 | Error correction method for voice recognition result in collection scene |
CN112528980A (en) * | 2020-12-16 | 2021-03-19 | 北京华宇信息技术有限公司 | OCR recognition result correction method and terminal and system thereof |
CN112528980B (en) * | 2020-12-16 | 2022-02-15 | 北京华宇信息技术有限公司 | OCR recognition result correction method and terminal and system thereof |
CN112861518A (en) * | 2020-12-29 | 2021-05-28 | 科大讯飞股份有限公司 | Text error correction method and device, storage medium and electronic device |
CN112861518B (en) * | 2020-12-29 | 2023-12-01 | 科大讯飞股份有限公司 | Text error correction method and device, storage medium and electronic device |
CN114168808A (en) * | 2021-11-22 | 2022-03-11 | 中核核电运行管理有限公司 | Regular expression-based document character string coding identification method and device |
CN115659958A (en) * | 2022-12-27 | 2023-01-31 | 中南大学 | Chinese spelling error checking method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106959977A (en) | Candidate collection computational methods and device, word error correction method and device in word input | |
CN109887497B (en) | Modeling method, device and equipment for speech recognition | |
CN106534548B (en) | Voice error correction method and device | |
JP6182272B2 (en) | Natural expression processing method, processing and response method, apparatus, and system | |
CN107622054B (en) | Text data error correction method and device | |
CN110033760B (en) | Modeling method, device and equipment for speech recognition | |
CN100489841C (en) | Method and integrated development tool for building a natural language understanding application | |
CN101458681A (en) | Voice translation method and voice translation apparatus | |
CN106598939A (en) | Method and device for text error correction, server and storage medium | |
CN110070855B (en) | Voice recognition system and method based on migrating neural network acoustic model | |
CN108647191B (en) | Sentiment dictionary construction method based on supervised sentiment text and word vector | |
CN107945792A (en) | Method of speech processing and device | |
CN109213861A (en) | In conjunction with the tourism evaluation sensibility classification method of At_GRU neural network and sentiment dictionary | |
CN110147544B (en) | Instruction generation method and device based on natural language and related equipment | |
US11934781B2 (en) | Systems and methods for controllable text summarization | |
CN109460558B (en) | Effect judging method of voice translation system | |
US20180018960A1 (en) | Systems and methods for automatic repair of speech recognition engine output | |
CN106648819A (en) | Internationalized code conversion method based on editor | |
CN112528605B (en) | Text style processing method, device, electronic equipment and storage medium | |
CN113779972A (en) | Speech recognition error correction method, system, device and storage medium | |
CN108304424A (en) | Text key word extracting method and text key word extraction element | |
CN103678271A (en) | Text correction method and user equipment | |
CN113920999A (en) | Voice recognition method, device, equipment and storage medium | |
US11145308B2 (en) | Symbol sequence estimation in speech | |
CN112216284A (en) | Training data updating method and system, voice recognition method and system, and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20200528 Address after: 310051 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Applicant after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping B radio 14 floor tower square Applicant before: GUANGZHOU UCWEB COMPUTER TECHNOLOGY Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170718 |