CN107220381A - A kind of input text automatic error correction method towards question answering system - Google Patents

A kind of input text automatic error correction method towards question answering system Download PDF

Info

Publication number
CN107220381A
CN107220381A CN201710507529.6A CN201710507529A CN107220381A CN 107220381 A CN107220381 A CN 107220381A CN 201710507529 A CN201710507529 A CN 201710507529A CN 107220381 A CN107220381 A CN 107220381A
Authority
CN
China
Prior art keywords
phonetic
noun
field
current
field noun
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710507529.6A
Other languages
Chinese (zh)
Other versions
CN107220381B (en
Inventor
江艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Cloud Network Technology Co Ltd
Original Assignee
Nanjing Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Cloud Network Technology Co Ltd filed Critical Nanjing Cloud Network Technology Co Ltd
Priority to CN201710507529.6A priority Critical patent/CN107220381B/en
Publication of CN107220381A publication Critical patent/CN107220381A/en
Application granted granted Critical
Publication of CN107220381B publication Critical patent/CN107220381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Abstract

The invention provides a kind of input text automatic error correction method towards question answering system.This method includes:1) frequent fault pinyin syllable table is defined;2) field noun dictionary is built;3) using train out report system construction step 2) field noun dictionary;4) text of user's input is received, the text of user's input includes Chinese character or phonetic, turns phonetic instrument using Chinese character and switch to phonetic, then all legal or illegal phonetic using reverse maximum matching method cutting in addition to single syllable phonetic;5) by step 4) in the phonetic of cutting use fuzzy matching algorithm combination step 1) defined in frequent fault pinyin syllable table matching step 3 one by one) in structure field noun dictionary;6) the field noun matched is given a mark one by one;7) the field noun of highest scoring is exported.The present invention by correcting automatically after the problem of, be capable of the accuracy and accuracy rate of question answering system, lift Consumer's Experience effect.

Description

A kind of input text automatic error correction method towards question answering system
Technical field
Entangled automatically the present invention relates to information retrieval and inquiry field, more particularly to a kind of input text towards question answering system Wrong method.
Background technology
Question answering system (Question Answering System, QA) is a kind of advanced form of information retrieval system, it The problem of user is proposed with natural language can be answered with accurate, succinct natural language.It is people that it, which studies the main cause risen, Demand to quickly and accurately obtaining information.The design of question answering system primarily focuses on how to improve proposes problem to user In the quality of the answer of acquisition, the enquirement without the input to user is differentiated and corrected.
But user proposes problem to question answering system, and some input errors inevitably occur, these mistakes can pole The accuracy of big influence question answering system output.Question answering system inevitably results in the corresponding of acquisition due to the input error of user Answer quality is reduced, and making the accuracy of question answering system reduces, and user experience is not high.
In search application, phonetic retrieval technology can effectively avoid input method, realize that phonetic primitiveness is retrieved, certain Change search behavior in degree;And input method application in, phonetic error correcting technique can automatic identification and change user input Error character string, to ensure correct Chinese character output, improves the fault-tolerance of input method.Phonetic error correcting technique can effectively strengthen application Autgmentability and user experience.
In view of this, it would be highly desirable to develop a kind of input text automatic error correction method that can solve the problem that above mentioned problem.
The content of the invention
The purpose of the present invention aims to solve the problem that mistake occurs in the problem of user inputs in existing question answering system, so as to cause to ask Answer that effect is undesirable, the problem of the accuracy rate and low accuracy of question answering system.So as to provide a kind of input towards question answering system Text automatic error correction method, it can correct the input error of user automatically, and consider user's local pronunciation mistake in itself, from And lift question and answer effect.
To achieve the above object, the invention provides a kind of input text automatic error correction method towards question answering system.Should Method comprises the following steps:
1) frequent fault pinyin syllable table is defined;
2) field noun dictionary is built, the field noun includes noun and noun phrase;
3) using train out report system construction step 2) field noun dictionary,
The train out report system is defined as:R=(Q, A, δ, q0, F), wherein, Q represents state set, and A represents input Phonetic collection, δ:Q × A → Q, δ are Q and A direct product Q × A to Q mapping, q0∈ Q are initial state,For final state;
4) text of user's input is received, the text of user's input includes Chinese character or phonetic, phonetic is turned using Chinese character Instrument switchs to phonetic, then all legal or illegal in addition to single syllable phonetic using reverse maximum matching method cutting Phonetic;
5) by step 4) in cutting phonetic using fuzzy matching algorithm matching step 3 one by one) in the field noun that builds The editing distance of phonetic during dictionary, and record matching, then according to step 1) in frequent fault pinyin syllable table to common The editing distance of phonetic is adjusted under mistake, and the editing distance of the phonetic after the editing distance of phonetic or adjustment is less than setting Threshold value beta when continue match, match most long field noun, finally take path it is all Chinese character combination;
6) to step 5) in all Chinese characters combine the one or more field nouns to be formed and give a mark;
7) step 6 is exported) middle marking highest field noun.
Further, the step 1) in frequent fault pinyin syllable table include phonetic start error phonetic and phonetic knot The phonetic of tail error, the phonetic of the phonetic beginning error includes:N and l mistakes, h and f mistakes, cacuminal z, c, s and flat tongue consonant Zh, ch, sh mistake;The phonetic of the phonetic ending error includes:Pre-nasal sound an, en, in and rear nasal sound ang, eng, ing are wrong By mistake.
Further, the step 2) specifically include:
201) field noun is ranked up according to word frequency;
202) the field noun that word frequency is less than predetermined word frequency threshold alpha is rejected, word frequency is more than to the neck of setting word frequency threshold alpha Domain name word extracts addition field noun dictionary.
Further, the step 3) specifically include:
301) read step 2 one by one) in field noun;
302) judge whether to include the lead-in phonetic of current area word in the noun lead-in phonetic storehouse of field, that is, judge whether The tree of included current lead-in phonetic, if so, then performing step 303);If it is not, then performing step 304);
If 303) include the lead-in phonetic of current area noun in field noun lead-in phonetic storehouse, that is, judge included The tree of current lead-in phonetic, then find out the lead-in phonetic tree existed and obtain first node;
304) the lead-in phonetic of current area noun is added in domain term lead-in phonetic storehouse, builds a field noun Current phonetic is simultaneously put into node by phonetic tree;
305) whether be terminal note, that is, whether judge current phonetic is field noun the last character if judging current phonetic Phonetic, if so, then performing step 311;If it is not, then performing step 306;
If 306) current phonetic is not terminal note in step 305, i.e., current phonetic is not last of field noun The phonetic of word, then judge whether current mark bit is already set as 1, if so, then performing step 308), if it is not, then performing step It is rapid 307);
If 307) be not configured to 1 before phonetic flag bit in step 306,0 will be set to by phonetic flag bit herein, so Step 308 is performed afterwards);
308) phonetic of the next word of current area noun is read;
309) judgment step 303) or step 304) present node child node whether include step 308) phonetic, if It is then to jump to step 305), if it is not, then performing step 310);
310) by step 308) phonetic be added to the child node of present node, step 305 is then jumped to again);
If 311) step 305) in current phonetic be terminal note, be the phonetic of field noun the last character, then will The flag bit of pinyin node is set to 1;
312) judge whether field noun runs through;If so, then terminating process, if it is not, then returning to step 301) continue to read Take next field noun.
Further, the step 5) specifically include:
The step 5) specifically include:
501) receiving step 4) the middle n pinyin sequence used after the segmentation of reverse maximum matching algorithm;
502) initial value is assigned to i and k, which pinyin sequence i=1, k=1, i represents, k represents that kth is set;
503) judge whether to deposit kth field noun tree;If in the presence of execution step 504), if being not present, perform step It is rapid 505);
504) the editing distance dis for the lead-in phonetic that current i-th of phonetic is set with kth is calculated, step is then performed 506);
If 505) be not present, kth field noun tree, i.e. field noun tree are matched to terminate, the spelling not matched Sound, illustrates that current i-th of phonetic is not matched, and now matches next pinyin sequence, and since the 1st domain term tree Match somebody with somebody, then perform step 504);
506) whether the phonetic for judging current matching is common phonetic mistake;If so, then performing step 507), if it is not, Then perform step 508);
507) if common phonetic mistake, then editing distance dis is reduced;
508) judgment step 504) and step 507) in editing distance dis whether be less than setting distance threshold β;If so, Then perform step 509), if it is not, then performing step 512);
If 509) step 508) in editing distance dis be less than or equal to given threshold β, the pinyin string on record concatenation;
510) judgment step 504) in the flag bit of current phonetic whether be 1;If so, then performing step 511), if it is not, then Perform step 514);
If 511) step 510) in the flag bit of current phonetic be 1, show to have matched the phonetic of field noun, will Phonetic flag bit with successful field noun is set to the field noun phonetic on 1, record matching, then performs step 514);
If 512) step 508) in editing distance dis be more than given threshold β, perform k++, matching cut-out is gone under matching The lead-in phonetic of one tree, then performs step 513);
513) pinyin string of matching cut-out is emptied, step 503 is then jumped to) next field noun tree is matched again;
If 514) step 510) in the flag bit of current phonetic be 0, i.e., be not end mark position, then perform i++, continuation With next pinyin sequence, next phonetic that user is inputted is compared with the next stage node of field noun tree present node Match somebody with somebody;
515) judgment step 514) in i whether be less than or equal to n, that is, whether judge i-th phonetic is user's input text Pinyin sequence last phonetic;If so, then performing step 516), if it is not, then performing step 517);
If 516) i is less than or equal to n, i.e. the pinyin sequence of the text of user's input, which is not matched also, to be terminated, and is now counted successively The volume of phonetic and next phonetic of the text conversion of input in the phonetic set of next state of the phonetic matched Distance is collected, while the editing distance for finding out minimum is superimposed with dis before, then step 506 is jumped to), determine whether common Phonetic mistake updates editing distance, and judging distance threshold value;
If 517) i is more than n, that is, the pinyin sequence inputted is matched to be terminated, then terminates matching process, then perform step 6)。
Further, the step 6) specifically include following steps:
601) calculation procedure 5 one by one) in the field noun Chinese character that matches the editing distance of text Chinese character is inputted with user, And by the Chinese character editing distance combination step 5 calculated) in matching process phonetic editing distance and the length of field noun To give a mark;
602) judge whether the fraction of the field noun matched is more than the threshold θ of setting fraction;If so, then performing step 603), if it is not, then performing step 604);
603) if so, then removing the field noun less than or equal to given threshold θ.
604) if it is not, then remaining larger than given threshold θ field noun.
Beneficial effects of the present invention:Compared with prior art, originally towards the input text automatic error-correcting side of question answering system Method, it is considered in the case of the place pronunciation mistake of user, build field dictionary using finite automata, high-ranking military officer's domain name word is configured to Many query trees, greatly reduce Algorithms T-cbmplexity, while using fuzzy matching, setting editing distance threshold value, and root Editing distance is adjusted according to common user's phonetic mistake, the Chinese character mistake of user's input can either be solved, user is also can solve the problem that The problem of after the phonetic mistake of input, error correction, can lift question and answer effect, strengthen user experience.
Brief description of the drawings
Fig. 1 is a kind of flow chart of input text automatic error correction method towards question answering system of the embodiment of the present invention;
The method that Fig. 2 builds field noun dictionary for a kind of use train out report system provided in an embodiment of the present invention Flow chart;
Fig. 3 is the embodiments of the invention provide a kind of algorithm flow chart of fuzzy matching;
Fig. 4 is the principle schematic of train out report system provided in an embodiment of the present invention;
Fig. 5 is an exemplary plot for the field dictionary tree of structure provided in an embodiment of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with inventive embodiments Accompanying drawing, the technical scheme in inventive embodiments is clearly and completely described, it is clear that the embodiments described below are only Only it is a part of embodiment of invention, and not all embodiment.Based on the embodiment in invention, those of ordinary skill in the art exist The all other embodiment obtained under the premise of creative work is not made, the scope of invention protection is belonged to.
Referring to Fig. 1, the embodiments of the invention provide a kind of input text automatic error correction method towards question answering system Flow chart.
In a step 101, frequent fault pinyin syllable table is defined.
Incorrect pinyin syllabary in the present embodiment may include two classes, and a class is the phonetic that phonetic starts error, and a class is The phonetic of phonetic ending error;The phonetic of phonetic beginning error can include:N and l mistakes, h and f mistakes, cacuminal z peace tongues Sound zh mistakes, cacuminal c and flat tongue consonant ch mistakes, cacuminal s and flat tongue consonant sh mistakes;The phonetic of phonetic ending error can be wrapped Include:Pre-nasal sound an and rear nasal sound ang mistakes, pre-nasal sound en and rear nasal sound eng mistakes, pre-nasal sound in and rear nasal sound ing mistakes etc..
In a step 102, field noun dictionary is built, the field noun includes noun and noun phrase.Noun can be with Common noun and non-generic noun, e.g. knowledge, identity card etc..Noun phrase can be such as intellectual property, registered enterprise, Business administration and this kind of word of enterprise's real-name authentication etc..
In step 103, using the field noun dictionary of train out report system construction step 102.
Wherein it is determined that finite automata be defined as:R=(Q, A, δ, q0, F), wherein, Q represents state set, and A represents defeated Enter phonetic collection, δ:Q × A → Q, δ are Q and A direct product Q × A to Q mapping, q0∈ Q are initial state,To terminate shape State, schematic diagram can be found in Fig. 4.To lack computing in error correction procedure, and almost be calculated in DFA algorithms without what, having It is the conversion of state.Do not acted without State Transferring here simultaneously, the simply Query (lookup) having.One can consider that, By S query (lookup) U, V, by U query (lookup) V, Q, pass through V query (lookup) U, Q.
For example, there are following several nouns in the thesaurus of field:Knowledge, instruction, wisdom, intellectual property, commander, system Clothes, Alipay.Tree construction as shown in Figure 5 is configured to by DFA, thus our field dictionary is built into One similar tree with one one, so we judge in a problem of user inputs with the presence or absence of wrongly written character, just significantly Reduce the matching range of retrieval.Such as we want judgemental knowledge property right, and according to the phonetic of first character, we just can confirm that Need retrieval is that tree, is then retrieved again in this tree.How to judge whether a domain term is over, Judged using flag.The phonetic of the last character in domain term, represents that this word terminates, and we are last by word The flag bit of one word is set to isEnd=1, and remaining is then set to isEnd=0;Knowledge in as shown in Figure 5 is produced with knowledge Power, knowledge is a word, and intellectual property is also a word, and shi and quan flag bit are both configured into isEnd=1.
At step 104, the text of user's input, i.e., the problem of user inputs are received.The text of user's input includes the Chinese Word or phonetic.Then turn phonetic instrument using Chinese character and switch to phonetic.The phonetic of input error is understood in view of user, phonetic now May a not pinyin syllable, by the part of illegal phonetic in addition to monosyllabic phonetic, remaining merging, is fuzzy matching Prepare;Reuse all legal or illegal phonetic of the reverse maximum matching method cutting in addition to single syllable phonetic.
In step 105, by the phonetic cut in step 104 using structure in fuzzy matching algorithm one by one matching step 103 The field noun dictionary built, and during record matching phonetic editing distance, then according to frequent fault phonetic in step 101 Syllabary is adjusted to the editing distance of phonetic under frequent fault, the editor of the phonetic after the editing distance of phonetic or adjustment Distance continues to match when being less than the threshold value beta of setting, matches most long field noun.Finally all Chinese characters in path are taken to combine.That By after this step, qualified one or more field nouns can be matched.
In step 106, the one or more field nouns to be formed are combined to all Chinese characters in step 105 to be beaten Point.
Specific marking mode can be found in following steps:
In step 106-1, the field noun Chinese character matched one by one in calculation procedure 105 inputs text Chinese character with user Editing distance, and by the editing distance and neck of phonetic in matching process in the Chinese character editing distance combination step 105 calculated The length of domain name word is given a mark.Editing distance is smaller, field noun length is bigger, and score is higher.
In step 106-2, judge whether the fraction of the field noun matched is more than the threshold θ of setting fraction;If so, Step 106-3 is then performed, if it is not, then performing step 106-4.
In step 106-3, if so, then removing the field noun less than or equal to given threshold θ.
In step 106-4, if it is not, then remaining larger than given threshold θ field noun.
In step 107, marking highest field noun in output step 106.
A kind of input text automatic error correction method towards question answering system provided in an embodiment of the present invention is carried out above Detailed description, will be described in detail to building field noun using train out report system in step 103 below.
Referring to Fig. 2, the embodiments of the invention provide a kind of field of use train out report system construction step 102 Mainly step 103 in Fig. 1 is specifically described for the method flow diagram of noun dictionary, the embodiment, and it comprises the following steps:
In step 201, the field noun in Fig. 1 in step 102 is read one by one.
In step 202., judge whether to include the lead-in phonetic of current area word in the noun lead-in phonetic storehouse of field, that is, sentence It is disconnected whether the tree of included current lead-in phonetic;If so, then performing step 203;If it is not, then performing step 204.
In step 203, if including the lead-in phonetic of current area noun in the noun lead-in phonetic storehouse of field, that is, judge The tree of included current lead-in phonetic, then find out the lead-in phonetic tree existed and obtain first node.
In step 204, the lead-in phonetic of current area noun is added in domain term lead-in phonetic storehouse, builds one Current phonetic is simultaneously put into node by field noun phonetic tree.
In step 205, whether be terminal note, that is, whether judge current phonetic is that field noun is last if judging current phonetic The phonetic of one word, if so, then performing step 211;If it is not, then performing step 206.
In step 206, if current phonetic is not terminal note in step 205, i.e., current phonetic is not field noun The phonetic of the last character, then judge whether current mark bit is already set as 1, if so, step 208 is then performed, if it is not, then Perform step 207.
This step judge be because prevent short field noun end mark position by long field noun flag bit cover, For example there are field noun knowledge and intellectual property, knowledge is first added into finite automata, then shi flag bit is arranged to 1, when When intellectual property is added into finite automata, judge if not doing this step, shi end mark position can then be replaced by 0, search When can not then match knowledge this field noun, therefore judge herein critically important.
In step 207, if being not configured to 1 before phonetic flag bit in step 206, phonetic flag bit it will set herein 0 is set to, step 208 is then performed.
In a step 208, the phonetic of the next word of current area noun is read.
In step 209, whether judgment step 203 or the child node of step 204 present node include the spelling of step 208 Sound;If so, then jumping to step 205;If it is not, then performing step 210.
In step 210, the phonetic of step 208 is added to the child node of present node, step is then jumped to again 205。
In step 211, it is the spelling of field noun the last character if current phonetic is terminal note in step 205 Sound, then be set to 1 by the flag bit of pinyin node.
In the step 212, judge whether field noun runs through;If so, then terminating process, if it is not, then returning to step 201 Continue to read next field noun.
With a specific example explanation:If what is read in step 201 is that field noun is intellectual property, corresponding phonetic It is zhi shi chan quan.Then judge whether include current area noun (intellectual property) in the noun lead-in phonetic storehouse of field Lead-in phonetic zhi;If so, then finding out this tree of the lead-in phonetic zhi that has existed.If it is not, then high-ranking military officer's domain name word (knowledge production Power) lead-in phonetic zhi is added in the noun lead-in phonetic storehouse of field, builds a field noun phonetic number, and by current phonetic Zhi is put into node.After judgement terminates, whether be terminal note, that is, judge current phonetic if further judging current phonetic zhi Zhi whether be field noun last phonetic;If so, current pinyin node zhi flag bits isEnd then is set into 1, so Judge whether field noun (intellectual property) has all read again afterwards, read, terminated process, otherwise, read next field Noun.Obviously here, whether current phonetic zhi is not terminal note, then then judge this when current mark bit for Be noted as 1, that is, whether zhi has been labeled as 1, it is no if, then zhi flag bit isEnd is set to 0, then Next word phonetic shi of field noun (intellectual property) is read again.Then judge whether the child node of present node includes to be somebody's turn to do Phonetic, that is, current this number of zhi child node whether comprising shi this phonetic, comprising then judging that current phonetic shi is No is terminal note, deterministic process zhi as described above process.Do not include, then add current phonetic shi as zhi this number Whether child node, it is terminal note, deterministic process zhi as described above process, here, differing that current phonetic shi is then judged again One repeats.Aforesaid operations process is repeated, until field noun all in step 201 has been read.
Above is to a kind of field name of use train out report system construction step 102 provided in an embodiment of the present invention The method of word dictionary is described in detail, and detailed retouch will be carried out to fuzzy matching algorithm provided in an embodiment of the present invention below State.
Referring to Fig. 3, the embodiments of the invention provide a kind of algorithm flow chart of fuzzy matching, the embodiment is mainly to figure Step 105 is specifically described in 1.It is specific as follows:
In step 301, n phonetic sequence after splitting in Fig. 1 in step 104 using reverse maximum matching algorithm is received Row.
In step 302, initial value is assigned to i and k, which pinyin sequence i=1, k=1, i represents, k represents that kth is set.
In step 303, judge whether to deposit kth field noun tree;If in the presence of, step 304 is performed, if being not present, Then perform step 305.
In step 304, the editing distance dis for the lead-in phonetic that current i-th of phonetic is set with kth is calculated, is then held Row step 306.
In step 305, if terminating in the absence of kth field noun tree, i.e. field noun tree are matched, no The phonetic mixed, illustrates that current i-th of phonetic is not matched, and now matches next pinyin sequence, and from the 1st domain term Tree starts matching, then performs step 304.
Within step 306, whether the phonetic for judging current matching is common phonetic mistake;If so, then performing step 307, if it is not, then performing step 308.
In step 307, if common phonetic mistake, then editing distance dis is reduced.
In step 308, whether editing distance dis is less than the distance threshold β of setting in judgment step 304 and step 307; If so, then performing step 309;If it is not, then performing step 312.
In a step 309, if the phonetic that editing distance dis is less than or equal on given threshold β, record concatenation in step 308 String.
In the step 310, whether the flag bit of current phonetic is 1 in judgment step 304;If so, then performing step 311; If it is not, then performing step 314.
In step 311, if the flag bit of current phonetic is 1 in step 310, show to have matched the spelling of field noun Sound, the phonetic flag bit for the field noun that the match is successful is set to the field noun phonetic on 1, record matching, then performed Step 314.
In step 312, if editing distance dis is more than given threshold β in step 308, k++ is performed, matching cut-out is gone The lead-in phonetic of next tree is matched, step 313 is then performed.
In step 313, the pinyin string of matching cut-out is emptied, step 303 is then jumped to and matches next field again Noun tree.
In a step 314, if the flag bit of current phonetic is 0 in step 310, i.e., it is not end mark position, then performs i+ +, continue to match next pinyin sequence, the next stage section of next phonetic that user is inputted and field noun tree present node Point comparison match.
In step 315, whether the i in judgment step 314 is less than or equal to n, that is, judges whether i-th of phonetic is that user is defeated Enter last phonetic of the pinyin sequence of text;If so, then performing step 316;If it is not, then performing step 317.
In step 316, if i is less than or equal to n, i.e. the pinyin sequence of the text of user's input, which is not matched also, to be terminated, this It is next that text of the phonetic with inputting that Shi Yici calculates in the phonetic set of next state of the phonetic matched is converted The editing distance of phonetic, while the editing distance for finding out minimum is superimposed with dis before, then jumps to step 506, judges whether Editing distance, and judging distance threshold value are updated for common phonetic mistake;
In step 317, if i is more than n, that is, the pinyin sequence inputted is matched to be terminated, then terminates matching process, then Perform the step 106 in Fig. 1.
With a specific example explanation:Assuming that n phonetic sequence after being split in step 104 using reverse maximum matching algorithm It is classified as zi shi chan quan (sub- knowledge property right), then comprise the following steps that:,
Phonetic zi after being split in receiving step 104 using reverse maximum matching algorithm;Judge whether to deposit the 1st field name Word tree (generally can all have the 1st tree).If in the presence of the lead-in for calculating current 1st phonetic zi and the 1st tree is spelled The editing distance dis of sound, it is assumed that one tree is zhi, then current first phonetic zi and the 1st tree zhi editing distance Dis=1.If terminating in the absence of the 1st field noun tree, i.e. field noun tree are matched, the phonetic not matched is said Bright current 1st phonetic zi is not matched, and now matches next phonetic si, and matched since the 1st domain term tree.So Whether the phonetic zi for judging current matching afterwards is common phonetic mistake, it is clear that zi is transformed into zhi and only needs to a step, that is, edit away from From for 1, it can be determined that it is common mistake to go out, then reduce editing distance dis, such as can be decreased to 0.5.If first Number is shen, then phonetic zi to shen distance of changing is 4 into 4 steps, i.e. editing distance, is judged as not common mistake.So Afterwards to by the editing distance of the current phonetic zi under frequent fault or not common mistake all with predetermined distance threshold β (it is assumed that Equal to 2) being compared, it is more than, then searches second tree, search deterministic process consistent with one tree;It is less than or equal to, then remembers This tree of the zi that record is matched, whether the flag bit for then judging current phonetic zi is 1, it is clear that zi flag bit is not 1 here, I++ is so then performed, continues to match next phonetic shi, next phonetic shi that user is inputted is current with field noun tree Node zhi next stage node (shi, hui, fu in such as Fig. 5) comparison match;Then judge whether this phonetic of zi is step again Last phonetic quan in rapid 104 pinyin sequence, it is clear that be not, then do not have the text pinyin sequence for illustrating user's input also Matching terminates, then now calculate in the phonetic set of the next states of phonetic zhi matched phonetic (shi in such as Fig. 5, Hui, fu) with input text conversion next phonetic shi editing distance, find out minimum editing distance, that is, son section Point shi, editing distance is 0, is compared again with β after 0 is overlapped with dis=0.5 before, i.e., now dis=0.5+0= 0.5<β, then, the zhi shi pinyin strings on record matching carry out next phonetic chan and quan lookup with matching, mistake Journey is consistent with said process.The word finally matched is intellectual property, knows and show property right, knowledge, instruction.When marking just one by one Calculate the editing distance (sub- knowledge property right) that the field noun Chinese character that matches inputs text Chinese character with user, and by the Chinese calculated The editing distance of phonetic and the length of field noun are given a mark in word editing distance combination matching process, because the word matched In, intellectual property knows property right with input text need to only change a step, zhi shi chan quan and input Pinyin zi shi Chan quan also only need to change a step, and length is maximum, so this field noun highest scoring of intellectual property, finally becomes Will marking highest field noun (intellectual property) output.
Detailed elaboration is carried out to a kind of algorithm of fuzzy matching of the embodiment of the present invention above, with reference to described here The input text automatic error correction method towards question answering system, it is considered to user place pronunciation mistake in the case of, using limited Automatic machine builds domain term storehouse, and high-ranking military officer's domain name word is configured to many query trees, greatly reduces Algorithms T-cbmplexity, simultaneously Using fuzzy matching, editing distance threshold value is set, and editing distance is adjusted according to common user's phonetic mistake, can either be solved The Chinese character mistake of user's input, the problem of also can solve the problem that after the phonetic mistake of user's input, error correction can lift question and answer effect, Strengthen user experience.
Above-described embodiment, has been carried out further to the purpose of the present invention, technical scheme and beneficial effect Describe in detail, should be understood that the embodiment that the foregoing is only the present invention, be not intended to limit the present invention Protection domain, within the spirit and principles of the invention, any modification, equivalent substitution and improvements done etc. all should be included Within protection scope of the present invention.

Claims (6)

1. a kind of input text automatic error correction method towards question answering system, it is characterised in that comprise the following steps:
1) frequent fault pinyin syllable table is defined;
2) field noun dictionary is built, the field noun includes noun and noun phrase;
3) using train out report system construction step 2) field noun dictionary,
The train out report system is defined as:R=(Q, A, δ, q0, F), wherein, Q represents state set, and A represents input Pinyin Collection, δ:Q × A → Q, δ are Q and A direct product Q × A to Q mapping, q0∈ Q are initial state,For final state;
4) text of user's input is received, the text of user's input includes Chinese character or phonetic, phonetic instrument is turned using Chinese character Switch to phonetic, then all legal or illegal spelling using reverse maximum matching method cutting in addition to single syllable phonetic Sound;
5) by step 4) in cutting phonetic using fuzzy matching algorithm matching step 3 one by one) in the field noun dictionary that builds, And during record matching phonetic editing distance, then according to step 1) in frequent fault pinyin syllable table under frequent fault The editing distance of phonetic is adjusted, and the editing distance of the phonetic after the editing distance of phonetic or adjustment is less than the threshold value of setting Continue to match during β, match most long field noun, finally take the Chinese character combination that path is all;
6) to step 5) in all Chinese characters combine the one or more field nouns to be formed and give a mark;
7) step 6 is exported) middle marking highest field noun.
2. a kind of input text automatic error correction method towards question answering system according to claim 1, it is characterised in that institute State step 1) in frequent fault pinyin syllable table include phonetic start error phonetic and phonetic ending error phonetic, the spelling The phonetic of sound beginning error includes:N and l mistakes, h and f mistakes, cacuminal z, c, s and flat tongue consonant zh, ch, sh mistake;It is described to spell The phonetic of sound ending error includes:Pre-nasal sound an, en, in and rear nasal sound ang, eng, ing mistake.
3. a kind of input text automatic error correction method towards question answering system according to claim 1, it is characterised in that institute State step 2) specifically include:
201) field noun is ranked up according to word frequency;
202) the field noun that word frequency is less than predetermined word frequency threshold alpha is rejected, word frequency is more than to the field name of setting word frequency threshold alpha Word extracts addition field noun dictionary.
4. a kind of input text automatic error correction method towards question answering system according to claim 1, it is characterised in that institute State step 3) specifically include:
301) read step 2 one by one) in field noun;
302) judge whether to include the lead-in phonetic of current area noun in the noun lead-in phonetic storehouse of field, that is, judge whether Include the tree of current lead-in phonetic;If so, then performing step 303), if it is not, then performing step 304);
If 303) include the lead-in phonetic of current area noun in field noun lead-in phonetic storehouse, that is, judge included current The tree of lead-in phonetic, then find out the lead-in phonetic tree existed and obtain first node;
304) the lead-in phonetic of current area noun is added in the noun lead-in phonetic storehouse of field, builds a field noun and spell Current phonetic is simultaneously put into node by sound tree;
305) whether judge current phonetic is terminal note, that is, judge current phonetic whether be field noun the last character spelling Sound;If so, then performing step 311), if it is not, then performing step 306;
If 306) current phonetic is not terminal note in step 305, i.e., current phonetic is not the last character of field noun Phonetic, then judge whether current mark bit is already set as 1;If so, then performing step 308), if it is not, then performing step 307);
If 307) be not configured to 1 before phonetic flag bit in step 306,0 will be set to by phonetic flag bit herein, then held Row step 308);
308) phonetic of the next word of current area noun is read;
309) judgment step 303) or step 304) present node child node whether include step 308) phonetic;If so, then Jump to step 305), if it is not, then performing step 310);
310) by step 308) phonetic be added to the child node of present node, step 305 is then jumped to again);
If 311) step 305) in current phonetic be terminal note, be the phonetic of field noun the last character, then by phonetic The flag bit of node is set to 1;
312) judge whether field noun runs through;If so, then terminating process, if it is not, then returning to step 301) continue under reading One field noun.
5. a kind of input text automatic error correction method towards question answering system according to claim 1, it is characterised in that institute State step 5) specifically include:
501) receiving step 4) the middle n pinyin sequence used after the segmentation of reverse maximum matching algorithm;
502) initial value is assigned to i and k, which pinyin sequence i=1, k=1, i represents, k represents that kth is set;
503) judge whether to deposit kth field noun tree;If in the presence of execution step 504), if being not present, perform step 505);
504) the editing distance dis for the lead-in phonetic that current i-th of phonetic is set with kth is calculated, step 506 is then performed);
If 505) be not present, kth field noun tree, i.e. field noun tree are matched to terminate, the phonetic not matched, Illustrate that current i-th of phonetic is not matched, now match next pinyin sequence, and matched since the 1st domain term tree, Then step 504 is performed);
506) whether the phonetic for judging current matching is common phonetic mistake;If so, then performing step 507), if it is not, then holding Row step 508);
507) if common phonetic mistake, then editing distance dis is reduced;
508) judgment step 504) and step 507) in editing distance dis whether be less than setting distance threshold β;If so, then holding Row step 509), if it is not, then performing step 512);
If 509) step 508) in editing distance dis be less than or equal to given threshold β, the pinyin string on record concatenation;
510) judgment step 504) in the flag bit of current phonetic whether be 1;If so, then performing step 511), if it is not, then performing Step 514);
If 511) step 510) in the flag bit of current phonetic be 1, show to have matched the phonetic of field noun, will match into The phonetic flag bit of the field noun of work(is set to the field noun phonetic on 1, record matching, then performs step 514);
If 512) step 508) in editing distance dis be more than given threshold β, perform k++, next of matching is gone in matching cut-out The lead-in phonetic of tree, then performs step 513);
513) pinyin string of matching cut-out is emptied, step 503 is then jumped to) next field noun tree is matched again;
If 514) step 510) in the flag bit of current phonetic be 0, i.e., be not end mark position, then perform i++, continue under matching One pinyin sequence, the next stage node comparison match of next phonetic that user is inputted and field noun tree present node;
515) judgment step 514) in i whether be less than or equal to n, that is, whether judge i-th phonetic is spelling that user inputs text Last phonetic of sound sequence;If so, then performing step 516), if it is not, then performing step 517);
If 516) i is less than or equal to n, i.e. the pinyin sequence of the text of user's input, which is not matched also, to be terminated, now calculating successively The editor of phonetic in the phonetic set of the next state for the phonetic mixed and next phonetic of the text conversion of input away from From, at the same find out minimum editing distance be superimposed with dis before, then jump to step 506), determine whether common phonetic Mistake updates editing distance, and judging distance threshold value;
If 517) i is more than n, that is, the pinyin sequence inputted is matched to be terminated, then terminates matching process, then perform step 6).
6. a kind of input text automatic error correction method towards question answering system according to claim 1, it is characterised in that institute State step 6) comprise the following steps:
601) calculation procedure 5 one by one) in the field noun Chinese character that matches the editing distance of text Chinese character is inputted with user, and will The Chinese character editing distance combination step 5 calculated) in matching process the editing distance of phonetic and the length of field noun beat Point;
602) judge whether the fraction of the field noun matched is more than the threshold θ of setting fraction;If so, then performing step 603), if it is not, then performing step 604);
603) if so, then removing the field noun less than or equal to given threshold θ.
604) if it is not, then remaining larger than given threshold θ field noun.
CN201710507529.6A 2017-06-28 2017-06-28 A kind of input text automatic error correction method towards question answering system Active CN107220381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710507529.6A CN107220381B (en) 2017-06-28 2017-06-28 A kind of input text automatic error correction method towards question answering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710507529.6A CN107220381B (en) 2017-06-28 2017-06-28 A kind of input text automatic error correction method towards question answering system

Publications (2)

Publication Number Publication Date
CN107220381A true CN107220381A (en) 2017-09-29
CN107220381B CN107220381B (en) 2019-11-15

Family

ID=59950459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710507529.6A Active CN107220381B (en) 2017-06-28 2017-06-28 A kind of input text automatic error correction method towards question answering system

Country Status (1)

Country Link
CN (1) CN107220381B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399608A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of conversational system text error correction system and method based on phonetic
CN111028834A (en) * 2019-10-30 2020-04-17 支付宝(杭州)信息技术有限公司 Voice message reminding method and device, server and voice message reminding equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236706A (en) * 2011-06-17 2011-11-09 浙江大学 Fast fuzzy pinyin inquiry method of mass Chinese file names
CN104050255A (en) * 2014-06-13 2014-09-17 上海交通大学 Joint graph model-based error correction method and system
CN106878289A (en) * 2017-01-22 2017-06-20 中国人民解放军信息工程大学 Matching regular expressions method and its device based on multidimensional template finite automata TMFA

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236706A (en) * 2011-06-17 2011-11-09 浙江大学 Fast fuzzy pinyin inquiry method of mass Chinese file names
CN104050255A (en) * 2014-06-13 2014-09-17 上海交通大学 Joint graph model-based error correction method and system
CN106878289A (en) * 2017-01-22 2017-06-20 中国人民解放军信息工程大学 Matching regular expressions method and its device based on multidimensional template finite automata TMFA

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399608A (en) * 2019-06-04 2019-11-01 深思考人工智能机器人科技(北京)有限公司 A kind of conversational system text error correction system and method based on phonetic
CN110399608B (en) * 2019-06-04 2023-04-25 深思考人工智能机器人科技(北京)有限公司 Text error correction system and method for dialogue system based on pinyin
CN111028834A (en) * 2019-10-30 2020-04-17 支付宝(杭州)信息技术有限公司 Voice message reminding method and device, server and voice message reminding equipment
CN111028834B (en) * 2019-10-30 2023-01-20 蚂蚁财富(上海)金融信息服务有限公司 Voice message reminding method and device, server and voice message reminding equipment

Also Published As

Publication number Publication date
CN107220381B (en) 2019-11-15

Similar Documents

Publication Publication Date Title
CN110489760B (en) Text automatic correction method and device based on deep neural network
CN106534548B (en) Voice error correction method and device
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
JP5870790B2 (en) Sentence proofreading apparatus and proofreading method
CN103365925B (en) Obtain polyphone phonetic, method based on phonetic retrieval and related device thereof
CN102033879B (en) Method and device for identifying Chinese name
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
US20060004744A1 (en) Method and system for approximate string matching
CN101256557B (en) Self-defining word management apparatus and method
CN102768681A (en) Recommending system and method used for search input
CN108052499A (en) Text error correction method, device and computer-readable medium based on artificial intelligence
WO1998039711A1 (en) Language analysis system and method
CN103049458A (en) Method and system for revising user word bank
CN112541095B (en) Video title generation method and device, electronic equipment and storage medium
CN104199954B (en) A kind of commending system and method for searching for input
CN111985234B (en) Voice text error correction method
CN104268176B (en) A kind of recommendation method based on search keyword
CN107220381B (en) A kind of input text automatic error correction method towards question answering system
CN110134766B (en) Word segmentation method and device for traditional Chinese medical ancient book documents
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
JP2000298667A (en) Kanji converting device by syntax information
US10650195B2 (en) Translated-clause generating method, translated-clause generating apparatus, and recording medium
US20180011836A1 (en) Tibetan Character Constituent Analysis Method, Tibetan Sorting Method And Corresponding Devices
CN111782892A (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN111079415B (en) Chinese automatic error checking method based on collocation conflict

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant