CN101499056A - Backward reference sentence pattern language analysis method - Google Patents

Backward reference sentence pattern language analysis method Download PDF

Info

Publication number
CN101499056A
CN101499056A CNA2008100053643A CN200810005364A CN101499056A CN 101499056 A CN101499056 A CN 101499056A CN A2008100053643 A CNA2008100053643 A CN A2008100053643A CN 200810005364 A CN200810005364 A CN 200810005364A CN 101499056 A CN101499056 A CN 101499056A
Authority
CN
China
Prior art keywords
sentence pattern
sentence
character
basic
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008100053643A
Other languages
Chinese (zh)
Inventor
徐文新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA2008100053643A priority Critical patent/CN101499056A/en
Publication of CN101499056A publication Critical patent/CN101499056A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

Inverted reference sentence pattern language analytic method is a language analytic method based on the reference pattern language, which can be applied to the aspects of natural language process, intelligent information process and the like. The method comprises: building the database of the basic sentence pattern S of the language including the processing information; giving the character type k of S; giving field j; the address of the sentence pattern or i being d, or giving the number n of the sentence pattern; using all of character elements P(SUB)i(/SUB) as key word; for every character element P(SUB)i(/SUB), listing S including the character element P(SUB)i(/SUB) or address d or number n of j value thereof so as to obtain an inverted table; setting P as the character element of sentence T that needs to be analyzed; according to the d or n of the inverted table P, accumulatively marking j of corresponding S of the database so as to obtain j value of each basic sentence pattern. A sentence pattern S of the j=k is reference sentence pattern of T. T is processed by referring to the related information of these sentence patterns, commonly and preferentially referring the sentence pattern that j is greater than k and considering other factors. If there are repeated character elements in S and T, the condition will be relatively complicated.

Description

Backward reference sentence pattern language analysis method
Technical field
The present invention is based on reference to the sentence pattern language analysis method of (containing collocations, phrase, phrase, vocabulary), can be used for aspects such as natural language processing, Intelligent Information Processing and carries out statement analysis, relatively semantic.
Background technology
Natural language processing, Intelligent Information Processing need be carried out statement analysis, relatively semantic.What more paid close attention to is that the participle accuracy rate that improves pinyin string has become the key that promotes the Chinese speech input level.The algorithm of Chinese phonetic alphabet string automatic word segmentation is varied, as: maximum matching method (MM), minimum participle word frequency back-and-forth method (FWF) with by speech traversal method.According to the difference of direction of scanning, maximum matching method is divided into forward maximum matching method (FMM) and reverse maximum matching method (BMM) again.
But the segmenting method accuracy rate of carrying out can not satisfy application need at present, and promoting participle accuracy rate and speed needs new algorithm.I revise in text, the international application no PCT/CN2005/001493 text at Chinese application number 200410067258.X " prime number replacing character string search technology ", proposition is pressed character unit prime number replacing to character string, prime number product Ft value with the character unit of search key T is a dividend, prime number product Fn value with the character unit of database basic sentence patterns Sn is a divisor, carry out division arithmetic, as aliquot, then this sentence pattern Sn be T can be with reference to sentence pattern, be called " contrary retrieval ", form language analysis method based on the reference sentence pattern.But the speed of " prime number replacing character string search technology " can not satisfy the needs of Language Processing, I revise in text, the international application no PCT/CN2005/001642 text at " bit mark character string retrieval technique " of Chinese application number 200510023383.5, the method of carrying out quick preliminary election with " bit mark character string retrieval technique " has been described, has made based on the language analysis method of reference sentence pattern practical.As a kind of method of accommodation, also can utilize inverted file, carry out language analysis based on the reference sentence pattern.
Summary of the invention
Phonetic entry, mechanical translation, search engine, Intelligent Information Processing need be carried out statement analysis, relatively semantic, what use at present is segmenting method, the disclosed language analysis with reference to sentence pattern, the semantic comparative approach of being based on of presents, it is language analysis method than the high aspect of " participle " method, but priority document has been continued to use common term, be called " technology for division word in inverted reference sentence ", now be called " backward reference sentence pattern language analysis method ".The conversion pinyin string is the process of Chinese character string in the Chinese speech input below, illustrates: 1. based on the phonetic entry algorithm basic principle of reference sentence pattern.2. the general step that carries out language analysis with the method for falling the reference sentence pattern.3. the pronunciation inputting method of falling the reference sentence pattern.It is identical that language analysis is carried out with utilization " position mark " " prime number replacing character string search technology " in many aspects of backward reference sentence pattern language analysis method, reference mutually.If the phonetic symbol of other Languages is regarded as the Chinese phonetic alphabet, word, phrase, sentence pattern of Chinese etc. regarded as in word, phrase, sentence pattern etc., and the application of other language each side also can be with reference to enforcement.
1. based on the phonetic entry algorithm ultimate principle of reference sentence pattern
A. language cooperative phenomenon and basic sentence patterns refine
The same with other Languages, the sum of Chinese sentence is difficult to limit, with regard to present technical merit, at common computer, can not list the sentence of sufficient amount at least, realizes transient response, finds the sentence of pinyin string correspondence.Certainly, sentence is made of word, as: " he graduates next June " can be decomposed into " he " " graduation " " next year " " June ", and word quantity is limited, and " modern Chinese dictionary " receives " words and phrases " about 60,000, but there is homonymous phenomena in word.Need in the phonetic entry voice string is carried out correct cutting, and in homonym, make the selection that conforms with linguistic context.
Chinese has more than 400 no a tuning joint, and Chinese characters in common use have 5000-7000, average more than 10 Chinese characters in common use of each syllable correspondence, then corresponding up to a hundred the Chinese characters of the syllable that has.Provide a syllable " xue " or " sheng " separately, can not determine the Chinese character of its meaning and indication, but " xuesheng " two syllable logotypes then mean " student ", here " xue " played " card mutually " effect with " sheng ", can be described as " cooperative phenomenon " of language.2 syllable numbers of permutations of Chinese are about 400*400=160,000, Chinese words and phrases 60, article 000, not all be dual-syllable words, but have " the sound nonsense is arranged " phenomenon on the one hand in the Chinese double-tone joint, as, " rexiang ", have on the one hand " the many speech of unisonance " phenomenon, for the word of " shixian " " realize, in advance, fall, sight line, time limit, seasonings, contemporary worthies " arranged as: sound, sound has " desirable, lanes and alleys " for the word of " lixiang ".But, if there is the people to say " shixianlixiang ", we will appreciate that and be " realizing ideal ", and not will appreciate that and be " sight line lanes and alleys ", and " shixian " played " card mutually " effect with " lixiang " here.In a word, syllable is many more, and semanteme is definite more.For Chinese, in triphone, four syllables, " the many speech of unisonance " or the probability of " unisonance ambiguity " are fewer and feweri, and a large amount of " the sound nonsense is arranged " phenomenons occur.Big just because of the probability that " the sound nonsense is arranged " in the triphone, if other people mention a trisyllabic strange name in talking, as " lifuhao ", we can judge the name that this is a people in the time of most, rather than word.
Since in triphone, four syllables, the probability of " unisonance ambiguity " is fewer and feweri, as with single syllable word, dual-syllable words, form three, four words and even above phrase, the sentence pattern of five words by semantic collocation, provide corresponding phonetic, set up database, because sound, correspondence is good one by one for justice, carry out the conversion of sound word according to this database, just can improve accuracy rate.If the T of speech conversion or phonetic input is " shixianlixiangxuyaonuli ", carry out participle with the forward maximum matching method, can find pinyin string at database: " shixianlixiang ", its corresponding Chinese character string is " realizing ideal ", T is treated to " xuyaonuli realizes ideal ", can avoids selecting the speech mistake when changing respectively with " shixian " " lixiang ".
Certainly, language is not so simple.If someone says " shixianweidalixiang ", we will appreciate that to be " realizing great desirable ", rather than " contemporary worthies's greatness lanes and alleys ", and " cooperative phenomenon " of descriptive language can " be striden syllable "; If someone says " lixiangyijingshixian ", we will appreciate that and are " ideal realizes ", rather than " desirable artistic conception sight line ", and " cooperative phenomenon " of descriptive language can " unordered ".But, common segmenting method can not find that " shixianweidalixiang ", " lixiangyijingshixian " same database " shixianlixiang " have inner link, and is just invalid to " striding the syllable cooperative phenomenon " " unordered cooperative phenomenon ".Certainly, character string is carried out " jump lattice relatively ", also can find to have relation of inclusion between the two, but database is big slightly the time, response speed can not be satisfied the demand.
So, I propose to carry out language analysis with " prime number replacing character string search technology ", and this method becomes corresponding prime number " walk abreast and be divided by " to " serial relatively " of character between main string and the substring, have reduced the time of reading of data, within the specific limits, speed than the fast 1-2 of pattern match doubly.And prime number replacing is divided exactly judgement, the characteristics of " not considering character unit order ", just in time suitable " striding the syllable cooperative phenomenon " " the unordered cooperative phenomenon " of handling.If " realize # ideal " record is arranged in the database, and, illustrate that character unit order is flexibly from here, no matter import " realizing great ideal ", " ideal realizes ", divide exactly once and all can find with the symbol of a similar #.Relatively find " unordered cooperative phenomenon " if jump lattice, need carry out when screening jumping lattice for twice relatively with reference to sentence pattern with character string.Certainly, the speed of " prime number replacing character string search technology " can not satisfy the needs of Language Processing, so need to carry out sentence pattern primary election with " bit mark character string retrieval technique ".
For Chinese, " five cars " can regard word or numeral-classifier compound collocation as, " Spring Festival Party " is a phrase, " pigeon " is measure word and noun collocation, " to make a phone call " be a verb-object word group, also can regard phrase as, " he laughs at " can regard phrase as, also can regard subject-predicate sentence as, and " I say Chinese " then is a sentence that SVO is arranged.The sentence that extract, collocations, phrase, phrase, word etc. can be arranged in the language analysis database, and for the ease of style of writing, presents roughly is called " basic sentence patterns ", and its database is called " basic sentence patterns database ", is designated as jxk.Screen from jxk, character unit usually is called " with reference to sentence pattern " by the sentence pattern that keyword T comprises, and part character unit is called " fault-tolerant sentence pattern " by the sentence pattern that T comprises.After analyzing relatively to " with reference to sentence pattern ", " fault-tolerant sentence pattern ", the sentence pattern of determining to be used for to handle T is called " basic sentence pattern ".
In database, select " with reference to sentence pattern " to do " basic sentence pattern ", preferentially select the many sentence patterns of character unit, abbreviate " priority of long word " as.For " he graduates next June " this sentence,, 2 more reference sentence patterns of character unit's number are arranged then if database has " his graduation " " next June ".If database does not have " next June ", then can find " next year " words such as " June " to do reference.In other words, based on the compatible common branch word algorithm of phonetic entry algorithm energy of reference sentence pattern, the standard of sentence pattern is flexibly, the quantity of basic sentence patterns can be with the condition adjustment of hardware in the database, the hardware handles ability is big, and sentence pattern is long, the quantity volume is many, and treatment effect will be good more.If hardware condition is poor, can classify by subject to basic sentence patterns, as general 1,600,000,400,000 of literature and historys, 400,000 of science and technology, 400,000 of newpapers and periodicals.If user's typing scientific and technical article is then called in 1,600,000+400,000 database.If Current Library fails to find reference sentence pattern with a high credibility, in other subject sentence pattern storehouse, search again.Even hardware performance is good, classification is built the storehouse and can be considered that still after all, some user seldom uses some professional sentence pattern.In service, if program is found the words and phrases of a certain specialty of the current frequent use of user, can adjust the formation of Current Library automatically.
Typical sentence has subject, predicate, object, as, " I say Chinese ".If basic sentence patterns all has this three kinds of compositions, be convenient to computing machine " identification understand " beyond doubt, but subject, predicate, three kinds of compositions of object are various, the quantity of sentence pattern can be a lot.So " I say Chinese " resolved into " I say ", " saying Chinese ", it is desirable listing database in as two basic sentence patterns.Preferably " basic sentence pattern " carries out language analysis, except the principle of " priority of long word ", should take all factors into consideration many-sided factors such as grammer, frequency.If T is " woshuohanyuhenliuli ",, can be processed into " wo says Chinese henliuli " according to " saying Chinese " according to the principle of priority of long word; If " it is fluent to say " this collocation is arranged in the database, further can be processed into " wo says that Chinese hen is fluent ".Use " saying " again, in the reference sentence pattern, search in the corresponding Chinese character contain " saying ", sound for the S of " I say " of " woshuo " as the 3rd basic sentence pattern, be processed into " I say that Chinese hen is fluent ", this method can be described as " being connection " or " association ".At last, note method rule, frequency processing become " I say that Chinese is very fluent ".Certainly, also be fine if " very fluent " listed in database as a sentence pattern, this will decide according to processing speed, the memory size of cpu.Sentence pattern quantity is big more, and syntax rule can be simple more, and sentence pattern quantity is more little, just needs many more syntax rules to do auxiliary judgment.
Under the limited condition of hardware, list the SVO sentence of jxk in and can only use always, as " I am on duty ".The database main body is the verb of certain language and the sentence pattern that object constitutes, as " tighten management "; And the moving sentence pattern of mending relation, as " writing well "; Subject-predicate sentence is as " it is fine ".But being not limited to basic sentence patterns, also can be various collocations, as: " university | be " " economize | the city " " height | the building " " bar | ox " " though | " " with | mode " " copy | all over ".
Word, Chinese idiom that three words are above help the sentence cutting, can take among the jxk.Specialized word quantity is big, as mentioned above, can divide subject to build the storehouse, also can consider to list common professional term in jxk, and what be of little use lists in the vocabulary.The statement of striding syllable can not be searched in common index, but the big vocabulary of sequential search fast, if the accuracy rate height that utilizes the sentence pattern of income jxk that T is carried out cutting, cutting rear section pinyin string is not found believable S in jxk, can go to search in the vocabulary.In the application, be that the branch subject is built the storehouse, still be divided into basic sentence patterns storehouse, vocabulary, can test effect by reality and decide.
The word of two Chinese characters need be paid close attention to, as having in " shengshi " corresponding word: " provinces and cities ", " momentum ".Wherein " provinces and cities " can be left two font formulas as a kind of collocation, as: " Hubei | economize | Wuhan | the city "; And " momentum " should form long sentence pattern with other word as far as possible, as " momentum | great " " make | momentum " " momentum | many ".
B. style analysis, quoted passage information, related information
Language is complicated, on the basis of reference sentence pattern, can provide style information, quoted passage information, related information etc., improves the conversion accuracy of phonetic symbol string to text strings.
The style trend analysis: an available data L comes mark style tendency, style is divided into some classes such as newpapers and periodicals, official document, commercial affairs, economy, literature, history, philosophy, mathematics, physics, machinery, electronics, chemistry, biology, education, several bit among the corresponding L; Belong to certain n class style substantially as one piece of article or a statement in the corpus, then the corresponding n of a L bit is changed to " 1 ", be the style tendency of this piece article or statement.Corpus is carried out the style tendency that statistical study draws a basic sentence patterns, word, be designated as L n: in " newpapers and periodicals, official document, literature, history " 4 class styles, then corresponding 4 bit are changed to " 1 ", L=51 as " foreign minister " this speech " common "; " buoyancy " this speech " common " then is changed to " 1 ", L=8960 with corresponding 3 bit in " physics, machinery, education " 3 class styles.
In the phonetic entry, by analyzing the L of the basic sentence patterns that generates certain statement n, the style that obtains this statement is inclined to, and is designated as L sThe mode of the style tendency available aggregate of section, chapter, a piece of writing is calculated: establishing and generating certain section used basic sentence pattern, the word of literal is 100.As the 3rd bit indication " commercial style ", with the 3rd the bit integer 4 that is " 1 " and the L of these 100 basic sentence patterns nDo " and " computing of " position ", it is effective that value equals 4 record, promptly satisfies 4﹠amp; L n=4 record is assumed to be 40; As the 13rd bit indication " biology ", the 13rd bit is the integer 4096 of " 1 ", satisfies 4096﹠amp; L n=4096 record is effective, is assumed to be 10.The value 40 of " commercial style " is big, use the bit of " official document " " newpapers and periodicals " " economy " " history " near the newpapers and periodicals style and so on to go statistics again, if the record number of " official document " is 45, " commercial affairs " are 40, " economy " is 37, and " history " is 15, and it is bigger than other class record number that official document, newpapers and periodicals, commercial affairs, economic 4 classes write down numbers " obviously ", corresponding 4 bit are changed to 1, are designated as L p=15.Need to determine rightly the standard of " using always " " obviously ", make L n, L pIn bit an amount of, the too much very few effect that all can not produce.Some frequent basic sentence patterns, words that use, the style kind of appearance is many, can be not used in the style analysis, allows L n=0 gets final product.
For a pending statement T, may generate several alternative statements after the routine processes, the L of each alternative statement sCan be by analyzing the L that generates its used basic sentence pattern nObtain, with the L of alternative statement sStyle tendency L with section, chapter, each level of a piece of writing pCompare, satisfy L sOr L p=L pOr the statement of its formula of equal value, with a high credibility, preferentially select for use.Its implication is: the style tendency of this statement, the overall style that does not exceed this segment word is inclined to scope.Otherwise the style tendency is not inconsistent, and abandons or arranges by the back in alternative sentence pattern, will take all factors into consideration the factors such as j value of the basic sentence pattern of generated statement certainly.Than higher target be, the style analysis can be carried out real-time dynamicly, estimates the statement that has generated, and passes through L nWith L pCompare, the sentence pattern when instructing follow-up statement to generate is selected.
Set up the quoted passage system: create a quoted passage data bank, long article is by the paragraph cutting and interlink, and provides call number by section, the more first sentence of piece of writing name, each paragraph of quoted passage, the sentence S of elite is taken in jxk, and provides call number.When the pinyin string of T and certain S or press the text strings of statement that T generates and S when identical or close, according to call number the front and back literary composition is read, not only can improve accuracy, can also save time.If with " road of university " income jxk, if the T of input is identical or close with this phonetic, or the statement that further generates is " road of university ", be about to " road of university; " whole section and link in moral obviously; human-oriented, aim at absolute perfection next section " know then has calmly with onlying ... " read, confirm for the user.If the user only remembers " in moral obviously ", or wherein " obviously moral " three words, this quoted passage system is just invalid.But can be with the quoted passage data bank, by the statement cutting and interlink, provide every position mark value W, and set up index with a position mark value V, during user prompt " quoted passage ", use " mingmingde ", or go " just retrieving " to filter out R with position mark value V, the W of " the obviously moral " that generate 1After, carry out the character fuzzy matching, read the record that meets, confirm for the user.If cpu is enough fast, needn't user prompt " quoted passage ", use to Automatic Program T, or position mark value V, the W of the statement that generates, deblurring is searched.Also can set up a fuzzy quoted passage system, remove quoted passage data bank fuzzy search according to inverted list with the statement of T or generation with common inverted file mode; The quoted passage data bank can also be pressed a sentence cutting, and the front and back link, important statement incorporated into jxk, after adding up to indicate according to inverted list, the record of j=m may be a quoted passage, after the user confirms, can read hereinafter according to link, m refers to the character unit number of T, and concrete grammar sees below.
In addition, can provide conjunctive word, the related information of basic sentence patterns in the database.Basic sentence patterns mainly is to set up around verb, and conjunctive word can mainly be set up around noun, provides according to knowledge system.Synonym, nearly justice, arranged side by side, exclusion relations can be handled in conjunctive word, as: " liver " can make conjunctive word with " courage ", " argon " mutually with " neon " " radon "; "+" can be used as the conjunctive word of " adding "; " CO 2" can be used as the conjunctive word of " carbon dioxide ".Contain " eryanghuatan " as T, be converted to " carbon dioxide ", but provide " CO 2", select for the user.The multi-level relation of inclusion between the notion also can be handled in conjunctive word, as: pigeon, bird, animal." pigeon bird " " pigeon animal " " bird animal " as collocation income data storehouse, be fine, but sentence pattern quantity may be too big.Can only list " pigeon " " bird " " animal " in the sentence pattern storehouse, but after " pigeon ", list related notion.If importing T, the user is " gezishiniao ", determine that by position mark, the contrary retrieval of prime number replacing " pigeon " is with reference to vocabulary, use the phonetic niao and the T of its conjunctive word " bird " to mate again, the match is successful, T can be processed into " pigeon shi bird ", continue to handle, may obtain " pigeon is a bird " with frequency, grammer, knowledge system.Conjunctive word can also concern by processing attribute, as " Shanghai " and " China ", " water " and " buoyancy ", " circle ", " ∏ " and " radius ".Outside the conjunctive word, can basic sentence patterns be connected to special knowledge system with related information.
Figure A200810005364D00091
The Chinese verb normal with auxiliary word ",, mistake " logotype, adjective normal with auxiliary word ",, must " logotype, can handle with syntax rule, also verb, adjective and auxiliary word can be mixed into " visit " " beauty " etc., as reference sentence pattern income data storehouse, the third selection is that auxiliary word is handled as conjunctive word, can reduce the quantity of basic sentence patterns.Similarly, also can handle relations such as measure word and number, noun verb and measure word, adjective verb and adverbial word, noun and preposition, noun and the noun of locality, verb and trend speech, the grammatical morpheme that can also handle various language changes, as [ist] and corresponding est conjunctive word, handle the superlative degree of English as the adjective adverbial word.
2. backward reference sentence pattern language analysis method:
Can utilize " prime number replacing " to reach " bit mark character string retrieval technique " based on the language analysis of reference sentence pattern and carry out, as the method for an accommodation, also available inverted file method is carried out.During " backward reference sentence pattern language analysis method " implemented, according to the difference of using, the character unit of the statement T that jxk, inverted list, needs are analyzed, can be syllable, phoneme, the other Languages of word, the Chinese phonetic alphabet of Chinese character, other Languages syllable, phoneme, be fit to the phonetic unit of identification etc., for convenience of explanation, be called " character unit ", be designated as P.
A. repetition-free word accords with the situation of unit
When explanation does not earlier here contain the character unit of repetition, utilize reference sentence pattern to carry out the basic step of language analysis, if the character unit of repetition is arranged, after accumulative total indicated, situation was complicated, explanation in the back.The flow process of backward reference sentence pattern language analysis method can be divided into two stages, four steps substantially:
1. the storehouse stage of falling row is built in language analysis, divides to set up the basic sentence patterns database, set up two steps of inverted list, belongs to program development period, and flow process is seen accompanying drawing 1.
The 1st step, according to application need, certain language is analyzed, refine basic sentence patterns (containing collocations, phrase, phrase, vocabulary, down together), set up jxk, count the contained number k of character unit of each basic sentence patterns S, provide field j, make whole j=0; Provide process informations such as corresponding informance, character string structural information, style tendency, frequency, quoted passage information, related information, syntactic information.
Character unit is the phonetic symbol that the Chinese phonetic alphabet, other Languages are fit to the phonetic unit of acoustic treatment in the phonetic entry, is the word etc. of Chinese character, other Languages in mechanical translation, search engine, the Intelligent Information Processing.Corresponding informance is the corresponding text strings of phonetic symbol string of certain language in the phonetic entry; Mechanical translation then is the corresponding basic sentence patterns of target language, but needn't be corresponding one by one, so that carry out the style trend analysis; Intelligent Information Processing, corresponding informance can provide the nearly adopted sentence pattern of instruction, standard statement, standard vocabulary, centre word and synonym.The string structure information explanation of character unit is formed the character unit order of basic sentence patterns and is fixed, still flexibly, and promptly can " unordered collaborative "; And could insert other character unit between the character unit, promptly can " it be collaborative to stride syllable ".The style tendency indicates basic sentence patterns and is common in which class style, provides the style tendency L of source language sentence pattern in the mechanical translation simultaneously nStyle tendency L with the target language sentence pattern n, utilize L nCan analyze the L of current file p, use L nSame L pCompare, preferentially select L nWith L pThe target language sentence pattern that conforms to, auxiliary statement generates; Or analysis generates the L of the used sentence pattern of object statement nObtain the L of this statement s, same L pCompare, in a plurality of alternative statements, give L sWith L pThe evaluation that the statement that conforms to is higher.Frequency, common term is a word frequency, because basic sentence patterns, collocation, word etc. are arranged in the database, so use frequency, is the statistics in the certain limit, such as, add up 100,000,000 statement, certain basic sentence patterns occurs 2300 times, is exactly 0.0023%.Quoted passage information provides the link information between quoted passage data bank relevant paragraph numbering or sentence that is cut out and the sentence, and mechanical translation also can be utilized.Related information is and basic sentence patterns, word often occurring words, symbol, formula simultaneously, can provide according to knowledge system or grammer, also can be associated with more fully knowledge system; Can provide the alternative statement of target language in the mechanical translation, or grammatical morpheme changes; In the Intelligent Information Processing, nearly adopted sentence pattern of synonym and even antisense sentence pattern also can provide with conjunctive word.Syntactic information is the syntactic category of basic sentence patterns, the part of speech of word etc.; Mechanical translation not only will provide the syntactic information of source language basic sentence patterns, the syntactic information of the corresponding basic sentence patterns of target language, also needs the grammar system support.The row of falling needs address d, or sentence pattern numbering n.In the 3rd step accumulative total indicates, after need reading j value and tire out note, write again, if the 2nd step fell row and uses the sentence pattern address, the 3rd step need the calculating side-play amount obtains the address of j value; If use j value address in the row of falling, accumulative total indicates needn't calculate side-play amount.As to the sentence pattern numbering,, should number in order for reducing query time.
js k j n
abc 3 1
bd 2 2
ad 2 3
The 2nd goes on foot, and sets up the inverted list of jxk.Order is listed all character element P of this language in this type of application i(i=1,2,3 ... w) as keyword; Order reads each basic sentence patterns S from jxk, if S contains P i, in the inverted list keyword, search P i, and the address of this sentence pattern or the address d of j value be listed in P iAfter, handle whole basic sentence patterns among the jxk, obtain inverted list, be called " arranging the d table ".Provide the basic sentence patterns numbering among the jxk, order reads each basic sentence patterns S from jxk, if S contains P i, in the inverted list keyword, search P i, and with the number column of this basic sentence patterns at P iAfter, handle whole basic sentence patterns among the jxk, obtain inverted list, be called " arranging the n table ".For ease of explanation, provide the sentence pattern numbering in the presents more.This is two kinds of inverted lists of top signal jxk:
2. the concrete statement analysis phase, divide accumulative total sign, statement to analyze for two steps, the run time of belonging to the user, flow process is seen accompanying drawing 2.
In the 3rd step, each 1 T of analyzing and processing should be changed to 0 with the j that all writes down among the jxk earlier.If all character units of sentence T that need to analyze are P 1, P 2, P 3P m, read its first character element P 1, in inverted list, search P 1If, inverted list P 1After, certain address d or sentence pattern numbering n is arranged, then the j value that should write down among the jxk is increased by 1; Same, use other character element P 2, P 3P mHandle.Dispose, the j value of each record is that this basic sentence patterns S contains the character unit number among the T.This process is called accumulative total and indicates.If T=dca, after the inverted list above utilizing indicated, jxk became:
js k j n
abc 3 2 1
bd 2 1 2
ad 2 2 3
If each character unit is regarded as an element, the common factor of basic sentence patterns S and T is designated as J, J=S ∩ T then, and j is the size of J; The k value is all character unit numbers of basic sentence patterns S, is the size of S.J=k, promptly J=S has J=T ∩ S again, thus S=T ∩ S, according to the set operation principle: T ∩ S = S ⇔ S ⊆ T 。Its implication is, if j=k, each character unit of S all appears among the T.The 3rd sentence pattern j=k, its alphabet first a, d appear among the T, and " c " arranged among the dca, and character unit order is also different, but can find to exist in " dca " and " ad " contact by the row of falling.These characteristics and " prime number replacing " divide exactly judge identical, " unordered ", " striding the syllable cooperative phenomenon " of being fit to handle language.1st, 2 sentence pattern 0<j<k, they have part character unit identical with T, are the fault-tolerant sentence patterns of broad sense.
The 4th goes on foot, and inquires the sentence pattern S of all j=k, and these sentence patterns are " with reference to the sentence pattern " of T, and therefrom preferred part sentence pattern is done " basic sentence pattern ", in order to handle T.There is " unordered cooperative phenomenon " in the language, but be not arbitrarily, as: " realizing the # ideal " can not be " real # is existing desirable "; Some sentence pattern can only " stride syllable collaborative ", can not " unordered collaborative ", as: " with ... mode " can not be used as " mode ... with ".So the sentence pattern S that is not all j=k can be used for handling T, S need be carried out texture ratio with the character unit of T, reject invalid S, provide character meta structure information when this need build the storehouse in the 1st step certainly.Result set R at j=k 1In, according to language " cooperative phenomenon ", the k value, just the j value is big more, and semanteme is definite more, and it is reliable more to handle T according to this.Like this, there is no need R 1In all S carry out the character meta structure relatively with T, only need to check the big S of j, k.In the phonetic entry, can integrate consideration to the size of k and j value with style trend analysis, frequency etc.
In the search engine, the reference sentence pattern bigger with several k values compares with T, can finish the cutting to T; In the Intelligent Information Processing, the statement of the spoken statement of user is T, determine basic sentence pattern by the accumulative total sign, can reject unessential information among the T, can also in corresponding informance, conjunctive word, obtain instruction, standard statement, standard vocabulary, centre word, the nearly adopted sentence pattern of synonym, antisense sentence pattern etc., computing machine directly executes instruction or by standard statement, standard vocabulary being analyzed the synthetic instruction that produces, carries out proper handling, can be used for man-machine interaction; Centre word, the nearly adopted sentence pattern of synonym, antisense sentence pattern can be used for intelligent information and search.In the mechanical translation, backward reference sentence pattern language analysis method can guarantee that original statement is by correct cutting, find out core, the auxiliary element of source language sentence, can be described as and allow computer understanding " sentence ", utilize basic sentence patterns, the word of the target language that provides in the corresponding informance, more synthetic object statement under the support of grammar system.
B. the situation that has character unit to repeat
Do not consider that above S, T contain the situation of repeat character (RPT) unit, but a statement may repeat certain character unit in the language, in phonetic entry, also occupies certain ratio.Be provided with 3 basic sentence patterns: S 1: aabb, k=4; S 2: abc, k=3; S 3: acd, k=3.Set up inverted list dual mode can be arranged: 1. no matter character a how many times occurs in basic sentence patterns n, behind the inverted list keyword a, n only occurs 1 time, is called single table, represents with dpb=1; 2. character a occurs m time in basic sentence patterns n, and behind the inverted list keyword a, n just occurs m time, is called to repeat table, represents with dpb=2.
Keyword Single table Keyword Repeat table
a
1、2、3 a 1、1、2、3
b 1、2 b 1、1、2
c 2、3 c 2、3
d 3 d 3
Be provided with two pending keyword: T 1=" aabbcef ", T 2=" abce ".T 1The reference sentence pattern should be S 1, S 2T 2The reference sentence pattern should be S 2To T 1Do not reject a, the b of repetition, add up to indicate with single table: S 1: aabb, k=4, j=4; S 2: abc, k=3, j=5; S3:acd, k=3, j=3.If the sentence pattern of inquiry j=k, S 2Omitted S 3Sneaked into.Its reason is T 1Middle a, b are repetitions, indicate twice, and all j values that contain the record of a, b have all increased by twice, for S 1Be out of question, but S 2, S 3The j value produced problem: the j value can not accurately reflect the size that the character unit of S and T occurs simultaneously.Because whether the factors such as character unit of repetition are arranged among the difference of inverted list, the T, indicating back j value can be different; The approach that solves can be considered from 3 aspects: with the number of unduplicated character unit among the h record S, add up sign again, the modification querying condition after rejecting the character unit that repeats among the T the jxk.
The single inverted list sign of first analysis and utilization back various case, in the table " a[a] b[b] cef " represent that accumulative total sign preceding " a, b " is disallowable, " abce[] " represent that this repetition-free word symbol is first disallowable ,+expression S appears at R 1In ,-expression S does not appear at R 1In, y represents that whether intention appears meeting in S, and n represents that whether intention appears not meeting in S, and * represents that S appears at R 1In be lengthy and jumbled.
S k h j dpb T j=k j=k?or?j>k
aabb 4 4 1 aabbcef +y +y
abc 3 5 1 aabbcef -n +y
acd 3 3 1 aabbcef + * + *
aabb 4 2 1 abce -y -y
abc 3 3 1 abce +y +y
acd 3 2 1 abce -y -y
1-No 2-Ok
aabb 4 2 1 a[a]b[b]cef -n -n
abc 3 3 1 a[a]b[b]cef +y +y
acd 3 2 1 a[a]b[b]cef -y -y
aabb 4 2 1 abce[] -y -y
abc 3 3 1 abce[] +y +y
acd 3 2 1 abce[] -y -y
3-No 4-No
S k h j dpb T j=h j=h?or?j>h
aabb 2 4 1 aabbcef -n +y
abc 3 5 1 aabbcef -n +y
acd 3 3 1 aabbcef +* +*
aabb 2 2 1 abce + * + *
abc 3 3 1 abce +y +y
acd 3 2 1 abce -y -y
5-No 6-Ok
aabb 2 2 1 a[a]b[b]cef +y +y
abc 3 3 1 a[a]b[b]cef +y +y
acd 3 2 1 a[a]b[b]cef -y -y
aabb 2 2 1 abce[] + * + *
abc 3 3 1 abce[] +y +y
acd 3 2 1 abce[] -y -y
7-Ok 8-Ok
Table in the summary: when jxk has the k value, do not reject the character unit that repeats among the T, relax querying condition, with " j=k or j〉k " inquiry, but R 1Has lengthy and jumbled S; Reject the character unit that repeats among the T, with j=k or with j=k or j〉omission all can appear in the k inquiry.When jxk has the h value, do not reject the character unit that repeats among the T, unavailable j=h inquiry, available j=h or j〉the h inquiry, but R 1Has lengthy and jumbled S; Reject the character unit that repeats among the T, available j=h or with j=h or j h inquires about, and also has lengthy and jumbled S.Certainly, " j=k or j〉k " etc. being changed into " j〉k-1 " and so on also is fine.
4 kinds of feasible scheme 2,6,7,8 inquiry back R 1Lengthy and jumbled S appears in the capital, from R 1In when selecting basic sentence pattern, reply S and the character unit of T compare, and reject two kinds of nonconforming S: if arbitrary character unit discovery in T of S, this S is lengthy and jumbled record, abandons; If S and T structure are not inconsistent, also abandon.Some minor issues are arranged here, and aabb is lengthy and jumbled in some versions is the reference sentence pattern of abce, if aabb fixes, can reject by texture ratio, if flexibly, as a#a#b#b, if compare from the beginning character unit of T again with the 2nd a of character unit, can not reject this S.So, " unordered collaborative " the reliable method of phenomenon of checking should be done like this: the 1st a of character unit with S is relatively first with the 1st character of T, and success runs into #, write down the current position i of T, the 2nd a of character unit with S is relatively first with the successive character of T, unsuccessful, returns, compare from the beginning character unit of T, if still unsuccessful, nonconforming, abandon to i-1.
From R 1In the degree of lengthy and jumbled S and the simple degree of querying condition, more excellent with 2,7 two schemes.Scheme 7 querying conditions are the simplest, R 1In lengthy and jumbled S minimum.But the h value can not reflect what of character unit of S fully, when keyword is T 1The time, from R 1In determine " basic sentence pattern ", " aabb " can not get paying the utmost attention to.So can consider 2,7 two schemes are combined, promptly provide k and h in the database simultaneously.Obtain R with the j=h inquiry 1, press k value size preferred " basic sentence pattern " again.
Further, T is analyzed, if there is not repeat character (RPT) unit, accumulative total indicates, and with the j=k inquiry, presses k value size preferred " basic sentence pattern "; If repeat character (RPT) unit is arranged, after the rejecting, accumulative total indicates, and with the j=h inquiry, presses k value size preferred " basic sentence pattern ":
S k h j dpb T j=k j=h
aabb 4 2 2 1 a[a]b[b]cef +y
abc 3 3 3 1 a[a]b[b]cef +y
acd 3 3 2 1 a[a]b[b]cef -y
aabb 4 2 2 1 abce -y
abc 3 3 3 3 1 abce +y
acd 3 3 2 1 abce -y
If T 3=" abccd ", k=5 has repeat character (RPT) unit, is " abcd " behind the rejecting c, after accumulative total indicates, S 1J=2, with j=h inquiry, " aabb " can enter R 1, be lengthy and jumbled.In other words, because T 3There is 1 c of character unit to repeat, with the j=h inquiry, by T 3The basic sentence patterns that other character unit a, b, d repeat to constitute, as: aad, bbad can enter R 1, owing to be to be character unit with syllable, word, Chinese character in the language analysis, this lengthy and jumbled amount is little.
Analysis and utilization repeats inverted list and indicates the back various case again:
S k h j dpb T j=k j=k?or?j>k
aabb 4 8 2 aabbcef -n +y
abc 3 5 2 aabbcef -n +y
acd 3 3 2 aabbcef + * + *
aabb 4 4 2 abce + * + *
abc 3 3 2 abce +y +y
acd 3 2 2 abce -y -y
9-No 10-Ok
aabb 4 4 2 a[a]b[b]cef +y +y
abc 3 3 2 a[a]b[b]cef +y +y
acd 3 2 2 a[a]b[b]cef -y -y
aabb 4 4 2 abce[] + * + *
abc 3 3 2 abce[] +y +y
acd 3 2 2 abce[] -y -y
11-Ok 12-Ok
S k h j dpb T j=h j=h?or?j>h
aabb 2 8 2 aabbcef -n +y
abc 3 5 2 aabbcef -n +y
acd 3 3 2 aabbcef + * + *
aabb 2 4 2 abce -y + *
abc 3 3 2 abce +y +y
acd 3 2 2 abce -y -y
13-No 14-Ok
aabb 2 4 2 a[a]b[b]cef -n +y
abc 3 3 2 a[a]b[b]cef +y +y
acd 3 2 2 a[a]b[b]cef -y -y
aabb 2 4 2 abce[] -y + *
abc 3 3 2 abce[] +y +y
acd 3 2 2 abce[] -y -y
15-No 16-Ok
Schemes 10,11,12,14,16 etc. are feasible for 5 kinds, all lengthy and jumbled S can occur after the inquiry.Also can provide k and h simultaneously in database, if T does not have repeat character (RPT) unit, accumulative total indicates, and with the j=h inquiry, presses the k value and determines " basic sentence pattern "; If T has repeat character (RPT) unit, after the rejecting, accumulative total indicates, and with the j=k inquiry, presses the k value and determines " basic sentence pattern ":
S k h j dpb T j=k j=h
aabb 4 2 4 2 a[a]b[b]cef +y
abc 3 3 3 2 a[a]b[b]cef +y
acd 3 3 2 2 a[a]b[b]cef -y
aabb 4 2 4 2 abce -y
abc 3 3 3 2 abce +y
acd 3 3 2 2 abce -y
T 3=" abccd ", k=5 has repeat character (RPT) unit, is " abcd " behind the rejecting c, after accumulative total indicates, S 1J=4, with j=k inquiry, " aabb " can enter R 1, be lengthy and jumbled.In other words, because T 3There is 1 c of character unit to repeat, with the j=k inquiry, by T 3The basic sentence patterns that other character unit a, b, d repeat to constitute, as: aad, bbad can enter R 1, owing to be to be character unit with syllable, word, Chinese character in the language analysis, this lengthy and jumbled amount is little.
The selection of querying condition is influenced by two aspects: 1. certain character element P of basic sentence patterns S iOccur m time, at the keyword P of inverted list iAfter, the address d of the address of S, j value, numbering n occur m time, still occur 1 time, and as mentioned above, dpb=1 represents in the inverted list address d or numbering n appearance 1 time; Dpb=2, address d or numbering n occur m time in the expression inverted list.Certainly, this parameter is dispensable, because general procedure only can use a kind of inverted list, and definite which kind of scheme that adopts of design phase.2. among the pending keyword T whether repeat character (RPT) unit is arranged, whether disallowable before tired note indicates.
Further, if hardware condition is good, single table is arranged in program simultaneously and repeat table, T 3=abccd, wherein c is repetition, and accumulative total is when indicating, and c indicates according to repeating table, only indicates 1 time, and with (cc) expression, and other character unit a, b, d respectively indicate 1 time according to single table in the following table:
S k h j dpb T j=k j=h?or?j>h
aabb 4 2 4 1+2 (aa)(bb)cef +y +y
abc 3 3 3 1+2 (aa)(bb)cef +y +y
acd 3 3 2 1+2 (aa)(bb)cef -y -y
aabb 4 2 2 1+2 abce -y + *
abc 3 3 3 1+2 abce +y +y
acd 3 3 2 1+2 abce -y -y
aabb 4 2 2 1+2 ab(cc)d -y + *
abc 3 3 3 1+2 ab(cc)d +y +y
acd 3 3 3 1+2 ab(cc)d +y +y
With j=h or j〉as if h inquiry has lengthy and jumbledly, eliminated lengthy and jumbledly, but still not thorough with j=k, is provided with following sentence pattern:
4 aab 3
5 aabb 4
6 aaabbb 6
7 aaaabbd 7
8 aaaaacd 7
Single inverted list, repetition inverted list are:
Keyword Single table Keyword Repeat table
a 4.5.6,7,8 a 4,4,5,5,6,6,6,7,7,7,7,8,8,8,8,8
b 4,5,6,7 b 4,5,5,6,6,6,7,7
c 8 c 8
d 7,8 d 7,8
Tired note indicates the back:
n S k j dpb T j=k S k j dpb T j=k
4 aab 3 3 1+2 (aa)(bb)cef +y aab 3 3 1+2 (aa)(bb)cd +y
5 aabb 4 4 1+2 (aa)(bb)cef +y aabb 4 4 1+2 (aa)(bb)cd +y
6 aaabbb 6 6 1+2 (aa)(bb)cef +n aaabbb 6 6 1+2 (aa)(bb)cd +n
7 aaaabbd 7 6 1+2 (aa)(bb)cef -y aaaabbd 7 7 1+2 (aa)(bb)cd +n
8 aaaaacd 7 6 1+2 (aa)(bb)cef -y aaaaacd 7 7 1+2 (aa)(bb)cd +n
T 1=" aabbcef ", after the sign, S 6Enter R 1Be lengthy and jumbled.T 4=" aabbcd ", after the sign, S 6, S 7, S 8Enter R 1, be lengthy and jumbled.(aa) (bb) be repetition, respectively indicate 1 time that c, d respectively indicate 1 time with single table, S with repeating table 8J=5+0+1+1=7, reason is: S 8Middle a occurs 2 times, but repeats in the table, and 8 have occurred 5 times behind the keyword a.Will thoroughly eliminate lengthy and jumbledly, inverted list will be used grouping sheet, represents with dpb=3:
Keyword 1 group 2 groups 3 groups 4 groups 5 groups 6 groups
a 4,5 6 7 8
b 4 5,7 6
c 8
d 7,8
S 7=aaaabbd, wherein a occurs 4 times, so behind the keyword a the 4th group provides sentence pattern numbering 7; B occurs 2 times, so behind the keyword b the 2nd group provides sentence pattern numbering 7; D occurs 1 time, so behind the keyword d the 1st group provides sentence pattern numbering 7.In order to save the space, sentence pattern can be numbered n or address d grouping storage continuously, between group, insert some signs, perhaps behind keyword P, provide reference position, the length of each group.
" ultimate principle " part is mentioned, and in the phonetic entry, important statement of quoted passage data bank and jxk is merged, and after accumulative total indicated, the record of j=m may be a quoted passage.Utilize the grouping inverted list, can thoroughly eliminate lengthy and jumbledly by the method for " backward compatible, upwards horizontal sliding ", and search when realizing " quoted passage " with " with reference to sentence pattern ".Illustrate method of operating, flow process is referring to accompanying drawing 3:
T 5=“aaabb”,m=5。A occurs 3 times, is designated as q=3; 3 groups at inverted list a are found sentence pattern 6, with the j value increase by 3 of sentence pattern 6; " backward compatible " finds sentence pattern 4,5 at 2 groups, and the j value of sentence pattern 4,5 is respectively increased by 2; " upwards horizontal sliding " finds sentence pattern 7 at 4 groups, with the j value increase by 3 of sentence pattern 7, finds sentence pattern 8 at 5 groups, with the j value increase by 3 of sentence pattern 8.B occurs 2 times, is designated as q=2; Find sentence pattern 5,7 at 2 groups, with the j value increase by 2 of sentence pattern 5,7; " backward compatible " finds sentence pattern 4 at 1 group, with the j value increase by 1 of sentence pattern 4; " upwards horizontal sliding " finds sentence pattern 6 at 3 groups, with the j value increase by 2 of sentence pattern 6.Can establish packet number in the program is i, begin to search sentence pattern from i=1 and indicate, and when i<=q, j=j+i; As i〉during q, j=j+q.Indicate result such as following table, j=k is recorded as " with reference to sentence pattern ", k〉j〉0 record is broad sense " fault-tolerant sentence pattern ", but bigger with the record meaning of k=j-1.Being recorded as of j=m " quoted passage ".
n S k j dpb T j=k j=k-1 j=m
4 aab 3 3 3 (aaa)(bb) +
5 aabb 4 4 3 (aaa)(bb) +
6 aaabbb 6 5 3 (aaa)(bb) + +
7 aaaabbd 7 5 3 (aaa)(bb) +
8 aaaaacd 7 3 3 (aaa)(bb)
With querying condition j=k or j=m, obtain " with reference to sentence pattern " and reach " quoted passage " record set R 1, can check the comparability of character meta structure from the maximum start-of-record of j value by the descending sort of j value.Comparable and record j=m of structure may be " quoted passage ", reads corresponding Chinese character string, for user's decision: be to win the wherein literal C of corresponding T, still read context according to link.The if there is no record of j=m, the preferred big sentence pattern of k value in the sentence pattern that the character meta structure conforms to, and factors such as consideration frequency, style tendency, grammer as basic sentence pattern, carry out analyzing and processing to T.
Sum up, inverted list can have single table, repeat table, grouping sheet, and jxk can provide k value or h value or provide k simultaneously and the h value, by different querying conditions, forms multiple scheme and obtains R 1If g=k-j, promptly the number of times that character unit repeats in the basic sentence patterns in jxk, provides k and g field, or h and g field, and even provides k and h and g simultaneously, can form multiple scheme again and obtain R 1, but do not have essential distinction.R 1In have lengthy and jumbledly, it is lengthy and jumbled to use the grouping inverted list to eliminate, but the grouping inverted list is more complicated, and by to the comparing of the character unit of S and T, can reject R 1In nonconforming record, so in using, be not to use the grouping inverted list, should be according to basic sentence patterns S, pending keyword T, hardware performance decision scheme.Comprise various information among the basic sentence patterns database jxk, bigger, differ and intactly reside in the internal memory surely, can take into account and deposit the middle subtabulation jxkcopy of establishment, one of n or d and fields such as k, j are wherein arranged, tire out note with jxkcopy and indicate, inquire about the sentence pattern that needs, read the information of this part sentence pattern again according to jxkcopy to jxk.Also can take into account and deposit the middle temporary table jxktemp of establishment, one of n or d and j field are wherein arranged, if the character unit among the T relates to certain n or d in inverted list, then produce a record, and tire out the note sign, after finishing at jxktemp, write jxk, inquire about again.Backward reference sentence pattern language analysis method, unit builds inverted list by character, and to build inverted index by word similar to some search engines, in order to reduce the size of inverted list, can use for reference the inverted list compress technique of present use.Compare with the general discharge method that falls, the principal feature of backward reference sentence pattern language analysis method is: in jxk, set up fields such as k, j, and j=k, then T comprises the character unit of S, reaches the effect of " contrary retrieval " in " prime number replacing character string search technology "; If j=m, then S comprises the character unit of T, reaches the effect of " just retrieving " in " prime number replacing character string search technology ".If repeat character (RPT) unit is arranged among S, the T, situation can be complicated, but by in jxk, providing h, inverted list being adjusted, T made methods such as analysiss, modification querying condition and also can handle.
3. the pronunciation inputting method of falling the reference sentence pattern
Last joint has illustrated General Principle, the method for backward reference sentence pattern language analysis, and in the input of this section explanation Chinese speech, the row of falling is with reference to concrete steps, the fault-tolerance processing method of sentence method, and the application of other Languages, others also can reference.Explanation earlier a bit, in the sound word conversion of phonetic entry, what frequently carry out is " syllable is relatively ", is not " character is relatively ", thus at Jian Ku, fall in the processes such as row, sign, conversion, to each syllable, an available Chinese character or other symbology.As, with " pair " representative " fu ", " lining " representative " li ", " number " representative " hao ", be called " syllable is for word ".Such benefit is: with " number " " in the pair number " coupling together, than with " hao " " fulihao " coupling together, be more convenient for locating saving space, raising speed.
A. arrange step in the phonetic entry with reference to sentence method
1. in the Chinese speech input, character unit is the no tuning joint of Chinese or the tuning joint is arranged, weigh speed and accuracy, so wanting of basic sentence patterns is an amount of.Following table is the pattern of phonetic entry jxk, can also increase information such as quoted passage link:
Numbering Pinyin string k j Chinese character string Syntactic information Style Conjunctive word Frequency %
45886 fuli#hao 3 Welfare is good Name shape 13 Very 0.0157
45893 fuli#cha 3 Welfare is poor Name shape 13 Very 0.0113
88544 hao#gongfu 3 The good time Form and name 16 All over the body 0.0137
98253 yifang&shi 3 One side is Principal series 8205 In addition 0.0053
98969 yi&fangshi 3 In mode Guest Jie 8205 0.0079
173561 qianghua#guanli 4 Tighten management Moving name 14 0.0017
﹠amp; Expression can be inserted other word from here, and just this sentence pattern can " stride syllable collaborative ", if after the inquiry, the 98253rd, with reference to sentence pattern, establish user's input " yihefafangshi ", scan from first syllable with yi, coupling, but second syllable do not match, invalid.Handle with 98969 again, from first syllable scanning, mate with yi; ﹠amp; Lattice are jumped in expression, therefore compare he, fa with fang, and up to fang, coupling then with the shi coupling, effectively, is processed into " in the hefa mode ".
# represents " unordered collaborative ", is exactly interchangeable character unit order from here.If after the inquiry, qianghua#guanli is with reference to sentence pattern, begin to mate from the 1st syllable of T with qiang, find the character unit that the match is successful, continue coupling with hua, unsuccessful, then structure is not comparable, abandons.If the match is successful with hua, # represents " unordered ", writes down the position i of current T, follows hua with guan and mates backward, and is unsuccessful, and the 1st syllable that returns T again begins coupling, if unsuccessful to i-1, abandons.The match is successful, and the character meta structure has comparability.
2. phonetic entry falls to arrange by syllable, and Chinese has more than 400 no tuning joint, and " fu " wherein arranged; Sentence pattern 45886 contains " fu " this syllable, shows the numbering 45886 of this basic sentence patterns behind inverted list keyword " fu ":
Syllable The sentence pattern numbering
fu 11678,45886,45893,78253,88544,299555
li 45886,45893,78253,173561,204891
yi 98253,98969
3. the pinyin string T that establishes user's input is " zhe|jia|gong|si|fu|li|hao|you|jin|tie ", and accumulative total indicates, and obtains following table:
Numbering Pinyin string k j Chinese character string Syntactic information Style Conjunctive word Frequency %
88342 hao#gongsi 3 3 Good company Form and name 13 Family 0.0237
45886 fuli#hao 3 3 Welfare is good Name shape 13 Very 0.0157
88544 hao#gongfu 3 3 The good time Form and name 16 All over the body 0.0137
56894 you#jintie 3 3 Subsidy is arranged Noun 14 0.0127
78253 you#fuli 3 3 Buoyancy is arranged Moving name 8960 Not yet 0.0093
45893 tie#jiagong 3 3 Ironworking Noun 520 Very 0.0073
45893 fuli#cha 3 2 Welfare is poor Name shape 13 Very 0.0113
173561 qianghua#guanli 4 1 Tighten management Moving name 14 0.0017
98969 yi&fangshi 3 0 In mode Guest Jie 8205 0.0079
When not having identical character unit, the j value character that just basic sentence patterns S is identical with the T unit number of each record: 45886 j value is 3, and " fu, li, hao " 3 character units are identical with T; 45893 j value is 2, and " fu, li " 2 character units are identical with T; 173561 j value is 1, and " li " 1 character unit is identical with T.
4. determine from " with reference to sentence pattern " " basic sentence pattern " that it is multiple that conversion " phonetic symbol string " is that the method for " text strings " has, and the following describes a kind of conversion plan, for reference:
If the pinyin string T of user speech input back conversion is above " zhejiagongsifulihaoyoujintie ", 10 character units are arranged, m=10.After accumulative total indicates, the highest reference sentence pattern of k value and frequency synthesis evaluation is successively: " hao#gongsi, fuli#hao, hao#gongfu, you#jintie, you#fuli, tie#jiagong ... ", corresponding Chinese character is " good company, good, good time of welfare, subsidy is arranged, buoyancy is arranged ", " ironworking " ...Can therefrom select n sentence pattern as " basic sentence pattern ", with regard to Chinese, the about 8-15 of each a statement Chinese character, the about 3-5 of each a basic sentence patterns Chinese character if generate 2-3 alternative statement, estimates that the n value gets final product for 5-10.Can define character array a: A[n earlier] [m], the element of array should be able to store a Chinese character, is 2 characters if Chinese character is looked by system, then need define one 3 dimension group.As an example, we are made as 5 with n, i.e. definition character array: A[5] [10], corresponding 1 the basic sentence pattern of per 1 row, 1 character unit among the corresponding T of per 1 row.Because array A[5] [10] middle element is since 0, and hereinafter the highest sentence pattern of title evaluation is the u=0 sentence pattern, is thereafter 1,2,3,4 sentence patterns; Character unit among the T also is called i=0,1,2 by order ... character unit.
A. check the structure comparability, determine basic sentence pattern.
At first choose the highest u=0 sentence pattern of comprehensive evaluation, check whether the structure of respective symbols unit among this sentence pattern and the T is comparable, can not compare, abandon this sentence pattern as fruit structure; If comparability is arranged, if the P among the T iBe same as the P of 0 sentence pattern x, i.e. P i=P x, 0 sentence pattern character element P xCorresponding Chinese character C x, promptly P x ⇔ C x , Then make A[0] [i]=Cx.By same quadrat method, choose other sentence pattern successively and handle again.
U=0 is " hao#gongsi ", reads " hao ", mate with T, and i=6 success, the corresponding Chinese character C=" good " of P=in the u=0 sentence pattern " hao " makes A[0] [6]=" good "; # indication " unordered collaborative ", record T current location I=6 mates with gong backward, and is unsuccessful up to the last character unit of T, returns, mate from the i=0 character unit of T, i=2 success, the C=of " gong " " public affairs " makes A[0] [2]=" public affairs "; Be right after coupling with si, success makes A[0] [3]=" department ".
U=1 is " fuli#hao ", reads " fu ", mate with T, and i=4 success, the C=of " fu " " good fortune " makes A[1 in this sentence pattern] [4]=" good fortune "; Be right after coupling with li, success makes A[1] [5]=" profit ".# indication " unordered collaborative ", T current location I=5 mates with hao backward, the i=6 success, the C=of " hao " " good " makes A[1] [6]=" good ".
" hao#gongfu " reads " hao ", mate with T, and i=6 success, the C=of " hao " in the u=2 sentence pattern " good " makes A[2] [6]=" good "; # indication " unordered collaborative ", T current location I=6 matches at last with gong backward, and is unsuccessful, returns, mate from the i=0 character unit of T, i=2 success, the C=of " gong " " merit " makes A[2 in the u=2 sentence pattern] [2]=" merit "; Be right after coupling with fu, unsuccessful, illustrate that the character meta structure of " hao#gongfu " is not comparable with T, can not use sentence pattern for referencial use, the element cleaning that the u=2 of array A is capable is for empty, even A[2] [n]=" ".The u value does not increase by 1.
" you#jintie " reads " you ", mate with T, and i=7 success, the C=of " you " " has " in the u=2 sentence pattern, makes A[2] [7]=" having ".# indication " unordered collaborative ", T current location I=7 mates with jin backward, i=8 success, the C=of " jin " " Tianjin " makes A[2 in the u=2 sentence pattern] [8]=" Tianjin "; Be right after coupling with tie, success makes A[2] [9]=" subsides ".
Continue to handle you#fuli ", " tie#jiagong ", character array A[5] [10] become:
zhe jia gong si fu li hao you jin tie
0 1 2 3 4 5 6 7 8 9
0 Public Department Good
1 Good fortune Sharp Good
2 Have Tianjin Paste
3 Floating Power Have
4 Add The worker Iron
B. check the compatibility between the sentence pattern, the decision statement generates scheme
The sentence pattern that n selected sentence pattern is the number j of character unit maximum, frequency is high, during generated statement, many more to their employings, the character unit among the T can handle well more, and the statement confidence level of generation is high more.But may have " incompatible " phenomenon in u the sentence pattern: certain character element P of two sentence patterns is identical, but corresponding transitional information C is inequality, and these two sentence patterns can not be used to generate same statement, is exactly with in 1 row different C being arranged in array A.As: P=" gong ", C=in the u=0 sentence pattern " public affairs ", C=in the u=4 sentence pattern " worker ".So need to check the compatibility between the sentence pattern, the decision statement generates scheme.
At first suppose statement of the common generation of available 5 sentence patterns, this scheme is 01234 scheme.
In " zhe " pairing row, A[u] [0] be sky, not modification.In " jia " pairing row, A[0] [1], A[1] [1], A[2] [1], A[3] [1] be empty, and A[4 is only arranged] [1]=" adding ", not modification.In " gong " pairing row, A[1] [2], A[2] [2], A[3] [2] be empty, A[0] [2]=" public affairs ", and A[4] [2]=" worker ", 0 sentence pattern is incompatible with 4 sentence patterns, and modification is: 0123 and 1234.
In " si " pairing row, A[0 is only arranged] [3]=" department ", not modification.In " fu " pairing row, A[0] [4], A[3] [4], A[4] [4] be empty, A[1] [4]=" good fortune ", and A[3] [4]=" float ", 1 sentence pattern is incompatible with 3 sentence patterns, and sentence pattern generation scheme modifying is 012,023 and 124,234.
In " li " pairing row, A[0] [5], A[2] [5], A[4] [5] be empty, A[1] [5]=" profit ", and A[3] [5]=" power ", 1 sentence pattern is incompatible with 3 sentence patterns, and is identical with " fu " row, do not revise sentence pattern generation scheme.
In " hao " pairing row, A[2] [6], A[3] [6], A[4] [6] be empty, A[0] and [6]=" good ", A[1] [6]=" good ", compatible, do not revise sentence pattern and generate scheme.In " you " pairing row, A[2] [7]=" having ", A[3] [7]=" having ", be worth identical, modification not.In " jin " pairing row, A[0] [8], A[1] [8], A[3] [8], A[4] [8] be empty, A[2 only] [8]=" Tianjin ", not modification.In " tie " pairing row, A[2] [9]=" subsides ", A[4] [9]=" iron ", conflict, 2 sentence patterns are incompatible with 4 sentence patterns, modification.124,234 two scheme modifyings are 12,14,23,34, then have 6 sentence patterns and generate scheme 013,023 and 12,14,23,34.
In 6 sentence pattern generation schemes, 12,14,23,34 schemes are only utilized 2 basic sentence patterns, and the statement of generation is with a low credibility, abandons; Determine to generate two alternative statements with 013,023 two scheme, its style tendency is decided by sentence pattern 1,2.
C. press 013 scheme, generated statement.
First character array B[m of definition], if in the programming with " phonetic is for word ", can define, 6 letters of long in the Chinese, a character string array of definable B[6 like this if handle with phonetic in the programming] [m].Make the initial value of each element of B equal corresponding syllable: zhe|jia|gong|si|fu|li|hao|you|jin|tie among the T.If A[0] [m] be not empty, then makes B[m]=A[0] [m], B become " the zhe|jia| public affairs | department | fu|li| is good | you|jin|tie ".If A[1] [m] be not empty, then makes B[m]=A[1] [m], B become " the zhe|jia| public affairs | department | good fortune | profit | good | you||jin|tie ".If A[3] [m] be not empty, then makes B[m]=A[3] [m], B become " the zhe|jia| public affairs | department | good fortune | profit | good | have | Tianjin | paste ".
So far, trunk forms, but still has character unit to be untreated, can utilize the conjunctive word of the basic sentence pattern 0,1,3 that generates this statement, and the reference sentence pattern outside n, comprehensive utilization is information such as connection, frequency, grammer, style, knowledge system, and T is continued to handle.Untreated " zhejia " among the T can find " converting into money " " this family " in database; " you ", database medium-high frequency speech is " having "; The conjunctive word of " hao#gongsi " has measure word " family ", sound " jia "; Comprehensive various factors, T can be treated to " this company's welfare is good, and subsidy is arranged ", output statement 1.Its style tendency is decided by " fuli#hao " L s=L 1=13.
By 023 scheme, T is converted to " zhejia company buoyancy is good, and subsidy is arranged ", trunk forms, untreated " zhejia " among the T, can find " converting into money " " this family " in database, the conjunctive word of " hao#gongsi " has measure word " family ", sound " jia ", comprehensive various factors, T can be treated to " this company's buoyancy is good, and subsidy is arranged ", output statement 2.Its style tendency is decided by " you#fuli " L s=L 3=8960.
For the statement that generates, should mark.In superincumbent the giving an example, the difference of two statements is decided by " welfare is good "
" buoyancy is arranged ".Style tendency, " buoyancy " are education, machinery, physics word, if this article is commercial affairs, economy, official document class file, with the style tendency L of statement 2 s=L 3=8960, with the style tendency L of section, chapter, a piece of writing pCompare L sOrL pCan not equal L pAnd, then can conform to L by the style tendency of the statement 1 of " welfare is good " decision sOr L p=L pIn addition, the frequency height of " welfare is good " also should be paid the utmost attention to.In the phonetic entry, certain basic sentence patterns is if 1 time, and the ratio that hereinafter repeats is very high, so frequency should dynamically be adjusted.Grammatical analysis can be carried out in the process of generated statement, also can behind generated statement statement be marked; In Chinese, " this company's buoyancy is good, and subsidy is arranged " is obstructed, and it is improper, fine certainly if can analyze by syntax rule.In addition, tone, stress, intonation, speech pause etc. also can be used for auxiliary process.But, reliable method is, as a phrase income jxk, accumulative total indicates back j=k=4 with " corporate welfare ", generated statement, and estimating when calculating confidence level can be higher.
In a word, on the basis of good sentence pattern database, consideration is the weight of multiple factors such as connection, structural information, style tendency, grammer, frequency, knowledge system and each factor, design a good decision-making treatment process, be important, be not only to improve the conversion accuracy, higher requirement is to have certain fault-tolerant, error correcting capability.
B. fault-tolerant processing
Compare with received pronunciation, majority's voice more or less have mistake.No matter be with position mark, prime number replacing method, still do language analysis with the method for falling the reference sentence pattern, we wish that certain fault-tolerant and even error correcting capability is arranged.After if accumulative total indicates, the j value can accurately reflect the size that the character unit of S and T occurs simultaneously, and the sentence pattern of 0<j<k has part character unit identical with T, is called fault-tolerant sentence pattern.Its quantity may be a lot, and what meaning was bigger is j=k-1, j=k-2 and the big sentence pattern of j value, and querying condition can change similar j into〉k-3andj〉1, R 1In can comprise these fault-tolerant sentence patterns.After if accumulative total indicates, the j value can not accurately reflect the size that the character unit of S and T occurs simultaneously, with reference to can not clearly dividing by the j value between sentence pattern, fault-tolerant sentence pattern, the lengthy and jumbled sentence pattern, can relax querying condition, as querying condition being revised as similar j〉k-3 and j 1 or j h-3 and j 1, then significant fault-tolerant sentence pattern enters R basically 1When consideration is fault-tolerant, to R 1In S and the character unit of T compare, purpose is to reject two kinds of nonconforming S:1. to reject and the incomparable S of T structure; 2. reject the record S that differs greatly with T character unit, character unit undiscovered number in T of parameter e record S can be set, as e 2 the time, abandon this sentence pattern.In the mark of position, establish W=W t﹠amp; W n, as the bit number of " 1 " among the W near W nIn the bit number of " 1 ", can tentatively regard fault-tolerant sentence pattern as, this instruction that needs " position " to count is just with practical value soon.The prime number replacing retrieval owing to eliminate common divisor, factor decomposition shortage effective method, is not easy to find out fault-tolerant sentence pattern.Here according to producing wrong concrete reason, some methods are targetedly proposed.
Producing mistake may be the reason of dialect, at big dialect, and can dedicated programmed.But the voice entry system of standard language also should have certain dialect fault-tolerant ability.As, say that the quite a lot of n of people, the l of Chinese is regardless of, " kaffir lily " should read " junzilan ", and is read as " junzinan ", after accumulative total indicates, " junzilan " this sentence pattern k=3, j=2, j=k-1, be fault-tolerant sentence pattern,, " junzinan " be processed into " kaffir lily " according to it.In " position mark ", when building the storehouse, can be divided into one group to n, l, " the contrary retrieval of position mark " obtains R 1After, by " junzinan " prime number replacing " contrary retrieval ", can not find that suitably with reference to sentence pattern, the fault-tolerant rule of dialect that follow procedure is set is used " junzilan " prime number replacing " contrary retrieval " again, finds " kaffir lily "; Also can be at R 1In, use the product of the prime number of the prime number of nan, lan and fang respectively, go to carry out twice " contrary retrieval ", obtain R 2, deal with again.Certainly, when building jxk and inverted list, position mark, prime number replacing, lan, nan are all made nan handle, also be fine; Also can consider, " junzinan " correspondence " kaffir lily " is used as a sentence pattern handle.
The generation of mistake also may be that user pronunciation is accidental smudgy.As, when reading " Confucian school ", asophia, like " rujia ", like " yujia ", first sound is between ru, yu again.If with the method for falling row, one among available " rujia ", " yujia " adds up to indicate, then another j=k-1, be fault-tolerant sentence pattern, when fault-tolerant sentence pattern is a lot, may not necessarily find the sentence pattern of wanting, can consider with after ru, the jia accumulative total sign, set up the record that a temporary table temp1 stores j=k, put j=0 again, indicate with yu, jia accumulative total, obtain temporary table temp2, after the merging, leave out the sentence pattern of repetition, select j=k again and be worth big sentence pattern to do with reference to sentence pattern; If go accumulative total to indicate simultaneously with " ru, yu, jia ", wherein the sentence pattern that is made of ru, ju needs to reject.In the mark of position, can use " ru, yu, jia " same tense marker, obtain Wt, but prime number replacing needs to carry out " contrary retrieval " respectively with the prime number product of " rujia ", " yujia ".
The user of non-localism area also misspokes word unavoidably.As, " fat " of " carefree and contented " answers " pan ", but often the someone is read as " pang ", can be " xinguangtipang ", " xinguangtipang " respectively corresponding " carefree and contented ", do two sentence pattern income receipt storehouses, if the user reads " xinguangtipang ", program converts " carefree and contented " to, but reminds the user wrong.
Fault-tolerant sentence pattern may be statement, the word that the user copys.Make " drill excellent then sing " as, user imitative " Officialdom is the natural outler for good scholars ", pinyin string is " yaneryouzechang ", after accumulative total indicates, the j=k=2 of " performance ", T are treated to " drilling eryouze sings ", and the k=5 of " Officialdom is the natural outler for good scholars ", j=3, j=j-2, k, j are bigger, handle by copying statement,, T is treated to " drill excellent then sing " from " Officialdom is the natural outler for good scholars " extraction " then excellent " three words according to " eryouze ".Reach in " prime number replacing " at " position mark ", after handling with " performance ", analyze and find that " eryouze " do not have believable sentence pattern reference, can attempt " just retrieving " being done in the sentence pattern storehouse with the place value and the prime number value of " eryouze ", obtain " Officialdom is the natural outler for good scholars ", therefrom extract " then excellent " and handle.
Multisyllable language such as English also may be read because of connecting, cause the syllabification complexity, as a piece of paper, its pronunciation may be [
Figure A200810005364D0024181015QIETU
| pi:s|
Figure A200810005364D0024181015QIETU
V|pei|p
Figure A200810005364D0024181015QIETU
] 5 character units, also may be [
Figure A200810005364D0024181015QIETU
| pi:|s
Figure A200810005364D0024181015QIETU
V|pei|p
Figure A200810005364D0024181015QIETU
.] 5 character units, all be necessary to list in the sentence pattern storehouse.If [s wherein
Figure A200810005364D0024181015QIETU
V] in quick pronunciation often by reduction and unclear, can by [
Figure A200810005364D0024181015QIETU
| pi:|pei|p
Figure A200810005364D0024181015QIETU
] 4 character units, list the sentence pattern storehouse again in.Red paper, its pronunciation may be [red|pei|p
Figure A200810005364D0024181015QIETU
], [re|pei|p
Figure A200810005364D0024181015QIETU
], be 3 character units, also list the sentence pattern storehouse in.Only in this way,, read to cause the change of tune even connect when the user thinks input " a small piece of red paper ", so [s wherein
Figure A200810005364D0024181015QIETU
V] weakened, unintelligible, reject this syllable after, add up to indicate, after the contrary retrieval, also can find with reference to sentence pattern, reach good effect.De ﹠amp in the following table; Indication can be inserted other word herein.
Figure A200810005364D0024181329QIETU
In a word, a large amount of is fault-tolerant, need handle by phonetics, dialectology theory and the experiment effect of various language, also needs cpu that enough processing speeds are arranged.
After database accumulative total indicates, similar following situation also can occur: " colleges and universities " are the abbreviations of " institution of higher learning ", the pinyin string of analyzing as needs is " jiaqianggaoxiaoguanli ", " jiaqiangguanli " is " strengthening management ", and handling the back is " strengthening the gaoxiao management ".If not having pinyin string in the sentence pattern storehouse is the sentence pattern gaoxiao of j=k=2, but have:
Numbering Pinyin string k j Chinese character string Syntactic information
94753 gaodengyuanxiao 4 2 Institution of higher learning The noun phrase
As alternative plan, can from the 94753rd sentence pattern, extract " colleges and universities " two words by gaoxiao, be processed into " strengthening colleges and universities' management ".In " prime number replacing ", need to go to finish with " just retrieving ".
Description of drawings
Fig. 1 is that language analysis is built the storehouse and arranged process flow diagram
Fig. 2 is the concrete statement analysis process of user figure
Fig. 3 indicates process flow diagram with grouping inverted list accumulative total
Embodiment
The analytical approach of falling the reference sentence pattern in the phonetic entry has been described in the summary of the invention, the implementation method of some others has been described here again, and provided the signal code of a section " reference sentence pattern ".
A. the implementation method of others
1. set up the database of basic sentence patterns (contain collocations, phrase, phrase, word, down with) S, provide the character number k of unit of each basic sentence patterns or reject the character number h of unit that repeats or provide k and h simultaneously.The address is d, or provides sentence pattern numbering n.
For mechanical translation, character unit is a Chinese character, and accurately is primary, and response speed is inessential relatively, can enlarge the quantity of basic sentence patterns as far as possible.Be the simple mode of Chinese-English machine translation jxk below, corresponding English sentence wherein must be arranged, Chinese structure information, Chinese grammar information, English Grammar information etc. can also be arranged as corresponding informance:
Numbering Chinese sentence patterns k j English sentence Chinese grammar information English Grammar information
95864 See the ﹠ TV 3 watch?TV Moving guest Moving guest
For search engine, a large amount of data needs to handle, and response speed is very important, and just statement is carried out cutting, and emphasis is to take in the sentence pattern and the collocations of makeing mistakes easily, so the quantity of basic sentence patterns will be lacked.The character string structural information that Chinese character string wherein must be arranged:
Numbering Basic sentence patterns h j Frequency %
2895 In the ﹠ mode 3 0.0075
2. set up the file that comprises all character units, in each character element P iAfter list and comprise this character element P iAll basic sentence patterns numbering n, or address d obtains inverted list.Inverted list has single table, repeats kinds such as table, grouping sheet.
Chinese mechanical translation, search engine fall to arrange by Chinese character, and following table is listed basic sentence patterns numbering n behind each Chinese character:
Chinese character The sentence pattern numbering
Say 28901,45086,67872,75123,900250
Stream 35984,77925,298955,354565
Other language mechanical translation can fall to arrange by word, and following table is listed address d behind English word:
English word The address
watch 00001520,00012640,00091580,00378C20
walk 0000AAC0,0005E20,000E1540,0029E160
3. be to provide k value or h value among the basic sentence patterns database jxk, still provide k value, h value simultaneously, and the kind of inverted list, influential to the method that accumulative total indicates, and they can further influence querying condition and result set R 1In lengthy and jumbled degree.
If the grouping inverted list needs to check the number of times that each character unit repeats among the T, the method for pressing " backward compatible " adds up to indicate, and inquires about R with j=k 1In do not have lengthy and jumbled.
If single inverted list when jxk has the k value, is not rejected the character unit that repeats among the T, available j=k or j〉the k inquiry.When jxk has the h value, do not reject the character unit that repeats among the T, available j=h or j〉the h inquiry; Reject the character unit that repeats among the T, available j=h or with j=h or j h inquires about.These schemes, R 1All have lengthy and jumbled.When jxk provides k and h simultaneously, if T does not have repeat character (RPT) unit, directly accumulative total indicates, and with the j=k inquiry, if T has repeat character (RPT) unit, after the rejecting, accumulative total indicates, and with the j=h inquiry, lengthy and jumbled S quantity is fewer.
If the repetition inverted list when jxk has the k value, is not rejected the character unit that repeats among the T, available j=k or j〉the k inquiry; Reject the character unit that repeats among the T, available j=k or with j=k or j k inquires about.When jxk has the h value, do not reject the character unit that repeats among the T, available j=h or j〉the h inquiry; Reject the character unit that repeats among the T, available j=h or j〉the h inquiry.These schemes, R 1All have lengthy and jumbled.When jxk provides k and h simultaneously, T does not have repeat character (RPT) unit, and directly accumulative total indicates, and with the j=h inquiry, if T has repeat character (RPT) unit, after the rejecting, accumulative total indicates, and with the j=k inquiry, lengthy and jumbled S quantity is fewer.
If single inverted list is arranged simultaneously, repeat inverted list, when accumulative total indicated, the character unit that repeats among the T indicated according to repeating table, and unduplicated character unit indicates according to single table, with j=h or j〉h inquires about has lengthy and jumbledly, represents with j=k, has a spot of lengthy and jumbled.
4. inquire about the record set that obtains, be loosely referred to as " with reference to sentence pattern ", wherein has lengthy and jumbled sentence pattern, also has the incomparable sentence pattern of character meta structure and T, need to reject, and select k value or h value or the big sentence pattern of j value, as " the basic sentence pattern " of analyzing and processing T, wherein what of S character unit k more can accurately reflect than h, j.
For search engine,, obtain following table after the sign if statement T=" so that mode of understanding " is arranged:
Numbering Basic sentence patterns k h j Frequency % Syntactic information
5694 Understand with square # 4 4 4 0.006 Name is moving
2895 In the ﹠ mode 3 3 3 0.025 Guest Jie
The 1st character unit with 5694 " with ", begin comparison from the 1st the character unit of T, success; The 2nd character unit " side " with 5694 and T successive character unit compare, and unsuccessful, structure is not comparable, abandons this sentence pattern.The 1st character unit with 2895 " with " begin comparison from the 1st the character unit of T, success; ﹠amp; Expression can be inserted other composition herein, therefore, compares with the 2nd the character unit of T with " side ", and is unsuccessful, continue to compare with 3,4,5,6 characters unit, and up to the 7th character unit, success; Compare success with " formula " with T successive character unit; 2895 with T character meta structure comparability is arranged.According to 2895, with the T cutting for " with | be convenient to understand | mode ", can avoid forward maximum matching method (FMM) cutting for " so that | in | understand | | mode ".
For mechanical translation,, obtain following table after the sign if Chinese sentence " I see one hour TV usually " is arranged:
Numbering Chinese sentence patterns k h j The corresponding sentence pattern of English Chinese grammar information English Grammar information
786532 One hour 5 5 5 for?an?hour Time Time
958634 See the ﹠ TV 3 3 3 watch?TV Moving guest Moving guest
456826 Usually 2 2 2 usually The frequency adverbial word The frequency adverbial word
012865 I 1 1 1 1 Before the verb Before the verb
012866 I 1 1 1 me Behind the verb preposition Behind the verb preposition
Can be with the sentence cutting: " I | usually | see | one hour | | TV ", the core of sentence is a verb, 958634 Chinese and english is V-O construction, can at first extract its corresponding English sentence " watch TV "; " one hour " correspondence " for an hour ", Chinese and english is the time phrase, and according to English Grammar, time adverbial is placed on after predicate and the object, obtains " watch TV for an hour "; " I " of Chinese am positioned at V-O construction " Kan ﹠amp; TV " preceding, be subject, select 012865, translate into English subject form I, according to English Grammar, subject obtains " I watch TV foran hour " before predicate; " usually " is the frequency adverbial word, and corresponding English is " usually ", and English frequency adverbial word between subject, predicate, obtains " I usually watch TV for an hour " usually.In other words, mechanical translation except the syntactic information of sentence pattern, also needs good grammar system support.
B.vc illustrates code
Following code passes through on vc, does signal and uses.What use is the repetition inverted list, does not reject the character unit that repeats among the T, is equivalent to scheme 9,10.
#include<iostream.h>
void?main(){
Struct Juxing{charjs[10]; Int k, j; ; // sentence pattern, j, k value
Juxingjxk[3]={{"babb",4,0},{"abc",4,0},{"acd",3,0}};
Juxing*jxdz=jxk;
Struct dpr{char zi; Juxing*dizhi[5]; Int kong; ; 1 row of // inverted list
dpr?dpb[4]={{′a′,},{′b′,},{′c′,},{′d′,}};
Int n, m, i, r, kz; //n is a jxk sentence pattern sequence number, and m is a character ordinal number, and i is an inverted list keyword sequence number, and r is the position, and kz is an empty position.
char?gjc;
for(i=0;i<4;i++){
Gjc=dpb[i] .zi; // current search letter
For (n=0; N<3; N++) //n is the current sentential form numbering
For (m=0; M<5; M++) //m is the s character ordinal number
if(gjc==jxk[n].js[m]){
kz=dpb[i].kong;dpb[i].dizhi[kz]=jxdz+n;
dpb[i].kong=dpb[i].kong+1;};};};};
For (i=0; I<4:i++) { result is arranged in // output
cout<<dpb[i].zi;
for(r=0;r<5;r++){cout<<","<<dpb[i].dizhi[r];}
cout<<endl;};
Char text[]=" abbbcef "; // T to be analyzed
chartc;
for(m=0;text[m]!=′\0′;m++){
Tc=text[m]; // obtain 1 character of T
for(int?i=0;i<4;i++){
If (dpb[i] .zi==tc) if // find letter
For (int r=0; R<dpb[i] .kong; R++) dpb[i] .dizhi[r]-j++; ; // indicate
break;};};};
For (n=0; N<3; N++) { // output jxk indicates the result
cout<<jxk[n].js<<","<<jxk[n].k<<","<<jxk[n].j;
cout<<endl;};
}

Claims (6)

1. a language analysis method is characterized in that, may further comprise the steps:
A. set up the database of basic sentence patterns (containing collocations, phrase, phrase, word, the down together) S of certain language, provide process information; Provide the number k of character unit of each basic sentence patterns or provide the character number h of unit that rejects after repeating or provide k and h simultaneously or provide k and the character multiplicity g of unit or provide h and g or provide k and h and g; Provide j; The address of sentence pattern or j is d, or provides sentence pattern numbering n;
B. list all character element P of this this kind of language application i(i=1,2,3 ... w), to each character element P i, all list and comprise this character element P iAll basic sentence patterns or the address d of j, or sentence pattern numbering n draws inverted list;
C. establishing the sentence that needs to analyze is T, with the character element P of T r(i=1,2,3 ... m), according to inverted list P rD, or n adds up to indicate to the j of basic sentence patterns database respective record, obtains the j value of each basic sentence patterns S;
D. the j by each sentence pattern S relatively and k, h or and the size of g, filter out T and comprise, may comprise the S of its alphabet unit, part character unit, character unit to S and T compares, reject nonconforming S, ordinary priority selects the big sentence pattern of k or h or j value as basic sentence pattern, with reference to these sentence patterns T is carried out analyzing and processing.
2. in accordance with the method for claim 1, it is characterized in that: after accumulative total indicates, if the j value can accurately reflect the size that the character unit of S and T occurs simultaneously, with the sentence pattern S of j=k as the reference sentence pattern, with the sentence pattern S of 0<j<k as fault-tolerant sentence pattern, the sentence pattern S of j=m as possible quoted passage, is selected basic sentence pattern, analyzing and processing T according to qualifications from these sentence patterns; After if accumulative total indicates, the j value can not accurately reflect the size that the character unit of S and T occurs simultaneously, and suitably relaxes querying condition and obtains R 1, from R 1In select basic sentence pattern according to qualifications, analyzing and processing T.
3. in accordance with the method for claim 1, it is characterized in that: in the phonetic entry, accumulative total indicates, after rejecting the record lengthy and jumbled, that structure does not conform to, the preferential big sentence pattern of k or h or j value of selecting is as basic sentence pattern, but take all factors into consideration frequency, grammer, style, be join, the weight of the multiple factor of related information and each factor does selection.
4. it is characterized in that in accordance with the method for claim 1: with a data L nThe style tendency of bit mark S, analyze the L of the S that generates certain statement n, the style that obtains this statement is inclined to L sAmount to the L of the S of a joint literal nOr the L of statement s, analyze the style tendency L that draws this section literal pIf satisfy L sOr L p=L pOr its formula of equal value, then the style of this sentence tendency meets the style tendency of this joint, can give preferential reservation in alternative statement; When generating follow-up statement, preferentially select L nNear L pBasic sentence pattern; In the mechanical translation, provide the L of source language S simultaneously nL with target language S n, utilize L nAnalyze the L of current file p, use L s, L nSame L pCompare, estimate the alternative statement, the aid in later statement that have generated and generate.
5. in accordance with the method for claim 1, it is characterized in that: the quoted passage data is organized storage, and the sentence S with the piece of writing name of quoted passage data, first sentence, elite takes in the basic sentence patterns storehouse again, and provides quoted passage information; As T and certain sentence pattern S or press the corresponding informance of statement that T generates and S when identical or close, automatically or accept user prompt, according to quoted passage information, the front and back literary composition is read, for user's affirmation.
6. in accordance with the method for claim 1, it is characterized in that: the conjunctive word, the related information that provide basic sentence patterns, handle the nearly justice relation of synonym between the notion, exclusion relations arranged side by side, relation of inclusion, relation on attributes, or be used to handle the metamorphosis of grammatical relation, language, auxiliary analyzing and processing to T.
CNA2008100053643A 2008-01-28 2008-01-28 Backward reference sentence pattern language analysis method Pending CN101499056A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008100053643A CN101499056A (en) 2008-01-28 2008-01-28 Backward reference sentence pattern language analysis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008100053643A CN101499056A (en) 2008-01-28 2008-01-28 Backward reference sentence pattern language analysis method

Publications (1)

Publication Number Publication Date
CN101499056A true CN101499056A (en) 2009-08-05

Family

ID=40946133

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008100053643A Pending CN101499056A (en) 2008-01-28 2008-01-28 Backward reference sentence pattern language analysis method

Country Status (1)

Country Link
CN (1) CN101499056A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023984B (en) * 2009-09-10 2013-12-04 阿里巴巴集团控股有限公司 Method and device for screening duplicated entity data
CN106095742A (en) * 2016-06-20 2016-11-09 北京金山安全软件有限公司 Text content generation method and server
CN110597082A (en) * 2019-10-23 2019-12-20 北京声智科技有限公司 Intelligent household equipment control method and device, computer equipment and storage medium
CN110892406A (en) * 2017-02-02 2020-03-17 语言探索爱路泰达有限公司 Multi-language exchange system and message transmission method
CN113743054A (en) * 2021-08-17 2021-12-03 上海明略人工智能(集团)有限公司 Alphabet vector learning method, system, storage medium and electronic device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023984B (en) * 2009-09-10 2013-12-04 阿里巴巴集团控股有限公司 Method and device for screening duplicated entity data
CN106095742A (en) * 2016-06-20 2016-11-09 北京金山安全软件有限公司 Text content generation method and server
CN110892406A (en) * 2017-02-02 2020-03-17 语言探索爱路泰达有限公司 Multi-language exchange system and message transmission method
CN110597082A (en) * 2019-10-23 2019-12-20 北京声智科技有限公司 Intelligent household equipment control method and device, computer equipment and storage medium
CN113743054A (en) * 2021-08-17 2021-12-03 上海明略人工智能(集团)有限公司 Alphabet vector learning method, system, storage medium and electronic device

Similar Documents

Publication Publication Date Title
Black et al. Statistically-driven computer grammars of English: The IBM/Lancaster approach
US7035789B2 (en) Supervised automatic text generation based on word classes for language modeling
Rayson Matrix: A statistical method and software tool for linguistic analysis through corpus comparison
Bjarnadóttir The database of modern Icelandic inflection (Beygingarlýsing íslensks nútímamáls)
Sawalha Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora
US20110040553A1 (en) Natural language processing
CN101499056A (en) Backward reference sentence pattern language analysis method
Gervás A logic programming application for the analysis of Spanish verse
CN102929864A (en) Syllable-to-character conversion method and device
Onyenwe et al. Toward an effective igbo part-of-speech tagger
Bahadur et al. Architecture of English to Sanskrit machine translation
CN115617965A (en) Rapid retrieval method for language structure big data
KR20200073524A (en) Apparatus and method for extracting key-phrase from patent documents
JP2005063030A (en) Method for expressing concept, method and device for creating expression of concept, program for implementing this method, and recording medium for recording this program
KR100376032B1 (en) Method for recognition and correcting korean word errors using syllable bigram
Maučec et al. Modelling highly inflected Slovenian language
L’haire FipsOrtho: A spell checker for learners of French
JP2005158044A (en) Apparatus, method and program for information retrieval, and computer-readable recording medium stored with this program
Amezian et al. Training an LSTM-based Seq2Seq Model on a Moroccan Biscript Lexicon
JP2005025555A (en) Thesaurus construction system, thesaurus construction method, program for executing the method, and storage medium with the program stored thereon
Loftsson Tagging and parsing Icelandic text
KR20040051426A (en) A Method for the N-gram Language Modeling Based on Keyword
Abdelkader et al. How Existing NLP Tools of Arabic Language Can Serve Hadith Processing
Arısoy et al. Turkish dictation system for broadcast news applications
El Sayed et al. Arabic Information Extraction Methods: A Survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20090805

RJ01 Rejection of invention patent application after publication