CN106997245A - A kind of method that input method dictionary is built according to Chinese language model - Google Patents

A kind of method that input method dictionary is built according to Chinese language model Download PDF

Info

Publication number
CN106997245A
CN106997245A CN201610066190.6A CN201610066190A CN106997245A CN 106997245 A CN106997245 A CN 106997245A CN 201610066190 A CN201610066190 A CN 201610066190A CN 106997245 A CN106997245 A CN 106997245A
Authority
CN
China
Prior art keywords
word
chinese language
module
language model
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610066190.6A
Other languages
Chinese (zh)
Inventor
杨文韬
杨景玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610066190.6A priority Critical patent/CN106997245A/en
Publication of CN106997245A publication Critical patent/CN106997245A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The present invention is applied to computer input method field, there is provided a kind of method that input method dictionary is built according to Chinese language model, it is made up of Chinese language model module and word making module, Chinese language model module provides word-building information and dictionary management information for word making module, the word-building information that word making module is provided according to Chinese language model, by database management language, Mass production input method word, constructed dictionary utilizes the dictionary management information that Chinese language model module is provided, and can easily carry out the operation such as additions and deletions, retrieval, sequence.Chinese language model requires to customize according to speech pause in communication and semantic integrity, has fully demonstrated Chinese feature.Lack the receipts word standard of science present invention efficiently solves traditional input method dictionary and to collect word efficiency low, dictionary uncomplete content face system and can not managed effectively, have no way of holding that dictionary content, input speed are slow, lack the key issues such as communication experience in user's input process.

Description

A kind of method that input method dictionary is built according to Chinese language model
Technical field
The present invention relates to computer input method field, the input method dictionary more particularly to automatically generated according to Chinese language model.
Background technology
In Chinese character coding input method field, Hanzi coding technique and dictionary technology are two big core technologies.Since the eighties in last century The development of more than 30 years, Hanzi coding technique has tended to be ripe and stably, and the space of current input method innovation and development and potentiality have collected In to input method dictionary technically, but current input method dictionary state-of-the-art is seen, either towards desktop computer standard key The input method of disk exploitation, or the input method that facing moving terminal such as handset touch panel is developed, and opened towards field of speech recognition The input method of hair, the problem of its dictionary has five aspects:
One is dictionary " small without complete ".Dictionary word is mainly derived from the general word in all kinds of dictionaries, and daily artificial accumulation All kinds of words, its dictionary scale generally between tens of thousands of to ten tens of thousands of, due to small scale, and mostly specification, The static word of " dictionary " class, thus the flexile dynamic communication demand of Chinese is can not meet, typing person can not be with words Combined type entirety typing, causes typing to experience poor, efficiency low.The representational input method of this kind of " small without complete " has with king's code Five-stroke input method is all kinds of code input methods of representative, and all-phonetic input method, the Intelligent ABC phonetic in spelling input method field is defeated Enter method, purple light spelling input method, phonetic adding input method etc.;
Two be dictionary " greatly without complete ".Representational input method has search dog spelling input method, Baidu's input method etc., this kind of input method Dictionary is normally based on " the general word of dictionary class+search engine finds neologisms hot word " structure, the general word source of dictionary class In all kinds of dictionaries and daily accumulation, " search engine find neologisms hot word " be by existing patent such as《A kind of method for obtaining neologisms, Device and a kind of input method system》(publication number:CN1924858B)、《Internet hot words mining method and device》(publication number: CN104679738A)、《The method and system of neologisms or hot word are provided》(publication number:CN102163198A)、《Dictionary is given birth to Into method and its system, input method and input system》(publication number:) etc. CN103853746A patented technology is realized, its base Present principles be by search engine collect webpage on user's input frame in content, then by participle technique handle, and with it is rear Platform corpus relatively and is reached after certain frequency threshold, is defined as hot word, neologisms, is finally included into input method dictionary.With neologisms, The continuous accumulation of hot word, dictionary scale generally in hundreds of thousands bar between nearly million, if along with " cloud " word, can supply The dictionary of user search can reach millions of even nearly ten million bars, although scale is unprecedentedly expanded, but due to using From the mass-election technology in user's input frame, and often some popular name, place name, news that user searches in input frame Event, and network popular word etc., few people search in input method frame " on this basis, to be, not to be to say, more Say it is faster, ate after meal, say all be unable to explain clearly, fly, having run, one connect one, countercharge " as word, therefore, By search engine technique it is difficult to be collected into these closest to the essential core of Chinese, most " fat person " of generality application, height The input method word of quality.So-called neologisms, hot word are seen again, and they have significant limitation in fact, one is Life cycle is very short, word more popular at that time, it is possible to soon just unfeelingly eliminated, such as " my father Be that Li Gang, Chinese style are gone across the road, APEC is blue, the fragrant body ... of member " etc.;The second is it is very narrow using scope, it is daily defeated in people Enter in document and dialogue, run into so-called neologisms, hot word probability it is very small, therefore by neologisms, hot word come improve input effect Rate is very limited amount of with input experience effect;The third is being keen to also limited using the crowd size of neologisms hot word.《Spoken and written languages Report》Second edition on January 6th, 2016 (total 953rd phase) is published the article《The high section student useful year neologism condition survey of primary school》 (bang flat, Yang Chuanxin) display, " in the attitude of useful year neologism, explicitly indicate that ' liking ' accounts for 28.76%, And ' it doesn't matter ' and ' being unable to explain clearly ' attitude accounts for 45.15% and 19.40% respectively;Hold ' not liking ' attitude for 6.69% ", " in the frequency of useful year neologism, the 17.57% annual neologism of student's ' being commonly used ', 55.85% student is ' even You use ', and ' hardly with ' ratio of ' from being not used ' is respectively 16.72% and 10.37% ".In summary, search is drawn Hold up discovery neologisms hot word method to be not suitable for building input method core word bank, be suitable only for, as one kind supplement, merely utilizing This technique construction input method dictionary, can cause many real valuable input method words not have and really excavate, very entirely Substantial amounts of useless " rubbish " word is infiltrated, the phenomenon for " greatly without complete " occur;
Three be that dictionary content is difficult to remember.No matter for " small without complete " dictionary, or for the dictionary of " greatly without complete ", It is required for which word contained inside artificially phrase library for temporary memory, so, could when using input method not comprising which word Typing is carried out according to word mode.Can face it is small to tens of thousands of, ten tens of thousands of in input method dictionary, it is even up to ten million to nearly million greatly and Word without any language regulation and feature again, user is difficult memory, thus causes considerable hurdle for effective use dictionary;
Four be dictionary word " receive and difficult to govern ".The simple of current input method dictionary, substantially Chinese data is piled up, on word The not subsidiary any language message in face and dictionary management information, thus the word in dictionary can not be classified, be selected, additions and deletions, The necessary management such as sequence, cause that input method dictionary upgrading is difficult, orientation customization function is poor, repetition maintenance workload greatly, no Beneficial to the development of input method dictionary;
Five be that the poor, efficiency of input experience is low.Dictionary word or chased after simply because existing input method dictionary is copyed by rote Network neologisms hot word is sought, words combination rule is not studied substantially from Chinese exchange, thus is difficult to accomplish according to communication When speech pause rule and semantic integrity requirement inputted, directly results in words input and mutually disconnect, record with communication The problem of entering inefficiency.
The content of the invention
The invention aims to solve above-mentioned input method dictionary content " it is small without it is complete, big without it is complete, be difficult to remember, dictionary It is low that content is difficult to manage, poor, efficiency is experienced in typing person's input " the problems such as,
To realize object above, the present invention is achieved through the following technical solutions:
A kind of method that input method dictionary is built according to Chinese language model, is made up of Chinese language model module and word making module,
Described Chinese language model module, provides word-building information and to be most lifelong during for for word making module Mass production word Into dictionary provide dictionary management information;The word making module is automatic for the word-building information provided according to Chinese language model module Mass production word.
Described Chinese language model module is made up of model identification submodule and model word-building message sub-module, model identification submodule Block includes Chinese language model, and Chinese language model is by representing character string identification, prefix, parenthesis and the suffix of word making main body Composition, be attached to before word making main body for prefix, be attached to behind word making main body for suffix, in the middle of insertion word making main body For parenthesis, prefix, parenthesis and suffix can only occur first, can also co-occurrence, in addition, parenthesis can be wrapped Containing one or more;What described word making main body referred to that the word making main body submodule in word making module included is used for word making Basic word;The main language properties by being classified according to spoken language, written word, dialect etc. of model word-building message sub-module Information data table, and the word structure type information tables of data classified according to subject-predicate, dynamic guest, centering etc., and during according to table Between, the semantic domain information data table classified of space, quantity, degree etc., and according to query, state, pray making, sigh with feeling The tone type information data table classified, and according to it is active and passive, the voice type information data classified such as make Table, and modify the composition such as level, weight order information data table;Chinese language mould in described Chinese language model module Type is developed according to Chinese speech pause feature and semantic integrity requirement.
Described word making module is made up of word making main body submodule, part-of-speech tagging submodule and word structure mark submodule, word making Main body submodule includes the basic word for word making, and these basic words are referred to as word making main body;Part-of-speech tagging submodule is by word making Basic word in main body submodule carries out part-of-speech tagging, be divided into noun, verb, adjective, pronoun, adverbial word, number, The specific tables of data such as measure word, preposition, conjunction, auxiliary word, interjection, onomatopoeia;Word structure marks submodule by word making main body Basic word in module carries out word structure mark, be divided into subject-predicate, state guest, state benefits, centering, the shape heart, measure the heart, quantity, Side by side, inverted sequence, it is overlapping, along passing, refer to again, mix, the specific data such as prepositional phrase and synonym, antonym, parallel word Table.The tables of data in part-of-speech tagging submodule and word structure mark submodule in described word making module is that word making main body is set Breakpoint information is put, insertion operation is performed to word making main body during for word making and two parts before and after word making main body are acted upon respectively.
In order to realize the purpose of the present invention, the invention provides one kind according to Chinese language model Mass production input method dictionary word Method, including three below step:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Pass through《Modern Chinese dictionary》Deng reference book and mode is artificially collected, choose basic, versatility word making Material, and word making main body submodule, part-of-speech tagging submodule and word structure mark submodule are built based on this;
Step 3:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method dictionary word.
In order to realize the purpose of the present invention, the present invention is further extended to each professional input field, present invention also offers one kind The method that specialized dictionary is built according to Chinese language model, including following four step:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Set up specialized word word making material database;
Step 3:Based on above-mentioned specialized word word making material database, build word making main body submodule, part-of-speech tagging submodule and Word structure marks submodule;
Step 4:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method specialized word.
In order to realize the purpose of the present invention, dictionary Content Implementation is effectively managed based on Chinese language model present invention also offers one kind The method of reason, is made up of language material module and dictionary information module, wherein, language material module includes making in Chinese language model module Whole words of word module generation;The composition of dictionary information module and the model word-building message sub-module in Chinese language model module It is identical, and by its transfer data information.
In order to realize the purpose of the present invention, present invention also offers a kind of method of the hints model word in input method prompting frame, Including three below step:
Step 1:In input method code table, increase Chinese language model information, make the every word and Chinese language mould in code table Type formation one-to-one relationship;
Step 2:In input method engine during increase search code table, the function of corresponding word is searched according to Chinese language model;
Step 3:The icon or button of model word, or other similar indicative marks are checked in increase in input method prompting frame, When it is a group model word that the coding inputted is corresponding, the mark is activated, when cursor of mouse is moved to the mark above When, Chinese language model is shown, when with mouse-click or when pressing pre-defined keyboard, the Chinese language model is shown Corresponding whole words.
Beneficial effect
The input method dictionary built by the present invention, is fully reflected in Chinese syntax and morphology feature, particularly communication Words combination rule, allow typing person use access expansion communication in speech pause mode and semantic integrity carry out Input, creates a kind of natural language communication environment of emulation, improves typing experience;
The input method dictionary built by the present invention, is realized using Chinese language model as commander, to various words in communication Combination rule carries out comprehensive system summary, so that the input method dictionary for establishing unified standard receives word standard, it is ensured that constructed Dictionary word answers the heart to receive to the greatest extent, solve conventional input method dictionary collect word not comprehensively, not system, without standard the problem of, prevent User has no way of holding to dictionary content, is fanned the air because lacking word in typing and code and then returns the generation for deleting phenomenon;
By the input method dictionary that builds of the present invention, it using Chinese inherent law be basic to be, originally to ask source, has target, has again Point ground " active " generates the process of word, is completely contradicted with " dragging in sea " neologisms hot word direction by search engine, and has this The difference of matter, its accuracy, universality, practicality are improved comprehensively, and can effectively block " rubbish " word to enter, thus Computer resource is saved during dictionary use, recall precision is improved, has prevented influence of the rubbish word to typing person;
The input method dictionary built by the present invention, is substantially improved Chinese words combination rate, individual character typing phenomenon becomes in typing Zero is bordering on, typing person avoids formula individual character typing mode of " squeezing toothpaste out of a tube " in the past in input process, realizes " fat person " whole Body typing, efficiency of inputting can improve more than 30%, create considerable operating efficiency and social efficiency and benefit;
The input method dictionary built by the present invention, because most words of the inside are all based on Chinese language model generation, As long as user remembers a word therein during typing, then just can conclude that all words with same characteristic features are all included therewith , such as, attempt " after having a meal " overall typing to succeed, then similarly " after a bath afterwards, handed over after money, Disturbed by making noise after frame, bought after dish ... " etc. can serve as a word and trust audaciously overall typing, this feature is used The various word of quantity is to the transformation for remembering a limited number of Chinese language models from memory inputting dictionary at family, and memory capacitance is significantly Reduce, if coordinating the Chinese language model prompt facility in input method prompting frame again, just can more easily grasp in dictionary and receive The word of record, further enhances typing experience and efficiency;
The input method dictionary built by the present invention, by being added on Chinese language model, complete language message and dictionary are managed Information, and database technology is utilized, realize with model word making, the target with model management dictionary, preferably solve dictionary Can not precision management the problem of, make the regular maintenance of dictionary, fine selection, become towards specific area customization, and upgrading Abnormal simple, effective solution that the problem of making conventional input method dictionary " receive and difficult to govern " obtains;
Further, since the input method dictionary developed using the present invention, is built by excavating Chinese language model, essential upper body The characteristics of having showed Chinese language and rule, therefore, in addition to being applicable to the input method fields such as conventional keyboard and touch-screen, also It can be applied to be related to other multiple fields of Chinese information processing, such as speech recognition input method, robot are recognized to human language, And its recognition efficiency and accuracy rate is improved significantly.
Embodiment
The basic thought of the present invention is to build input method dictionary using Chinese language model and it is implemented effectively to manage.According to this Basic thought, is described further to the corresponding module in present invention and is elaborated as follows in conjunction with the embodiments below:
1st, Chinese language model is refined
Chinese language model is mainly refined from three dimensions.
One is refined according to semantic integrity during communication and speech pause rule.Semantic integrity is first said, is finger speech There is the words assembly of complete meaning in sentence.If for example, " will have a meal, work, sing, pay, do one's assignment " respectively Be used as a semantic unit, then, " after having a meal, finish the work after, sung woman singer, handed over after money, finished after operation " all should Be regarded as corresponding complete semantic unit, according to this thought, analysis " after having a meal, finish the work after, sung woman singer, Hand over after money, finished after operation ", it is substantially the word " have a meal, work, sing, pay, do one's assignment " by V-O construction Centre insertion " End ", then add " rear " formation of suffix, the words group of " verb+complete+object+rear " thus, can be extracted Matched moulds type, verb is represented with Verb, and Obj represents object, then Chinese language model is just represented by:" after the complete Obj of Verb ", Represented with this " after having a meal, finish the work after, sung woman singer, handed over after money, finished after operation ... " etc. large quantities of words; Besides speech pause rule during communication, by taking " an ancient temple is located in Mount Huang underfooting within thousand " one as an example, according to speaker just Normal speech pause custom, it should as shown in "/an ancient temple/was located in/Mount Huang underfooting in thousand ", so, with " being located in " therein To refine key element, so that it may summarize the words combination of " verb+", represent verb with " Verb ", then this Chinese Speech model is just represented by " Verb exists ", it just represent " be located in, disappear in, lose, write on, hang over ... " etc. Large quantities of words;
Two are refined according to the rule of Chinese parts of speech and word structural generation phrase.It is described with reference to the Modern Chinese knowledge of grammar Chinese parts of speech refer mainly to noun, verb, adjective, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection, number, amount Word, onomatopoeia etc., described word structure refer mainly to predicate-object phrase, subject-predicate phrase, adjective-centre structure, shape core structure, state benefit knot Structure, amount core structure, parallel construction, reverse structure, overlay structure, mix structure, preposition structure etc..Different Chinese parts of speech With different generation phrase rules, by taking adjective as an example, it can be combined closely with " very ", " very ", " very much " etc., use Adj Adjective is represented, thus can extract that " very Adj ", " Chinese language model such as very Adj ", " Adj is very much ", is represented respectively " terribly cold, highly difficult, very generous ... ", " very good, as smart as a new pin, at full stretch ... ", " be very cold, it is fast very much, It is expensive very much ... " etc. each large quantities of words;Different word structures equally has different generation phrase rules, is tied with dynamic guest Exemplified by structure, can be inserted into the middle of word ", cross ", behind can add suffix " forward and backward " etc., if representing verb with Verb, Obj represents object, can extract " Verb Obj, Verb Obj, Verb cross Obj, Verb cross after Obj, VerbObj Before " etc. correlation model, represented respectively with this " sing song, eat meal, go up net, drink tea ... ", " lose money, it is spacious Work, stopped class ... ", " wash bath, sold dish, sang play ... ", " handed over after money, after swiped through tooth, bought after dish ... " And " before having a meal, sleep before, online before ... " etc. each large quantities of words.
Three are the configuration rules according to Chinese clause to refine.Described clause refers mainly to query, states, pray making, sigh with feeling that four is big Clause.Every kind of clause has different profiling characteristics, and different Chinese language models can be extracted accordingly.Such as query clause, In configuration, common are " ... ..., no ... ", verb is represented with Verb, Adj represents adjective, Accordingly, can extract " Verb, Adj, not Verb " etc. Chinese language model, represented and " handed over respectively with this , learn, eat ... ", " cold, difficult, beautiful ... ", " do not hand over, do not learn, do not write ... " etc. each large quantities of words;For another example imperative mood, its profiling characteristic is to often require that what someone does or do not do assorted , verb is represented with Verb, can extract accordingly " Chinese language model such as fast Verb, other Verb, certain Verb ", with This represent respectively " say, hurry up soon, going ... ", " do not move, do not mentionlet alone, not walking ... ", " it is certain come, it is an accepted argument, certain The respective large quantities of words of friendship ... " etc..Equally, large quantities of Chineses can also be extracted according to statement clause and exclamation clause Say model.
Extract after Chinese language model, also to have necessary language message and dictionary management information in the above.Including: Language properties, refer mainly to whether the information such as spoken language, written word, dialect;Structure type, refer mainly to whether V-O construction, Subject-predicate phrase, adjective-centre structure, shape core structure, predicate-complement structure, amount core structure, parallel construction, reverse structure, overlay structure, Mix the information such as structure, preposition structure;Semantic domain, refer mainly to whether table time, space, quantity, degree, belong to, sentence The information such as disconnected, result;Tone type, refer mainly to whether table query, state, pray making, sigh with feeling tone information;Voice type, Refer mainly to whether table is active and passive, make voice information.In addition will also be with the modification information such as level and weight order.It is above-mentioned Attachment information method be:Respective field is set to realize in database data table.
2nd, word making module is built.Word making module is the basic platform that automatic batch generates dictionary word, wherein the word in each tables of data Language is signified word making main body in Chinese language model.In word making module, the tables of data of part-of-speech tagging submodule is either constituted, Or the tables of data that word structure marks submodule is constituted, detailed subclassification information all should be further added, to adapt to during word making Precisely the need for selection word making main body.For example, with Noun representation nouns, what " Noun people " this Chinese language model was represented It is large quantities of words such as " Chinese, American, Canadian, Pekinese, Shanghai people, people from Anshan, northeasterners ... ", , it is necessary to carry out word making as word making main body with the noun list in part-of-speech tagging submodule during word making, but make discovery from observation, " Noun " It is only limitted to represent the noun in the places such as country, area, city, rather than the whole in noun list, so, it is necessary to noun list Place noun can be segmented out, accurate word making could be realized.Below by the key data table subdivision situation point row in word making module such as Under:
In part-of-speech tagging submodule, noun is subdivided into:Material noun, abstract noun, life noun, countable noun, orientation name Word, time noun, place noun, appellation noun;Pronoun is subdivided into:It is personal pronoun, demonstrative pronoun, interrogative pronoun, special Pronoun;Verb is subdivided into:Actional verb, statal verb, transitive verb, directional verb, modal verb;Adjective is subdivided into: Qualifying adjective, state adjective;Measure word is subdivided into:Individual measure word, set measure word, measurement word, indefinite measure word, quasi- measure word, Compound classifier, momentum word, borrow measure word, macroscopical measure word;Adverbial word is subdivided into:Adverb of time, adverb of place, degree adverb, Scope adverbial word, frequency adverbial word, tone adverbial word.Situation is no longer numerous states for other tables of data subdivision in part-of-speech tagging submodule.
In word structure mark submodule, subject-predicate phrase is subdivided into:Name-ejector half, name-shape type, generation-ejector half, generation-shape type;State guest Structure is subdivided into:Dynamic-name type, dynamic-die;Predicate-complement structure is subdivided into:Dynamic-shape type, dynamic-number-amount type, move-become type;Centering knot Structure is subdivided into:Shape-name type, name-name type, dynamic-name type, number-amount-name type.Other tables of data in word structure mark submodule Subdivision situation is no longer numerous to be stated.
In addition to each tables of data adds detailed subclassification information in for word making module, to be also word in each several tables (namely Word making main body) breakpoint is set, insertion operation is performed during for word making and two parts before and after word making main body are respectively processed. For each tables of data word in word structure mark submodule, its breakpoint location is identical with the separation of the structure, with V-O construction Exemplified by tables of data, such as word of the inside is " have a meal, catch the train, the people that runs on a bank, working, sing ... ", and it moves guest separation can Be expressed as " eat-meal, catch up with-train, run on a bank-people, upper-class, sing-sing ", then, be that the breakpoint that they are marked also falls in this place, This point is understood that.Here the breakpoint setup principle of each tables of data word in part-of-speech tagging submodule, part of speech are stressed The word marked in submodule tables of data is simple structure mostly, and middle is difficult to disconnect, at this moment, only needs position setting therebetween Breakpoint, exemplified by describing vocabulary, such as the word of the inside is " generous, flourishing, flurried, beautiful ... ", then sets breakpoint Form afterwards is " big-square, red-fiery, flurried-, drift-bright ... ", has breakpoint, it is possible to respectively according to Chinese language model AdjLeftAdjLeftAdjRightAdjRight、AdjLeftAdjRightAdjRight、AdjLeftAdjLeftAdjRightAdjRight Ground " (AdjLeft and AdjRight represent two parts of adjective breakpoint or so respectively) Mass production " naturally and easily, red fire It is fiery, covered with confusion, float beautiful bright ... ", " generous side, flourishing fiery, flurried, beautiful bright ... ", and " naturally and easily Ground, it is flourishing, in a split of a hurry, float it is beautiful brightly ... " etc. respective large quantities of words.
3rd, according to the embodiment of Chinese language model Mass production dictionary word.
It is divided into three steps:
Step 1:Refine Chinese language model
Assuming that extracted " Verb not Com, Verb also Verb Com, Adj are not that Adj points, Adj return Adj, Adj pole " (wherein VerbCom represents structure of complementation to six Chinese language models, and Verb and Com are respectively structure of complementation " verb " Part and " complement " part, Adj represent adjective), based on this, customize Chinese language model module;
Step 2:In part-of-speech tagging submodule in word making module and word structure mark submodule respectively customization describe vocabulary and Structure of complementation word lists, it is assumed that it is the meter such as " difficult, nervous, expensive, remote, long, big, arduous ... " to describe the content in vocabulary 3000 words;Content in structure of complementation word lists for " see clearly, clean, walk it is fast, eat up ... " etc. 5000 words of meter Equal breaking in the middle of language, every word;
Step 3:SQL database query language, the word-building information provided according to Chinese language model module, to above-mentioned shape are provided Hold two parts of vocabulary and word entirety or breakpoint in structure of complementation word lists or so and be combined splicing, the word needed for generation. For example for describing vocabulary, when with " Adj is that Adj points, Adj return Adj, Adj very much " word making, using describing in vocabulary Word it is overall, by connecting method, each generate " difficulty be difficult point, anxiety be nervous point, it is expensive be your point, be far far point, Length is long point, be greatly it is a little bigger, arduous be arduous point ... ", " difficulty returns difficult, anxiety to return nervous, expensive to return expensive, the remote, length that returns from a distance to return Long, great Gui is big, it is arduous return it is arduous ... ", and " it is difficult very much, it is nervous very much, it is expensive very much, distal pole, long very much, big pole , it is arduous very much ... " 3000 words.It is similar, for structure of complementation word lists, when with " Com " is not raw by Verb During into word, the form of " dynamic word left part of mending+or not is moved and mends word right part " is spliced into, disposably generate " do not see, Do not wash clean, walk it is unhappy, eat up ... " meter 5000 words;When with " Verb also Verb not Com " generate word when, The form of " whether the dynamic dynamic benefit word left part+or not of benefit word left part+also+dynamic to mend word right part " is spliced into, is disposably generated " see do not see yet, wash do not wash clean yet, walk also walk it is unhappy, eat and also eat up ... " meter 5000 words.
4th, the embodiment that thesaurus management system realization is effectively managed dictionary content is set up using Chinese language model.
It is divided into three steps:
Step 1, structure language material module, the whole words for the word making module generation that the module is used to include in the present invention, every word Language corresponds to and generates its Chinese language model;
Step 2, structure dictionary information module, the module and the model word-building information in the Chinese language model module in the present invention Submodule contains completely, and by its transfer data information.It addition of in addition and retain timing information, for weighing word life cycle Length, is specifically set as long, general, short three kinds of grades, for use in deleting out-of-date word from dictionary in time;It addition of Language block classification information, is specifically divided into Semantic word and speech pause word.In addition, weight order is defined it is normative, Written property, versatility, paralogy, associativity, six single dimension weights of specific factor and comprehensive weight, to meet repeated code word The need for sorting and customizing specific dictionary.
After step 3, above-mentioned two module are set up, it is possible to carry out effective management using database platform.Such as, for The reduction repetition rate of coding considers that user receives custom, in dictionary " it is difficult very much, it is nervous very much, it is expensive very much, distal pole, long pole , it is big very much, arduous very much ... " etc. large quantities of words, merely desire to retain " it is difficult very much, it is expensive very much, distal pole, it is long very much, The word that " individual character adjective+very much " is constituted greatly very much ... " etc., and to delete " it is nervous very much, it is arduous very much ... " The word constituted Deng " two words and above adjective+very much ", need to only be searched using data base query language from language material module come Come from " Adj is very much " model and length is more than 3 entry, and deleted from dictionary;For another example, if to select not , can be in inquiry by rationally setting modification hierarchical value come real rationally to control dictionary capacity with the dictionary word of modification level It is existing, it is assumed here that modification hierarchical value is set to 1, then, from " very well, too beautiful, pretty good, flown, walk ... " When the inside is inquired about, " too beautiful, pretty good ... " modified containing two-stage is just filtered, only remaining " very well, flown, Walk ... " etc. one-level modification word.
5th, the embodiment of Pinyin coding is automatically performed to the word of Mass production using Chinese language model.
It is equally known that being that word addition Pinyin coding is a vast and numerous work during exploitation spelling input method makes code table Journey, although program software automatic phonetic notation can be utilized, but due to the presence of a large amount of polyphones in Chinese, be necessarily required to by artificial Verification, and utilize Chinese language model module and word making module to cooperate, it can easily solve this problem.Method is first A phonetic notation information table is first customized, phonetic notation is individually carried out to compositions such as " prefix, insertion, the suffix " in Chinese language model, Carry out phonetic notation to the word making main body in each tables of data of word making module again in addition, after the completion of this two work, utilize data base querying language Speech carries out word and coding splices, and all neologisms of automatic batch generation have just been automatically performed phonetic notation, greatly reduce workload, And accuracy 100%, eliminate the hardship of artificial check and correction.

Claims (8)

1. a kind of method that input method dictionary is built according to Chinese language model, it is characterized in that:Including Chinese language model module and Word making module, wherein,
The Chinese language model module, provides word-building information and to ultimately generate during for for word making module Mass production word Dictionary provide dictionary management information;
The word-building information automatic batch that the word making module is used to be provided according to Chinese language model module generates word.
2. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that, Described Chinese language model module is made up of model identification submodule and model word-building message sub-module,
Model identification submodule includes Chinese language model, Chinese language model by represent the character string identification of word making main body, prefix, Parenthesis and suffix composition, be attached to before word making main body for prefix, be attached to behind word making main body for suffix, insertion In the middle of word making main body for parenthesis, prefix, parenthesis and suffix can only occur first, can also co-occurrence, in addition, Parenthesis can include one or more;Described word making main body refers to that the word making main body submodule in word making module is included The basic word for word making;
The main language properties Information Number by being classified according to spoken language, written word, dialect etc. of model word-building message sub-module According to table, and the word structure type information tables of data classified according to subject-predicate, dynamic guest, centering etc., and according to table time, sky Between, the semantic domain information data table classified such as quantity, degree, and according to query, state, pray making, sigh with feeling and divided The tone type information data table of class, and according to it is active and passive, the voice type information data table classified such as make, with And modify the composition such as level, weight order information data table;
Described Chinese language model is developed according to Chinese speech pause feature and semantic integrity requirement.
3. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that, Described word making module is made up of word making main body submodule, part-of-speech tagging submodule and word structure mark submodule,
Word making main body submodule includes the basic word for word making, and these basic words are referred to as word making main body;
Basic word in word making main body submodule is carried out part-of-speech tagging by part-of-speech tagging submodule, is divided into noun, verb, shape Hold the specific tables of data such as word, pronoun, adverbial word, number, measure word, preposition, conjunction, auxiliary word, interjection, onomatopoeia;
Word structure mark submodule by word making main body submodule basic word carry out word structure mark, be divided into subject-predicate, State guest, state benefit, centering, the shape heart, the amount heart, quantity, side by side, inverted sequence, it is overlapping, along passing, refer to again, mix, prepositional phrase And the specific tables of data such as synonym, antonym, parallel word.
4. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that, The tables of data in part-of-speech tagging submodule and word structure mark submodule in described word making module is that word making main body sets disconnected Point information, performs insertion operation to word making main body during for word making and two parts before and after word making main body is acted upon respectively.
5. a kind of method as claimed in claim 1 that input method dictionary is built according to Chinese language model, it is characterised in that bag Include following steps:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Pass through《Modern Chinese dictionary》Deng reference book and mode is artificially collected, choose basic, versatility word making Material, and word making main body submodule, part-of-speech tagging submodule and word structure mark submodule are built based on this;
Step 3:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method dictionary word.
6. a kind of method as claimed in claim 1 that input method specialized dictionary is built according to Chinese language model, its feature exists In comprising the following steps:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Set up specialized word word making material database;
Step 3:Based on above-mentioned specialized word word making material database, build word making main body submodule, part-of-speech tagging submodule and Word structure marks submodule;
Step 4:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method specialized word.
7. a kind of management method as claimed in claim 1 that input method dictionary is built according to Chinese language model, it is characterized in that: Including language material module and dictionary information module, wherein,
Described language material module includes whole words of the word making module generation described in claim 1;
The composition of described dictionary information module and model word-building information in the Chinese language model module described in claim 1 Module is identical, and by its transfer data information.
8. a kind of development approach of the input method as claimed in claim 1 that input method dictionary is built based on Chinese language model, Characterized in that, with input method prompting frame hints model word, comprising the following steps:
Step 1:In input method code table, increase Chinese language model information, make the every word and Chinese language mould in code table Type formation one-to-one relationship;
Step 2:In input method engine during increase search code table, the function of corresponding word is searched according to Chinese language model;
Step 3:The icon or button of model word, or other similar indicative marks are checked in increase in input method prompting frame, When it is a group model word that the coding inputted is corresponding, the mark is activated, when cursor of mouse is moved to the mark above When, Chinese language model is shown, when with mouse-click or when pressing pre-defined keyboard, the Chinese language model is shown Corresponding whole words.
CN201610066190.6A 2016-01-24 2016-01-24 A kind of method that input method dictionary is built according to Chinese language model Pending CN106997245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610066190.6A CN106997245A (en) 2016-01-24 2016-01-24 A kind of method that input method dictionary is built according to Chinese language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610066190.6A CN106997245A (en) 2016-01-24 2016-01-24 A kind of method that input method dictionary is built according to Chinese language model

Publications (1)

Publication Number Publication Date
CN106997245A true CN106997245A (en) 2017-08-01

Family

ID=59428750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610066190.6A Pending CN106997245A (en) 2016-01-24 2016-01-24 A kind of method that input method dictionary is built according to Chinese language model

Country Status (1)

Country Link
CN (1) CN106997245A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271495A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556596A (en) * 2007-08-31 2009-10-14 北京搜狗科技发展有限公司 Input method system and intelligent word making method
US20110320468A1 (en) * 2007-11-26 2011-12-29 Warren Daniel Child Modular system and method for managing chinese, japanese and korean linguistic data in electronic form
CN102314439A (en) * 2010-06-30 2012-01-11 百度在线网络技术(北京)有限公司 Input method combined with application interface and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101556596A (en) * 2007-08-31 2009-10-14 北京搜狗科技发展有限公司 Input method system and intelligent word making method
US20110320468A1 (en) * 2007-11-26 2011-12-29 Warren Daniel Child Modular system and method for managing chinese, japanese and korean linguistic data in electronic form
CN102314439A (en) * 2010-06-30 2012-01-11 百度在线网络技术(北京)有限公司 Input method combined with application interface and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271495A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing
CN109271495B (en) * 2018-08-14 2023-02-17 创新先进技术有限公司 Question-answer recognition effect detection method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
CN110516073A (en) A kind of file classification method, device, equipment and medium
CN104484411B (en) A kind of construction method of the semantic knowledge-base based on dictionary
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
US10496756B2 (en) Sentence creation system
WO2005033909A2 (en) Relationship analysis system and method for semantic disambiguation of natural language
CN109960756A (en) Media event information inductive method
CN109614620A (en) A kind of graph model Word sense disambiguation method and system based on HowNet
CN110321561A (en) A kind of keyword extracting method and device
CN112036178A (en) Distribution network entity related semantic search method
Zakharov Corpora of the Russian language
CN109299455A (en) A kind of Computer Language Processing method of the extraordinary collocation of Chinese gerund
Mohnot et al. Hybrid approach for Part of Speech Tagger for Hindi language
CN106997245A (en) A kind of method that input method dictionary is built according to Chinese language model
Feng Evolution and present situation of corpus research in China
CN110188352A (en) A kind of text subject determines method, apparatus, calculates equipment and storage medium
CN107818078B (en) Semantic association and matching method for Chinese natural language dialogue
Dimitrova et al. Implementation of the Bulgarian-Polish online dictionary
Kilgarriff Putting the corpus into the dictionary
Pan Automatic Generation of Children's Songs Based on Machine Statistic Learning
Zhang et al. PQAC-WN: constructing a wordnet for Pre-Qin ancient Chinese
Kainan et al. Extraction method of judicial language entities based on regular expression
CN110909537A (en) Artificial intelligence method for modern Chinese component analysis
Amezian et al. Training an LSTM-based Seq2Seq Model on a Moroccan Biscript Lexicon
Islam et al. Beyond Words: Unraveling Text Complexity with Novel Dataset and A Classifier Application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170801

WD01 Invention patent application deemed withdrawn after publication