CN106997245A - A kind of method that input method dictionary is built according to Chinese language model - Google Patents
A kind of method that input method dictionary is built according to Chinese language model Download PDFInfo
- Publication number
- CN106997245A CN106997245A CN201610066190.6A CN201610066190A CN106997245A CN 106997245 A CN106997245 A CN 106997245A CN 201610066190 A CN201610066190 A CN 201610066190A CN 106997245 A CN106997245 A CN 106997245A
- Authority
- CN
- China
- Prior art keywords
- word
- chinese language
- module
- language model
- dictionary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/02—Input arrangements using manually operated switches, e.g. using keyboards or dials
- G06F3/023—Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
- G06F3/0233—Character input methods
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
The present invention is applied to computer input method field, there is provided a kind of method that input method dictionary is built according to Chinese language model, it is made up of Chinese language model module and word making module, Chinese language model module provides word-building information and dictionary management information for word making module, the word-building information that word making module is provided according to Chinese language model, by database management language, Mass production input method word, constructed dictionary utilizes the dictionary management information that Chinese language model module is provided, and can easily carry out the operation such as additions and deletions, retrieval, sequence.Chinese language model requires to customize according to speech pause in communication and semantic integrity, has fully demonstrated Chinese feature.Lack the receipts word standard of science present invention efficiently solves traditional input method dictionary and to collect word efficiency low, dictionary uncomplete content face system and can not managed effectively, have no way of holding that dictionary content, input speed are slow, lack the key issues such as communication experience in user's input process.
Description
Technical field
The present invention relates to computer input method field, the input method dictionary more particularly to automatically generated according to Chinese language model.
Background technology
In Chinese character coding input method field, Hanzi coding technique and dictionary technology are two big core technologies.Since the eighties in last century
The development of more than 30 years, Hanzi coding technique has tended to be ripe and stably, and the space of current input method innovation and development and potentiality have collected
In to input method dictionary technically, but current input method dictionary state-of-the-art is seen, either towards desktop computer standard key
The input method of disk exploitation, or the input method that facing moving terminal such as handset touch panel is developed, and opened towards field of speech recognition
The input method of hair, the problem of its dictionary has five aspects:
One is dictionary " small without complete ".Dictionary word is mainly derived from the general word in all kinds of dictionaries, and daily artificial accumulation
All kinds of words, its dictionary scale generally between tens of thousands of to ten tens of thousands of, due to small scale, and mostly specification,
The static word of " dictionary " class, thus the flexile dynamic communication demand of Chinese is can not meet, typing person can not be with words
Combined type entirety typing, causes typing to experience poor, efficiency low.The representational input method of this kind of " small without complete " has with king's code
Five-stroke input method is all kinds of code input methods of representative, and all-phonetic input method, the Intelligent ABC phonetic in spelling input method field is defeated
Enter method, purple light spelling input method, phonetic adding input method etc.;
Two be dictionary " greatly without complete ".Representational input method has search dog spelling input method, Baidu's input method etc., this kind of input method
Dictionary is normally based on " the general word of dictionary class+search engine finds neologisms hot word " structure, the general word source of dictionary class
In all kinds of dictionaries and daily accumulation, " search engine find neologisms hot word " be by existing patent such as《A kind of method for obtaining neologisms,
Device and a kind of input method system》(publication number:CN1924858B)、《Internet hot words mining method and device》(publication number:
CN104679738A)、《The method and system of neologisms or hot word are provided》(publication number:CN102163198A)、《Dictionary is given birth to
Into method and its system, input method and input system》(publication number:) etc. CN103853746A patented technology is realized, its base
Present principles be by search engine collect webpage on user's input frame in content, then by participle technique handle, and with it is rear
Platform corpus relatively and is reached after certain frequency threshold, is defined as hot word, neologisms, is finally included into input method dictionary.With neologisms,
The continuous accumulation of hot word, dictionary scale generally in hundreds of thousands bar between nearly million, if along with " cloud " word, can supply
The dictionary of user search can reach millions of even nearly ten million bars, although scale is unprecedentedly expanded, but due to using
From the mass-election technology in user's input frame, and often some popular name, place name, news that user searches in input frame
Event, and network popular word etc., few people search in input method frame " on this basis, to be, not to be to say, more
Say it is faster, ate after meal, say all be unable to explain clearly, fly, having run, one connect one, countercharge " as word, therefore,
By search engine technique it is difficult to be collected into these closest to the essential core of Chinese, most " fat person " of generality application, height
The input method word of quality.So-called neologisms, hot word are seen again, and they have significant limitation in fact, one is
Life cycle is very short, word more popular at that time, it is possible to soon just unfeelingly eliminated, such as " my father
Be that Li Gang, Chinese style are gone across the road, APEC is blue, the fragrant body ... of member " etc.;The second is it is very narrow using scope, it is daily defeated in people
Enter in document and dialogue, run into so-called neologisms, hot word probability it is very small, therefore by neologisms, hot word come improve input effect
Rate is very limited amount of with input experience effect;The third is being keen to also limited using the crowd size of neologisms hot word.《Spoken and written languages
Report》Second edition on January 6th, 2016 (total 953rd phase) is published the article《The high section student useful year neologism condition survey of primary school》
(bang flat, Yang Chuanxin) display, " in the attitude of useful year neologism, explicitly indicate that ' liking ' accounts for 28.76%,
And ' it doesn't matter ' and ' being unable to explain clearly ' attitude accounts for 45.15% and 19.40% respectively;Hold ' not liking ' attitude for 6.69% ",
" in the frequency of useful year neologism, the 17.57% annual neologism of student's ' being commonly used ', 55.85% student is ' even
You use ', and ' hardly with ' ratio of ' from being not used ' is respectively 16.72% and 10.37% ".In summary, search is drawn
Hold up discovery neologisms hot word method to be not suitable for building input method core word bank, be suitable only for, as one kind supplement, merely utilizing
This technique construction input method dictionary, can cause many real valuable input method words not have and really excavate, very entirely
Substantial amounts of useless " rubbish " word is infiltrated, the phenomenon for " greatly without complete " occur;
Three be that dictionary content is difficult to remember.No matter for " small without complete " dictionary, or for the dictionary of " greatly without complete ",
It is required for which word contained inside artificially phrase library for temporary memory, so, could when using input method not comprising which word
Typing is carried out according to word mode.Can face it is small to tens of thousands of, ten tens of thousands of in input method dictionary, it is even up to ten million to nearly million greatly and
Word without any language regulation and feature again, user is difficult memory, thus causes considerable hurdle for effective use dictionary;
Four be dictionary word " receive and difficult to govern ".The simple of current input method dictionary, substantially Chinese data is piled up, on word
The not subsidiary any language message in face and dictionary management information, thus the word in dictionary can not be classified, be selected, additions and deletions,
The necessary management such as sequence, cause that input method dictionary upgrading is difficult, orientation customization function is poor, repetition maintenance workload greatly, no
Beneficial to the development of input method dictionary;
Five be that the poor, efficiency of input experience is low.Dictionary word or chased after simply because existing input method dictionary is copyed by rote
Network neologisms hot word is sought, words combination rule is not studied substantially from Chinese exchange, thus is difficult to accomplish according to communication
When speech pause rule and semantic integrity requirement inputted, directly results in words input and mutually disconnect, record with communication
The problem of entering inefficiency.
The content of the invention
The invention aims to solve above-mentioned input method dictionary content " it is small without it is complete, big without it is complete, be difficult to remember, dictionary
It is low that content is difficult to manage, poor, efficiency is experienced in typing person's input " the problems such as,
To realize object above, the present invention is achieved through the following technical solutions:
A kind of method that input method dictionary is built according to Chinese language model, is made up of Chinese language model module and word making module,
Described Chinese language model module, provides word-building information and to be most lifelong during for for word making module Mass production word
Into dictionary provide dictionary management information;The word making module is automatic for the word-building information provided according to Chinese language model module
Mass production word.
Described Chinese language model module is made up of model identification submodule and model word-building message sub-module, model identification submodule
Block includes Chinese language model, and Chinese language model is by representing character string identification, prefix, parenthesis and the suffix of word making main body
Composition, be attached to before word making main body for prefix, be attached to behind word making main body for suffix, in the middle of insertion word making main body
For parenthesis, prefix, parenthesis and suffix can only occur first, can also co-occurrence, in addition, parenthesis can be wrapped
Containing one or more;What described word making main body referred to that the word making main body submodule in word making module included is used for word making
Basic word;The main language properties by being classified according to spoken language, written word, dialect etc. of model word-building message sub-module
Information data table, and the word structure type information tables of data classified according to subject-predicate, dynamic guest, centering etc., and during according to table
Between, the semantic domain information data table classified of space, quantity, degree etc., and according to query, state, pray making, sigh with feeling
The tone type information data table classified, and according to it is active and passive, the voice type information data classified such as make
Table, and modify the composition such as level, weight order information data table;Chinese language mould in described Chinese language model module
Type is developed according to Chinese speech pause feature and semantic integrity requirement.
Described word making module is made up of word making main body submodule, part-of-speech tagging submodule and word structure mark submodule, word making
Main body submodule includes the basic word for word making, and these basic words are referred to as word making main body;Part-of-speech tagging submodule is by word making
Basic word in main body submodule carries out part-of-speech tagging, be divided into noun, verb, adjective, pronoun, adverbial word, number,
The specific tables of data such as measure word, preposition, conjunction, auxiliary word, interjection, onomatopoeia;Word structure marks submodule by word making main body
Basic word in module carries out word structure mark, be divided into subject-predicate, state guest, state benefits, centering, the shape heart, measure the heart, quantity,
Side by side, inverted sequence, it is overlapping, along passing, refer to again, mix, the specific data such as prepositional phrase and synonym, antonym, parallel word
Table.The tables of data in part-of-speech tagging submodule and word structure mark submodule in described word making module is that word making main body is set
Breakpoint information is put, insertion operation is performed to word making main body during for word making and two parts before and after word making main body are acted upon respectively.
In order to realize the purpose of the present invention, the invention provides one kind according to Chinese language model Mass production input method dictionary word
Method, including three below step:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Pass through《Modern Chinese dictionary》Deng reference book and mode is artificially collected, choose basic, versatility word making
Material, and word making main body submodule, part-of-speech tagging submodule and word structure mark submodule are built based on this;
Step 3:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module
The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method dictionary word.
In order to realize the purpose of the present invention, the present invention is further extended to each professional input field, present invention also offers one kind
The method that specialized dictionary is built according to Chinese language model, including following four step:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Set up specialized word word making material database;
Step 3:Based on above-mentioned specialized word word making material database, build word making main body submodule, part-of-speech tagging submodule and
Word structure marks submodule;
Step 4:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module
The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method specialized word.
In order to realize the purpose of the present invention, dictionary Content Implementation is effectively managed based on Chinese language model present invention also offers one kind
The method of reason, is made up of language material module and dictionary information module, wherein, language material module includes making in Chinese language model module
Whole words of word module generation;The composition of dictionary information module and the model word-building message sub-module in Chinese language model module
It is identical, and by its transfer data information.
In order to realize the purpose of the present invention, present invention also offers a kind of method of the hints model word in input method prompting frame,
Including three below step:
Step 1:In input method code table, increase Chinese language model information, make the every word and Chinese language mould in code table
Type formation one-to-one relationship;
Step 2:In input method engine during increase search code table, the function of corresponding word is searched according to Chinese language model;
Step 3:The icon or button of model word, or other similar indicative marks are checked in increase in input method prompting frame,
When it is a group model word that the coding inputted is corresponding, the mark is activated, when cursor of mouse is moved to the mark above
When, Chinese language model is shown, when with mouse-click or when pressing pre-defined keyboard, the Chinese language model is shown
Corresponding whole words.
Beneficial effect
The input method dictionary built by the present invention, is fully reflected in Chinese syntax and morphology feature, particularly communication
Words combination rule, allow typing person use access expansion communication in speech pause mode and semantic integrity carry out
Input, creates a kind of natural language communication environment of emulation, improves typing experience;
The input method dictionary built by the present invention, is realized using Chinese language model as commander, to various words in communication
Combination rule carries out comprehensive system summary, so that the input method dictionary for establishing unified standard receives word standard, it is ensured that constructed
Dictionary word answers the heart to receive to the greatest extent, solve conventional input method dictionary collect word not comprehensively, not system, without standard the problem of, prevent
User has no way of holding to dictionary content, is fanned the air because lacking word in typing and code and then returns the generation for deleting phenomenon;
By the input method dictionary that builds of the present invention, it using Chinese inherent law be basic to be, originally to ask source, has target, has again
Point ground " active " generates the process of word, is completely contradicted with " dragging in sea " neologisms hot word direction by search engine, and has this
The difference of matter, its accuracy, universality, practicality are improved comprehensively, and can effectively block " rubbish " word to enter, thus
Computer resource is saved during dictionary use, recall precision is improved, has prevented influence of the rubbish word to typing person;
The input method dictionary built by the present invention, is substantially improved Chinese words combination rate, individual character typing phenomenon becomes in typing
Zero is bordering on, typing person avoids formula individual character typing mode of " squeezing toothpaste out of a tube " in the past in input process, realizes " fat person " whole
Body typing, efficiency of inputting can improve more than 30%, create considerable operating efficiency and social efficiency and benefit;
The input method dictionary built by the present invention, because most words of the inside are all based on Chinese language model generation,
As long as user remembers a word therein during typing, then just can conclude that all words with same characteristic features are all included therewith
, such as, attempt " after having a meal " overall typing to succeed, then similarly " after a bath afterwards, handed over after money,
Disturbed by making noise after frame, bought after dish ... " etc. can serve as a word and trust audaciously overall typing, this feature is used
The various word of quantity is to the transformation for remembering a limited number of Chinese language models from memory inputting dictionary at family, and memory capacitance is significantly
Reduce, if coordinating the Chinese language model prompt facility in input method prompting frame again, just can more easily grasp in dictionary and receive
The word of record, further enhances typing experience and efficiency;
The input method dictionary built by the present invention, by being added on Chinese language model, complete language message and dictionary are managed
Information, and database technology is utilized, realize with model word making, the target with model management dictionary, preferably solve dictionary
Can not precision management the problem of, make the regular maintenance of dictionary, fine selection, become towards specific area customization, and upgrading
Abnormal simple, effective solution that the problem of making conventional input method dictionary " receive and difficult to govern " obtains;
Further, since the input method dictionary developed using the present invention, is built by excavating Chinese language model, essential upper body
The characteristics of having showed Chinese language and rule, therefore, in addition to being applicable to the input method fields such as conventional keyboard and touch-screen, also
It can be applied to be related to other multiple fields of Chinese information processing, such as speech recognition input method, robot are recognized to human language,
And its recognition efficiency and accuracy rate is improved significantly.
Embodiment
The basic thought of the present invention is to build input method dictionary using Chinese language model and it is implemented effectively to manage.According to this
Basic thought, is described further to the corresponding module in present invention and is elaborated as follows in conjunction with the embodiments below:
1st, Chinese language model is refined
Chinese language model is mainly refined from three dimensions.
One is refined according to semantic integrity during communication and speech pause rule.Semantic integrity is first said, is finger speech
There is the words assembly of complete meaning in sentence.If for example, " will have a meal, work, sing, pay, do one's assignment " respectively
Be used as a semantic unit, then, " after having a meal, finish the work after, sung woman singer, handed over after money, finished after operation " all should
Be regarded as corresponding complete semantic unit, according to this thought, analysis " after having a meal, finish the work after, sung woman singer,
Hand over after money, finished after operation ", it is substantially the word " have a meal, work, sing, pay, do one's assignment " by V-O construction
Centre insertion " End ", then add " rear " formation of suffix, the words group of " verb+complete+object+rear " thus, can be extracted
Matched moulds type, verb is represented with Verb, and Obj represents object, then Chinese language model is just represented by:" after the complete Obj of Verb ",
Represented with this " after having a meal, finish the work after, sung woman singer, handed over after money, finished after operation ... " etc. large quantities of words;
Besides speech pause rule during communication, by taking " an ancient temple is located in Mount Huang underfooting within thousand " one as an example, according to speaker just
Normal speech pause custom, it should as shown in "/an ancient temple/was located in/Mount Huang underfooting in thousand ", so, with " being located in " therein
To refine key element, so that it may summarize the words combination of " verb+", represent verb with " Verb ", then this Chinese
Speech model is just represented by " Verb exists ", it just represent " be located in, disappear in, lose, write on, hang over ... " etc.
Large quantities of words;
Two are refined according to the rule of Chinese parts of speech and word structural generation phrase.It is described with reference to the Modern Chinese knowledge of grammar
Chinese parts of speech refer mainly to noun, verb, adjective, pronoun, adverbial word, preposition, conjunction, auxiliary word, interjection, number, amount
Word, onomatopoeia etc., described word structure refer mainly to predicate-object phrase, subject-predicate phrase, adjective-centre structure, shape core structure, state benefit knot
Structure, amount core structure, parallel construction, reverse structure, overlay structure, mix structure, preposition structure etc..Different Chinese parts of speech
With different generation phrase rules, by taking adjective as an example, it can be combined closely with " very ", " very ", " very much " etc., use Adj
Adjective is represented, thus can extract that " very Adj ", " Chinese language model such as very Adj ", " Adj is very much ", is represented respectively
" terribly cold, highly difficult, very generous ... ", " very good, as smart as a new pin, at full stretch ... ", " be very cold, it is fast very much,
It is expensive very much ... " etc. each large quantities of words;Different word structures equally has different generation phrase rules, is tied with dynamic guest
Exemplified by structure, can be inserted into the middle of word ", cross ", behind can add suffix " forward and backward " etc., if representing verb with Verb,
Obj represents object, can extract " Verb Obj, Verb Obj, Verb cross Obj, Verb cross after Obj, VerbObj
Before " etc. correlation model, represented respectively with this " sing song, eat meal, go up net, drink tea ... ", " lose money, it is spacious
Work, stopped class ... ", " wash bath, sold dish, sang play ... ", " handed over after money, after swiped through tooth, bought after dish ... "
And " before having a meal, sleep before, online before ... " etc. each large quantities of words.
Three are the configuration rules according to Chinese clause to refine.Described clause refers mainly to query, states, pray making, sigh with feeling that four is big
Clause.Every kind of clause has different profiling characteristics, and different Chinese language models can be extracted accordingly.Such as query clause,
In configuration, common are " ... ..., no ... ", verb is represented with Verb, Adj represents adjective,
Accordingly, can extract " Verb, Adj, not Verb " etc. Chinese language model, represented and " handed over respectively with this
, learn, eat ... ", " cold, difficult, beautiful ... ", " do not hand over, do not learn, do not write
... " etc. each large quantities of words;For another example imperative mood, its profiling characteristic is to often require that what someone does or do not do assorted
, verb is represented with Verb, can extract accordingly " Chinese language model such as fast Verb, other Verb, certain Verb ", with
This represent respectively " say, hurry up soon, going ... ", " do not move, do not mentionlet alone, not walking ... ", " it is certain come, it is an accepted argument, certain
The respective large quantities of words of friendship ... " etc..Equally, large quantities of Chineses can also be extracted according to statement clause and exclamation clause
Say model.
Extract after Chinese language model, also to have necessary language message and dictionary management information in the above.Including:
Language properties, refer mainly to whether the information such as spoken language, written word, dialect;Structure type, refer mainly to whether V-O construction,
Subject-predicate phrase, adjective-centre structure, shape core structure, predicate-complement structure, amount core structure, parallel construction, reverse structure, overlay structure,
Mix the information such as structure, preposition structure;Semantic domain, refer mainly to whether table time, space, quantity, degree, belong to, sentence
The information such as disconnected, result;Tone type, refer mainly to whether table query, state, pray making, sigh with feeling tone information;Voice type,
Refer mainly to whether table is active and passive, make voice information.In addition will also be with the modification information such as level and weight order.It is above-mentioned
Attachment information method be:Respective field is set to realize in database data table.
2nd, word making module is built.Word making module is the basic platform that automatic batch generates dictionary word, wherein the word in each tables of data
Language is signified word making main body in Chinese language model.In word making module, the tables of data of part-of-speech tagging submodule is either constituted,
Or the tables of data that word structure marks submodule is constituted, detailed subclassification information all should be further added, to adapt to during word making
Precisely the need for selection word making main body.For example, with Noun representation nouns, what " Noun people " this Chinese language model was represented
It is large quantities of words such as " Chinese, American, Canadian, Pekinese, Shanghai people, people from Anshan, northeasterners ... ",
, it is necessary to carry out word making as word making main body with the noun list in part-of-speech tagging submodule during word making, but make discovery from observation, " Noun "
It is only limitted to represent the noun in the places such as country, area, city, rather than the whole in noun list, so, it is necessary to noun list
Place noun can be segmented out, accurate word making could be realized.Below by the key data table subdivision situation point row in word making module such as
Under:
In part-of-speech tagging submodule, noun is subdivided into:Material noun, abstract noun, life noun, countable noun, orientation name
Word, time noun, place noun, appellation noun;Pronoun is subdivided into:It is personal pronoun, demonstrative pronoun, interrogative pronoun, special
Pronoun;Verb is subdivided into:Actional verb, statal verb, transitive verb, directional verb, modal verb;Adjective is subdivided into:
Qualifying adjective, state adjective;Measure word is subdivided into:Individual measure word, set measure word, measurement word, indefinite measure word, quasi- measure word,
Compound classifier, momentum word, borrow measure word, macroscopical measure word;Adverbial word is subdivided into:Adverb of time, adverb of place, degree adverb,
Scope adverbial word, frequency adverbial word, tone adverbial word.Situation is no longer numerous states for other tables of data subdivision in part-of-speech tagging submodule.
In word structure mark submodule, subject-predicate phrase is subdivided into:Name-ejector half, name-shape type, generation-ejector half, generation-shape type;State guest
Structure is subdivided into:Dynamic-name type, dynamic-die;Predicate-complement structure is subdivided into:Dynamic-shape type, dynamic-number-amount type, move-become type;Centering knot
Structure is subdivided into:Shape-name type, name-name type, dynamic-name type, number-amount-name type.Other tables of data in word structure mark submodule
Subdivision situation is no longer numerous to be stated.
In addition to each tables of data adds detailed subclassification information in for word making module, to be also word in each several tables (namely
Word making main body) breakpoint is set, insertion operation is performed during for word making and two parts before and after word making main body are respectively processed.
For each tables of data word in word structure mark submodule, its breakpoint location is identical with the separation of the structure, with V-O construction
Exemplified by tables of data, such as word of the inside is " have a meal, catch the train, the people that runs on a bank, working, sing ... ", and it moves guest separation can
Be expressed as " eat-meal, catch up with-train, run on a bank-people, upper-class, sing-sing ", then, be that the breakpoint that they are marked also falls in this place,
This point is understood that.Here the breakpoint setup principle of each tables of data word in part-of-speech tagging submodule, part of speech are stressed
The word marked in submodule tables of data is simple structure mostly, and middle is difficult to disconnect, at this moment, only needs position setting therebetween
Breakpoint, exemplified by describing vocabulary, such as the word of the inside is " generous, flourishing, flurried, beautiful ... ", then sets breakpoint
Form afterwards is " big-square, red-fiery, flurried-, drift-bright ... ", has breakpoint, it is possible to respectively according to Chinese language model
AdjLeftAdjLeftAdjRightAdjRight、AdjLeftAdjRightAdjRight、AdjLeftAdjLeftAdjRightAdjRight
Ground " (AdjLeft and AdjRight represent two parts of adjective breakpoint or so respectively) Mass production " naturally and easily, red fire
It is fiery, covered with confusion, float beautiful bright ... ", " generous side, flourishing fiery, flurried, beautiful bright ... ", and " naturally and easily
Ground, it is flourishing, in a split of a hurry, float it is beautiful brightly ... " etc. respective large quantities of words.
3rd, according to the embodiment of Chinese language model Mass production dictionary word.
It is divided into three steps:
Step 1:Refine Chinese language model
Assuming that extracted " Verb not Com, Verb also Verb Com, Adj are not that Adj points, Adj return Adj, Adj pole
" (wherein VerbCom represents structure of complementation to six Chinese language models, and Verb and Com are respectively structure of complementation " verb "
Part and " complement " part, Adj represent adjective), based on this, customize Chinese language model module;
Step 2:In part-of-speech tagging submodule in word making module and word structure mark submodule respectively customization describe vocabulary and
Structure of complementation word lists, it is assumed that it is the meter such as " difficult, nervous, expensive, remote, long, big, arduous ... " to describe the content in vocabulary
3000 words;Content in structure of complementation word lists for " see clearly, clean, walk it is fast, eat up ... " etc. 5000 words of meter
Equal breaking in the middle of language, every word;
Step 3:SQL database query language, the word-building information provided according to Chinese language model module, to above-mentioned shape are provided
Hold two parts of vocabulary and word entirety or breakpoint in structure of complementation word lists or so and be combined splicing, the word needed for generation.
For example for describing vocabulary, when with " Adj is that Adj points, Adj return Adj, Adj very much " word making, using describing in vocabulary
Word it is overall, by connecting method, each generate " difficulty be difficult point, anxiety be nervous point, it is expensive be your point, be far far point,
Length is long point, be greatly it is a little bigger, arduous be arduous point ... ", " difficulty returns difficult, anxiety to return nervous, expensive to return expensive, the remote, length that returns from a distance to return
Long, great Gui is big, it is arduous return it is arduous ... ", and " it is difficult very much, it is nervous very much, it is expensive very much, distal pole, long very much, big pole
, it is arduous very much ... " 3000 words.It is similar, for structure of complementation word lists, when with " Com " is not raw by Verb
During into word, the form of " dynamic word left part of mending+or not is moved and mends word right part " is spliced into, disposably generate " do not see,
Do not wash clean, walk it is unhappy, eat up ... " meter 5000 words;When with " Verb also Verb not Com " generate word when,
The form of " whether the dynamic dynamic benefit word left part+or not of benefit word left part+also+dynamic to mend word right part " is spliced into, is disposably generated
" see do not see yet, wash do not wash clean yet, walk also walk it is unhappy, eat and also eat up ... " meter 5000 words.
4th, the embodiment that thesaurus management system realization is effectively managed dictionary content is set up using Chinese language model.
It is divided into three steps:
Step 1, structure language material module, the whole words for the word making module generation that the module is used to include in the present invention, every word
Language corresponds to and generates its Chinese language model;
Step 2, structure dictionary information module, the module and the model word-building information in the Chinese language model module in the present invention
Submodule contains completely, and by its transfer data information.It addition of in addition and retain timing information, for weighing word life cycle
Length, is specifically set as long, general, short three kinds of grades, for use in deleting out-of-date word from dictionary in time;It addition of
Language block classification information, is specifically divided into Semantic word and speech pause word.In addition, weight order is defined it is normative,
Written property, versatility, paralogy, associativity, six single dimension weights of specific factor and comprehensive weight, to meet repeated code word
The need for sorting and customizing specific dictionary.
After step 3, above-mentioned two module are set up, it is possible to carry out effective management using database platform.Such as, for
The reduction repetition rate of coding considers that user receives custom, in dictionary " it is difficult very much, it is nervous very much, it is expensive very much, distal pole, long pole
, it is big very much, arduous very much ... " etc. large quantities of words, merely desire to retain " it is difficult very much, it is expensive very much, distal pole, it is long very much,
The word that " individual character adjective+very much " is constituted greatly very much ... " etc., and to delete " it is nervous very much, it is arduous very much ... "
The word constituted Deng " two words and above adjective+very much ", need to only be searched using data base query language from language material module come
Come from " Adj is very much " model and length is more than 3 entry, and deleted from dictionary;For another example, if to select not
, can be in inquiry by rationally setting modification hierarchical value come real rationally to control dictionary capacity with the dictionary word of modification level
It is existing, it is assumed here that modification hierarchical value is set to 1, then, from " very well, too beautiful, pretty good, flown, walk ... "
When the inside is inquired about, " too beautiful, pretty good ... " modified containing two-stage is just filtered, only remaining " very well, flown,
Walk ... " etc. one-level modification word.
5th, the embodiment of Pinyin coding is automatically performed to the word of Mass production using Chinese language model.
It is equally known that being that word addition Pinyin coding is a vast and numerous work during exploitation spelling input method makes code table
Journey, although program software automatic phonetic notation can be utilized, but due to the presence of a large amount of polyphones in Chinese, be necessarily required to by artificial
Verification, and utilize Chinese language model module and word making module to cooperate, it can easily solve this problem.Method is first
A phonetic notation information table is first customized, phonetic notation is individually carried out to compositions such as " prefix, insertion, the suffix " in Chinese language model,
Carry out phonetic notation to the word making main body in each tables of data of word making module again in addition, after the completion of this two work, utilize data base querying language
Speech carries out word and coding splices, and all neologisms of automatic batch generation have just been automatically performed phonetic notation, greatly reduce workload,
And accuracy 100%, eliminate the hardship of artificial check and correction.
Claims (8)
1. a kind of method that input method dictionary is built according to Chinese language model, it is characterized in that:Including Chinese language model module and
Word making module, wherein,
The Chinese language model module, provides word-building information and to ultimately generate during for for word making module Mass production word
Dictionary provide dictionary management information;
The word-building information automatic batch that the word making module is used to be provided according to Chinese language model module generates word.
2. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that,
Described Chinese language model module is made up of model identification submodule and model word-building message sub-module,
Model identification submodule includes Chinese language model, Chinese language model by represent the character string identification of word making main body, prefix,
Parenthesis and suffix composition, be attached to before word making main body for prefix, be attached to behind word making main body for suffix, insertion
In the middle of word making main body for parenthesis, prefix, parenthesis and suffix can only occur first, can also co-occurrence, in addition,
Parenthesis can include one or more;Described word making main body refers to that the word making main body submodule in word making module is included
The basic word for word making;
The main language properties Information Number by being classified according to spoken language, written word, dialect etc. of model word-building message sub-module
According to table, and the word structure type information tables of data classified according to subject-predicate, dynamic guest, centering etc., and according to table time, sky
Between, the semantic domain information data table classified such as quantity, degree, and according to query, state, pray making, sigh with feeling and divided
The tone type information data table of class, and according to it is active and passive, the voice type information data table classified such as make, with
And modify the composition such as level, weight order information data table;
Described Chinese language model is developed according to Chinese speech pause feature and semantic integrity requirement.
3. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that,
Described word making module is made up of word making main body submodule, part-of-speech tagging submodule and word structure mark submodule,
Word making main body submodule includes the basic word for word making, and these basic words are referred to as word making main body;
Basic word in word making main body submodule is carried out part-of-speech tagging by part-of-speech tagging submodule, is divided into noun, verb, shape
Hold the specific tables of data such as word, pronoun, adverbial word, number, measure word, preposition, conjunction, auxiliary word, interjection, onomatopoeia;
Word structure mark submodule by word making main body submodule basic word carry out word structure mark, be divided into subject-predicate,
State guest, state benefit, centering, the shape heart, the amount heart, quantity, side by side, inverted sequence, it is overlapping, along passing, refer to again, mix, prepositional phrase
And the specific tables of data such as synonym, antonym, parallel word.
4. a kind of method that input method dictionary is built according to Chinese language model according to claim 1, it is characterized in that,
The tables of data in part-of-speech tagging submodule and word structure mark submodule in described word making module is that word making main body sets disconnected
Point information, performs insertion operation to word making main body during for word making and two parts before and after word making main body is acted upon respectively.
5. a kind of method as claimed in claim 1 that input method dictionary is built according to Chinese language model, it is characterised in that bag
Include following steps:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Pass through《Modern Chinese dictionary》Deng reference book and mode is artificially collected, choose basic, versatility word making
Material, and word making main body submodule, part-of-speech tagging submodule and word structure mark submodule are built based on this;
Step 3:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module
The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method dictionary word.
6. a kind of method as claimed in claim 1 that input method specialized dictionary is built according to Chinese language model, its feature exists
In comprising the following steps:
Step 1:Chinese language model is refined, and builds model identification submodule and model word-building message sub-module based on this;
Step 2:Set up specialized word word making material database;
Step 3:Based on above-mentioned specialized word word making material database, build word making main body submodule, part-of-speech tagging submodule and
Word structure marks submodule;
Step 4:Using database processing software, by the model word-building message sub-module and word making mould in Chinese language model module
The information of corresponding data table is associated in block, utilizes query sentence of database Mass production input method specialized word.
7. a kind of management method as claimed in claim 1 that input method dictionary is built according to Chinese language model, it is characterized in that:
Including language material module and dictionary information module, wherein,
Described language material module includes whole words of the word making module generation described in claim 1;
The composition of described dictionary information module and model word-building information in the Chinese language model module described in claim 1
Module is identical, and by its transfer data information.
8. a kind of development approach of the input method as claimed in claim 1 that input method dictionary is built based on Chinese language model,
Characterized in that, with input method prompting frame hints model word, comprising the following steps:
Step 1:In input method code table, increase Chinese language model information, make the every word and Chinese language mould in code table
Type formation one-to-one relationship;
Step 2:In input method engine during increase search code table, the function of corresponding word is searched according to Chinese language model;
Step 3:The icon or button of model word, or other similar indicative marks are checked in increase in input method prompting frame,
When it is a group model word that the coding inputted is corresponding, the mark is activated, when cursor of mouse is moved to the mark above
When, Chinese language model is shown, when with mouse-click or when pressing pre-defined keyboard, the Chinese language model is shown
Corresponding whole words.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610066190.6A CN106997245A (en) | 2016-01-24 | 2016-01-24 | A kind of method that input method dictionary is built according to Chinese language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610066190.6A CN106997245A (en) | 2016-01-24 | 2016-01-24 | A kind of method that input method dictionary is built according to Chinese language model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106997245A true CN106997245A (en) | 2017-08-01 |
Family
ID=59428750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610066190.6A Pending CN106997245A (en) | 2016-01-24 | 2016-01-24 | A kind of method that input method dictionary is built according to Chinese language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106997245A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271495A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101556596A (en) * | 2007-08-31 | 2009-10-14 | 北京搜狗科技发展有限公司 | Input method system and intelligent word making method |
US20110320468A1 (en) * | 2007-11-26 | 2011-12-29 | Warren Daniel Child | Modular system and method for managing chinese, japanese and korean linguistic data in electronic form |
CN102314439A (en) * | 2010-06-30 | 2012-01-11 | 百度在线网络技术(北京)有限公司 | Input method combined with application interface and equipment |
-
2016
- 2016-01-24 CN CN201610066190.6A patent/CN106997245A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101556596A (en) * | 2007-08-31 | 2009-10-14 | 北京搜狗科技发展有限公司 | Input method system and intelligent word making method |
US20110320468A1 (en) * | 2007-11-26 | 2011-12-29 | Warren Daniel Child | Modular system and method for managing chinese, japanese and korean linguistic data in electronic form |
CN102314439A (en) * | 2010-06-30 | 2012-01-11 | 百度在线网络技术(北京)有限公司 | Input method combined with application interface and equipment |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271495A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Question and answer recognition effect detection method, device, equipment and readable storage medium storing program for executing |
CN109271495B (en) * | 2018-08-14 | 2023-02-17 | 创新先进技术有限公司 | Question-answer recognition effect detection method, device, equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10997370B2 (en) | Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time | |
CN110516073A (en) | A kind of file classification method, device, equipment and medium | |
CN104484411B (en) | A kind of construction method of the semantic knowledge-base based on dictionary | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
US10496756B2 (en) | Sentence creation system | |
WO2005033909A2 (en) | Relationship analysis system and method for semantic disambiguation of natural language | |
CN109960756A (en) | Media event information inductive method | |
CN109614620A (en) | A kind of graph model Word sense disambiguation method and system based on HowNet | |
CN110321561A (en) | A kind of keyword extracting method and device | |
CN112036178A (en) | Distribution network entity related semantic search method | |
Zakharov | Corpora of the Russian language | |
CN109299455A (en) | A kind of Computer Language Processing method of the extraordinary collocation of Chinese gerund | |
Mohnot et al. | Hybrid approach for Part of Speech Tagger for Hindi language | |
CN106997245A (en) | A kind of method that input method dictionary is built according to Chinese language model | |
Feng | Evolution and present situation of corpus research in China | |
CN110188352A (en) | A kind of text subject determines method, apparatus, calculates equipment and storage medium | |
CN107818078B (en) | Semantic association and matching method for Chinese natural language dialogue | |
Dimitrova et al. | Implementation of the Bulgarian-Polish online dictionary | |
Kilgarriff | Putting the corpus into the dictionary | |
Pan | Automatic Generation of Children's Songs Based on Machine Statistic Learning | |
Zhang et al. | PQAC-WN: constructing a wordnet for Pre-Qin ancient Chinese | |
Kainan et al. | Extraction method of judicial language entities based on regular expression | |
CN110909537A (en) | Artificial intelligence method for modern Chinese component analysis | |
Amezian et al. | Training an LSTM-based Seq2Seq Model on a Moroccan Biscript Lexicon | |
Islam et al. | Beyond Words: Unraveling Text Complexity with Novel Dataset and A Classifier Application |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170801 |
|
WD01 | Invention patent application deemed withdrawn after publication |