CN101556596B - Input method system and intelligent word making method - Google Patents

Input method system and intelligent word making method Download PDF

Info

Publication number
CN101556596B
CN101556596B CN2009100051271A CN200910005127A CN101556596B CN 101556596 B CN101556596 B CN 101556596B CN 2009100051271 A CN2009100051271 A CN 2009100051271A CN 200910005127 A CN200910005127 A CN 200910005127A CN 101556596 B CN101556596 B CN 101556596B
Authority
CN
China
Prior art keywords
spoken
template
input
obtains
entry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009100051271A
Other languages
Chinese (zh)
Other versions
CN101556596A (en
Inventor
张扬
郭奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN2009100051271A priority Critical patent/CN101556596B/en
Publication of CN101556596A publication Critical patent/CN101556596A/en
Application granted granted Critical
Publication of CN101556596B publication Critical patent/CN101556596B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides an input method system. The system comprises: a word stock; a spoken language template related to various characteristics and standards of the oral entries; an input interface unit for receiving input information of users; an information conversion unit for searching the word stock according to the received input information to obtain corresponding candidate options; an intelligent word making unit for obtaining corresponding candidate options according to the spoken language template and intelligent phrase; a display output unit for displaying the candidate options and outputting the candidate option selected by users. The invention further discloses an intelligent word making method. In the invention, user can input the spoken vocabularies better with improved input experience and efficiency and low cost. The system can be adapted to quick changing and updating speed of the spoken vocabularies.

Description

The method of a kind of input method system and intelligent word
The application be that August 31, application number in 2007 are 200710121247.9 the applying date, name is called the dividing an application of application for a patent for invention of " a kind of method, device and a kind of input method system that obtains spoken language entries ".
Technical field
The invention belongs to field of information processing, particularly relate to the method for a kind of input method system and a kind of intelligent word.
Background technology
Current input method system (comprising Chinese, Japanese or the like) all is based on the ordering that word frequency in its word bank system and the word bank system comes to provide in the information input process for the user candidate word.An important indicator of the ordering person of being to use of candidate word first-selected speech hit rate height in the information input process.Said first-selected speech hit rate is meant, receives after user's the input information, and ordering is that the user needs most at preceding vocabulary or word.
Prior art has been taked various measures in order to improve first-selected speech hit rate, for example, enlarges the capacity of dictionary, stores more entry; Perhaps obtain nearest neologisms and word frequency information more accurately through variety of way; Perhaps, adopt the mode that loads specialized dictionary, improve the first-selected speech hit rate of user under some special input condition.Should, these technological improvements can improve user's first-selected speech hit rate to a certain extent, but expect the spoken language entries handled for the present invention, but are powerless.
In fact the spoken language entries that the present invention's expectation is obtained can be divided into two classifications, and the one, common oral term, the one, netspeak.For oral term; Since people to the use of oral term than written word more flexibly be not restricted to one pattern; As; " try ", " walking ", " having a meal ", " beat can ball " or the like, so adopt the collection method of existing various vocabulary to be difficult to obtain accurately and enough comprehensive oral terms.And for netspeak, because it has more complicated characteristic, for example: Chinese character/numeral/letter is used (" 8 mistake ", " expectation ing " etc.) with, even also has the participation of symbol; Wrong speech rate very high (" Wahaha ", " heartily ", " digging heartily " etc.); Dynamic change property in time is very strong.So adopting the collection method of existing various vocabulary can't obtain especially.
Researchist's subjective initiative now often obtaining and study of above-mentioned spoken language entries adopted artificial form, because can well satisfy the complex characteristics of these spoken language entries.For example, " the Chinese cyberspeak dictionary " compiled in root unit by the researcher of broadcaster college of art of Beijing Broadcasting Institute formally published in June calendar year 2001.More than 2000 of this dictionary income entry, positive contract 400,000 words, its source is exactly to rely on manual sorting.But artificially collect the defective that is difficult to overcome: gathering speed is too low, cost is too high; And its gathering speed is difficult to adapt to the renewal speed of spoken language entries.And then because the renewal speed of language is more and more faster, new netspeak vocabulary and usage emerge in an endless stream, the simple manpower and materials that rely on artificial mode will continue labor.
Because along with the rise of internet, the cost that people exchange communication each other greatly reduces, it is more also more convenient to release news, and therefore, language is also with the mad development of a kind of unprecedented speed.In the process that the netizen releases news on BBS, Blog and immediate communication tool, use the chance of spoken language entries to increase greatly, and existing input method can't satisfy such demand.
Summary of the invention
Technical matters to be solved by this invention provides the method for a kind of input method system and a kind of intelligent word, can help the better input port of user language vocabulary, improves input and experiences and input efficiency.
In order to address the above problem, the invention discloses a kind of input method system, comprising:
Dictionary;
Spoken template; Said spoken template is relevant with the various characteristics and the criterion of spoken language entries; Wherein said spoken template obtains in the following manner: obtain spoken language entries; Analyze the said spoken language entries of obtaining, feedback information is provided to the rule template that presets; Optimize said rule template according to feedback information, obtain spoken template; Said spoken language entries obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries;
Input interface unit is used to receive user's input information;
The information translation unit is used for according to the input information that is received, and the retrieval dictionary obtains corresponding candidate item;
The intelligent word unit is used for according to said spoken template, and intelligent word obtains corresponding candidate item;
Show output unit, be used to show candidate item, and the candidate item of output user selection.
According to another embodiment of the present invention, a kind of method of utilizing intelligent word to carry out the information input is also disclosed, comprising: the input information that receives the user; According to said input information and the spoken template that presets, intelligent word obtains corresponding candidate item; Wherein, said spoken template obtains in the following manner: obtain spoken language entries; Analyze the said spoken language entries of obtaining, feedback information is provided to the rule template that presets; Optimize said rule template according to feedback information, obtain spoken template; Said spoken language entries obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Show candidate item, and the candidate item of output user selection.
Compared with prior art, the present invention has the following advantages:
At first; Input method system provided by the invention can foundation carry out intelligent word with the various characteristics and the relevant spoken template of criterion of spoken language entries; Obtain spoken language entries, can help the better input port of user language vocabulary, improve input and experience and input efficiency; Efficient is higher and cost is lower, and can adapt to spoken vocabulary variation renewal speed ratio characteristic faster.
Secondly, the present invention can obtain the very higher spoken template of closing to reality situation, accuracy rate and coverage rate through the iteration optimization (comprise and improving and expansion) to spoken template; And then, utilize so spoken template to carry out intelligent word, the spoken language entries instance that can not be subject in the dictionary to be included.
Description of drawings
Fig. 1 is a kind of flow chart of steps of obtaining the method embodiment of spoken language entries of the present invention;
Fig. 2 is a kind of flow chart of steps of obtaining the method preferred embodiment of spoken language entries of the present invention;
Fig. 3 is a kind of structured flowchart that obtains the device embodiment of spoken language entries of the present invention;
Fig. 4 is the structured flowchart of a kind of input method system embodiment of the present invention;
Fig. 5 is the structured flowchart of the another kind of input method system embodiment of the present invention;
Fig. 6 is the structured flowchart of a kind of participle device of the present invention embodiment;
Fig. 7 is the structured flowchart of the another kind of participle device of the present invention embodiment.
Embodiment
For make above-mentioned purpose of the present invention, feature and advantage can be more obviously understandable, below in conjunction with accompanying drawing and embodiment the present invention done further detailed explanation.
Method of the present invention can be described in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in DCE, put into practice the present invention, in these DCEs, by through communication network connected teleprocessing equipment execute the task.In DCE, program module can be arranged in this locality and the remote computer storage medium that comprises memory device.
With reference to Fig. 1, show a kind of method embodiment that obtains spoken language entries of the present invention, specifically can comprise:
Step 101, orientation are obtained required internet language material, form corpus.
Language material one speech generally is appreciated that to be the samples of text that is used to obtain information needed training text transaction module; Its precision, coverage rate have directly determined the model accuracy that obtains quality of information and train.And that the spoken language entries that the present invention hopes to obtain occurs in some internet language material is more frequent, and what in other internet language material, occur is then fewer, and therefore, the present invention needs the directed required language material that obtain.The mode that orientation is obtained can improve the quality of language material, avoids but not sneaked into the ranks that meet extraction condition because not smart some that cause of language material do not belong to the vocabulary of extracting objects originally, such as short sentence in a few thing recruitment information or abbreviation etc.
For example, the present invention can obtain the frequent BBS that occurs of those spoken language entries, blog, the spoken dictionary of user individual or text/resources such as voice-enabled chat record.Obtain manner can for: use oriented network spider (focused spider) to extract, perhaps from user thesaurus trusty or chat record storehouse, obtain, such as, user's cell dictionary upload function that search dog input method official homepage provides etc.And extract for orientation, choosing of website can be to specify website to extract, the classified points that also can be based on extracting content on web pages is filtered.And concrete extraction process belongs to technology contents well known in the art, is not described in detail in this.
Step 102, according to presetting strategy, from said corpus, extract qualified entry.
The spoken language entries that the present invention relates to can comprise the entry of not included by traditional dictionary (the Out-of-Vocabulary word of two classifications; Unregistered word): the one, the colloquial style of dictionary entry derives from usage, like " having a meal ", " happy ", " trying "; Another kind of is the internet language that in internet, applications, widely uses, like " bang and lie prone ", " dark reddish purple ", " 8 mistake ", " PPMM " etc.Though it is more general that the entry of last classification uses in people's daily communication; But owing to be media with sound mainly; So can't collect relevant language material, and along with the rise of internet, these vocabulary are increasing to be appeared in the armory of this magnanimity of internet; Therefore, the present invention just is able to these entries are extracted and excavate.On the other hand, also do not have absolute boundary between these two types of internet language and the traditional spoken words, they always interact, interpenetrate, and in the resources bank of this magnanimity of internet with depositing symbiosis.Also just because of this, the present invention could pass through certain methods, they a large amount of extracting and studying in time.Certainly, obtain qualified entry, at first need carry out signature analysis, and then set up the corresponding strategy that extracts above-mentioned entry in order to extract.
Strategy in the step 102 just can be provided with based on the signature analysis situation to spoken language entries, and is general, extracts strategy and can be divided into two kinds of rule template and statistical classifications, the perhaps mix of the two.Following brief account:
Mode 1
Can accomplish the extraction to entry through the following strategy that presets: preset a plurality of rule templates, said rule template is used for describing the individual character array mode of entry; Carry out repeatedly entry according to said rule template and extract (certainly, under the less situation of rule template, also can only once extract), each entry extracts and adopts one or more rule template.For example, ABC (having a cigarette), AAB (trying), ABAB (joyfully) template etc. can be set one or more templates at every turn and be used for the entry extraction.
Mode 2
Can accomplish extraction through the following strategy that presets:, carry out cutting according to the participle dictionary to a given character string in the corpus to entry; Convert the participle fragment into a plurality of candidate's entries; According to the feature database that presets, judge whether candidate's entry belongs to spoken language entries, if then extract.Mode 2 belongs to a kind of concrete realization of statistical classification, and its principle mainly relies on the category theory in machine learning field.For example, a given Chinese character string that length is n at first uses a segmenter that this string is carried out cutting, and spoken language string wherein becomes the participle fragment because be not embodied in the dictionary for word segmentation; Be converted into a series of possible spoken language entries candidate to the participle fragment then, judge that according to some characteristics of spoken language entries each candidate is or is not spoken language entries again, thereby accomplish classification.For example, according to the frequency of entry, contextual features such as punctuate, length are judged or the like.
Because the spoken language entries of finding occurs in ensuing participle process probably, these entries can add in the participle dictionary dynamically, promote the precision of word segmentation.The spoken language entries that mode 2 is primarily aimed at the participle fragment extracts, if employing mode 1 extracts, then can not need word segmentation processing.
Above two modes compare, the extraction scheme of the rule template that mode 1 is adopted, its implementation is fairly simple, but quality that its entry extracts and limited amount are in the quality and the coverage rate of template self.Though the statistical classification scheme operability that mode 2 is adopted is higher, needs lot of data to do statistics, but often there is " data are sparse " problem in the reality.Therefore in a preferred embodiment of the invention, can consider both mixing, criterion such as some rule templates are incorporated in the statistics identification and classification model with characteristic formp, often can obtain better effect.See the introduction of following mode 3 for details.
Mode 3
Can accomplish extraction through the following strategy that presets:, carry out cutting according to the participle dictionary to a given character string in the corpus to entry; Convert the participle fragment into a plurality of candidate's entries; Carry out repeatedly entry according to a plurality of rule templates that preset and extract, each entry extracts and adopts one or more rule template; Said rule template is used for describing the individual character array mode of entry.
For example, according to following feature templates, each candidate is carried out binary classification, form characteristics such as said AAB, ABC, AABB are also as a category feature of classifying.Under the big frame of statistical classification model, judge in conjunction with the characteristic synthetic of other classifications whether each candidate is spoken language entries, often can obtain higher judge precision.
Following table provides some possible spoken language entries characteristics and some possible spoken templates:
Top brief account adopt the entry extraction scheme of rule template and statistical classification and the two Combination application; But those skilled in the art should know; Also possibly there are other feasible extraction schemes; Be that the present invention is not limited to three kinds of above-mentioned extraction modes,, just belong to the explanation scope that presets strategy of the present invention as long as extract according to the various characteristics and the criterion of spoken language entries.
Step 103, the entry that obtains to extraction filter, and obtain required spoken language entries.
Said filtering rule can include but not limited to according to the frequency of occurrences, become standards such as Word probability, time, grammer and form characteristic, also can be according to information science relevant criterion such as scope occurring.Certainly, under specific circumstances, can also adopt artificial mode of filtering.Preferably, can also adopt the theoretical rubbish that filters wherein of some outside resources or information science.
For example, sew vocabulary or the like according to the front and back of the rubbish dictionary of collected arrangement, rubbish speech and filter, to remove rubbish vocabulary.
Again for example, according to information entropy theory, the number that extracts the entry frequency and appear at this an entry left side/right adjacent different Chinese character, thus judge that whether this entry is broken speech, does not promptly belong to spoken language entries required for the present invention.Give one example; For the entry that extracts according to the ABA template " not only gas but also "; The word of finding " again " left side appearance in the left side is many, and the word that " " word the right, the right occurs seldom, concentrates in " hatred, angry "; Thereby can know that it is a broken speech, promptly " not only gas but also " is not the spoken language entries of similar " tasting " required for the present invention and so on.
Again for example, can also be directed against each resulting entry, its occurrence number of statistics in corpus if be greater than or equal to predetermined threshold, confirms that then this entry is required spoken language entries.
With reference to Fig. 2, show a kind of preferred embodiment that obtains spoken language entries, specifically can may further comprise the steps, wherein, just no longer detail with similar part embodiment illustrated in fig. 1.
Step 201, orientation are obtained required internet language material, form corpus.
Step 202, collected internet language material is carried out the data purification pre-service.
As previously mentioned, the quality of language material has directly influenced the quality of the entry of final extraction, therefore, in this preferred embodiment, has increased the pre-treatment step that purifies.For example, from the form aspect, can remove the invalid informations such as html label in the webpage; From the content aspect, can also remove the invalid template on some types of web pages, for example, some fixed form information in the BBS webpage etc.Also need remove certain user's interference input in some cases, like scrabbling up big " top " word to express own intense emotion, perhaps certain phrase or sentence repeated some times with a plurality of " top " word such as some BBS users.Suchlike situation all can impact extraction process, therefore, can remove through step 202.
If the language material source is the user's voice chat record, then also need do the conversion of voice to Chinese character, making the input that offers extraction step unified is the treatable text formatting of computing machine.
Step 203, according to presetting strategy, from said corpus, extract qualified entry.
Step 204, carry out error correction to entry.
Promptly correct the wrongly written or mispronounced characters in the entry, preferred, can the wrongly written or mispronounced characters in the entry be corrected based on contextual similarity.For example, " blog fight " generally is the ill-formalness as " fight ", if but context has keywords such as blog, fight, scolding, can assert that be meant the meaning that the bloger fights mutually here on blog, and might not be wrong speech.Again for example, Wang Fei special edition " luxuriant and rich with fragrance article for sale ", " not having mosquito quietly " advertising words or the like can judge that through combining deep contextual analysis they are not wrong speech, do not need corrigendum.
Step 205, based on entry in enunciative similarity, change the various variants (for example, numeral, English etc.) of entry into canonical form.General canonical form can be judged through the height of the frequency of occurrences.Usually the process that can step 205 be called " entry normalization ", for example, with " 88 ", " bye bye " all is converted into canonical form " bye bye "; " Wahaha ", " heartily ", " digging heartily " all are converted into " Wahaha " of standard, or the like.Concrete normalized method can but be not limited to set up numeral, English mapping to Chinese character based on pronunciation model; Preferably, concrete normalized process also need be considered contextual similarity.
Need to prove that step 204 and step 205 are uninevitable to be occurred simultaneously, because it is respectively to different entry errors.In addition, if the spoken language entries that the present invention obtained is mainly used in input method, then because Chinese character coding input method dictionary entry must strictness be a Chinese character making things convenient for phonetic notation, normalization such as the numeral that comprises in just must be in this application scenarios, letter, symbol with entry.And when spoken language entries that the present invention obtained is mainly used in Chinese word segmentation; Then since Chinese word segmentation use in and do not require that the dictionary entry must be Chinese character entirely; Quite a few is arranged is that trade (brand) name, named entity etc. comprise numeral, letter, then can keep the primitive form of these entries and it goes without doing entry normalization.
Step 206, the entry that obtains to extraction filter, and obtain required spoken language entries.
What need further specify is, though in the description of present embodiment, step 204, step 205 and step 206 are described successively, and in fact, these three steps can also be accomplished simultaneously, promptly in a step, accomplish.
The spoken language entries that step 207, analysis are obtained provides feedback information to presetting strategy; Said feedback information is used to improve original rule template or characteristic, and new regulation template or new feature perhaps are provided.
The feedback information that step 207 provided can provide some invalid templates to step 202, perhaps is directed against the improvement of original invalid template, to realize more excellent purification preprocessing process.The feedback information that step 207 provided can also be to step 203 optimization is provided or new extraction template, to improve accuracy that entry extracts and comprehensive.Can find out that from step 207 extraction of rule template and improvement are the processes of an iteration, gradually can reach optimum.
Iteration optimization such as the ABC template: wherein the AC requirement is the speech in the dictionary.The first step can be selected a collection of B word seed, like " End ", " only ", " individual ", counts a collection of ABC entry.The second step statistics is gathered with the B word of AC entry co-occurrence here, also carries out necessary manual monitoring, thereby expands B word seed at the beginning, gets into the first step again.So iteration can be found out the B word set that the overwhelming majority meets the ABC template.
Same, this iterative process can also be used for finding new template at the spoken language entries instance that extracts.Such as having obtained a lot of entries such as " sudden and violent strong ", " sudden and violent refreshing ", " rich cruelly " etc. in certain period; We can find the perhaps form of manual intervention automatically through machine so; Find the template of " sudden and violent+< adjective>"; Here " cruelly " is as a degree adverb, with " very " " very " synonym.We can this template of conscious application then, is drawn into more entries such as " gloomy cruelly ", " handsome cruelly ".Identical reason can be found " doubly+< adjective>", " < adjective >+say " and " ... spread " template or the like automatically.Thereby obtaining gradually comprehensive with degree of accuracy all than higher spoken template.
Again for example; Can learn that through analyzing entries such as " more and more " that is extracted, " people sees the people " should not belong to the spoken language entries in " ABA " template; And should belong to the part of entry in the ABAC template, thereby optimize " ABA " template through increasing qualifications.Wait entry for " advanced back " that extracted, can learn that through analysis it should not belong to the spoken language entries in " ABC " template, and should belong to the ABCD template, wherein AC antonym each other; Thereby the adjustment extraction template upgrades filter criteria, more effectively extracts.
Step 208, the spoken language entries that obtains is added in the input method dictionary; And/or the rule template in the extraction strategy after will improving according to feedback information is added in the input method intelligent word rule base.
Intelligent word generally is appreciated that to: the input method instrument phonetic according to input, from some possible Chinese character string candidates, chooses the process that the most probable candidate exports Chinese character string dynamically.Because it belongs to the known technology of this area; In existing input method, used more; But existing intelligent word generally all is to organize speech according to the connection probabilistic information between the speech, and the proposition of the present invention's innovation can also be carried out the intelligent word of spoken vocabulary through presetting spoken template.For example; The most a kind of mode of in intelligent word, using spoken template can for: the input information according to the user obtains a plurality of possible individual characters combinations; Utilize spoken template that match filtering is carried out in these individual character combinations then, the spoken language entries that then can obtain in dictionary, not having storage is as candidate item.
In fact, traditional input method always combines adjacent syllable section in dictionary, to search corresponding Chinese character candidate for various possible syllable splittings; And in the group speech process under the present invention, can stride the coupling that syllable carries out template.The result meets the AABB template such as gao ' gao ' xing ' xing cutting, and it can combine AB phonetic and search the entry in the dictionary, and can be as traditional input method be unit group speech with the word, reduced group speech expense potentially.
Particularly; In step 208, only the spoken language entries that obtains is added in the input method dictionary, promptly belong to accurate coupling based on the spoken language entries instance; Be equivalent to expand existing dictionary to the spoken language entries instance; Because the present invention can obtain a large amount of spoken language entries instances, thus the input efficiency of user can be improved to a certain extent to spoken language entries, but be difficult to solve the situation of not including entry.And the rule template in the extraction strategy after will improving according to feedback information is added in the input method intelligent word rule base, then belongs to the dynamic construction based on spoken template.Such as existing template ABC, wherein AC is moving guest's phrase that dictionary is included, and the scope of B is limited and can dynamically finds, as " individual,, intact, one ".So when user's input Pinyin string " xi ' ge ' zao ", input method is found that the corresponding candidate of this phonetic " takes a bath " and is mated this template fully, exports as optimum answer thereby can will take a bath.Certainly, these two kinds of methods are not mutual exclusions, can exist simultaneously to satisfy the needs of different occasions.
For example, user inputs character string " huanle ", then candidate item shows existing common entry such as " joy " dictionary and the existing spoken language entries of dictionary such as " having changed "; And when user inputs character string " huanle money ", then the candidate item head-word shows " having gone back money ", and then shows " having changed money " etc.; Because according to the spoken template that presets, in existing template ABC, AC is that moving guest's phrase that dictionary is included " is gone back money " perhaps " exchanging money "; And " " belong in the B set; Therefore, the candidate item that can intelligent word obtains not having in the dictionary " has been gone back money " and " having changed money ", to further facilitate user's spoken language input.
Further, because spoken template of the present invention can reach good accuracy and comprehensive through after the iteration repeatedly; Therefore; It is applied in the input method system, then can better meets user's input demand, improve the input efficiency of user's spoken language entries.
The present invention can be applied to the input method platform of various input modes, comprises keyboard symbol, hand-written information and phonetic entry or the like.Be that said input information can comprise coded string, also can comprise the information of handwritten input information and phonetic entry, carry out the candidate item ordering because these input modes also all need be used dictionary.Because the information translation in these input modes all belongs to known technology, just do not detailed at this.Only being input as example with coded string below is elaborated.
In addition, because in the prior art, the input method platform may operate on the multiple computing equipment, for example, PC, personal digital assistant, mobile terminal device or the like are so the present invention also can be useful in the above-mentioned various computing equipment.
The present invention can be applied to the input method system that the candidate word ordering need appear in Japanese, Korean etc., for example, for Japanese, the candidate word ordering just need occur by the hiragana in the Japanese, when katakana is combined into phrase.Because the application of the present invention in above-mentioned several kinds of input method systems all is similar, so explanation for ease, this instructions only is illustrated with the situation that is applied in Chinese.
Step 209, the spoken language entries that obtains is added in the language material participle dictionary; And/or the rule template in the extraction strategy after will improving according to feedback information is added in the language material word segmentation regulation storehouse.
With Chinese is example, and Chinese word segmentation is used, and especially the Chinese word segmentation of information retrieval field mainly is the segmenting method that adopts based on dictionary, depends on the coverage rate of entry to a great extent, and the unregistered word of including spoken type of part helps to promote the precision of word segmentation; Further, the spoken template of the high-quality that iteration of the present invention is obtained is carried out the dynamic cutting on speech border, can obtain better participle effect.As for concrete participle process,, therefore no longer detail owing to be not emphasis of the present invention.
For aforesaid each method embodiment; For simple description; So it all is expressed as a series of combination of actions, but those skilled in the art should know that the present invention does not receive the restriction of described sequence of movement; Because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to Fig. 3, show a kind of device embodiment that obtains spoken language entries, specifically can comprise with lower component:
Language material acquisition module 301 is used for orientation and obtains required internet language material, forms corpus;
Entry abstraction module 302 is used for from said corpus, extracting qualified entry according to presetting strategy; Said preset the strategy relevant with the various characteristics of spoken language entries;
Filtering module 303 is used for filtering to extracting the entry that obtains, and obtains required spoken language entries.
In one embodiment of the invention, said entry abstraction module may further include: a plurality of rule templates, said rule template are used for describing the individual character array mode of entry; Template extracts submodule, is used for carrying out repeatedly entry according to said rule template and extracts, and each entry extracts and adopts one or more rule template.
In another embodiment of the present invention, said entry abstraction module can further be subdivided into: sheer is used for carrying out cutting to the given character string of corpus one according to the participle dictionary; Converter is used for converting the participle fragment into a plurality of candidate's entries; The feature extraction submodule is used for judging according to the feature database that presets whether candidate's entry belongs to spoken language entries, if then extract.
In another embodiment of the present invention, said entry abstraction module may further include: sheer is used for carrying out cutting to the given character string of corpus one according to the participle dictionary; Converter is used for converting the participle fragment into a plurality of candidate's entries; A plurality of rule templates, said rule template are used for describing the individual character array mode of entry; Template extracts submodule, is used for carrying out repeatedly entry according to a plurality of rule templates that preset and extracts, and each entry extracts and adopts one or more rule template.
In order to obtain the spoken template of high-quality, in one embodiment of the present invention, can also comprise: analyze feedback module 304, be used to analyze the spoken language entries of being obtained, feedback information is provided to presetting strategy; Said feedback information is used to improve original rule template or characteristic, and new regulation template or new feature perhaps are provided.
In order to improve the accuracy that entry obtains, in one embodiment of the present invention, can also comprise: correction module 305 is used for before filtering, carrying out error correction to entry.And entry normalization module 306 was used for before filtering, and in enunciative similarity, changed the various variants of entry into canonical form based on entry.Certainly, these two modules might not comprise simultaneously, select one as required and select for use and get final product.
In order to improve the precision of language material, present embodiment can also comprise pre-processing module 307, is used for collected internet language material is carried out the data purification pre-service.
With reference to Fig. 4, show a kind of input method system embodiment, specifically can comprise
Dictionary 401;
Spoken template 402; Said spoken template obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Analyze the spoken language entries of being obtained, feedback information is provided to the rule template that presets; Optimize the said rule template that presets in the strategy according to feedback information, obtain spoken template; In fact, shown in preceding table, spoken template of the present invention can also comprise some spoken rules;
Input interface unit 403 is used to receive user's input information;
Information translation unit 404 is used for according to the input information that is received, and retrieval dictionary 501 obtains corresponding candidate item;
Intelligent word unit 405 is used for according to spoken template 402, and intelligent word obtains corresponding candidate item;
Show output unit 406, be used to show candidate item, and the candidate item of output user selection.
In fact, the present invention proposes spoken template is incorporated in the intelligent word process for the first time just, thereby, make under the dictionary condition that need not preset very perfect spoken language entries instance, also can help user's required spoken language entries of input fast.And for the setting up of spoken template, the present invention does not need to limit, and those skilled in the art can obtain through variety of way, for example, and artificial set or the like.Certainly, in the embodiment of Fig. 4, proposed a kind of mode that obtains preferable spoken template, imported the efficient of spoken language entries with further raising user, but can not this mode be thought that the present invention obtains the sole mode of spoken template.
Carry out the output of spoken vocabulary through intelligent word if adopt spoken template; Then see and to show as: for the spoken language entries that does not have the dictionary from presentation; Be not activated under the situation of intelligent word and can't importing, and next can the input of situation that starts intelligent word; Perhaps,, then, can't import the spoken language entries that does not have in the dictionary, and open after the spoken template function, just can import closing under the situation of spoken template function if provide switch spoken template function.
With reference to Fig. 5, show another kind of input method system embodiment, comprising:
Store the dictionary 501 of spoken language entries; Said spoken language entries is obtained in the following manner: orientation is obtained required internet language material, forms corpus; According to presetting strategy, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Wherein, said preset the strategy relevant with the various characteristics of spoken language entries;
Spoken template 502; Said spoken template obtains in the following manner: analyze the spoken language entries of being obtained, to presetting strategy feedback information is provided; Optimize the said rule template that presets in the strategy according to feedback information, obtain spoken template;
Input interface unit 503 is used to receive user's input information;
Information translation unit 504 is used for according to the input information that is received, and retrieve stored has the dictionary 501 of spoken language entries, obtains corresponding candidate item;
Intelligent word unit 505 is used for according to spoken template 502, and intelligent word obtains corresponding candidate item;
Show output unit 506, be used to show candidate item, and the candidate item of output user selection.
Fig. 4 and Fig. 5 are two more close embodiment, and the key distinction is, the spoken template applications that embodiment shown in Figure 4 only obtains the present invention through iteration optimization is mainly accomplished the input of spoken language entries through spoken template in input method system; Embodiment shown in Figure 5 then is applied to spoken language entries and the spoken template that the present invention obtains in the input method system, through the covering of spoken language entries instance and replenishing of spoken template, then can reach better input effect.
With reference to Fig. 6, show a kind of embodiment of participle device, specifically can comprise:
Participle dictionary 601;
Store the word segmentation regulation storehouse 602 of spoken template, said spoken template obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Analyze the spoken language entries of being obtained, feedback information is provided to the rule template that presets; Optimize the said rule template that presets in the strategy according to feedback information, obtain spoken template;
Participle execution module 603 is used for utilizing the entry of participle dictionary and the rule template in the word segmentation regulation storehouse that language material is carried out participle.
With reference to Fig. 7, show the embodiment of another kind of participle device, comprising:
Store the participle dictionary 701 of spoken language entries, said spoken language entries is obtained in the following manner: orientation is obtained required internet language material, forms corpus; According to presetting strategy, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Wherein, said preset the strategy relevant with the various characteristics of spoken language entries;
Store the word segmentation regulation storehouse 702 of spoken template, said spoken template obtains in the following manner: analyze the spoken language entries of being obtained, to presetting strategy feedback information is provided; Optimize the said rule template that presets in the strategy according to feedback information, obtain spoken template;
Participle execution module 703 is used for utilizing the entry of participle dictionary and the rule template in the word segmentation regulation storehouse that language material is carried out participle.
Fig. 6 and Fig. 7 are two more close embodiment, and the key distinction is, the spoken template applications that embodiment shown in Figure 6 only obtains the present invention through iteration optimization mainly improves participle efficient through spoken template in the participle process; Embodiment shown in Figure 7 then is applied to spoken language entries and the spoken template that the present invention obtains in the participle process, through the covering of spoken language entries instance and replenishing of spoken template, then can reach more reasonably participle effect.
The invention also discloses a kind of method embodiment of intelligent word accordingly, promptly utilize intelligent word to obtain the process of candidate item, specifically comprise:
Step a, reception user's input information;
Step b, said input information of foundation and the spoken template that presets, intelligent word obtains corresponding candidate item;
Step c, displaying candidate item, and the candidate item of output user selection.
Wherein, preferred, said spoken template can obtain in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries; Analyze the spoken language entries of being obtained, feedback information is provided to the rule template that presets; Optimize the said rule template that presets in the strategy according to feedback information, obtain spoken template.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For device embodiment, because it is similar basically with method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
In a word, input method system " is seized user's desktop " a most important step as software/Internet firm, and quality of himself and user friendly degree have determined directly whether the user is willing to perhaps turn to this input method from other input methods in selecting this input method for use.Constantly perfect along with internet infrastructure; The mode that common netizen exchanges mutually is increasing, and convenience degree is also more and more strengthened, and they can pass through IM such as oicq/icq, live/yahoo/aol messenger; BBS, forms such as blog comment exchange with spoken word.This situation has highlighted traditional input method dictionary and has lacked the contradiction that fresh and alive spoken vocabulary exchanges with the growing spoken netspeak of use of netizen.And technical scheme proposed by the invention just can be obtained colloquial style vocabulary fast and effectively, and constantly finds spoken template wherein, can reach following technique effect:
1, covers the scope of fairly large spoken language entries.To the spoken language entries that extracts is not simply to add the input method dictionary, but with participation intelligent words such as rule templates, can cover more situation like this, promotes user's input fluency.
2, new term more timely and effectively.We know; The internet language has the fast characteristics of renewal pace of change, because the present invention is a kind of automatic abstracting method, needs artificial the local less of interference of participating in; So can obtain current up-to-date spoken vocabulary timely, know the current spoken trend that changes.
On the other hand, in the face of the data of magnanimity, people need manage and visit required information fast and accurately, comprising personal data such as Email, chat record, multimedia documents.Handle the level of coverage that the used participle program of these information depends on the participle dictionary to a great extent.Through spoken language entries Automatic Extraction method of the present invention, expansion participle dictionary and word segmentation regulation storehouse that we can be in time a large amount of.For the spoken language entries instance that is not drawn into, then also can handle it through the mode of template matches.
More than to a kind of method and apparatus that from internet information, extracts spoken language entries provided by the present invention; A kind of method of intelligent word and a kind of is applied in the spoken language entries that obtains in the aforementioned process and input method system and a kind of participle device of spoken template; Carried out detailed introduction; Used concrete example among this paper principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that on embodiment and range of application, all can change, in sum, this description should not be construed as limitation of the present invention.

Claims (2)

1. an input method system is characterized in that, comprising:
Dictionary;
Spoken template, said spoken template is relevant with the various characteristics and the criterion of spoken language entries; Wherein said spoken template obtains in the following manner: obtain spoken language entries; Analyze the said spoken language entries of obtaining, feedback information is provided to the rule template that presets; Optimize said rule template according to feedback information, obtain spoken template; Said spoken language entries obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries;
Input interface unit is used to receive user's input information;
The information translation unit is used for according to the input information that is received, and the retrieval dictionary obtains corresponding candidate item;
The intelligent word unit is used for according to said spoken template, and intelligent word obtains corresponding candidate item;
Show output unit, be used to show candidate item, and the candidate item of output user selection.
2. a method of utilizing intelligent word to carry out the information input is characterized in that, comprising:
Receive user's input information;
According to said input information and the spoken template that presets, intelligent word obtains corresponding candidate item; Wherein, said spoken template obtains in the following manner: obtain spoken language entries; Analyze the said spoken language entries of obtaining, feedback information is provided to the rule template that presets; Optimize said rule template according to feedback information, obtain spoken template; Said spoken language entries obtains in the following manner: orientation is obtained required internet language material, forms corpus; According to the rule template that presets, from said corpus, extract qualified entry; The entry that obtains to extraction filters, and obtains required spoken language entries;
Show candidate item, and the candidate item of output user selection.
CN2009100051271A 2007-08-31 2007-08-31 Input method system and intelligent word making method Active CN101556596B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100051271A CN101556596B (en) 2007-08-31 2007-08-31 Input method system and intelligent word making method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100051271A CN101556596B (en) 2007-08-31 2007-08-31 Input method system and intelligent word making method

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN200710121247A Division CN100595760C (en) 2007-08-31 2007-08-31 Method for gaining oral vocabulary entry, device and input method system thereof

Publications (2)

Publication Number Publication Date
CN101556596A CN101556596A (en) 2009-10-14
CN101556596B true CN101556596B (en) 2012-04-18

Family

ID=41174713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100051271A Active CN101556596B (en) 2007-08-31 2007-08-31 Input method system and intelligent word making method

Country Status (1)

Country Link
CN (1) CN101556596B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103246355B (en) * 2012-02-06 2017-04-05 百度在线网络技术(北京)有限公司 On-line input method evaluating method, system and device
CN103092928B (en) * 2012-12-31 2015-12-23 安徽科大讯飞信息科技股份有限公司 Voice inquiry method and system
CN104461042B (en) * 2013-09-16 2017-12-26 百度在线网络技术(北京)有限公司 Based on the Japanese input method and system for retracting key and carrying out automatically error correction
CN104714940A (en) * 2015-02-12 2015-06-17 深圳市前海安测信息技术有限公司 Method and device for identifying unregistered word in intelligent interaction system
CN106997245A (en) * 2016-01-24 2017-08-01 杨文韬 A kind of method that input method dictionary is built according to Chinese language model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN101013443A (en) * 2007-02-13 2007-08-08 北京搜狗科技发展有限公司 Intelligent word input method and input method system and updating method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words
CN1912872A (en) * 2006-07-25 2007-02-14 北京搜狗科技发展有限公司 Method and system for abstracting new word
CN101013443A (en) * 2007-02-13 2007-08-08 北京搜狗科技发展有限公司 Intelligent word input method and input method system and updating method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JP特開2006-31143A 2006.02.02

Also Published As

Publication number Publication date
CN101556596A (en) 2009-10-14

Similar Documents

Publication Publication Date Title
CN100595760C (en) Method for gaining oral vocabulary entry, device and input method system thereof
Schmitz Inducing ontology from flickr tags
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
CN106649818B (en) Application search intention identification method and device, application search method and server
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN103077164A (en) Text analysis method and text analyzer
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN101334774A (en) Character input method and input method system
CN101673306B (en) Website information query method and system thereof
WO2017024553A1 (en) Information emotion analysis method and system
CN101556596B (en) Input method system and intelligent word making method
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
WO2013049529A1 (en) Method and apparatus for unsupervised learning of multi-resolution user profile from text analysis
CN108363725B (en) Method for extracting user comment opinions and generating opinion labels
CN103869999B (en) The method and device that candidate item caused by input method is ranked up
Boldrini et al. Machine learning techniques for automatic opinion detection in non-traditional textual genres
CN110674252A (en) High-precision semantic search system for judicial domain
CN110175289A (en) Mixed recommendation method based on cosine similarity collaborative filtering
CN110298033A (en) Keyword corpus labeling trains extracting tool
CN103870472B (en) A kind of compound word method for digging and device
Sirajzade et al. The LuNa Open Toolbox for the Luxembourgish Language
Carter Exploration and exploitation of multilingual data for statistical machine translation
CN108470026A (en) The sentence trunk method for extracting content and device of headline
Maciołek et al. Using shallow semantic analysis and graph modelling for document classification
Lee et al. Extracting multiword sentiment expressions by using a domain‐specific corpus and a seed lexicon

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant