CN1912872A - Method and system for abstracting new word - Google Patents

Method and system for abstracting new word Download PDF

Info

Publication number
CN1912872A
CN1912872A CNA200610103593XA CN200610103593A CN1912872A CN 1912872 A CN1912872 A CN 1912872A CN A200610103593X A CNA200610103593X A CN A200610103593XA CN 200610103593 A CN200610103593 A CN 200610103593A CN 1912872 A CN1912872 A CN 1912872A
Authority
CN
China
Prior art keywords
character string
neologisms
dictionary
searching keyword
computing equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200610103593XA
Other languages
Chinese (zh)
Other versions
CN100405371C (en
Inventor
佟子健
郭奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=37721812&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=CN1912872(A) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CNB200610103593XA priority Critical patent/CN100405371C/en
Publication of CN1912872A publication Critical patent/CN1912872A/en
Priority to PCT/CN2007/070260 priority patent/WO2008014702A1/en
Application granted granted Critical
Publication of CN100405371C publication Critical patent/CN100405371C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Abstract

This invention discloses a method for picking up new words including: obtaining the character strings of key words of searched engine to determine the character strings in conformity with preset rule, calculating the times of the character strings appearing in the preset Internet web database, if the times are greater than or equal to a preset threshold value, then the strings are output as new words, which are used in the search field.

Description

A kind of method and system that extracts neologisms
Technical field
The present invention relates to the internet information process field, particularly relate to a kind of method and system that from internet information, extracts neologisms.
Background technology
Appearing at of internet is the great revolution that spoken and written languages are developed to a great extent, and the sharp increase of word content, the appearance of brand-new content all make spoken and written languages experience once big change.People not only read the article above the newspaper magazine, more can read the article on the internet.As time goes on, the word content on the internet is more and more abundanter, be traditional papers and magazines Word message can not compare.And along with the acceleration that information is propagated, new words is propagated on the internet with unusual speed, just has a large amount of neologisms in the short time and occurs.In the past, the individual publishes an article very difficult on papers and magazines, and enter the Internet era, everyone can deliver the view of oneself on network, the literal of input also can be more and more personalized, along with being on the increase of Internet user, individual's article literal also is on the increase, and personalized new words also constantly emerges.For example, " internet " was not a speech in the several years in the past, but it is using widely as a speech now.
Because in a lot of Language Processing technology, speech is the most basic analytical element, therefore need obtain emerging speech timely and effectively, to guarantee the accuracy of Language Processing technology.For example, the vocabulary with different attribute is natural language understanding, mechanical translation, writes the basis of summary etc. automatically.For retrieving information, word always reduces the redundancy of result for retrieval as the search unit.In speech recognition, set up language model also usually the language message of speech, and based on speech, to solve the sound sensation uncertainty on the individual character level as minimum level.
But owing to neologisms constantly occur, and be dispersed in the numerous and jumbled corpus, be difficult to timely and effectively neologisms be distinguished.Prior art generally adopts by artificially collecting neologisms, joins the mode in the existing dictionary.
For example, neologisms are artificially collected by the supvr of search website, add the customization dictionary that use this website then; Perhaps artificially collect, be included into then in the system dictionary that uses of future generation (can be used for fields such as input method usually) by the dictionary developer; One public dictionary (for example, purple light) perhaps is set, manually accumulates the collection neologisms by online friend or other public then, join in this public dictionary, can gather a large amount of artificial strength.But these above-mentioned modes, heavy, the labour intensive of all expending time in very much, work, inefficiency.Therefore, people press for a kind of can be from numerous and jumbled language uses the timely and effective method of obtaining neologisms.
Summary of the invention
Technical matters to be solved by this invention provides a kind of new term fetch method and system, can obtain the neologisms that occur on the internet simple and convenient, timely and effectively automatically.
For solving the problems of the technologies described above, the invention provides a kind of method of extracting neologisms, may further comprise the steps:
Obtain the searching keyword character string of search engine;
Determine to meet the character string of presetting rule;
The number of times that the described character string that meets presetting rule occurs in the internet page database that presets is added up;
Preset threshold values if the described occurrence number that meets the character string of presetting rule is greater than or equal to, then this character string is exported as neologisms.
Preferably, determine to meet the character string of presetting rule by following steps: described searching keyword character string of obtaining and the record of the entry in original dictionary are compared; The searching keyword character string of removal existing record in original dictionary.
Preferably, the described step of determining to meet the character string of presetting rule comprises: remove that occurrence number is less than or equal to the searching keyword character string that presets threshold values in inquiry log.
The described step of determining to meet the character string of presetting rule can also comprise: remove the not searching keyword character string in presetting range of string length; Perhaps remove the searching keyword character string that does not meet the word-building rule.
Preferably, the method for described extraction neologisms also comprises: search engine judges whether described searching keyword character string has corresponding user and click behavior; If have, then store this searching keyword character string; If no, then abandon this searching keyword character string.
Preferably, the described step of determining to meet the character string of presetting rule comprises: remove the idle character in the described searching keyword character string of obtaining; Perhaps the described searching keyword character string of obtaining is cut apart according to separator.
Preferably, the method for described extraction neologisms also comprises: the neologisms according to output generate new dictionary or the neologisms that obtain are added into original dictionary, obtain the full dictionary of new dictionary or new edition.
Preferably, the method for described extraction neologisms also comprises: the input method system that comprises system's dictionary is set is arranged in first computing equipment, the full dictionary of described new dictionary or new edition is arranged in second computing equipment; Described input method system connects the renewal that described second computing equipment is finished system's dictionary by first computing equipment.
Preferably, the method for described extraction neologisms also comprises: be provided with and be used for receiving user's input information in the input method system and show that the unit of respective symbols is positioned at first computing equipment; The full dictionary that described new dictionary or new edition are set is system's dictionary of input method system, and described system dictionary is arranged in second computing equipment; Described input method system obtains corresponding information according to the information of user's input from the system's dictionary that is arranged in second computing equipment, shows respective symbols at first computing equipment.
Preferably, the internet page database that presets by the following steps acquisition: internet page is carried out weight assignment; Weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.
The present invention also provides a kind of system that extracts neologisms, comprising:
Interface unit is used to obtain the searching keyword character string of search engine;
Filter element is used to determine to meet the character string of presetting rule;
The internet page database is used to store the internet page surface information;
Statistic unit is added up the number of times that the described character string that meets presetting rule occurs in the internet page database that presets;
The neologisms determining unit is judged whether the described occurrence number that meets the character string of presetting rule is greater than or equal to preset threshold values; If then this character string is exported as neologisms.
Wherein, described filter element can comprise with in the lower module any or a plurality of:
Comparing module is used for the entry record of the described searching keyword character string of obtaining and original dictionary is compared;
Original dictionary filtering module is used for removing the searching keyword character string at the existing record of original dictionary.
The frequency filtering module is used for removing in the inquiry log occurrence number and is less than or equal to the searching keyword character string that presets threshold values.
The length filtration module is used to remove string length and is greater than or equal to the searching keyword character string that presets threshold values;
Perhaps the word-building filtering module is used to remove the searching keyword character string that does not meet the word-building rule.
The idle character filtering module is used for removing the idle character of the described searching keyword character string of obtaining;
Perhaps cut apart module, be used for the described searching keyword character string of obtaining being cut apart according to separator.
Preferably, the system of described extraction neologisms also comprises: the dictionary administrative unit is used for generating new dictionary or the neologisms that obtain being added into original dictionary according to the neologisms that obtain.
Preferably, described dictionary administrative unit is arranged in second computing equipment, and the system of described extraction neologisms also comprises: the input method unit, be arranged in first computing equipment, and wherein be provided with system's dictionary; Described input method unit connects the renewal that described dictionary administrative unit is finished system's dictionary by first computing equipment.
Preferably, described dictionary administrative unit is arranged in second computing equipment, and the system of described extraction neologisms also comprises: the input method receiver module, be used to receive user's input information, and be arranged in first computing equipment; The input method display module is used to show respective symbols, is arranged in first computing equipment; Described input method receiver module, input method display module and dictionary administrative unit are connected, and according to the information of user's input, obtain corresponding information from the dictionary administrative unit, show respective symbols at first computing equipment.
Preferably, the system of described extraction neologisms also comprises: internet page database generation unit is used for internet page is carried out weight assignment; And weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.
Compared with prior art, the present invention has the following advantages:
At first, because the present invention is the neologisms sources with the keyword in the inquiry log of internet search engine, can significantly reduce workload (especially than directly from corpus, analyzing) to the initial analysis of neologisms; And the keyword in the inquiry log of search engine can accurately reflect the trend of people on language uses, therefore nearly all internet new words all can here occur, and is the neologisms source representativeness that can guarantee the neologisms that acquire, comprehensive with the keyword in the inquiry log of internet search engine.
Secondly, the present invention can also filter and screen the keyword character string in the inquiry log, for example, remove existing in the existing dictionary, remove search rate lower, remove that form is not inconsistent, remove the user import the back in result of page searching, do not click behavior or the like.In above-mentioned filtration and the screening rule, some can also directly just filter out when inquiry log is stored, and for example, removes idle character such as punctuate during storage or only stores the keyword character string that the user has the click behavior.By being used singly or in combination of above-mentioned filtering rule, can increase speed and efficient that the present invention obtains neologisms greatly.
Moreover what the present invention will obtain may be the character of neologisms, is placed in the provided by the invention one selected internet page database, mates checking, if the frequency that occurs in this database presets threshold values greater than one, then this character is set to neologisms.Can improve the accuracy of the neologisms that extract like this, promptly really belong to the neologisms on the language meaning, but not vocabulary that is not of universal significance or wrong vocabulary.
In addition, if the neologisms that adopt the method for the invention to extract are applied in search field, when comprising neologisms in user's the searching keyword character string, can improve the degree of accuracy and the coverage of Search Results; If the neologisms that adopt the method for the invention to extract are applied in input method field, then can make things convenient for the faster neologisms of importing more accurately of user, not need just can in first or first page of candidate word, obtain the words of wishing to import through loaded down with trivial details candidate word selection course.
Description of drawings
Fig. 1 is the flow chart of steps of the embodiment of the invention 1;
Fig. 2 is the flow chart of steps of the embodiment of the invention 2;
Fig. 3 is the flow chart of steps of the embodiment of the invention 3;
Fig. 4 is the flow chart of steps that the present invention is applied to the embodiment 4 of input method;
Fig. 5 is the flow chart of steps that the present invention is applied to the embodiment 5 of input method;
Fig. 6 is the structured flowchart of neologisms extraction system of the present invention;
Fig. 7 a and Fig. 7 b are that system shown in Figure 6 is applied to the system chart in the input method.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Core concept of the present invention is, the searching keyword character string of the search engine that often uses with people is as the source of neologisms, therefrom find out and meet necessarily required character string, then these character strings are placed on and carry out frequency statistics in the internet environment, if frequently used, think that then this character string is neologisms.Successively all searching keyword character strings are finished above-mentioned steps, just can find out the neologisms that use on the internet under the present case.Certainly, because neologisms are constantly to occur along with the development of society, therefore need periodically carry out the automatic extraction of neologisms.
The method of extraction neologisms of the present invention can be applied to various language, and for example, Chinese, Japanese, Korean and English etc. are not because the present invention in the neologisms leaching process, can carry out word segmentation processing.Because the application flow of the present invention in above-mentioned several spoken and written languages all is similar, so for convenience of description, only the present invention is applied in Chinese situation below and describes.
With reference to Fig. 1, be the flow chart of steps of the embodiment of the invention 1, may further comprise the steps:
Step 101, obtain the searching keyword character string of search engine.
The effect of search engine in internet information spreading is extremely important.For example, Baidu's search engine can be searched for and surpass hundred million Chinese web page, accepts from each national Chinese search request of the whole world.Each year can be finished the response to billions of search, and tens million of netizens share the purest search experience from Baidu, the sea of the information of roaming leisurely.
Again for example, Google has developed maximum in the world at present search engine, and network information querying method the most easily is provided.By more than 40 hundred million webpages are put in order, the user that Google can be all over the world provides the suitable Search Results that needs.Now, Google can provide above 200,000,000 inquiry services every day.
In a word, the application of search engine in the internet more and more widely because it directly faces the user, so can guarantee the source and the accuracy of neologisms.
When the user searches for, need input inquiry keyword character string, search engine is sought the info web that meets this condition most according to this searching keyword character string in the info web of vastness.Described searching keyword character string can be short character string, for example, and " flower "; Also can be long character string, for example, " what is organization unit's code card "; Also can be a plurality of character strings that adopt separator to connect, for example, " romantic quiet Beijing, bar " is exactly a searching keyword character string that is formed by connecting by the space by four character strings.
When utilizing search engine, the user inquires about, then a record that just can generated query daily record on server.The following is a segment of inquiry log, comprise a record, corresponding certain user uses the information of search engine, comprising:
Date: Sun Dec 1 00:00:05 2002
IP:61.183.77.20
Whether hit: Cache at buffer zone
Search key: the United Nations
The number of results that finds: 7
Whether hit at buffer zone wherein that cache represents to hit in the hurdle, database represents not hit; The searching keyword character string is " the United Nations " just.As long as have a user to use search engine No. one time, will generate such log record.The present invention utilizes the extraction of this record realization to neologisms.
In the log record of reality, can how to store keyword character string according to the setting of concrete search engine, for example, for " what is organization unit's code card? " can be stored as " what is organization unit's code card ", promptly only remove punctuation character; Also can be stored as " organization unit's code card ", promptly also remove the character to search invalid such as interrogative.Again for example,, can directly be stored as " romantic quiet Beijing, bar " for " romantic quiet Beijing, bar "; Also can be stored as " bar " " romance " " quiet " " Beijing ", because the user has adopted separator, so it can be stored as keyword character string respectively.
Preferably, search engine can also be provided with and only store the searching keyword character string that result that the user obtains according to search has carried out relevant click behavior, if promptly the user does not carry out any corresponding click to the result that search obtains, this searching keyword character string may input error or in internet information not relevant information, then this searching keyword character string is abandoned, will not store.For example, search engine judges whether described searching keyword character string has corresponding user and click behavior; If have, then store this searching keyword character string; If no, then abandon this searching keyword character string.Can reduce the log store amount like this, also can reduce the data analysis amount of new word discovery, raise the efficiency.
Certainly, above-mentioned removal user does not have the step of the searching keyword character string of corresponding click behavior, also can be used as a kind of situation of the filtering rule that presets in the step 102, for example, from log record, obtain searching keyword character string and corresponding number of clicks or clickthrough, if there is not corresponding click behavior, then remove this searching keyword character string, will not carry out follow-up analytical procedure.
Storage condition to some possible searching keyword character strings has carried out simple declaration above, and those skilled in the art can be correlated with to be provided with and get final product according to concrete application, and the present invention does not need this is limited.
Step 102, determine to meet the character string of presetting rule.
Described presetting rule can be rule arbitrarily, and those skilled in the art can be as required or rule of thumb set and get final product, and this present invention is not limited.The filtering rule of enumerating some comparative optimizations below describes.
Preferably, the present invention can determine to meet the character string of presetting rule by following steps: described searching keyword character string of obtaining and the record of the entry in original dictionary are compared; The searching keyword character string of removal existing record in original dictionary obtains unwritten character string in original dictionary.Described in the method presetting rule is in original dictionary does not have record, filters out existing vocabulary, can avoid invalid, repeat calculation process, helps to economize on resources and raise the efficiency.
Preferably, the present invention can also determine to meet the character string of presetting rule by following steps: removal occurrence number in inquiry log is less than or equal to the searching keyword character string that presets threshold values.Adopt this presetting rule can eliminate those and frequent inadequately inquiry in inquiry log, occurs, and only carry out the calculating of next step to enough frequent searching keyword character string takes place, because frequent inadequately inquiry to occur be invalid for the new word discovery on the language meaning generally or have little significance, so adopt this presetting rule can further reduce calculated amount.
Preferably, the present invention can also determine to meet the character string of presetting rule by following steps: remove the idle character in the described searching keyword character string of obtaining; Perhaps the described searching keyword character string of obtaining is cut apart, obtained a plurality of character strings according to separator.Because in the inquiry log that has, what store in " search key " hurdle is " what is organization unit's code card " rather than " organization unit's code card ", perhaps storage is " romantic quiet Beijing, bar " rather than " bar ", " romance ", " quiet ", " Beijing ", therefore can remove idle character " what " and "Yes" in this step, perhaps will " bar romantic quiet Beijing " cut apart and obtain independent keyword, and then carry out next step calculating according to separator (space).
The reason of removing idle character is that this idle character is not the interested neologisms of the present invention.For example, " a " in English, " the ", and " ", " " in Chinese and the Japanese, " what " etc. all are invalid function character functional characters.These characters can not become the ending or the beginning part of neologisms usually, therefore, no matter how many their word frequency statisticses has, can directly remove, and neologisms of the present invention are extracted not influence.
The above-mentioned presetting rule of arbitrarily having enumerated several preferred filtrations employings, those skilled in the art can select for use therein, also can set other various presetting rule.And those skilled in the art can only adopt a kind of filtration presetting rule, also can adopt multiple; In multiple presetting rule filtration step, can carry out combination in any, there is not sequence limit yet.
For example, other possible filtering rules are: the removal string length is greater than or equal to the searching keyword character string that presets threshold values, is used to remove to be the inquiry of sentence; Remove string length and be less than or equal to the searching keyword character string that presets threshold values, be used to remove too short inquiry (user is not intended to click etc.); Removal does not meet the searching keyword character string of word-building rule, is used to remove the inquiry that those obviously do not meet general word-building principle; Even can also adopt modes such as manually carrying out blur filter.Those skilled in the art are (for example, different time carries out different filtering policys) as the case may be, and appropriate combination, selection presetting rule can best performance technology effects of the present invention.
In addition, for the literal that Chinese, Japanese and Korean etc. do not have the nature separator, also might there be following situation: for similar " organization unit's code card " such inquiry, whether also need to carry out participle, the character that obtains more brief (it is following to be generally four words) carries out next step calculating.The present invention is not limited this, promptly can participle, and also participle not.With carrying out next step calculating again behind the above-mentioned long character string participle, can reduce calculated amount, because if the independent character after the participle all is original speech, just can not carry out next neologisms determining step; But might cause neologisms to extract incomplete situation occurs.And if do not carry out participle, then can increase calculated amount, but can increase comprehensive that neologisms extract.Those skilled in the art select to get final product according to concrete needs.
Step 103, the number of times that the described character string that meets presetting rule is occurred in the internet page database that presets are added up.
The representative internet page information database that the described internet page database that presets can come out for those skilled in the art are selected.Preferably, the present invention can obtain the internet page database preset by following steps: internet page is carried out weight assignment; Weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.The statistics that the described character string that meets presetting rule is carried out occurrence number is word frequency statistics.
In the step that weight is given, it is an important situation that the time that forms according to webpage and the type of webpage are given corresponding weighted value.Because for word frequency statistics, the webpage time is extremely important to its influence, so the webpage time is also just bigger to the influence of weighted value, time point apart from word frequency statistics is far away more, then weighted value is just low more, if the mistiming greater than certain value, then can give this webpage lower weighted value, even get rid of outside word frequency statistics.Secondly type of webpage is also very big to the influence of word frequency statistics, described type of webpage generally is meant portal website, forum or some other webpage of having determined, the weighted value of these webpages is just higher, because the participant is more in these webpages, information updating is very fast, can react the up-to-date variation tendency of word frequency preferably.Judgement for type of webpage, can be by setting a rule base, stored the URL address of some webpages in this storehouse, thereby the webpage of determining these URL is important to word frequency statistics, the words that occurs on these webpages can be preferred statistics, then gives bigger weighted value to this webpage.
Secondly, the present invention can also remove some repeated pages, yellow webpage and spam page by the mode of giving low weighted value, thereby can further guarantee the accuracy of neologisms checking.
Moreover, owing to want conceivable result more accurate, just need the vocabulary of statistics all is user " input behavior " as far as possible, therefore the present invention can also handle the above selected page that comes out again, for example, remove the redundant information of the page etc., described page redundant information generally all is some invalid informations; Will not increase the calculated amount that neologisms extract if do not remove, and cause the word frequency that comes out not objective, the result is inaccurate.
Preset threshold values if the described occurrence number that meets the character string of presetting rule of step 104 is greater than or equal to, then this character string is exported as neologisms.By the setting of threshold values, can remove the lower character of some frequencies of occurrences, thereby vocabulary or the wrong vocabulary avoiding not being of universal significance extract as neologisms.For example, " Yoga " speech, the frequency that occurs in inquiry log is very high, but the frequency in the internet page surface information is lower, reason just is that this speech is an error character, correct should be " Yoga ", and the frequency of this speech in the internet page surface information is very high.The present invention just can well avoid " Yoga " proposed as neologisms.
With reference to Fig. 2, be the flow chart of steps of the embodiment of the invention 2, may further comprise the steps:
Step 201, obtain the searching keyword character string of search engine;
Step 202, removal occurrence number in inquiry log are less than or equal to the searching keyword character string that presets threshold values;
Step 203, the number of times that the described character string that meets presetting rule is occurred in the internet page database that presets are added up;
Preset threshold values if the described occurrence number that meets the character string of presetting rule of step 204 is greater than or equal to, then export this character string to step 205;
Step 205, the entry in the described character string of obtaining and original dictionary record is compared; The character string of removal existing record in original dictionary obtains unwritten character string in original dictionary, exports as neologisms.
Because in the relevant introduction of Fig. 1, described whole steps flow process of the present invention in detail, therefore, the not detailed information among the embodiment 2 can be referring to aforementioned.
The foregoing description also can be realized technique effect of the present invention, though because above-mentioned step arrangement can increase suitable data computation amount.But this embodiment can well prove any selection of filtration presetting rule of the present invention, the characteristics of arrangement, therefore here is illustrated.Theoretically, even the present invention is not provided with filtering rule fully, still can realize the purpose that neologisms of the present invention extract.
With reference to Fig. 3, be the flow chart of steps of the embodiment of the invention 3, may further comprise the steps:
Step 301, obtain the searching keyword character string of search engine;
Idle character in step 302, the described searching keyword character string of obtaining of removal; Perhaps the described searching keyword character string of obtaining is cut apart according to separator;
Step 303, removal occurrence number in inquiry log are less than or equal to the searching keyword character string that presets threshold values; For example, " eating shrimp ", " mountain pigeon " etc. have been removed;
Step 304, the described searching keyword character string of obtaining and the entry in original dictionary record is compared; The searching keyword character string of removal existing record in original dictionary; For example, " beauty ", " chain ", " fat-reducing ", " South Korean TV soaps " etc. have been removed;
Step 305, removal string length are greater than or equal to the searching keyword character string that presets threshold values, and remove the searching keyword character string that does not meet the word-building rule;
Step 306, remove the searching keyword character string that input method can intelligent word; For example, removed " English-Speaking World " etc.;
The intelligent word ability is meant that input method for speech that does not have in the dictionary or phrase, can search relevant word and speech, helps the user to form required speech or phrase.Because general input method all has the intelligent word ability, therefore can not influence user's input efficiency especially, so in this present invention in order to improve counting yield, reduce calculated amount, increased the intelligent word filtering rule.
Step 307, the number of times that the described character string that meets presetting rule is occurred in the internet page database that presets are added up;
Preset threshold values if the described occurrence number that meets the character string of presetting rule of step 308 is greater than or equal to, then this character string is exported as neologisms.For example, " little eyra ", " anniversary of the founding of a school speech of congratulation ", " Yoga " etc. have been removed; Adopt the present invention, as neologisms output be at last: " know that honor has a sense of shame ", " it is permanent to build letter ", " Qinghai-Tibet Railway ", " stopping flat ", " Hao Feier ", " Podolski ", " nine ministries and commissions " or the like.
From the neologisms of last output as can be seen, technique effect of the present invention still clearly, wherein, " Hao Feier " and the up-to-date vocabulary such as " Podolskis " that often occur normal appearance in " know that honor has a sense of shame " and " Qinghai-Tibet Railway ", the physical recreation in the news all have been extracted out.Especially when the present invention is applied on the input method, can improve the efficient that the user imports neologisms greatly, above-mentioned neologisms adopt existing input method to be difficult to for the first time or just can find correct character input first page of candidate word, just can help the user just finding correct character input for the first time or first page of candidate word but the neologisms that the present invention extracts are updated to the input method dictionary, improve user's input efficiency greatly.More than be a comparatively complicated realization of the present invention, not detailed part can be referring to aforementioned related content.
With reference to Fig. 4, be the flow chart of steps that the present invention is applied to the embodiment 4 of input method, suppose the neologisms that adopt the present invention to obtain extracting, renewal process can may further comprise the steps:
The neologisms that step 401, basis obtain generate new dictionary or the neologisms that obtain are added into original dictionary, promptly form the full dictionary of new dictionary or new edition.
The input method system that step 402, setting comprise system's dictionary is arranged in first computing equipment, and the new dictionary that step 401 obtains or the full dictionary of new edition are arranged in second computing equipment;
Step 403, described input method system connect the renewal that described second computing equipment is finished system's dictionary by first computing equipment.
Second computing equipment of the new dictionary that described storage obtains or the full dictionary of new edition can be present in by the form of server in the network, provides the Word library updating service to other any client-side program of the new word information of input method that need.Certainly, the form that does not need necessarily to pass through fixed server occurs, and also may reside in certain local computing device, provides the Word library updating service by P2P (point-to-point) technology to any client-side program of the new word information of input method that needs of other-end.
Among the embodiment of above-mentioned renewal, the mode of described renewal can for: when input method system upgrades, upgrade described system dictionary simultaneously; Perhaps, carry out the online updating of system's dictionary by the mode of server active push; Perhaps, initiate request by the user, server carries out the renewal of system's dictionary according to the request return data.Certainly, also can adopt the mode of mobile memory renewal or the mode of version updating.In a word, can adopt the mode of various Data Update, the present invention is not limited this, and those skilled in the art can select to get final product as required.
With reference to Fig. 5, be the flow chart of steps that the present invention is applied to the embodiment 5 of input method, suppose then may further comprise the steps the neologisms that adopt the present invention to obtain extracting:
The neologisms that step 501, basis obtain generate new dictionary or the neologisms that obtain are added into original dictionary, promptly form the full dictionary of new dictionary or new edition.
Step 502, be provided with and be used for receiving user's input information in the input method system and show that the unit of respective symbols is positioned at first computing equipment;
It is system's dictionary of input method system that the new dictionary that step 501 obtains or the full dictionary of new edition are set, and described system dictionary is arranged in second computing equipment.
Step 503, described input method system obtain corresponding information according to the information that the user imports from the system's dictionary that is arranged in second computing equipment, show respective symbols at first computing equipment, finish the literal input.
Among the embodiment 5, directly will be according to the present invention the full dictionary of the new dictionary that obtains of new term fetch method or new edition directly as system's dictionary of input method system, then can be implemented in the line dictionary and use, do not operated and do not need to upgrade.Wherein, input method system is divided for two parts, reception and display unit are positioned at first computing equipment, and dictionary information then is positioned at second computing equipment, can perfectly realize the online application of input method; Certainly, the codes match process that needs for input method system can be arranged on arbitrarily in certain computing equipment as required and all can.
Preferably, the present invention can also be applied in search field, when comprising neologisms in user's the searching keyword character string, can carry out accurate participle to user's searching keyword character string according to extract the dictionary that the neologisms method obtains by the present invention, search for according to word segmentation result then, can improve the degree of accuracy and the coverage of Search Results.
With reference to Fig. 6, be the system chart of neologisms extraction system of the present invention, comprise with lower member:
Interface unit 601 is used to obtain the searching keyword character string of search engine;
Filter element 602 is used to determine to meet the character string of presetting rule;
Internet page database 603 is used to store the internet page surface information;
Statistic unit 604 is added up the number of times that the described character string that meets presetting rule occurs in the internet page database that presets;
Neologisms determining unit 605 is judged whether the described occurrence number that meets the character string of presetting rule is greater than or equal to preset threshold values; If then this character string is exported as neologisms.
Preferably, can also comprise internet page database generation unit 607, be used for internet page is carried out weight assignment; And weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.Certainly, also can adopt labor management and described selected internet page database is set.
Preferably, can also comprise dictionary administrative unit 606, be used for generating new dictionary or the neologisms that obtain being added into original dictionary, promptly can obtain the full dictionary of new dictionary or new edition according to the neologisms that obtain.
Wherein, described filter element 602 can comprise with lower module:
Comparing module is used for the entry record of the described searching keyword character string of obtaining and original dictionary is compared; Original dictionary filtering module is used for removing the searching keyword character string at the existing record of original dictionary.
The frequency filtering module is used for removing in the inquiry log occurrence number and is less than or equal to the searching keyword character string that presets threshold values.
The length filtration module is used to remove the searching keyword character string that string length does not meet presetting range.
The word-building filtering module is used to remove the searching keyword character string that does not meet the word-building rule.
The idle character filtering module is used for removing the idle character of the described searching keyword character string of obtaining; Perhaps cut apart module, be used for the described searching keyword character string of obtaining being cut apart according to separator.
For above-mentioned module, described filter element is selected a use or is used in combination all is feasible.
With reference to Fig. 7 a and Fig. 7 b, be the application of neologisms in input method that obtains according to the method for the invention.
With reference to Fig. 7 a, can be applied to the renewal of input method system dictionary, wherein, described dictionary administrative unit 701 is arranged in second computing equipment; Input method unit 702 is arranged in first computing equipment, wherein is provided with system's dictionary; Described input method unit 702 connects the renewal that described dictionary administrative unit is finished system's dictionary by first computing equipment.
With reference to Fig. 7 b, can be applied to the input method system of online use, wherein, described dictionary administrative unit 701 is arranged in second computing equipment;
Input method receiver module 7021 is used to receive user's input information, is arranged in first computing equipment;
Input method display module 7022 is used to show respective symbols, is arranged in first computing equipment;
Described input method receiver module 7021, input method display module 7022 and dictionary administrative unit 701 are connected, and according to the information of user's input, obtain corresponding information from dictionary administrative unit 701, show respective symbols at first computing equipment.
The codes match module 7023 that also needs for input method system can be provided with arbitrarily as required and get final product.For example, in embodiment illustrated in fig. 7, to import receiver module 7021 and display module 7022 is arranged in user's first computing equipment, other functional modules that comprise codes match module 7023 are arranged in second computing equipment, connect second computing equipment by first computing equipment and finish input process jointly.Be shown in Fig. 7 b in the situation, new dictionary does not offer subscriber's local, but obtains relevant information when being imported by the user from network at every turn.
Because neologisms can't be known in the words that existing input method adopts the sealing corpus to analyze, and can only import word selection by the user and could learn and remember neologisms gradually.When the user need import neologisms, the user just can make user's input faster by constantly allowing the input method self study upgrade user thesaurus, but because along with information expansion, the appearance of neologisms is also more and more frequent, word selection process loaded down with trivial details when the user imports is also more and more frequent.The method of extraction neologisms of the present invention is applied on the input method, just can well address the above problem.
The present invention is owing to the word frequency statistics technology of using based on internet information, and with the inquiry log of search engine source as neologisms, the neologisms that use in a large amount of internets can have conveniently been obtained, these neologisms constantly offer the input method user again and use, make these users in use can follow the tracks of the variation of internet information constantly, constantly can import neologisms and don't all will be with each input neologisms the time through the loaded down with trivial details speech process of selecting, make neologisms also can become user's first-selected speech, the first-selected speech hit rate when the raising user imports neologisms.
Because length of the present invention is limited, comparatively detailed in the description part of method, the not detailed part of the description of components of system as directed.See also aforementioned relevant portion.
More than to a kind of method and system that extracts neologisms provided by the present invention, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (19)

1, a kind of method of extracting neologisms is characterized in that, may further comprise the steps:
Obtain the searching keyword character string of search engine;
Determine to meet the character string of presetting rule;
The number of times that the described character string that meets presetting rule occurs in the internet page database that presets is added up;
Preset threshold values if the described occurrence number that meets the character string of presetting rule is greater than or equal to, then this character string is exported as neologisms.
2, the method for extraction neologisms as claimed in claim 1 is characterized in that, determines to meet the character string of presetting rule by following steps:
Described searching keyword character string of obtaining and the record of the entry in original dictionary are compared;
The searching keyword character string of removal existing record in original dictionary.
3, the method for extraction neologisms as claimed in claim 1 or 2 is characterized in that, the described step of determining to meet the character string of presetting rule comprises:
Removal occurrence number in inquiry log is less than or equal to the searching keyword character string that presets threshold values.
4, the method for extraction neologisms as claimed in claim 2 is characterized in that, the described step of determining to meet the character string of presetting rule also comprises:
Remove the not searching keyword character string in presetting range of string length;
Perhaps remove the searching keyword character string that does not meet the word-building rule.
5, the method for extraction neologisms as claimed in claim 1 is characterized in that, also comprises:
Search engine judges whether described searching keyword character string has corresponding user and click behavior;
If have, then store this searching keyword character string; If no, then abandon this searching keyword character string.
6, the method for extraction neologisms as claimed in claim 1 is characterized in that, the described step of determining to meet the character string of presetting rule comprises:
Remove the idle character in the described searching keyword character string of obtaining;
Perhaps the described searching keyword character string of obtaining is cut apart according to separator.
7, the method for extraction neologisms as claimed in claim 1 is characterized in that, also comprises:
Neologisms according to output generate new dictionary or the neologisms that obtain are added into original dictionary, obtain the full dictionary of new dictionary or new edition.
8, the method for extraction neologisms as claimed in claim 7 is characterized in that, also comprises:
The input method system that setting comprises system's dictionary is arranged in first computing equipment, and the full dictionary of described new dictionary or new edition is arranged in second computing equipment;
Described input method system connects the renewal that described second computing equipment is finished system's dictionary by first computing equipment.
9, the method for extraction neologisms as claimed in claim 7 is characterized in that, also comprises:
Be provided with and be used for receiving user's input information in the input method system and show that the unit of respective symbols is positioned at first computing equipment;
The full dictionary that described new dictionary or new edition are set is system's dictionary of input method system, and described system dictionary is arranged in second computing equipment;
Described input method system obtains corresponding information according to the information of user's input from the system's dictionary that is arranged in second computing equipment, shows respective symbols at first computing equipment.
10, the method for extraction neologisms as claimed in claim 1 is characterized in that, obtains the internet page database that presets by following steps:
Internet page is carried out weight assignment;
Weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.
11, a kind of system that extracts neologisms is characterized in that, comprising:
Interface unit is used to obtain the searching keyword character string of search engine;
Filter element is used to determine to meet the character string of presetting rule;
The internet page database is used to store the internet page surface information;
Statistic unit is added up the number of times that the described character string that meets presetting rule occurs in the internet page database that presets;
The neologisms determining unit is judged whether the described occurrence number that meets the character string of presetting rule is greater than or equal to preset threshold values; If then this character string is exported as neologisms.
12, the system of extraction neologisms as claimed in claim 11 is characterized in that, described filter element comprises with lower module:
Comparing module is used for the entry record of the described searching keyword character string of obtaining and original dictionary is compared;
Original dictionary filtering module is used for removing the searching keyword character string at the existing record of original dictionary.
As the system of claim 11 or 12 described extraction neologisms, it is characterized in that 13, described filter element also comprises:
The frequency filtering module is used for removing in the inquiry log occurrence number and is less than or equal to the searching keyword character string that presets threshold values.
14, the system of extraction neologisms as claimed in claim 12 is characterized in that, described filter element also comprises:
The length filtration module is used to remove string length and is greater than or equal to the searching keyword character string that presets threshold values;
Perhaps the word-building filtering module is used to remove the searching keyword character string that does not meet the word-building rule.
15, the system of extraction neologisms as claimed in claim 11 is characterized in that, described filter element comprises with lower module:
The idle character filtering module is used for removing the idle character of the described searching keyword character string of obtaining;
Perhaps cut apart module, be used for the described searching keyword character string of obtaining being cut apart according to separator.
16, the system of extraction neologisms as claimed in claim 11 is characterized in that, also comprises:
The dictionary administrative unit is used for generating new dictionary or the neologisms that obtain being added into original dictionary according to the neologisms that obtain.
17, the system of extraction neologisms as claimed in claim 16 is characterized in that, described dictionary administrative unit is arranged in second computing equipment, and this system also comprises:
The input method unit is arranged in first computing equipment, wherein is provided with system's dictionary; Described input method unit connects the renewal that described dictionary administrative unit is finished system's dictionary by first computing equipment.
18, the system of extraction neologisms as claimed in claim 16 is characterized in that, described dictionary administrative unit is arranged in second computing equipment, and this system also comprises:
The input method receiver module is used to receive user's input information, is arranged in first computing equipment;
The input method display module is used to show respective symbols, is arranged in first computing equipment;
Described input method receiver module, input method display module and dictionary administrative unit are connected, and according to the information of user's input, obtain corresponding information from the dictionary administrative unit, show respective symbols at first computing equipment.
19, the system of extraction neologisms as claimed in claim 11 is characterized in that, also comprises:
Internet page database generation unit is used for internet page is carried out weight assignment; And weighted value is greater than or equal to the internet page that presets threshold values is stored to the internet page database.
CNB200610103593XA 2006-07-25 2006-07-25 Method and system for abstracting new word Active CN100405371C (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CNB200610103593XA CN100405371C (en) 2006-07-25 2006-07-25 Method and system for abstracting new word
PCT/CN2007/070260 WO2008014702A1 (en) 2006-07-25 2007-07-10 Method and system of extracting new words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200610103593XA CN100405371C (en) 2006-07-25 2006-07-25 Method and system for abstracting new word

Publications (2)

Publication Number Publication Date
CN1912872A true CN1912872A (en) 2007-02-14
CN100405371C CN100405371C (en) 2008-07-23

Family

ID=37721812

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200610103593XA Active CN100405371C (en) 2006-07-25 2006-07-25 Method and system for abstracting new word

Country Status (2)

Country Link
CN (1) CN100405371C (en)
WO (1) WO2008014702A1 (en)

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008144964A1 (en) * 2007-06-01 2008-12-04 Google Inc. Detecting name entities and new words
WO2008151465A1 (en) * 2007-06-14 2008-12-18 Google Inc. Dictionary word and phrase determination
CN100478961C (en) * 2007-09-17 2009-04-15 中国科学院计算技术研究所 New word of short-text discovering method and system
CN100489863C (en) * 2007-09-27 2009-05-20 中国科学院计算技术研究所 New word discovering method and system thereof
CN102043843A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and obtaining device for obtaining target entry based on target application
CN101645066B (en) * 2008-08-05 2011-08-24 北京大学 Method for monitoring novel words on Internet
CN102298581A (en) * 2010-06-23 2011-12-28 深圳市腾讯计算机系统有限公司 Method and device for processing input method word stock
CN101556596B (en) * 2007-08-31 2012-04-18 北京搜狗科技发展有限公司 Input method system and intelligent word making method
CN102855285A (en) * 2012-08-07 2013-01-02 网讯电通股份有限公司 Keyword management system and method for consultation service system
CN102955825A (en) * 2011-08-30 2013-03-06 北京搜狗科技发展有限公司 Method and system for updating input method lexicon
US8412517B2 (en) 2007-06-14 2013-04-02 Google Inc. Dictionary word and phrase determination
CN101334774B (en) * 2007-06-29 2013-08-14 北京搜狗科技发展有限公司 Character input method and input method system
CN103279565A (en) * 2013-06-14 2013-09-04 北京艾德思奇科技有限公司 Advertisement placement tracking method and system
CN103399766A (en) * 2013-07-29 2013-11-20 百度在线网络技术(北京)有限公司 Method and device for updating input method system
CN103870449A (en) * 2012-12-10 2014-06-18 百度国际科技(深圳)有限公司 Online automatic neologism excavating method and electronic device
CN104091058A (en) * 2014-06-27 2014-10-08 北京君和信达科技有限公司 Safety inspection conclusion submitting method and device
CN104216878A (en) * 2013-05-29 2014-12-17 酷盛(天津)科技有限公司 New word discovery system and method
CN104731766A (en) * 2013-12-20 2015-06-24 淘宝(中国)软件有限公司 Alphabetic writing lexicon establishing method, alphabetic writing lexicon establishing device, inputting method and inputting system
CN105095381A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for new word identification
CN105550285A (en) * 2015-12-10 2016-05-04 北京奇虎科技有限公司 Method and device for building POI dictionary
WO2016127459A1 (en) * 2015-02-12 2016-08-18 深圳市前海安测信息技术有限公司 Method and device for recognizing unlogged word in intelligent interaction system
CN106294650A (en) * 2016-08-03 2017-01-04 北京金和网络股份有限公司 Neologisms method for digging a little is buried based on search
CN107341251A (en) * 2017-07-10 2017-11-10 江西博瑞彤芸科技有限公司 A kind of extraction and the processing method of medical folk prescription and keyword
CN107391504A (en) * 2016-05-16 2017-11-24 华为技术有限公司 New word identification method and device
CN107423424A (en) * 2017-08-02 2017-12-01 合肥哲诚项目管理有限公司 A kind of work mark generation method and system based on network buzzword
CN108628832A (en) * 2018-05-08 2018-10-09 中国联合网络通信集团有限公司 A kind of information keyword acquisition methods and device
CN108664141A (en) * 2017-03-31 2018-10-16 微软技术许可有限责任公司 Input method with document context self-learning function
CN108984513A (en) * 2017-06-05 2018-12-11 阿里巴巴集团控股有限公司 A kind of word string recognition methods and server
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN109426357A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727271B (en) * 2008-10-22 2012-11-14 北京搜狗科技发展有限公司 Method and device for providing error correcting prompt and input method system
CN110245345A (en) * 2018-03-08 2019-09-17 普天信息技术有限公司 Participle processing method and device suitable for network neologisms
CN108984735B (en) * 2018-07-12 2019-08-13 广州资宝科技有限公司 Label Word library updating method, apparatus and electronic equipment
CN111611471B (en) * 2019-02-25 2023-12-26 阿里巴巴集团控股有限公司 Searching method and device and electronic equipment
CN110210028B (en) * 2019-05-30 2023-04-28 杭州远传新业科技股份有限公司 Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN111354342B (en) * 2020-02-28 2023-07-25 科大讯飞股份有限公司 Personalized word stock updating method, device, equipment and storage medium
CN111581971B (en) * 2020-06-04 2024-01-23 腾讯科技(深圳)有限公司 Word stock updating method, device, terminal and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1226717C (en) * 2000-08-30 2005-11-09 国际商业机器公司 Automatic new term fetch method and system
CN1128404C (en) * 2000-11-16 2003-11-19 意蓝科技股份有限公司 computer new word learning method and system
JP2003228571A (en) * 2001-11-28 2003-08-15 Kyoji Umemura Method of counting appearance frequency of character string, and device for using the method
CN100485603C (en) * 2003-04-04 2009-05-06 雅虎公司 Systems and methods for generating concept units from search queries
CN100397392C (en) * 2003-12-17 2008-06-25 北京大学 Method and apparatus for learning Chinese new words
CN100555276C (en) * 2004-01-15 2009-10-28 中国科学院计算技术研究所 A kind of detection method of Chinese new words and detection system thereof
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters
JP2006031108A (en) * 2004-07-12 2006-02-02 Shinichiro Fujitani System for retrieving merchandise/service on web

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008144964A1 (en) * 2007-06-01 2008-12-04 Google Inc. Detecting name entities and new words
US8010344B2 (en) 2007-06-14 2011-08-30 Google Inc. Dictionary word and phrase determination
WO2008151465A1 (en) * 2007-06-14 2008-12-18 Google Inc. Dictionary word and phrase determination
US8412517B2 (en) 2007-06-14 2013-04-02 Google Inc. Dictionary word and phrase determination
CN102124459B (en) * 2007-06-14 2013-06-12 谷歌股份有限公司 Dictionary word and phrase determination
CN101334774B (en) * 2007-06-29 2013-08-14 北京搜狗科技发展有限公司 Character input method and input method system
CN101556596B (en) * 2007-08-31 2012-04-18 北京搜狗科技发展有限公司 Input method system and intelligent word making method
CN100478961C (en) * 2007-09-17 2009-04-15 中国科学院计算技术研究所 New word of short-text discovering method and system
CN100489863C (en) * 2007-09-27 2009-05-20 中国科学院计算技术研究所 New word discovering method and system thereof
CN101645066B (en) * 2008-08-05 2011-08-24 北京大学 Method for monitoring novel words on Internet
CN102298581A (en) * 2010-06-23 2011-12-28 深圳市腾讯计算机系统有限公司 Method and device for processing input method word stock
CN102298581B (en) * 2010-06-23 2015-11-25 深圳市腾讯计算机系统有限公司 A kind of disposal route of input method dictionary and device
CN102043843A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and obtaining device for obtaining target entry based on target application
CN102955825B (en) * 2011-08-30 2016-04-06 北京搜狗科技发展有限公司 A kind of method and system upgrading input method dictionary
CN102955825A (en) * 2011-08-30 2013-03-06 北京搜狗科技发展有限公司 Method and system for updating input method lexicon
CN102855285A (en) * 2012-08-07 2013-01-02 网讯电通股份有限公司 Keyword management system and method for consultation service system
CN103870449B (en) * 2012-12-10 2018-06-12 百度国际科技(深圳)有限公司 The on-line automatic method and electronic device for excavating neologisms
CN103870449A (en) * 2012-12-10 2014-06-18 百度国际科技(深圳)有限公司 Online automatic neologism excavating method and electronic device
CN104216878A (en) * 2013-05-29 2014-12-17 酷盛(天津)科技有限公司 New word discovery system and method
CN103279565A (en) * 2013-06-14 2013-09-04 北京艾德思奇科技有限公司 Advertisement placement tracking method and system
CN103399766A (en) * 2013-07-29 2013-11-20 百度在线网络技术(北京)有限公司 Method and device for updating input method system
CN103399766B (en) * 2013-07-29 2016-05-11 百度在线网络技术(北京)有限公司 Upgrade the method and apparatus of input method system
US10558327B2 (en) 2013-07-29 2020-02-11 Baidu Online Network Technology (Beijing) Co., Ltd. Methods and devices for updating input method systems
CN104731766A (en) * 2013-12-20 2015-06-24 淘宝(中国)软件有限公司 Alphabetic writing lexicon establishing method, alphabetic writing lexicon establishing device, inputting method and inputting system
CN104091058A (en) * 2014-06-27 2014-10-08 北京君和信达科技有限公司 Safety inspection conclusion submitting method and device
WO2016127459A1 (en) * 2015-02-12 2016-08-18 深圳市前海安测信息技术有限公司 Method and device for recognizing unlogged word in intelligent interaction system
CN105095381A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for new word identification
CN105095381B (en) * 2015-06-30 2019-06-25 北京奇虎科技有限公司 New word identification method and device
CN105550285A (en) * 2015-12-10 2016-05-04 北京奇虎科技有限公司 Method and device for building POI dictionary
CN105550285B (en) * 2015-12-10 2018-12-14 北京奇虎科技有限公司 Construct the method and device of POI dictionary
CN107391504A (en) * 2016-05-16 2017-11-24 华为技术有限公司 New word identification method and device
CN107391504B (en) * 2016-05-16 2021-01-29 华为技术有限公司 New word recognition method and device
CN106294650B (en) * 2016-08-03 2019-08-20 北京金和网络股份有限公司 Neologisms method for digging a little is buried based on search
CN106294650A (en) * 2016-08-03 2017-01-04 北京金和网络股份有限公司 Neologisms method for digging a little is buried based on search
CN108664141A (en) * 2017-03-31 2018-10-16 微软技术许可有限责任公司 Input method with document context self-learning function
CN108664141B (en) * 2017-03-31 2022-08-09 微软技术许可有限责任公司 Input method with document context self-learning function
CN108984513A (en) * 2017-06-05 2018-12-11 阿里巴巴集团控股有限公司 A kind of word string recognition methods and server
CN108984513B (en) * 2017-06-05 2022-03-04 阿里巴巴集团控股有限公司 Word string recognition method and server
CN107341251A (en) * 2017-07-10 2017-11-10 江西博瑞彤芸科技有限公司 A kind of extraction and the processing method of medical folk prescription and keyword
CN107423424A (en) * 2017-08-02 2017-12-01 合肥哲诚项目管理有限公司 A kind of work mark generation method and system based on network buzzword
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN109426357A (en) * 2017-09-01 2019-03-05 百度在线网络技术(北京)有限公司 Data inputting method and device
CN109426357B (en) * 2017-09-01 2023-05-12 百度在线网络技术(北京)有限公司 Information input method and device
CN108628832A (en) * 2018-05-08 2018-10-09 中国联合网络通信集团有限公司 A kind of information keyword acquisition methods and device

Also Published As

Publication number Publication date
WO2008014702A1 (en) 2008-02-07
CN100405371C (en) 2008-07-23

Similar Documents

Publication Publication Date Title
CN1912872A (en) Method and system for abstracting new word
CN1924858A (en) Method and device for fetching new words and input method system
CN101398834B (en) Processing method and device for input information and input method system
CN100339855C (en) Content management system
CN1920827A (en) Method for obtaining newly encoded character string, input method system and word stock generation device
CN109522011B (en) Code line recommendation method based on context depth perception of programming site
CN101788988B (en) Information extraction method
WO2008098502A1 (en) Method and device for creating index as well as method and system for retrieving
CN101452470A (en) Method and apparatus for a web search engine generating summary-style search results
JP6355840B2 (en) Stopword identification method and apparatus
CN1664818A (en) Word collection method and system for use in word-breaking
CN1871603A (en) System and method for processing a query
CN1873642A (en) Searching engine with automating sorting function
CN1617134A (en) System for identifying paraphrases using machine translation techniques
CN101196898A (en) Method for applying phrase index technology into internet search engine
CN101079064A (en) Web page sequencing method and device
CN1910573A (en) System for identifying and classifying denomination entity
CN1877583A (en) Accessing identification index system and accessing identification index library generation method
CN102053991A (en) Method and system for multi-language document retrieval
CN1687925A (en) Method for realizing bilingual web page searching
CN1410918A (en) Searching engine based on information extraction technique
CN109948154A (en) A kind of personage's acquisition and relationship recommender system and method based on name
CN111488453B (en) Resource grading method, device, equipment and storage medium
CN103226601B (en) A kind of method and apparatus of picture searching
CN1838126A (en) Freely-inputted wireless short message matching and search engine information processing method, and apparatus therefor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant