WO2008022581A1 - Method and device for obtaining the new words and input method system - Google Patents

Method and device for obtaining the new words and input method system Download PDF

Info

Publication number
WO2008022581A1
WO2008022581A1 PCT/CN2007/070419 CN2007070419W WO2008022581A1 WO 2008022581 A1 WO2008022581 A1 WO 2008022581A1 CN 2007070419 W CN2007070419 W CN 2007070419W WO 2008022581 A1 WO2008022581 A1 WO 2008022581A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
user
words
new
frequency
Prior art date
Application number
PCT/CN2007/070419
Other languages
French (fr)
Chinese (zh)
Inventor
Qi Guo
Zijian Tong
Lei Yang
Original Assignee
Beijing Sogou Technology Development Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Family has litigation
First worldwide family litigation filed litigation Critical https://patents.darts-ip.com/?family=37817498&utm_source=google_patent&utm_medium=platform_link&utm_campaign=public_patent_search&patent=WO2008022581(A1) "Global patent litigation dataset” by Darts-ip is licensed under a Creative Commons Attribution 4.0 International License.
Application filed by Beijing Sogou Technology Development Co., Ltd. filed Critical Beijing Sogou Technology Development Co., Ltd.
Publication of WO2008022581A1 publication Critical patent/WO2008022581A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • the present invention relates to the field of Internet information processing, and in particular, to a method for acquiring a new word, a new word acquisition system, a new word acquisition device, and an input method system.
  • words are the most basic analysis elements in many language processing technologies, it is necessary to obtain new and emerging words in a timely and effective manner to ensure the accuracy of language processing technology. For example, vocabularies with different attributes are the basis for natural language understanding, machine translation, automatic writing of abstracts, and so on.
  • words are always used as search units to reduce the redundancy of search results.
  • speech recognition words are usually used as the lowest level of linguistic information, and language models are built based on words to solve the acoustic uncertainty at the word level.
  • the prior art generally employs a method of manually collecting new words and adding them to an existing vocabulary.
  • new words are manually collected by the administrator of the search site and then added to the custom vocabulary used by the site; or manually collected by the lexicon developer and then included in the system dictionary used by the next generation (usually For use in fields such as input methods; or set up a common vocabulary (for example, Violet), and then manually collect new words by netizens or other publics, and join the public vocabulary to gather a lot of artificial power.
  • these methods are very time consuming, labor intensive, labor intensive, and inefficient. Therefore, there is an urgent need for a method that can effectively acquire new words in a timely and efficient manner from the use of complex languages. Summary of the invention
  • the technical problem to be solved by the present invention is to provide a method and system for acquiring new words, which can acquire some new words frequently used by users in a simple, convenient, timely and effective manner; and can effectively remove interfering words and provide relatively accurate new word output. .
  • Another object of the present invention is to provide an input method system that can automatically and automatically acquire the personalized words of the user in a timely, convenient and effective manner, and acquire new words by collecting the individual words of the plurality of users.
  • Another object of the present invention is to provide a new word acquisition apparatus which can provide a relatively accurate new word output with high efficiency.
  • Another object of the present invention is to provide a vocabulary generating method and a vocabulary generating apparatus which can provide a relatively accurate vocabulary or a new vocabulary with high efficiency.
  • the present invention provides a method for acquiring a new word, comprising the steps of: acquiring a word selected by a user during a user input process; comparing a word selected by the user with an existing word, according to the comparison As a result, the user's personal words are obtained; the individual words of each user are collected; new words are obtained according to the personalized words.
  • the user word frequency is also recorded during the user input process, and the user word frequency is the frequency information of the user inputting the word.
  • the comparison may be: recording a word selected by the user to the user vocabulary, storing the existing word in the input system vocabulary, comparing the user lexicon with the input method system vocabulary; or directly comparing the user each time selected Words and existing words.
  • the following steps may be used to obtain the user's personalized words: determining whether the selected word of the user exists in the existing word; if not, determining that the word is the user's individual word.
  • the following steps can also be used to obtain the user's personalized words: determine whether the selected word of the user exists in the existing word; if not, further determine the corresponding word frequency of the word; if the word corresponding to the user word frequency If it is greater than or equal to the predetermined threshold, it is determined that the word is a personalized word.
  • the following steps can also be used to obtain the user's personal words: judge the user's selected words in the existing Whether there is a word; if it does not exist, it is determined that the word is a user's personality word; if it exists, the user's word frequency and system word frequency of the word are further compared, and the system word frequency is pre-prescribed in the input method system vocabulary
  • the word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
  • the user's personalized word can be obtained by the following steps: determining whether the selected word of the user exists in the existing word; if not, further determining the corresponding word frequency of the word; if the word corresponds If the user word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personality word; if present, the user word frequency and the system word frequency of the word are further compared, and the system word frequency is preset in the input method system vocabulary The word frequency information corresponding to the existing word; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
  • the method for acquiring a new word further includes: counting a number of times the personalized word appears in a preset internet page database; if the number of occurrences of the personalized word is greater than or equal to a preset threshold, The word is output as a new word.
  • the preset Internet page database is obtained by the following steps: weighting the Internet page; storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the collecting in the method for acquiring new words may be: the input method user computing device sends the user's personalized words to the word collecting computing device in real time or at a time.
  • the method for acquiring a new word further includes: generating a new thesaurus according to the outputted new words or adding the obtained new words to the original thesaurus, and obtaining a new thesaurus or a new version of the whole thesaurus.
  • the invention also discloses a method for acquiring a new word, comprising: acquiring a word selected by a user in a user input process; collecting selected words of each user; comparing the selected word of the user with the existing word, according to The result of the comparison obtains the user's personality word; the new word is obtained according to the personality word.
  • the invention also discloses a new word acquisition system based on an input method, comprising: a word extraction unit, which is connected with an input method system, and is used for acquiring a word selected by a user in a user input process; a word comparison unit, Connected to the word extraction unit, used to compare the selected word with the existing word, and obtain the user's individual word according to the comparison result; the collecting unit is used to collect the individual words of each user; the new word acquiring unit is used Obtain new words based on the personality words.
  • the invention also discloses another new word acquisition system based on the input method, comprising: a word extraction unit, which is connected with the input method system, and is used for acquiring the words selected by the user during the user input process; the collecting unit, a selected word for collecting each user; a word matching unit, connected to the collecting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result; , for obtaining a new word according to the personality word.
  • a word extraction unit which is connected with the input method system, and is used for acquiring the words selected by the user during the user input process
  • the collecting unit a selected word for collecting each user
  • a word matching unit connected to the collecting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result; , for obtaining a new word according to the personality word.
  • the invention also discloses an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for acquiring a word selected by the user during the user input process
  • the word matching unit is connected to the word extracting unit for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result.
  • the input interface unit, the display unit, and the system vocabulary of the input method system may be located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
  • the input method system may further include: a communication unit, configured to send the personalized word.
  • the input method system may further include: a user vocabulary for storing words selected by the user.
  • the input method system may further include: a word frequency recording unit, connected to the input method system, configured to record a user word frequency during a user input process, wherein the user word frequency is a frequency information input by the user for the word.
  • the word comparison unit may include: a first comparison subunit, configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison a unit, if not present, outputting the word to the second comparison subunit; and a second comparison unit, configured to further determine the corresponding word when the selected word does not exist in the existing word User word frequency; if the corresponding user word frequency of the word is greater than or equal to the predetermined threshold, then the word is determined to be a personalized word; the third comparison sub-unit is used when the user selected word exists in the existing word Further comparing the user word frequency and the system word frequency of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the input system system vocabulary; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined For a wide value, the word is determined to be a personal word.
  • the invention also discloses a new word obtaining device, comprising: a personalized word collecting unit, configured to collect individual words of each user; a statistical unit, configured to count the personalized words appearing in a preset internet page database The number of times; the new word determining unit is connected to the statistical unit for determining whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
  • the collecting is that the user computing device sends the user's personalized words to the personalized word collecting unit in real time or at a time.
  • the new word obtaining device further includes: a thesaurus generating unit, configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus, and obtain a new thesaurus or a new version of the whole thesaurus .
  • a thesaurus generating unit configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus, and obtain a new thesaurus or a new version of the whole thesaurus .
  • the new word obtaining device further includes: an internet page database generating unit, configured to perform weighting assignment on the internet page; and store an internet page whose weight value is greater than or equal to a preset threshold to the internet page database.
  • the invention also discloses a new word obtaining device, comprising: a word collecting unit for collecting selected words of each user; a word matching unit, connected with the word collecting unit, for comparing the selected words of the user a word and an existing word, obtaining a user personality word according to the comparison result; a new word obtaining unit, configured to acquire a new word according to the personality word.
  • the new word obtaining unit includes: a statistical subunit, configured to count the number of times the personalized word appears in the preset internet page database; the new word determining subunit is connected to the statistical subunit for determining Whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, the word is output as a new word.
  • the word collecting unit is further configured to collect a user word frequency corresponding to the word selected by the user;
  • the new word acquiring device further includes: a statistical subunit, configured to collect the personalized word in the preset internet page database The number of occurrences in the Internet, the Internet word frequency is obtained;
  • the weight word frequency determining sub-unit is configured to perform weight correction on the user word frequency and the Internet word frequency of the new word, and obtain the weight word frequency of the new word; the new word determining sub-unit, And determining whether the weight word frequency of the personalized word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
  • the invention also discloses a vocabulary generating method, comprising: collecting input behavior information of each user, the input behavior information including a selected word in a user input process and a corresponding user word frequency of the word; Each user's word frequency is weighted, and the user's cumulative word frequency of each word is calculated; a vocabulary is generated, and the vocabulary includes the words and their corresponding user cumulative word frequencies.
  • the vocabulary generating method further includes: removing a word whose user cumulative word frequency is less than or equal to a certain threshold.
  • the method for generating a thesaurus further includes: comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words; according to the user's personalized words Words generate a personal dictionary of words.
  • the method for generating the thesaurus further comprising: comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words; The number of times the word appears in the preset Internet page database, the Internet word frequency is obtained; the user cumulative word frequency of the personalized word and the Internet word frequency are weighted and summed to obtain the weight word frequency of the new word; If the weight of the word is greater than or equal to the preset threshold, the word is output as a new word; a new vocabulary is generated according to the outputted new word, and the new vocabulary includes the new word and its corresponding weight word frequency.
  • the invention also discloses a vocabulary generating device, comprising: a collecting unit, configured to collect input behavior information of each user, wherein the input behavior information includes a selected word in a user input process and a corresponding word frequency of the word a word frequency calculation unit, configured to perform weight correction on each user word frequency corresponding to the word, and calculate a cumulative word frequency of each word; a thesaurus generating unit, configured to generate a thesaurus, the word library including the words and their corresponding accumulations Word frequency.
  • the thesaurus generating device further includes: a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words.
  • the thesaurus generating device further includes: a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words.
  • a statistical unit configured to count the number of occurrences of the personalized word in the preset Internet page database, to obtain an Internet word frequency
  • a weight word frequency determining unit configured to perform a cumulative word frequency and an Internet word frequency for the user of the personalized word The weight is corrected and summed to obtain the weight of the word
  • the new word determining unit outputs the word as a new word if the weight of the personalized word is greater than or equal to the preset threshold
  • the generating unit generates a new vocabulary according to the outputted new word, and the new vocabulary includes the new word and its corresponding weight word frequency.
  • the present invention has the following advantages:
  • the present invention proposes a distributed architecture, including multiple users and a collection end, and collects new words with universal meanings from individual user words by collecting user input behavior information of multiple users;
  • the new words in the Internet information or corpus are also generated by the usage behavior of each user, so the present invention provides a solution from the perspective of user input, thereby being simple and convenient. Get more accurate, universal words.
  • the present invention further collects the user word frequency information in the user input behavior, thereby removing some interference vocabulary, such as user input errors, etc.; and also finding some new words with sociological significance, for example, obtaining some words through the user word frequency.
  • the invention can further input the collected user input behavior information into a selected internet page database, perform statistics on the number of occurrences thereof, and remove the vocabulary with lower frequency, thereby obtaining more accurate new words, that is, finding true It is a new word in the linguistic sense, and removes vocabulary or wrong vocabulary that does not have universal meaning.
  • the invention can also arrange the obtained new words into a new vocabulary or a new version of the whole vocabulary, thereby providing the input method, which can improve the hit rate and input speed of the preferred words when the user inputs, and can improve the reasonable sorting of the candidate words.
  • Sex convenient for users to input new words faster and more accurately, without the cumbersome candidate selection process, you can get the words you want to input in the first or first page candidate.
  • the new thesaurus or the new version of the full thesaurus can also be provided to the search engine. When the user's query keyword string includes new words, the accuracy and coverage of the search results can be improved.
  • Embodiment 1 is a flow chart showing the steps of Embodiment 1 of the present invention.
  • Figure 2 is a flow chart showing the steps of Embodiment 2 of the present invention.
  • Figure 3 is a flow chart showing the steps of Embodiment 3 of the present invention.
  • Figure 4 is a flow chart showing the steps of obtaining a new word from the collected user personality words
  • Figure 5 is a block diagram showing the structure of an embodiment of an input method system of the present invention.
  • FIG. 6 is a structural block diagram of a new word acquisition apparatus of the present invention.
  • FIG. 7 is a structural block diagram of another new word acquisition apparatus of the present invention.
  • FIG. 8 is a flow chart showing the steps of a method for generating a thesaurus according to the present invention.
  • Embodiment 1 is a flow chart of the steps of Embodiment 1 of the present invention, including the following steps:
  • Step 101 Obtain a word selected by the user during the user input process.
  • Step 101 is a word selected by the user who records the user's input behavior information.
  • the encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods.
  • Some words of the user's personality will be included in the words selected by the user.
  • the user needs to input words such as "broad value”, "nine ceremonies” or a certain name, but the original vocabulary of the input method does not Such words cannot be directly displayed to the user in the candidate words, and the user needs to select each word to obtain the desired personalized vocabulary.
  • the user can also create new words and new words in the original thesaurus that are not needed by the user through the artificial word-making function provided by the input method, so that the user can select the desired personality word during the input process. word.
  • the present invention is capable of selecting the personal words of the user from the words selected by the user.
  • Step 102 Compare the selected word with the existing word, and obtain the user's personalized word according to the comparison result.
  • the comparison may be performed once each time the user determines the selected word, and compares the selected word with the existing word. If it is within the preset judgment rule, it is determined to be the user's individual word and recorded, Recording the user's personality words into the system vocabulary or recording to the user's personality vocabulary;
  • the words selected by the user in 101 can be recorded only in the form of a cache.
  • step 101 records the user-selected words to the user lexicon first, and the input method system vocabulary is used to store the existing words
  • the comparison in step 102 can also compare the user lexicon with the input method system words at regular intervals.
  • the library records the determined user personality words into the user's personality dictionary or marks them in the user's vocabulary. This method can reduce the amount of data calculation during the user input process, thereby avoiding the extraction of the user input behavior and affecting the user's input behavior itself.
  • the preset rule for determining the user's personalized words can be set by a person skilled in the art as needed. Yes.
  • the user's personalized word is obtained by the following steps: determining whether the word selected by the user exists in the existing word; if not, determining that the word is the user's individual word.
  • Step 103 Collect individual words of each user.
  • the collecting may be: the input method user computing device sends the user's personalized words to the word collection computing device in real time or periodically, that is, the input method computing device has an automatically transmitted module.
  • the collection computing device exists in the form of a server.
  • the collecting may also send the personalized words to the collecting end periodically or irregularly for the input method user, that is, the sending is manually initiated by the user, for example, each user sends his own personalized words to a unified email address. Or implement collection in a unified server.
  • the vocabulary storing the user's personalized words may be sent to the collecting computing device in real time or periodically, for example, each user passes the timing or Unscheduled collection of the thesaurus on the server can be achieved.
  • the collection of user personality words is simpler, because the input method system used by the user at this time It is a server itself, which can be used by multiple users. It can collect the input behavior information of each user during use.
  • the present invention is feasible in any way that enables information collection, and is no longer - an illustration.
  • Step 104 Obtain a new word according to the personality word.
  • This step gets new words by removing duplicate words from all collected user personal words. This step can also use new filtering, simplified ways to get new words.
  • the present invention can obtain new words from the collected user personal words by: counting the number of occurrences of the personalized words in the preset Internet page database; if the number of occurrences of the personalized words is greater than or equal to If the preset threshold is used, the word is output as a new word.
  • Embodiment 2 of the present invention is a flow chart of the steps of Embodiment 2 of the present invention, which includes the following steps:
  • Step 201 Obtain, in a user input process, a word selected by a user
  • Step 202 Collect selected words of each user; (Non-read only), step 202 collects the user vocabulary of each user or the word selected by the user in the system vocabulary.
  • the collection manners may be in various manners as described above, and are not described herein again.
  • Step 203 Compare the selected word and the existing word by the user, and obtain the user's personalized word according to the comparison result;
  • Step 204 Obtain a new word according to the personality word.
  • the second embodiment is basically similar to the concept of the first embodiment.
  • the main difference is that the selected words of a plurality of users are collected first, and then the comparison is performed uniformly, and the user's individual words are obtained according to the comparison result; the method can reduce the comparison calculation. The number of times, and can reduce the burden of the local input method system, but because a large number of user-selected words are compared, the comparison will increase the system load of the server.
  • the person skilled in the art can select and use according to the needs.
  • Embodiment 3 it is a flowchart of the steps of Embodiment 3 of the present invention. Further optimization of Embodiment 3 based on Embodiment 1 includes the following steps:
  • Step 301 During the user input process, record the words selected by the user and the frequency of the user words to the user vocabulary;
  • a user vocabulary is created on the user side for recording the words selected by the user and the frequency of the user words, and the frequency of the user words is frequency information of the user inputting the words. This step can completely record the user's input behavior, regardless of whether the word is a new word.
  • the input system vocabulary can be set to the modifiable mode, and the user-selected words and their user words can be directly recorded to the system vocabulary.
  • Step 302 Compare the user vocabulary and the system vocabulary, and obtain the user's personalized words according to the comparison result; the following methods.
  • the first type determines whether the word selected by the user exists in the existing word; if not, determines that the word is a user's individual word.
  • the second type determining whether the word selected by the user exists in the existing word; if not, further determining the corresponding word frequency of the word; if the corresponding word frequency of the word is greater than or equal to the predetermined threshold, then determining This word is a personal word. If it exists, it can be determined as a non-personal word.
  • the third type determining whether the word selected by the user exists in the existing word; if not, determining that the word is a user's individual word; if present, further comparing the user's word frequency and system word of the word Frequency, the word frequency of the system is word frequency information corresponding to an existing word preset in the input method system vocabulary; if the ratio of the user word frequency to the system word frequency is greater than or equal to a predetermined threshold, determining the word as a personalized word .
  • the user word frequency is used to further judge the individual words, and some words that are not commonly used, but are very commonly used nowadays, that is, new words whose application scope or application environment has changed, can be obtained.
  • the ratio parameter used in the above method is a preferred example, and of course, other feasible parameters can also be used for evaluation.
  • the fourth type determines whether the word selected by the user exists in the existing word; if not, further determines the corresponding word frequency of the word; if the corresponding word frequency of the word is greater than or equal to the predetermined threshold, then determining The word is a personalized word; if present, the user's word frequency and the system word frequency of the word are further compared, and the system word frequency is the word frequency information corresponding to the existing word preset in the input system system vocabulary; If the ratio of the word frequency to the system word frequency is greater than or equal to the predetermined threshold, then the word is determined to be a personalized word.
  • This mode is a preferred example of the present invention, and a more accurate user personality word can be obtained.
  • Step 303 Collect individual words of each user.
  • Step 304 Obtain a new word according to the personality word.
  • This step gets new words by removing duplicate words from all collected user personal words. This step can also use new filtering, simplified ways to get new words. This will be detailed later in Figure 4.
  • Step 305 Generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • This step is used to organize the new words obtained in step 304 into a vocabulary, which can be used in the input method system or the search field.
  • the stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information.
  • a vocabulary update service to any other client program that needs to input the new vocabulary information.
  • it does not need to be in the form of a fixed server, or it can exist in a local computing device, and any required input method to other terminals through P2P (peer-to-peer) technology.
  • the client program of the new word information provides a thesaurus update service.
  • the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary.
  • the server returns data according to the request to update the system vocabulary.
  • various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.
  • setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.
  • the new thesaurus or the new version of the whole thesaurus obtained according to the new word extraction method of the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations.
  • the input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system
  • the matching process can be arbitrarily set in a computing device as needed.
  • the present invention is also applicable to the field of search.
  • the user can accurately segment the query keyword string according to the thesaurus obtained by the method for extracting new words by the present invention. Then, based on the result of the word segmentation, the accuracy and coverage of the search results can be improved.
  • the present invention can obtain new words from collected user personal words by the following steps:
  • Step 401 Remove duplicate user personality words
  • Step 402 Perform weight assignment on the Internet page; store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database, thereby obtaining a preset Internet page database;
  • Step 403 Statistics the personalized word in the preset The number of occurrences in the internet page database; if the number of occurrences of the personalized word is greater than or equal to the preset threshold, the word is lost as a new word Out.
  • Step 402 is an optional step, and the purpose is to obtain a selected internet page database, so as to ensure the accuracy of the new word screening.
  • other methods can be used to form a pre-built Internet page database.
  • the step 402 of weight assignment it is a relatively important situation to assign a corresponding weight value according to the time formed by the web page and the type of the web page. Because for word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain. The value can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics.
  • the webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages.
  • a rule base can be set, and the URL addresses of some webpages are stored in the library, so that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred.
  • the web page is given a greater weight value.
  • the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.
  • the vocabulary that needs to be counted is as much as possible of the user's "redundant information of the input page, etc., and the page redundancy information is generally invalid information; if not removed, new information will be added.
  • the amount of calculation of word extraction, and the frequency of words resulting in statistics are not objective, and the results are not accurate.
  • the present invention also proposes two new word acquisition systems based on the input method. Since the system is used to complete the foregoing method, only a brief introduction will be made below. For details, refer to the related parts.
  • a new word acquisition system based on input method including:
  • a word extraction unit connected to the input method system, for acquiring a word selected by the user in the user input process; a word comparison unit, connected to the word extraction unit, for comparing the selected word with the existing a word, obtaining a user's personality word according to the comparison result; a collecting unit, configured to collect individual words of each user; and a new word obtaining unit, configured to acquire a new word according to the personality word.
  • a new word acquisition system based on input method, including:
  • a word extraction unit connected to the input method system, for acquiring a word selected by the user during the user input process; a collecting unit for collecting the selected words of each user; a word matching unit, connected to the collecting unit And a method for comparing a user-selected word with an existing word, and acquiring a user's personalized word according to the comparison result; and a new word obtaining unit, configured to acquire a new word according to the personalized word.
  • the present invention also claims an input method system, including an input interface unit 501, a display unit 502, and a system vocabulary 503, and further includes:
  • a word extraction unit 504 connected to the input method system, for acquiring a word selected by the user during the user input process;
  • the word matching unit 505 is connected to the word extracting unit 504 for comparing the selected word with the existing word and obtaining the user's individual word according to the comparison result.
  • the user personality words may be stored in the user vocabulary 506 or may be stored in the system vocabulary 503 for marking; or may be stored in a special vocabulary.
  • the input method system can be used to extract the user's personality words in addition to the ordinary word input.
  • the input method system may be a common input method system.
  • the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device, and the input method system matches the local query according to the coding information input by the user.
  • the corresponding characters are displayed locally.
  • the input method system may also be a network input method system.
  • the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located in the second computing device, and the input method system is based on the user.
  • the input information is obtained from the second computing device to obtain corresponding information, and the corresponding character is displayed on the first computing device.
  • the input method system may further include: a user vocabulary 506 for storing a word selected by the user; and a communication unit 507, configured to send the personalized word.
  • a user vocabulary 506 for storing a word selected by the user
  • a communication unit 507 configured to send the personalized word.
  • Each user's input method system can send the user's personality words to a unified collection computing device, so as to collect a large amount of user input behavior information, and then analyze new words that meet the needs of the public and conform to the linguistic meaning.
  • the input method system may further include:
  • the word frequency recording unit 508 is connected to the input method system for recording a user word frequency during the user input process, and the user word frequency is frequency information of the user inputting the word.
  • the communication unit 507 can also be used to send user word frequency information related to personal words.
  • the word comparison unit 505 may further include:
  • a first comparison subunit 5051 configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison subunit, if not, output the word To the second comparison subunit;
  • the second comparison sub-unit 5052 is configured to further determine a user word frequency corresponding to the word when the selected word does not exist in the existing word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold , then determine that the word is a personal word.
  • the third comparison sub-unit 5053 is configured to further compare the user word frequency and the system word frequency of the word when the word selected by the user exists in the existing word, and the system word frequency is pre-prescribed in the input method system vocabulary
  • the word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
  • the above-mentioned word matching unit 505 is a preferred embodiment of the present invention. Of course, other matching rules may also be used, and the word matching unit 505 may include other sub-units, and the present invention No - an example.
  • the input interface unit 501 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching (such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on.
  • Display unit 502 and system vocabulary 503 are well known to those skilled in the art and will not be described in detail herein.
  • the present invention also provides a new word acquiring apparatus, including:
  • the personalized word collecting unit 601 is configured to collect the personalized words of each user; the personalized words of the user may be obtained by the input method, and automatically sent to the personalized word collecting unit; or may be set or organized by the user, and sent To the individual word collection unit; or each user sets their personality word vocabulary to a fixed network space, and the personalized word collection unit obtains the individual words of each user from the network space. That is, the user's personal words in this embodiment are not necessarily obtained through user input behavior, but may also be set or organized by the user.
  • a statistical unit 602 configured to calculate that the personalized word appears in a preset Internet page database Number of times;
  • the new word determining unit 603 is connected to the statistical unit 602, and is configured to determine whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, output the word as a new word.
  • the new word acquisition device can obtain a relatively accurate new word output according to the collected personal words of each user by using the verification in the Internet information.
  • the individual words of each user may be automatically obtained by the user's input behavior, or may be set or organized by the user.
  • the new word obtaining means may further include: a thesaurus generating unit 604, configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the whole thesaurus.
  • the new thesaurus or the new version of the full thesaurus can be used to update the input system system vocabulary or search for the word segmentation, thereby providing the user's input accuracy and the accuracy of the search results.
  • the new word obtaining apparatus may further include: an internet page database generating unit 605, configured to perform weighting on the internet page; and store an internet page whose weight value is greater than or equal to a preset threshold to the internet page database. .
  • the present invention also discloses another new word acquiring apparatus, including:
  • a word collecting unit 701 configured to collect selected words of each user
  • the word collecting unit 701 can be directly connected to an existing input method system to collect selected words of each user in real time, for example, a network input method.
  • the word collecting unit 701 can also extract the user-selected words transmitted in real time or periodically by each user's input method system, and the user-selected words are extracted by the user's input method system.
  • the word collecting unit 701 can also achieve the purpose of collecting a user selected word by receiving a user vocabulary or a system vocabulary sent by each user's input method system, wherein the user selected word is input by the user.
  • the method is extracted and stored in the user's thesaurus or system lexicon.
  • the word matching unit 702 is connected to the word collecting unit, and is configured to compare the selected word with the existing word, and obtain the user's personalized word according to the comparison result;
  • the new word obtaining unit 703 is configured to obtain a new word according to the personalized word.
  • the word comparison unit 702 can further include:
  • a first comparison sub-unit 7021 configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison sub-unit, if not, output the word To the second comparison subunit;
  • the second comparison unit 7022 is configured to further determine a user word frequency corresponding to the word when the selected word does not exist in the existing word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold, Then determine that the word is a personal word.
  • the third comparison sub-unit 7023 is configured to further compare the user word frequency and the system word frequency of the word when the word selected by the user exists in the existing word, and the system word frequency is pre-prescribed in the input method system vocabulary
  • the word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
  • the new word obtaining unit 703 may further include:
  • a statistical subunit 7031 configured to count the number of occurrences of the personalized word in a preset Internet page database, thereby obtaining an Internet word frequency of the word;
  • the new word determining subunit 7032 is connected to the statistical subunit for determining whether the internet word frequency is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
  • the new word acquiring device may further include:
  • the thesaurus generating unit 704 is configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
  • the Internet page database generating unit 705 is configured to perform weighting on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
  • the vocabulary generated by the vocabulary generating unit 704 can further include the user word frequency corresponding to the word.
  • the user word frequency and the internet word frequency may be weighted and superimposed and summed, and the user's personality word is given a weight word frequency. Then, filtering and removing according to the weight word frequency, for example, determining whether the weight word frequency of the personality word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
  • the invention also discloses a vocabulary generating method. Two embodiments of the vocabulary generating method are respectively described with reference to FIG. 8 , FIG. 8 a and FIG. 8 b , and the details are as follows:
  • the thesaurus generation method shown in Figure 8a includes the following steps:
  • Step 801a Collect input behavior information of each user, where the input behavior information includes a selected word in the user input process and a corresponding user word frequency of the word; the collection may be various manners mentioned in the foregoing.
  • Step 802a performing weight correction on the word frequency of each user corresponding to the word, and calculating a cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word, for example, first The word frequency of each user corresponding to the word is analyzed to find the distribution trend, and the probability of occurrence of a word frequency value or the frequency value of the word frequency is corrected by the average range of the word range.
  • the user accumulated word frequency calculated after the above correction can remove some users' accidental behavior or malicious behavior, and obtain a more objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.
  • Step 803a removing words whose user cumulative word frequency is less than or equal to a certain threshold. This step is a preferred step for further improving the ubiquity of words in the revenue lexicon.
  • Step 804a generating a thesaurus, the words database including words and their corresponding user cumulative word frequencies. Due to the large number of users of the input method, a universal vocabulary can be obtained by collecting the input behavior information of a large number of input method users.
  • the thesaurus can be directly provided to the input method system as a system vocabulary; it can also be imported as a user vocabulary by the user and used in conjunction with the system vocabulary.
  • the vocabulary generating method shown in FIG. 8a may further include the following steps: Step 805a: Comparing the generated lexicon with an existing vocabulary, and removing words that do not conform to the preset rule according to the comparison result, and outputting User-specific words; wherein the preset rules can be set by a person skilled in the art as needed, for example, in the foregoing step 302 of the present invention, four ways of obtaining user personality words according to the comparison result are obtained.
  • Step 806a Generate a personalized word dictionary according to the user personality word.
  • the thesaurus generation method shown in Figure 8b includes the following steps:
  • Step 801b Collect input behavior information of each user, where the input behavior information includes a selected word in the user input process and a corresponding word frequency of the word.
  • Step 802b performing weight correction on the word frequency of each user corresponding to the word, and calculating the cumulative word frequency of the user of each word.
  • Step 803b removing words whose user cumulative word frequency is less than or equal to a certain threshold.
  • Step 804b generating a vocabulary, the vocabulary including words and their corresponding user cumulative word frequency.
  • Step 805b Compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
  • Step 806b counting the number of occurrences of the personalized words in the preset Internet page database, and obtaining an Internet word frequency
  • Step 807b performing weight correction on the cumulative word frequency of the user word and the Internet word frequency, and obtaining a weighted word frequency of the personalized word; if the weight word frequency of the personalized word is greater than or equal to a preset threshold, Output the word as a new word;
  • Step 808b Generate a new vocabulary according to the outputted new word, the new vocabulary including the new word and its corresponding weight word frequency.
  • the invention also discloses a thesaurus generating device, comprising the following components:
  • a collecting unit configured to collect input behavior information of each user, where the input behavior information includes a selected word in a user input process and a corresponding word frequency of the word;
  • a word frequency calculation unit configured to perform weight correction on each word frequency of each word corresponding to the word, and calculate a cumulative word frequency of each word;
  • the thesaurus generating unit is configured to generate a thesaurus, the thesaurus including the words and their corresponding cumulative word frequencies.
  • the vocabulary generating device may further include: a personalized word determining unit, configured to compare the generated vocabulary with an existing vocabulary, and remove a word that does not conform to the preset rule according to the comparison result, and output the user personality or
  • the the thesaurus generating device may further include:
  • a personalized word determining unit configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
  • a statistical unit configured to count the number of occurrences of the personalized words in a preset Internet page database, and obtain an Internet word frequency
  • a weight word frequency determining unit configured to perform weighting on the cumulative word frequency of the user of the personalized word and the Internet word frequency, and obtain a weighted word frequency of the word;
  • a new word determining unit if the weight word frequency of the personality word is greater than or equal to a preset threshold, the word is output as a new word
  • the thesaurus generating unit generates a new thesaurus according to the outputted new words, the new thesaurus including new words and their corresponding weight words. Since the present invention uses the word frequency statistics technology based on Internet information, and the user inputs the behavior information as the source of the new word, a large number of new words frequently used by each user can be conveniently and quickly obtained, and these new words are collectively filtered and continuously Provided to input method users, so that these users can track changes in Internet information at all times during use, and constantly input new words without having to go through a tedious process of selecting words each time a new word is entered, so that new Words can also become the user's preferred words, improve the preferred word hit rate when users input new words, and improve the rationality of candidate word sorting.

Abstract

A method for obtaining the new words is disclosed, and comprises the following steps: obtaining the words selected by user during user inputting; comparing the word selected by user with the current words, and obtaining the user personality words according to the result of the comparison; collecting the personality words of every user; and obtaining the new words according to the personality words. The method for generating the lexicon is disclosed also, and comprises the following steps: collecting input action information of every user which composed of the selected words during the user inputting and the user words frequency corresponding to the selected words; weighting and repairing each user words frequency corresponding to the selected words and computing the user accumulation words frequency for each words; generating the lexicon which includes the words and the user accumulation words frequency corresponding to the words. The invention provides a distributed architecture, analyzes the obtained new words with the common meaning from each user personality words, provides the solution from the view of the user input, it can obtain the accurate new words with the common meaning in a convenient way.

Description

一种获取新词的方法、 装置以及一种输入法系统 本申请要求于 2006 年 8 月 9 日提交中国专利局、 申请号为 200610109732.X, 发明名称为"一种获取新词的方法、 装置以及一种输入法系 统"的中国专利申请的优先权, 其全部内容通过引用结合在本申请中。  Method and device for acquiring new words and input method system The present application claims to be submitted to the Chinese Patent Office on August 9, 2006, and the application number is 200610109732.X, and the invention name is "a method and device for acquiring new words" The priority of the Chinese Patent Application, the entire disclosure of which is incorporated herein by reference.
技术领域 Technical field
本发明涉及互联网信息处理领域,特别是涉及一种获取新词的方法、新词 获取系统、 新词获取装置以及一种输入法系统。  The present invention relates to the field of Internet information processing, and in particular, to a method for acquiring a new word, a new word acquisition system, a new word acquisition device, and an input method system.
背景技术 Background technique
互联网的出现在很大程度上是对语言文字发展的一大革命,文字内容的剧 增, 崭新内容的出现都使得语言文字经历了一次大的变革。人们不仅仅看报纸 杂志上面的文章, 更多的会阅读互联网上的文章。 随着时间的推移, 互联网上 的文字内容越来越丰富, 已经是传统报纸杂志的文字信息所不能比拟的了。 而 且, 随着信息传播的加速, 新的字词以超乎寻常的速度在互联网上传播, 短时 间内就会有大量的新词出现。 以前, 个人在报纸杂志上发表文章很困难, 而进 入互联网时代,每个人都可以在网络上发表自己的看法, 输入的文字也会越来 越个性化, 随着互联网用户的不断增多, 个人的文章文字也不断增多, 个性化 的新的字词也不断的涌现出来。 例如, "互联网"在若干年以前不是一个词, 但 现在它却作为一个词在广泛地使用。  The emergence of the Internet is largely a revolution in the development of language and characters. The dramatic increase in the content of texts and the emergence of new content have made the language and text undergo a major transformation. People not only read articles in newspapers and magazines, but also read articles on the Internet. As time goes by, the text content on the Internet is becoming more and more abundant, which is already unmatched by the text information of traditional newspapers and magazines. Moreover, as information dissemination accelerates, new words spread on the Internet at an extraordinary rate, and a large number of new words appear in a short period of time. In the past, it was very difficult for individuals to publish articles in newspapers and magazines. In the era of Internet, everyone can express their opinions on the Internet, and the input text will become more and more personalized. With the increasing number of Internet users, personal The text of the article is also increasing, and new and personalized words are constantly emerging. For example, "Internet" was not a word a few years ago, but it is now widely used as a word.
由于在很多语言处理技术中,词是最基本的分析元素, 因此需要及时有效 的获取新出现的词, 以保证语言处理技术的准确性。 例如, 具有不同属性的词 汇表是自然语言理解、 机器翻译、 自动撰写摘要等的基础。 为了检索信息, 总 是用词作为搜索单位来减少检索结果的冗余。在语音识别中,也通常把词作为 最低层次的语言信息, 并基于词建立语言模型, 以解决单字层次上的声觉不确 定性。  Since words are the most basic analysis elements in many language processing technologies, it is necessary to obtain new and emerging words in a timely and effective manner to ensure the accuracy of language processing technology. For example, vocabularies with different attributes are the basis for natural language understanding, machine translation, automatic writing of abstracts, and so on. In order to retrieve information, words are always used as search units to reduce the redundancy of search results. In speech recognition, words are usually used as the lowest level of linguistic information, and language models are built based on words to solve the acoustic uncertainty at the word level.
但是由于新词不断出现, 并且分散在庞杂的语料库中,很难及时有效的将 新词分辨出来。现有技术一般釆用由人工收集新词 ,加入到现有词库中的方式。  However, as new words continue to appear and are scattered in a complex corpus, it is difficult to distinguish new words in a timely and effective manner. The prior art generally employs a method of manually collecting new words and adding them to an existing vocabulary.
例如,新词由搜索网站的管理者人工收集, 然后加入该网站使用的定制词 库; 或者由词库开发者人工收集, 然后归入下一代使用的系统词典中(通常可 以用于输入法等领域); 或者设置一公共词库(例如, 紫光), 然后由网友或者 其他公众人工累积收集新词,加入到该公共词库中,可以集合大量人工的力量。 但是上述的这些方式, 都非常耗费时间、 工作繁重、 劳动密集、 效率低下。 因 此, 人们迫切需要一种能够从庞杂的语言使用中及时有效获取新词的方法。 发明内容 For example, new words are manually collected by the administrator of the search site and then added to the custom vocabulary used by the site; or manually collected by the lexicon developer and then included in the system dictionary used by the next generation (usually For use in fields such as input methods; or set up a common vocabulary (for example, Violet), and then manually collect new words by netizens or other publics, and join the public vocabulary to gather a lot of artificial power. However, these methods are very time consuming, labor intensive, labor intensive, and inefficient. Therefore, there is an urgent need for a method that can effectively acquire new words in a timely and efficient manner from the use of complex languages. Summary of the invention
本发明所要解决的技术问题是提供一种获取新词的方法和系统,可以简单 方便、及时有效的获取用户经常使用的一些新词; 并且可以有效的去除干扰词 汇, 提供比较准确的新词输出。  The technical problem to be solved by the present invention is to provide a method and system for acquiring new words, which can acquire some new words frequently used by users in a simple, convenient, timely and effective manner; and can effectively remove interfering words and provide relatively accurate new word output. .
本发明的另一目的在于, 提供一种输入法系统, 可以简单方便、 及时有效 的自动获取该用户的个性字词, 通过收集多个用户的个性字词即可获取新词。  Another object of the present invention is to provide an input method system that can automatically and automatically acquire the personalized words of the user in a timely, convenient and effective manner, and acquire new words by collecting the individual words of the plurality of users.
本发明的另一目的在于,还提供一种新词获取装置, 可以高效率的提供比 较准确的新词输出。  Another object of the present invention is to provide a new word acquisition apparatus which can provide a relatively accurate new word output with high efficiency.
本发明的另一目的还在于提供一种词库生成方法和词库生成装置,可以高 效率的提供比较准确的词库或者新词库。  Another object of the present invention is to provide a vocabulary generating method and a vocabulary generating apparatus which can provide a relatively accurate vocabulary or a new vocabulary with high efficiency.
为解决上述技术问题,本发明提供了一种获取新词的方法,包括以下步骤: 在用户输入过程中, 获取用户选择的字词;比较用户所选字词与现有字词, 根据比对结果获取用户个性字词; 收集各个用户的个性字词; 根据所述个性字 词获得新词。  To solve the above technical problem, the present invention provides a method for acquiring a new word, comprising the steps of: acquiring a word selected by a user during a user input process; comparing a word selected by the user with an existing word, according to the comparison As a result, the user's personal words are obtained; the individual words of each user are collected; new words are obtained according to the personalized words.
优选的, 还可以在用户输入过程中, 记录用户词频, 所述用户词频为用户 输入该字词的频率信息。  Preferably, the user word frequency is also recorded during the user input process, and the user word frequency is the frequency information of the user inputting the word.
其中, 所述比较可以为:记录用户所选字词至用户词库, 输入法系统词库 中存储现有字词, 比较用户词库与输入法系统词库; 或者直接比较用户每次所 选字词与现有字词。  The comparison may be: recording a word selected by the user to the user vocabulary, storing the existing word in the input system vocabulary, comparing the user lexicon with the input method system vocabulary; or directly comparing the user each time selected Words and existing words.
其中, 可以通过以下步骤实现用户个性字词的获取: 判断用户所选字词在 现有字词中是否存在; 如果不存在, 则确定该字词为用户个性字词。  The following steps may be used to obtain the user's personalized words: determining whether the selected word of the user exists in the existing word; if not, determining that the word is the user's individual word.
也可以通过以下步骤实现用户个性字词的获取:判断用户所选字词在现有 字词中是否存在; 如果不存在, 进一步判断该字词相应的用户词频; 如果该字 词相应的用户词频大于或者等于预定阔值, 则确定该字词为个性字词。  The following steps can also be used to obtain the user's personalized words: determine whether the selected word of the user exists in the existing word; if not, further determine the corresponding word frequency of the word; if the word corresponding to the user word frequency If it is greater than or equal to the predetermined threshold, it is determined that the word is a personalized word.
还可以通过以下步骤实现用户个性字词的获取:判断用户所选字词在现有 字词中是否存在; 如果不存在, 则确定该字词为用户个性字词; 如果存在, 则 进一步对比该字词的用户词频和系统词频,所述系统词频为在输入法系统词库 中预置的现有字词相应的词频信息;如果用户词频与系统词频的比值大于或者 等于预定阔值, 则确定该字词为个性字词。 The following steps can also be used to obtain the user's personal words: judge the user's selected words in the existing Whether there is a word; if it does not exist, it is determined that the word is a user's personality word; if it exists, the user's word frequency and system word frequency of the word are further compared, and the system word frequency is pre-prescribed in the input method system vocabulary The word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
优选的, 可以通过以下步骤实现用户个性字词的获取: 判断用户所选字词 在现有字词中是否存在; 如果不存在, 进一步判断该字词相应的用户词频; 如 果该字词相应的用户词频大于或者等于预定阔值, 则确定该字词为个性字词; 如果存在, 则进一步对比该字词的用户词频和系统词频, 所述系统词频为在输 入法系统词库中预置的现有字词相应的词频信息;如果用户词频与系统词频的 比值大于或者等于预定阔值, 则确定该字词为个性字词。  Preferably, the user's personalized word can be obtained by the following steps: determining whether the selected word of the user exists in the existing word; if not, further determining the corresponding word frequency of the word; if the word corresponds If the user word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personality word; if present, the user word frequency and the system word frequency of the word are further compared, and the system word frequency is preset in the input method system vocabulary The word frequency information corresponding to the existing word; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
优选的, 所述的获取新词方法, 还包括: 统计所述个性字词在预置的互联 网页面数据库中出现的次数;如果所述个性字词的出现次数大于或者等于预置 阔值, 则将该字词作为新词输出。 其中, 通过以下步骤获得预置的互联网页面 数据库: 对互联网页面进行权重赋值; 将权重值大于或者等于预置阔值的互联 网页面存储至互联网页面数据库。  Preferably, the method for acquiring a new word further includes: counting a number of times the personalized word appears in a preset internet page database; if the number of occurrences of the personalized word is greater than or equal to a preset threshold, The word is output as a new word. The preset Internet page database is obtained by the following steps: weighting the Internet page; storing the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
所述的获取新词方法中的所述收集可以为:输入法用户计算设备实时或者 定时的将用户的个性字词发送至字词收集计算设备中。  The collecting in the method for acquiring new words may be: the input method user computing device sends the user's personalized words to the word collecting computing device in real time or at a time.
优选的, 所述的获取新词方法, 还包括: 根据输出的新词生成新词库或者 将得到的新词添加至原有词库, 得到新词库或者新版的全词库。  Preferably, the method for acquiring a new word further includes: generating a new thesaurus according to the outputted new words or adding the obtained new words to the original thesaurus, and obtaining a new thesaurus or a new version of the whole thesaurus.
本发明还公开了一种获取新词的方法, 包括:在用户输入过程中, 获取用 户选择的字词;收集各个用户的所选字词; 比较用户所选字词与现有字词, 根 据比对结果获取用户个性字词; 根据所述个性字词获得新词。  The invention also discloses a method for acquiring a new word, comprising: acquiring a word selected by a user in a user input process; collecting selected words of each user; comparing the selected word of the user with the existing word, according to The result of the comparison obtains the user's personality word; the new word is obtained according to the personality word.
本发明还公开了一种基于输入法的新词获取系统, 包括: 字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选择的字词;字词比对 单元, 与字词提取单元相连, 用于比较用户所选字词与现有字词, 根据比对结 果获取用户个性字词; 收集单元, 用于收集各个用户的个性字词; 新词获取单 元, 用于根据所述个性字词获取新词。  The invention also discloses a new word acquisition system based on an input method, comprising: a word extraction unit, which is connected with an input method system, and is used for acquiring a word selected by a user in a user input process; a word comparison unit, Connected to the word extraction unit, used to compare the selected word with the existing word, and obtain the user's individual word according to the comparison result; the collecting unit is used to collect the individual words of each user; the new word acquiring unit is used Obtain new words based on the personality words.
本发明还公开了另一种基于输入法的新词获取系统,包括:字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选择的字词;收集单元, 用于收集各个用户的所选字词; 字词比对单元, 与收集单元相连, 用于比较用 户所选字词与现有字词, 根据比对结果获取用户个性字词; 新词获取单元, 用 于根据所述个性字词获取新词。 The invention also discloses another new word acquisition system based on the input method, comprising: a word extraction unit, which is connected with the input method system, and is used for acquiring the words selected by the user during the user input process; the collecting unit, a selected word for collecting each user; a word matching unit, connected to the collecting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result; , for obtaining a new word according to the personality word.
本发明还公开了一种输入法系统, 包括输入接口单元、显示单元以及系统 词库, 还包括: 字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选择的字词;字词比对单元, 与字词提取单元相连, 用于比较用户所 选字词与现有字词, 根据比对结果获取用户个性字词。  The invention also discloses an input method system, comprising an input interface unit, a display unit and a system vocabulary, further comprising: a word extraction unit, connected to the input method system, for acquiring a word selected by the user during the user input process The word matching unit is connected to the word extracting unit for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result.
其中, 所述输入法系统的输入接口单元、显示单元以及系统词库可以位于 同一计算设备中; 或者所述输入法系统的输入接口单元、显示单元位于第一计 算设备中, 系统词库位于第二计算设备中, 所述输入法系统根据用户输入的信 息, 从位于第二计算设备中获取相应信息, 在第一计算设备显示相应字符。  The input interface unit, the display unit, and the system vocabulary of the input method system may be located in the same computing device; or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located at the first In the second computing device, the input method system acquires corresponding information from the second computing device according to the information input by the user, and displays the corresponding character in the first computing device.
所述的输入法系统, 还可以包括: 通信单元, 用于发送所述个性字词。 所述的输入法系统, 还可以包括: 用户词库, 用于存储用户所选字词。 所述的输入法系统, 还可以包括:词频记录单元, 与输入法系统相连, 用 于在用户输入过程中,记录用户词频, 所述用户词频为用户输入该字词的频率 信息。  The input method system may further include: a communication unit, configured to send the personalized word. The input method system may further include: a user vocabulary for storing words selected by the user. The input method system may further include: a word frequency recording unit, connected to the input method system, configured to record a user word frequency during a user input process, wherein the user word frequency is a frequency information input by the user for the word.
其中, 所述字词比对单元可以包括: 第一比对子单元, 用于判断用户所选 字词在现有字词中是否存在; 如果存在, 则输出该字词至第三比对子单元, 如 果不存在, 则输出该字词至第二比对子单元; 第二比对单元, 用于当用户所选 字词在现有字词中不存在时, 进一步判断该字词相应的用户词频; 如果该字词 相应的用户词频大于或者等于预定阔值, 则确定该字词为个性字词; 第三比对 子单元, 用于当用户所选字词在现有字词中存在时, 进一步对比该字词的用户 词频和系统词频,所述系统词频为在输入法系统词库中预置的现有字词相应的 词频信息; 如果用户词频与系统词频的比值大于或者等于预定阔值, 则确定该 字词为个性字词。  The word comparison unit may include: a first comparison subunit, configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison a unit, if not present, outputting the word to the second comparison subunit; and a second comparison unit, configured to further determine the corresponding word when the selected word does not exist in the existing word User word frequency; if the corresponding user word frequency of the word is greater than or equal to the predetermined threshold, then the word is determined to be a personalized word; the third comparison sub-unit is used when the user selected word exists in the existing word Further comparing the user word frequency and the system word frequency of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the input system system vocabulary; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined For a wide value, the word is determined to be a personal word.
本发明还公开了一种新词获取装置, 包括: 个性字词收集单元, 用于收集 各用户的个性字词; 统计单元, 用于统计所述个性字词在预置的互联网页面数 据库中出现的次数; 新词确定单元, 与统计单元相连, 用于判断所述个性字词 的出现次数是否大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。 其中,所述收集为用户计算设备向所述个性字词收集单元实时或者定时的 发送用户个性字词。 The invention also discloses a new word obtaining device, comprising: a personalized word collecting unit, configured to collect individual words of each user; a statistical unit, configured to count the personalized words appearing in a preset internet page database The number of times; the new word determining unit is connected to the statistical unit for determining whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, outputting the word as a new word. The collecting is that the user computing device sends the user's personalized words to the personalized word collecting unit in real time or at a time.
所述的新词获取装置, 还包括: 词库生成单元, 用于根据输出的新词生成 新词库或者将得到的新词添加至原有词库, 得到新词库或者新版的全词库。  The new word obtaining device further includes: a thesaurus generating unit, configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus, and obtain a new thesaurus or a new version of the whole thesaurus .
所述的新词获取装置, 还包括: 互联网页面数据库生成单元, 用于对互联 网页面进行权重赋值;并将权重值大于或者等于预置阀值的互联网页面存储至 互联网页面数据库。  The new word obtaining device further includes: an internet page database generating unit, configured to perform weighting assignment on the internet page; and store an internet page whose weight value is greater than or equal to a preset threshold to the internet page database.
本发明还公开了一种新词获取装置, 包括: 字词收集单元, 用于收集各用 户的所选字词; 字词比对单元, 与字词收集单元相连, 用于比较用户所选字词 与现有字词, 根据比对结果获取用户个性字词; 新词获取单元, 用于根据所述 个性字词获取新词。  The invention also discloses a new word obtaining device, comprising: a word collecting unit for collecting selected words of each user; a word matching unit, connected with the word collecting unit, for comparing the selected words of the user a word and an existing word, obtaining a user personality word according to the comparison result; a new word obtaining unit, configured to acquire a new word according to the personality word.
优选的, 所述新词获取单元包括: 统计子单元, 用于统计所述个性字词在 预置的互联网页面数据库中出现的次数;新词确定子单元,与统计子单元相连, 用于判断所述个性字词的出现次数是否大于或者等于预置阔值, 如果是, 则将 该字词作为新词输出。  Preferably, the new word obtaining unit includes: a statistical subunit, configured to count the number of times the personalized word appears in the preset internet page database; the new word determining subunit is connected to the statistical subunit for determining Whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, the word is output as a new word.
所述字词收集单元还用于收集用户所选字词相应的用户词频;所述的新词 获取装置, 还包括: 统计子单元, 用于统计所述个性字词在预置的互联网页面 数据库中出现的次数, 得到互联网词频; 权重词频确定子单元, 用于对所述新 词的用户词频和互联网词频进行权重修正后求和,得到该新词的权重词频; 新 词确定子单元, 用于判断所述个性字词的权重词频是否大于或者等于预置阔 值, 如果是, 则将该字词作为新词输出。  The word collecting unit is further configured to collect a user word frequency corresponding to the word selected by the user; the new word acquiring device further includes: a statistical subunit, configured to collect the personalized word in the preset internet page database The number of occurrences in the Internet, the Internet word frequency is obtained; the weight word frequency determining sub-unit is configured to perform weight correction on the user word frequency and the Internet word frequency of the new word, and obtain the weight word frequency of the new word; the new word determining sub-unit, And determining whether the weight word frequency of the personalized word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
本发明还公开了一种词库生成方法, 包括: 收集各用户的输入行为信息, 所述输入行为信息包括用户输入过程中的所选字词以及该字词相应的用户词 频; 对字词相应的各用户词频进行权重修正, 计算各字词的用户累积词频; 生 成词库, 所述词库包括字词及其相应的用户累积词频。  The invention also discloses a vocabulary generating method, comprising: collecting input behavior information of each user, the input behavior information including a selected word in a user input process and a corresponding user word frequency of the word; Each user's word frequency is weighted, and the user's cumulative word frequency of each word is calculated; a vocabulary is generated, and the vocabulary includes the words and their corresponding user cumulative word frequencies.
所述的词库生成方法,还包括: 去除用户累积词频小于或等于一定阔值的 字词。  The vocabulary generating method further includes: removing a word whose user cumulative word frequency is less than or equal to a certain threshold.
所述的词库生成方法, 还包括: 比较所述生成的词库与现有词库, 根据比 对结果去除不符合预置规则的字词,输出用户个性字词; 根据所述用户个性字 词生成个性字词库。 The method for generating a thesaurus further includes: comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words; according to the user's personalized words Words generate a personal dictionary of words.
或者, 所述的词库生成方法, 还包括: 比较所述生成的词库与现有词库, 根据比对结果去除不符合预置规则的字词,输出用户个性字词; 统计所述个性 字词在预置的互联网页面数据库中出现的次数,得到互联网词频; 对所述个性 字词的用户累积词频和互联网词频进行权重修正后求和,得到该新词的权重词 频; 如果所述个性字词的权重词频大于或者等于预置阔值, 则将该字词作为新 词输出; 根据所述输出的新词生成新词库, 所述新词库包括新词及其相应的权 重词频。  Or the method for generating the thesaurus, further comprising: comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words; The number of times the word appears in the preset Internet page database, the Internet word frequency is obtained; the user cumulative word frequency of the personalized word and the Internet word frequency are weighted and summed to obtain the weight word frequency of the new word; If the weight of the word is greater than or equal to the preset threshold, the word is output as a new word; a new vocabulary is generated according to the outputted new word, and the new vocabulary includes the new word and its corresponding weight word frequency.
本发明还公开了一种词库生成装置, 包括: 收集单元, 用于收集各用户的 输入行为信息,所述输入行为信息包括用户输入过程中的所选字词以及该字词 相应的用户词频;词频计算单元,用于对字词相应的各用户词频进行权重修正, 计算各字词的累积词频; 词库生成单元, 用于生成词库, 所述词库包括字词及 其相应的累积词频。  The invention also discloses a vocabulary generating device, comprising: a collecting unit, configured to collect input behavior information of each user, wherein the input behavior information includes a selected word in a user input process and a corresponding word frequency of the word a word frequency calculation unit, configured to perform weight correction on each user word frequency corresponding to the word, and calculate a cumulative word frequency of each word; a thesaurus generating unit, configured to generate a thesaurus, the word library including the words and their corresponding accumulations Word frequency.
所述的词库生成装置, 还包括: 个性字词确定单元, 用于比较所述生成的 词库与现有词库,根据比对结果去除不符合预置规则的字词,输出用户个性字 所述的词库生成装置, 还包括: 个性字词确定单元, 用于比较所述生成的 词库与现有词库,根据比对结果去除不符合预置规则的字词,输出用户个性字 词; 统计单元, 用于统计所述个性字词在预置的互联网页面数据库中出现的次 数, 得到互联网词频; 权重词频确定单元, 用于对所述个性字词的用户累积词 频和互联网词频进行权重修正后求和,得到该字词的权重词频;新词确定单元, 如果所述个性字词的权重词频大于或者等于预置阔值,则将该字词作为新词输 出; 所述词库生成单元根据所述输出的新词生成新词库, 所述新词库包括新词 及其相应的权重词频。  The thesaurus generating device further includes: a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words. The thesaurus generating device further includes: a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words. a statistical unit, configured to count the number of occurrences of the personalized word in the preset Internet page database, to obtain an Internet word frequency; a weight word frequency determining unit, configured to perform a cumulative word frequency and an Internet word frequency for the user of the personalized word The weight is corrected and summed to obtain the weight of the word; the new word determining unit outputs the word as a new word if the weight of the personalized word is greater than or equal to the preset threshold; The generating unit generates a new vocabulary according to the outputted new word, and the new vocabulary includes the new word and its corresponding weight word frequency.
与现有技术相比, 本发明具有以下优点:  Compared with the prior art, the present invention has the following advantages:
首先, 本发明提出了分布式的架构, 包括多个用户端和一个收集端, 通过 收集多个用户端的用户输入行为信息,从各个用户的个性字词中分析获得具有 普遍意义的新词;由于互联网信息或者语料库中的新词也是由各个用户的使用 行为而产生的, 所以本发明从用户输入的角度提供解决方案,从而简单方便的 获取比较准确的、 具有普遍意义的新词。 First, the present invention proposes a distributed architecture, including multiple users and a collection end, and collects new words with universal meanings from individual user words by collecting user input behavior information of multiple users; The new words in the Internet information or corpus are also generated by the usage behavior of each user, so the present invention provides a solution from the perspective of user input, thereby being simple and convenient. Get more accurate, universal words.
其次, 本发明还进一步收集了用户输入行为中的用户词频信息,从而可以 去除一些干扰词汇, 例如用户输入错误等情况; 还可以找出一些具有社会学意 义的新词, 例如通过用户词频获取一些原来不常用,但是现在很常用的一些词 汇, 即应用范围或者应用环境发生了改变的新词。 通过上述分析, 可以获取比 较准确的新词。  Secondly, the present invention further collects the user word frequency information in the user input behavior, thereby removing some interference vocabulary, such as user input errors, etc.; and also finding some new words with sociological significance, for example, obtaining some words through the user word frequency. Some words that were not commonly used, but are now commonly used, that is, new words whose application scope or application environment has changed. Through the above analysis, new words that are more accurate can be obtained.
本发明还可以进一步对收集的用户输入行为信息,放置到一精选互联网页 面数据库中, 对其出现次数进行统计, 去除频率较低的词汇, 从而获取更为准 确的新词, 即找出真正属于语言意义上的新词, 而去除不具有普遍意义的词汇 或者错误词汇。  The invention can further input the collected user input behavior information into a selected internet page database, perform statistics on the number of occurrences thereof, and remove the vocabulary with lower frequency, thereby obtaining more accurate new words, that is, finding true It is a new word in the linguistic sense, and removes vocabulary or wrong vocabulary that does not have universal meaning.
本发明还可以将得到的新词编排形成新词库或者新版的全词库,从而提供 给输入法使用, 可以提高用户输入时首选词的命中率和输入速度, 并可以提高 候选词排序的合理性, 方便用户更快更准确的输入新词, 不需要经过繁瑣的候 选词选择过程就可以在第一个或者第一页候选词中得到希望输入的字词。还可 以将新词库或者新版的全词库提供给搜索引擎使用,当用户的查询关键词字符 串中包括新词时, 可以提高搜索结果的精确度和覆盖度。  The invention can also arrange the obtained new words into a new vocabulary or a new version of the whole vocabulary, thereby providing the input method, which can improve the hit rate and input speed of the preferred words when the user inputs, and can improve the reasonable sorting of the candidate words. Sex, convenient for users to input new words faster and more accurately, without the cumbersome candidate selection process, you can get the words you want to input in the first or first page candidate. The new thesaurus or the new version of the full thesaurus can also be provided to the search engine. When the user's query keyword string includes new words, the accuracy and coverage of the search results can be improved.
附图说明 DRAWINGS
图 1是本发明实施例 1的步骤流程图;  1 is a flow chart showing the steps of Embodiment 1 of the present invention;
图 2是本发明实施例 2的步骤流程图;  Figure 2 is a flow chart showing the steps of Embodiment 2 of the present invention;
图 3是本发明实施例 3的步骤流程图;  Figure 3 is a flow chart showing the steps of Embodiment 3 of the present invention;
图 4是从收集的用户个性字词获取新词的步骤流程图;  Figure 4 is a flow chart showing the steps of obtaining a new word from the collected user personality words;
图 5是本发明一种输入法系统的实施例结构框图;  Figure 5 is a block diagram showing the structure of an embodiment of an input method system of the present invention;
图 6是本发明一种新词获取装置的结构框图;  6 is a structural block diagram of a new word acquisition apparatus of the present invention;
图 7是本发明另一种新词获取装置的结构框图;  7 is a structural block diagram of another new word acquisition apparatus of the present invention;
图 8是本发明一种词库生成方法的步骤流程图。  FIG. 8 is a flow chart showing the steps of a method for generating a thesaurus according to the present invention.
具体实施方式 detailed description
为使本发明的上述目的、特征和优点能够更加明显易懂, 下面结合附图和 具体实施方式对本发明作进一步详细的说明。 例如, 中文、 日文、 韩文等, 因为编码输入时, 输入法系统为使用者在信息输 入过程中提供候选词, 用户选择其需要的字词,从而收集该用户的所选字词作 为获取新词的信息来源。由于本发明在上述几种语言文字中的应用流程都是相 似的, 所以为了方便说明, 下面仅仅对本发明应用在中文的情况进行说明。 The present invention will be further described in detail with reference to the drawings and specific embodiments. For example, Chinese, Japanese, Korean, etc., because when inputting the code, the input method system provides the candidate words for the user in the information input process, and the user selects the words he needs, thereby collecting the selected words of the user as acquiring new words. Source of information. Since the application flow of the present invention in the above several languages is similar, for convenience of explanation, only the case where the present invention is applied in Chinese will be described below.
参照图 1 , 是本发明实施例 1的步骤流程图, 包括以下步骤:  1 is a flow chart of the steps of Embodiment 1 of the present invention, including the following steps:
步骤 101、 在用户输入过程中, 获取用户选择的字词。  Step 101: Obtain a word selected by the user during the user input process.
对于需要通过编码输入文字的语言而言, 用户需要输入编码字符串, 并在 候选词中选择需要的字词,从而完成输入。 步骤 101就是记录用户的输入行为 信息之 用户所选字词。 所述编码字符串可以为拼音码也可以为字形码, 即本发明可以适用与各种输入法。  For a language that needs to input text by encoding, the user needs to input the encoded string and select the desired word among the candidate words to complete the input. Step 101 is a word selected by the user who records the user's input behavior information. The encoded character string may be a pinyin code or a font code, that is, the present invention can be applied to various input methods.
用户所选字词中会包括一些该用户的个性字词, 例如, 该用户需要经常输 入"阔值"、 "九部委"或者某个人名等词汇, 但是输入法原有的词库中并没有这 样的词, 所以在候选词中无法直接显示给用户, 用户需要对每个字进行选择从 而得到需要的个性词汇。再例如,用户还可以通过输入法提供的人工造词功能, 创造一些原有词库中没有但是该用户需要使用的新字新词,这样用户在输入过 程中才可以选择到所需的个性字词。本发明能够从用户所选择输入的字词中挑 选出该用户的个性字词。  Some words of the user's personality will be included in the words selected by the user. For example, the user needs to input words such as "broad value", "nine ministries" or a certain name, but the original vocabulary of the input method does not Such words cannot be directly displayed to the user in the candidate words, and the user needs to select each word to obtain the desired personalized vocabulary. For another example, the user can also create new words and new words in the original thesaurus that are not needed by the user through the artificial word-making function provided by the input method, so that the user can select the desired personality word during the input process. word. The present invention is capable of selecting the personal words of the user from the words selected by the user.
步骤 102、 比较用户所选字词与现有字词, 根据比对结果获取用户个性字 词。  Step 102: Compare the selected word with the existing word, and obtain the user's personalized word according to the comparison result.
所述比较可以当每次用户确定所选字词时就进行一次,比较该所选字词和 现有字词, 如果在预置的判断规则内, 则确定为用户个性字词并记录, 可以将 所述用户个性字词记录至系统词库或者记录至用户个性字词库; 此时, 步骤 The comparison may be performed once each time the user determines the selected word, and compares the selected word with the existing word. If it is within the preset judgment rule, it is determined to be the user's individual word and recorded, Recording the user's personality words into the system vocabulary or recording to the user's personality vocabulary;
101中的用户所选字词就可以仅仅以緩存的形式记录即可。 The words selected by the user in 101 can be recorded only in the form of a cache.
如果步骤 101将用户所选字词先记录至用户词库,输入法系统词库用于存 储现有字词,则步骤 102所述比较也可以每隔一定时间比较用户词库与输入法 系统词库,将确定的用户个性字词记录至用户个性字词库或者在用户词库中加 以标记。这种方式可以减少用户输入过程中的数据计算量,从而避免对用户输 入行为的提取而影响用户的输入行为本身。  If the step 101 records the user-selected words to the user lexicon first, and the input method system vocabulary is used to store the existing words, the comparison in step 102 can also compare the user lexicon with the input method system words at regular intervals. The library records the determined user personality words into the user's personality dictionary or marks them in the user's vocabulary. This method can reduce the amount of data calculation during the user input process, thereby avoiding the extraction of the user input behavior and affecting the user's input behavior itself.
所述判断用户个性字词的预置规则,本领域技术人员可以根据需要设定即 可。 例如, 最简单的一种方式, 通过以下步骤实现用户个性字词的获取: 判断 用户所选字词在现有字词中是否存在; 如果不存在, 则确定该字词为用户个性 字词。 The preset rule for determining the user's personalized words can be set by a person skilled in the art as needed. Yes. For example, in the simplest way, the user's personalized word is obtained by the following steps: determining whether the word selected by the user exists in the existing word; if not, determining that the word is the user's individual word.
步骤 103、 收集各个用户的个性字词。  Step 103: Collect individual words of each user.
所述收集可以为:输入法用户计算设备实时或者定时的将用户的个性字词 发送至字词收集计算设备中, 即优选的,输入法计算设备具有一个自动发送的 模块。 优选的, 所述收集计算设备以服务器的形式存在。  The collecting may be: the input method user computing device sends the user's personalized words to the word collection computing device in real time or periodically, that is, the input method computing device has an automatically transmitted module. Preferably, the collection computing device exists in the form of a server.
所述收集还可以为输入法用户定时或者不定时的将自己的个性字词发送 至收集端, 即所述发送由用户人工发起, 例如, 各用户将自己的个性字词发送 至统一的邮件地址或者统一的服务器中实现收集。  The collecting may also send the personalized words to the collecting end periodically or irregularly for the input method user, that is, the sending is manually initiated by the user, for example, each user sends his own personalized words to a unified email address. Or implement collection in a unified server.
当然,如果用户将个性字词存储在用户词库或者系统词库的情况时, 可以 将该存储有用户个性字词的词库实时或者定时的发送至收集计算设备, 例如, 各个用户通过定时或者不定时的将词库在服务器备份即可实现收集。  Of course, if the user stores the personal words in the user lexicon or the system lexicon, the vocabulary storing the user's personalized words may be sent to the collecting computing device in real time or periodically, for example, each user passes the timing or Unscheduled collection of the thesaurus on the server can be achieved.
再者, 对于网络输入法(仅仅提供给用户输入接口和显示接口, 通过连接 服务器完成整个输入过程)而言, 其用户个性字词的收集就更简单了, 因为此 时用户使用的输入法系统本身就是一个服务器, 可以供多个用户使用,在使用 过程中就可以收集各用户的输入行为信息了。  Furthermore, for the network input method (provided only to the user input interface and display interface, through the connection server to complete the entire input process), the collection of user personality words is simpler, because the input method system used by the user at this time It is a server itself, which can be used by multiple users. It can collect the input behavior information of each user during use.
实际上, 本发明釆用任何能够实现信息收集的方式都是可行的, 不再—— 列举说明。  In fact, the present invention is feasible in any way that enables information collection, and is no longer - an illustration.
步骤 104、 根据所述个性字词获得新词。  Step 104: Obtain a new word according to the personality word.
本步骤可以通过在所有收集的用户个性字词中去除重复的字词,从而获得 新词。 本步骤还可以釆用其他过滤、 简化的方式获得新词。  This step gets new words by removing duplicate words from all collected user personal words. This step can also use new filtering, simplified ways to get new words.
优选的, 本发明可以通过以下步骤从收集的用户个性字词获取新词: 统计 所述个性字词在预置的互联网页面数据库中出现的次数;如果所述个性字词的 出现次数大于或者等于预置阔值, 则将该字词作为新词输出。  Preferably, the present invention can obtain new words from the collected user personal words by: counting the number of occurrences of the personalized words in the preset Internet page database; if the number of occurrences of the personalized words is greater than or equal to If the preset threshold is used, the word is output as a new word.
参照图 2, 是本发明实施例 2的步骤流程图, 包括以下步骤:  Referring to FIG. 2, it is a flow chart of the steps of Embodiment 2 of the present invention, which includes the following steps:
步骤 201、 在用户输入过程中, 获取用户选择的字词;  Step 201: Obtain, in a user input process, a word selected by a user;
步骤 202、 收集各个用户的所选字词; (非只读), 步骤 202收集各个用户的用户词库或者系统词库中的用户所选字 词即可。 所述收集方式可以釆用前述的各种方式, 在此不再赘述。 Step 202: Collect selected words of each user; (Non-read only), step 202 collects the user vocabulary of each user or the word selected by the user in the system vocabulary. The collection manners may be in various manners as described above, and are not described herein again.
步骤 203、 比较用户所选字词与现有字词, 根据比对结果获取用户个性字 词;  Step 203: Compare the selected word and the existing word by the user, and obtain the user's personalized word according to the comparison result;
步骤 204、 根据所述个性字词获得新词。  Step 204: Obtain a new word according to the personality word.
实施例 2与实施例 1的构思基本相似, 主要区别在于,先收集多个用户的 所选字词, 再统一进行比对, 根据比对结果获取用户个性字词; 该方式可以减 少比对计算的次数, 并可以减少本地输入法系统的负担,但是由于汇集了大量 用户所选字词之后才进行比对,会增加服务器的系统负担。对于实施例 2与实 施例 1 , 本领域技术人员根据需要选择使用即可。  The second embodiment is basically similar to the concept of the first embodiment. The main difference is that the selected words of a plurality of users are collected first, and then the comparison is performed uniformly, and the user's individual words are obtained according to the comparison result; the method can reduce the comparison calculation. The number of times, and can reduce the burden of the local input method system, but because a large number of user-selected words are compared, the comparison will increase the system load of the server. For the embodiment 2 and the embodiment 1, the person skilled in the art can select and use according to the needs.
参照图 3 , 是本发明实施例 3的步骤流程图, 实施例 3在实施例 1的基础 上的进一步优化, 包括以下步骤:  Referring to FIG. 3, it is a flowchart of the steps of Embodiment 3 of the present invention. Further optimization of Embodiment 3 based on Embodiment 1 includes the following steps:
步骤 301、 在用户输入过程中, 记录用户所选字词及其用户词频至用户词 库;  Step 301: During the user input process, record the words selected by the user and the frequency of the user words to the user vocabulary;
在用户端建立用户词库, 用于记录用户所选字词及其用户词频, 所述用户 词频为用户输入该字词的频率信息。 本步骤可以完整的记录用户的输入行为, 而不用考虑该字词是否为新词。  A user vocabulary is created on the user side for recording the words selected by the user and the frequency of the user words, and the frequency of the user words is frequency information of the user inputting the words. This step can completely record the user's input behavior, regardless of whether the word is a new word.
当然, 还可以不设置用户词库, 而将输入法系统词库设置为可修改模式, 可以直接将用户所选字词及其用户词频记录至系统词库。  Of course, instead of setting the user vocabulary, the input system vocabulary can be set to the modifiable mode, and the user-selected words and their user words can be directly recorded to the system vocabulary.
步骤 302、 比较用户词库和系统词库, 根据比对结果获取用户个性字词; 以下几种方式。  Step 302: Compare the user vocabulary and the system vocabulary, and obtain the user's personalized words according to the comparison result; the following methods.
第一种: 判断用户所选字词在现有字词中是否存在; 如果不存在, 则确定 该字词为用户个性字词。  The first type: determines whether the word selected by the user exists in the existing word; if not, determines that the word is a user's individual word.
第二种: 判断用户所选字词在现有字词中是否存在; 如果不存在, 进一步 判断该字词相应的用户词频;如果该字词相应的用户词频大于或者等于预定阔 值, 则确定该字词为个性字词。 如果存在, 则可以确定为非个性字词。  The second type: determining whether the word selected by the user exists in the existing word; if not, further determining the corresponding word frequency of the word; if the corresponding word frequency of the word is greater than or equal to the predetermined threshold, then determining This word is a personal word. If it exists, it can be determined as a non-personal word.
第三种: 判断用户所选字词在现有字词中是否存在; 如果不存在, 则确定 该字词为用户个性字词; 如果存在, 则进一步对比该字词的用户词频和系统词 频, 所述系统词频为在输入法系统词库中预置的现有字词相应的词频信息; 如 果用户词频与系统词频的比值大于或者等于预定阔值,则确定该字词为个性字 词。 其中用户词频用于进一步判断个性字词, 可以获取一些原来不常用, 但是 现在很常用的一些词汇, 即应用范围或者应用环境发生了改变的新词。 上述方 法中釆用的比值参数是一种优选例, 当然,也可以釆用其他可行的参数进行评 价。 The third type: determining whether the word selected by the user exists in the existing word; if not, determining that the word is a user's individual word; if present, further comparing the user's word frequency and system word of the word Frequency, the word frequency of the system is word frequency information corresponding to an existing word preset in the input method system vocabulary; if the ratio of the user word frequency to the system word frequency is greater than or equal to a predetermined threshold, determining the word as a personalized word . The user word frequency is used to further judge the individual words, and some words that are not commonly used, but are very commonly used nowadays, that is, new words whose application scope or application environment has changed, can be obtained. The ratio parameter used in the above method is a preferred example, and of course, other feasible parameters can also be used for evaluation.
第四种, 判断用户所选字词在现有字词中是否存在; 如果不存在, 进一步 判断该字词相应的用户词频;如果该字词相应的用户词频大于或者等于预定阔 值, 则确定该字词为个性字词; 如果存在, 则进一步对比该字词的用户词频和 系统词频,所述系统词频为在输入法系统词库中预置的现有字词相应的词频信 息; 如果用户词频与系统词频的比值大于或者等于预定阔值, 则确定该字词为 个性字词。本方式为本发明的一种优选例,可以获得较为精确的用户个性字词。  The fourth type determines whether the word selected by the user exists in the existing word; if not, further determines the corresponding word frequency of the word; if the corresponding word frequency of the word is greater than or equal to the predetermined threshold, then determining The word is a personalized word; if present, the user's word frequency and the system word frequency of the word are further compared, and the system word frequency is the word frequency information corresponding to the existing word preset in the input system system vocabulary; If the ratio of the word frequency to the system word frequency is greater than or equal to the predetermined threshold, then the word is determined to be a personalized word. This mode is a preferred example of the present invention, and a more accurate user personality word can be obtained.
步骤 303、 收集各个用户的个性字词。  Step 303: Collect individual words of each user.
步骤 304、 根据所述个性字词获得新词。  Step 304: Obtain a new word according to the personality word.
本步骤可以通过在所有收集的用户个性字词中去除重复的字词,从而获得 新词。 本步骤还可以釆用其他过滤、 简化的方式获得新词。 后面将通过图 4 对此进行详述。  This step gets new words by removing duplicate words from all collected user personal words. This step can also use new filtering, simplified ways to get new words. This will be detailed later in Figure 4.
步骤 305、 根据输出的新词生成新词库或者将得到的新词添加至原有词 库, 得到新词库或者新版的全词库。  Step 305: Generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus.
本步骤用于将步骤 304获取的新词组织排版形成词库,从而可以用于输入 法系统或者搜索领域。  This step is used to organize the new words obtained in step 304 into a vocabulary, which can be used in the input method system or the search field.
例如, 用于更新普通输入法: 设置包含系统词库的输入法系统位于第一计 算设备中,得到的新词库或者新版的全词库位于第二计算设备中; 需要更新词 库的输入法系统通过第一计算设备连接所述第二计算设备完成系统词库的更 新。  For example, for updating the normal input method: setting an input method system including a system vocabulary in the first computing device, and obtaining a new vocabulary or a new version of the full vocabulary in the second computing device; The system connects the second computing device to the second computing device to complete the update of the system vocabulary.
所述存储得到的新词库或者新版的全词库的第二计算设备可以通过服务 器的形式存在于网络中,向其他任何需要输入法新词信息的客户端程序提供词 库更新服务。 当然, 并不需要一定通过固定服务器的形式出现, 也可以存在于 某个本地计算设备中, 通过 P2P (点对点)技术向其他终端的任何需要输入法 新词信息的客户端程序提供词库更新服务。 The stored new vocabulary or the new version of the second vocabulary second computing device may exist in the network in the form of a server, and provide a vocabulary update service to any other client program that needs to input the new vocabulary information. Of course, it does not need to be in the form of a fixed server, or it can exist in a local computing device, and any required input method to other terminals through P2P (peer-to-peer) technology. The client program of the new word information provides a thesaurus update service.
上述更新的实施例中, 所述更新的方式可以为: 当输入法系统更新时, 同 时更新所述系统词库; 或者, 由服务器主动推送的方式进行系统词库的在线更 新; 或者, 由用户发起请求, 服务器根据请求返回数据进行系统词库的更新。 当然, 也可以釆用移动存储器更新的方式或者版本更新的方式。 总之, 可以釆 用各种数据更新的方式, 本发明对此并不加以限定, 本领域技术人员可以根据 需要选择即可。  In the above updated embodiment, the updating may be performed by: updating the system vocabulary at the same time when the input method system is updated; or performing online update of the system vocabulary by means of the server actively pushing; or, by the user The request is initiated, and the server returns data according to the request to update the system vocabulary. Of course, you can also use the way of mobile memory update or version update. In short, various data update methods may be used, and the present invention is not limited thereto, and those skilled in the art may select them according to needs.
再例如, 用于更新网络输入法: 设置输入法系统中用于接收用户输入信息 和显示相应字符的单元位于第一计算设备中;得到的新词库或者新版的全词库 为输入法系统的系统词库, 所述系统词库位于第二计算设备中; 所述输入法系 统根据用户输入的信息,从位于第二计算设备中的系统词库获取相应信息,在 第一计算设备显示相应字符, 完成文字输入。  For another example, for updating the network input method: setting a unit for receiving user input information and displaying corresponding characters in the input method system is located in the first computing device; the obtained new thesaurus or the new version of the full thesaurus is the input method system a system vocabulary, the system vocabulary is located in a second computing device; the input method system obtains corresponding information from a system vocabulary located in the second computing device according to information input by the user, and displays corresponding characters in the first computing device , complete the text input.
上例中可以直接将根据本发明新词提取方法获得的新词库或者新版的全 词库直接作为输入法系统的系统词库, 则可以实现在线词库使用, 而不需要更 新操作了。 其中, 将输入法系统分为了两部分, 接收和显示单元位于第一计算 设备, 词库信息则位于第二计算设备, 可以完美的实现输入法的在线应用; 当 然,对于输入法系统需要的编码匹配过程可以根据需要任意设置在某个计算设 备中均可。  In the above example, the new thesaurus or the new version of the whole thesaurus obtained according to the new word extraction method of the present invention can be directly used as the system vocabulary of the input method system, and the online thesaurus can be used without updating operations. The input method system is divided into two parts, the receiving and displaying unit is located in the first computing device, and the thesaurus information is located in the second computing device, which can perfectly implement the online application of the input method; of course, the encoding required for the input method system The matching process can be arbitrarily set in a computing device as needed.
优选的, 本发明还可以应用在搜索领域, 当用户的查询关键词字符串中包 括新词时,可以根据通过本发明提取新词方法得到的词库对用户的查询关键词 字符串进行准确分词, 然后根据分词结果进行搜索, 可以提高搜索结果的精确 度和覆盖度。  Preferably, the present invention is also applicable to the field of search. When a new word is included in a query keyword string of a user, the user can accurately segment the query keyword string according to the thesaurus obtained by the method for extracting new words by the present invention. Then, based on the result of the word segmentation, the accuracy and coverage of the search results can be improved.
优选的, 参照图 4, 本发明可以通过以下步骤从收集的用户个性字词获取 新词:  Preferably, referring to FIG. 4, the present invention can obtain new words from collected user personal words by the following steps:
步骤 401、 去除重复的用户个性字词;  Step 401: Remove duplicate user personality words;
步骤 402、 对互联网页面进行权重赋值; 将权重值大于或者等于预置阔值 的互联网页面存储至互联网页面数据库, 从而得到预置的互联网页面数据库; 步骤 403、 统计所述个性字词在预置的互联网页面数据库中出现的次数; 如果所述个性字词的出现次数大于或者等于预置阔值,则将该字词作为新词输 出。 Step 402: Perform weight assignment on the Internet page; store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database, thereby obtaining a preset Internet page database; Step 403: Statistics the personalized word in the preset The number of occurrences in the internet page database; if the number of occurrences of the personalized word is greater than or equal to the preset threshold, the word is lost as a new word Out.
其中, 步骤 402为可选步骤, 其目的是为了获得一个精选的互联网页面数 据库, 从而可以保证对新词筛选的准确性。 当然, 也可以釆用其他方法形成预 置的互联网页面数据库。  Step 402 is an optional step, and the purpose is to obtain a selected internet page database, so as to ensure the accuracy of the new word screening. Of course, other methods can be used to form a pre-built Internet page database.
在权重赋予的步骤 402中 ,根据网页形成的时间和网页的类型赋予相应的 权重值是一个比较重要的情形。 因为对于词频统计而言, 网页时间对其的影响 非常重要, 所以网页时间对权重值的影响也就较大,距离词频统计的时间点越 远, 则权重值就越低, 如果时间差大于一定的值, 则可以赋予该网页较低的权 重值, 甚至排除在词频统计之外。 其次网页类型对词频统计的影响也很大, 所 述网页类型一般是指门户网站、论坛或者其他一些已经确定的网页, 这些网页 的权重值就较高, 因为这些网页中参与者较多、 信息更新较快、 能够较好的反 应词频的最新变化趋势。 对于网页类型的判定, 可以通过设定一个规则库, 该 库中存储了一些网页的 URL地址,从而确定这些 URL的网页是对词频统计比 较重要的, 在这些网页上出现的字词会是优选统计的, 则对该网页赋予更大的 权重值。  In the step 402 of weight assignment, it is a relatively important situation to assign a corresponding weight value according to the time formed by the web page and the type of the web page. Because for word frequency statistics, the impact of web page time is very important, so the impact of web page time on the weight value is greater. The farther the time point from the word frequency statistics is, the lower the weight value is. If the time difference is greater than certain. The value can give the page a lower weight value, even excluded from the word frequency statistics. Secondly, the type of webpage has a great influence on the word frequency statistics. The webpage type generally refers to a portal website, a forum or some other determined webpages. The weight value of these webpages is higher because there are more participants and information in these webpages. Update the latest trends that are faster and better able to respond to word frequency. For the determination of the webpage type, a rule base can be set, and the URL addresses of some webpages are stored in the library, so that the webpages of these URLs are more important for word frequency statistics, and the words appearing on these webpages are preferred. For statistical purposes, the web page is given a greater weight value.
其次, 本发明还可以通过赋予较低权重值的方式去除一些重复网页、黄色 网页和垃圾网页, 从而可以进一步保证新词验证的准确性。  Secondly, the present invention can further remove some duplicate web pages, yellow web pages and spam web pages by giving lower weight values, thereby further ensuring the accuracy of new word verification.
再者,由于要想得到的结果更准确,就需要统计的词汇尽量都是用户的 "输 页面的冗余信息等, 所述页面冗余信息一般都是一些无效信息; 如果不去除将 会增加新词提取的计算量, 以及导致统计出来的词频不客观, 结果不准确。  Furthermore, since the result to be obtained is more accurate, the vocabulary that needs to be counted is as much as possible of the user's "redundant information of the input page, etc., and the page redundancy information is generally invalid information; if not removed, new information will be added. The amount of calculation of word extraction, and the frequency of words resulting in statistics are not objective, and the results are not accurate.
相应的, 本发明还提出了两个基于输入法的新词获取系统, 由于该系统用 于完成前述的方法, 所以下面仅仅进行简单介绍, 未详尽之处可以参见前述相 关部分。  Correspondingly, the present invention also proposes two new word acquisition systems based on the input method. Since the system is used to complete the foregoing method, only a brief introduction will be made below. For details, refer to the related parts.
一种基于输入法的新词获取系统, 包括:  A new word acquisition system based on input method, including:
字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选 择的字词; 字词比对单元, 与字词提取单元相连, 用于比较用户所选字词与现 有字词, 根据比对结果获取用户个性字词; 收集单元, 用于收集各个用户的个 性字词; 新词获取单元, 用于根据所述个性字词获取新词。 一种基于输入法的新词获取系统, 包括: a word extraction unit, connected to the input method system, for acquiring a word selected by the user in the user input process; a word comparison unit, connected to the word extraction unit, for comparing the selected word with the existing a word, obtaining a user's personality word according to the comparison result; a collecting unit, configured to collect individual words of each user; and a new word obtaining unit, configured to acquire a new word according to the personality word. A new word acquisition system based on input method, including:
字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选 择的字词; 收集单元, 用于收集各个用户的所选字词; 字词比对单元, 与收集 单元相连, 用于比较用户所选字词与现有字词,根据比对结果获取用户个性字 词; 新词获取单元, 用于根据所述个性字词获取新词。  a word extraction unit, connected to the input method system, for acquiring a word selected by the user during the user input process; a collecting unit for collecting the selected words of each user; a word matching unit, connected to the collecting unit And a method for comparing a user-selected word with an existing word, and acquiring a user's personalized word according to the comparison result; and a new word obtaining unit, configured to acquire a new word according to the personalized word.
参照图 5 , 本发明还要求保护一种输入法系统, 包括输入接口单元 501、 显示单元 502以及系统词库 503 , 还包括:  Referring to FIG. 5, the present invention also claims an input method system, including an input interface unit 501, a display unit 502, and a system vocabulary 503, and further includes:
字词提取单元 504 , 与输入法系统相连, 用于在用户输入过程中, 获取用 户选择的字词;  a word extraction unit 504, connected to the input method system, for acquiring a word selected by the user during the user input process;
字词比对单元 505 , 与字词提取单元 504相连, 用于比较用户所选字词与 现有字词,根据比对结果获取用户个性字词。 所述用户个性字词可以存储在用 户词库 506中也可以存储在系统词库 503中,加以标记即可; 或者还可以存储 至一专门词库。  The word matching unit 505 is connected to the word extracting unit 504 for comparing the selected word with the existing word and obtaining the user's individual word according to the comparison result. The user personality words may be stored in the user vocabulary 506 or may be stored in the system vocabulary 503 for marking; or may be stored in a special vocabulary.
即上述输入法系统除了用于普通的字词输入,还可以用于提取用户的个性 字词。 上述输入法系统可以为普通输入法系统, 例如, 所述输入法系统的输入 接口单元、显示单元以及系统词库位于同一计算设备中, 该输入法系统根据用 户输入的编码信息通过本地查询匹配在本地显示相应字符。上述输入法系统也 可以为网络输入法系统, 例如, 所述输入法系统的输入接口单元、 显示单元位 于第一计算设备中, 系统词库位于第二计算设备中, 所述输入法系统根据用户 输入的信息,从位于第二计算设备中获取相应信息,在第一计算设备显示相应 字符。  That is to say, the above input method system can be used to extract the user's personality words in addition to the ordinary word input. The input method system may be a common input method system. For example, the input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device, and the input method system matches the local query according to the coding information input by the user. The corresponding characters are displayed locally. The input method system may also be a network input method system. For example, the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located in the second computing device, and the input method system is based on the user. The input information is obtained from the second computing device to obtain corresponding information, and the corresponding character is displayed on the first computing device.
所述的输入法系统,还可以包括: 用户词库 506 ,用于存储用户所选字词; 通信单元 507 , 用于发送所述个性字词。 各个用户的输入法系统可以将该用户 的个性字词发送至统一的收集计算设备中,从而达到收集大量用户输入行为信 息的目的, 进而分析得到符合大众需要、 符合语言学意义的新词。  The input method system may further include: a user vocabulary 506 for storing a word selected by the user; and a communication unit 507, configured to send the personalized word. Each user's input method system can send the user's personality words to a unified collection computing device, so as to collect a large amount of user input behavior information, and then analyze new words that meet the needs of the public and conform to the linguistic meaning.
为了进一步提高用户个性字词的获取准确度, 所述的输入法系统,还可以 包括:  In order to further improve the accuracy of the acquisition of the user's personal words, the input method system may further include:
词频记录单元 508 , 与输入法系统相连, 用于在用户输入过程中, 记录用 户词频, 所述用户词频为用户输入该字词的频率信息。 此时, 所述通信单元 507 , 还可以用于发送个性字词相关的用户词频信息。 The word frequency recording unit 508 is connected to the input method system for recording a user word frequency during the user input process, and the user word frequency is frequency information of the user inputting the word. At this time, the communication unit 507, can also be used to send user word frequency information related to personal words.
优选的, 所述字词比对单元 505则可以进一步包括:  Preferably, the word comparison unit 505 may further include:
第一比对子单元 5051 , 用于判断用户所选字词在现有字词中是否存在; 如果存在, 则输出该字词至第三比对子单元, 如果不存在, 则输出该字词至第 二比对子单元;  a first comparison subunit 5051, configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison subunit, if not, output the word To the second comparison subunit;
第二比对子单元 5052 , 用于当用户所选字词在现有字词中不存在时, 进 一步判断该字词相应的用户词频;如果该字词相应的用户词频大于或者等于预 定阔值, 则确定该字词为个性字词。  The second comparison sub-unit 5052 is configured to further determine a user word frequency corresponding to the word when the selected word does not exist in the existing word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold , then determine that the word is a personal word.
第三比对子单元 5053 , 用于当用户所选字词在现有字词中存在时, 进一 步对比该字词的用户词频和系统词频,所述系统词频为在输入法系统词库中预 置的现有字词相应的词频信息;如果用户词频与系统词频的比值大于或者等于 预定阔值, 则确定该字词为个性字词。  The third comparison sub-unit 5053 is configured to further compare the user word frequency and the system word frequency of the word when the word selected by the user exists in the existing word, and the system word frequency is pre-prescribed in the input method system vocabulary The word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
上述的字词比对单元 505是本发明的一优选实施例, 当然, 也可以釆用其 他的比对规则, 则所述字词比对单元 505可以包括其他的子单元, 本发明对此 并不——举例说明。  The above-mentioned word matching unit 505 is a preferred embodiment of the present invention. Of course, other matching rules may also be used, and the word matching unit 505 may include other sub-units, and the present invention No - an example.
上述的输入法系统中的输入接口单元 501 最重要的是可以用于提供使用 者进行信息输入、 字词选取的动作; 还可以用于进行各种模式的切换工作, 例 如: 输入语言的切换 (如: 简体繁体、 中文英文的切换)、 输入模式的切换 (如: 单字输入、 词输入、 句子输入的切换)、 输入状态的切换 (如: 文字、 标点符号、 特殊符号的切换)等等。 显示单元 502以及系统词库 503都为本领域技术人员 所熟知之信息, 在此不再详述。  The input interface unit 501 in the above input method system is most important for providing the user with information input and word selection; and can also be used for switching various modes, for example: input language switching ( Such as: Simplified and Traditional Chinese, Chinese and English switching), input mode switching (such as: single-word input, word input, sentence input switching), input state switching (such as: text, punctuation, special symbol switching) and so on. Display unit 502 and system vocabulary 503 are well known to those skilled in the art and will not be described in detail herein.
参照图 6 , 本发明还提供了一种新词获取装置, 包括:  Referring to FIG. 6, the present invention also provides a new word acquiring apparatus, including:
个性字词收集单元 601 , 用于收集各用户的个性字词; 所述用户的个性字 词可以通过输入法获取, 自动发送至个性字词收集单元; 也可以由用户自行设 定或者整理,发送至个性字词收集单元; 或者各用户将其个性字词汇集至一固 定的网络空间,所述个性字词收集单元从该网络空间中获取各个用户的个性字 词。 即本实施例中的用户个性字词并不一定是通过用户输入行为获取的,也可 以是用户自行设定或者整理的。  The personalized word collecting unit 601 is configured to collect the personalized words of each user; the personalized words of the user may be obtained by the input method, and automatically sent to the personalized word collecting unit; or may be set or organized by the user, and sent To the individual word collection unit; or each user sets their personality word vocabulary to a fixed network space, and the personalized word collection unit obtains the individual words of each user from the network space. That is, the user's personal words in this embodiment are not necessarily obtained through user input behavior, but may also be set or organized by the user.
统计单元 602 , 用于统计所述个性字词在预置的互联网页面数据库中出现 的次数; a statistical unit 602, configured to calculate that the personalized word appears in a preset Internet page database Number of times;
新词确定单元 603 , 与统计单元 602相连, 用于判断所述个性字词的出现 次数是否大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。  The new word determining unit 603 is connected to the statistical unit 602, and is configured to determine whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, output the word as a new word.
上述新词获取装置可以根据所述收集的各用户的个性字词,通过在互联网 信息中的验证,从而获得较为精确的新词输出。 所述各用户的个性字词可以由 用户的输入行为自动获取, 也可以由用户自行设定或者整理。  The new word acquisition device can obtain a relatively accurate new word output according to the collected personal words of each user by using the verification in the Internet information. The individual words of each user may be automatically obtained by the user's input behavior, or may be set or organized by the user.
上述新词获取装置还可以包括: 词库生成单元 604, 用于根据输出的新词 生成新词库或者将得到的新词添加至原有词库, 得到新词库或者新版的全词 库。所述新词库或者新版的全词库可以用于输入法系统词库的更新或者搜索弓 I 擎分词, 从而提供用户的输入准确率和搜索结果的准确率。  The new word obtaining means may further include: a thesaurus generating unit 604, configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the whole thesaurus. The new thesaurus or the new version of the full thesaurus can be used to update the input system system vocabulary or search for the word segmentation, thereby providing the user's input accuracy and the accuracy of the search results.
优选的, 所述的新词获取装置, 还可以包括: 互联网页面数据库生成单元 605 , 用于对互联网页面进行权重赋值; 并将权重值大于或者等于预置阀值的 互联网页面存储至互联网页面数据库。  Preferably, the new word obtaining apparatus may further include: an internet page database generating unit 605, configured to perform weighting on the internet page; and store an internet page whose weight value is greater than or equal to a preset threshold to the internet page database. .
参照图 7 , 本发明还公开了另一种新词获取装置, 包括:  Referring to FIG. 7, the present invention also discloses another new word acquiring apparatus, including:
字词收集单元 701 , 用于收集各用户的所选字词;  a word collecting unit 701, configured to collect selected words of each user;
所述字词收集单元 701可以直接与现有的输入法系统相连,实时收集各用 户的所选字词, 例如, 网络输入法。 所述字词收集单元 701还可以通过接收各 用户的输入法系统实时或者定时发送的用户所选字词,所述用户所选字词由该 用户的输入法系统提取。所述字词收集单元 701还可以通过接收各用户的输入 法系统发送的用户词库或者系统词库达到收集用户所选字词的目的, 其中, 所 述用户所选字词由该用户的输入法系统提取并存储至用户词库或者系统词库 中。  The word collecting unit 701 can be directly connected to an existing input method system to collect selected words of each user in real time, for example, a network input method. The word collecting unit 701 can also extract the user-selected words transmitted in real time or periodically by each user's input method system, and the user-selected words are extracted by the user's input method system. The word collecting unit 701 can also achieve the purpose of collecting a user selected word by receiving a user vocabulary or a system vocabulary sent by each user's input method system, wherein the user selected word is input by the user. The method is extracted and stored in the user's thesaurus or system lexicon.
字词比对单元 702, 与字词收集单元相连, 用于比较用户所选字词与现有 字词, 根据比对结果获取用户个性字词;  The word matching unit 702 is connected to the word collecting unit, and is configured to compare the selected word with the existing word, and obtain the user's personalized word according to the comparison result;
新词获取单元 703 , 用于根据所述个性字词获取新词。  The new word obtaining unit 703 is configured to obtain a new word according to the personalized word.
优选的, 所述字词比对单元可 702以进一步包括:  Preferably, the word comparison unit 702 can further include:
第一比对子单元 7021 , 用于判断用户所选字词在现有字词中是否存在; 如果存在, 则输出该字词至第三比对子单元, 如果不存在, 则输出该字词至第 二比对子单元; 第二比对单元 7022 , 用于当用户所选字词在现有字词中不存在时, 进一 步判断该字词相应的用户词频;如果该字词相应的用户词频大于或者等于预定 阔值, 则确定该字词为个性字词。 a first comparison sub-unit 7021, configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison sub-unit, if not, output the word To the second comparison subunit; The second comparison unit 7022 is configured to further determine a user word frequency corresponding to the word when the selected word does not exist in the existing word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold, Then determine that the word is a personal word.
第三比对子单元 7023 , 用于当用户所选字词在现有字词中存在时, 进一 步对比该字词的用户词频和系统词频,所述系统词频为在输入法系统词库中预 置的现有字词相应的词频信息;如果用户词频与系统词频的比值大于或者等于 预定阔值, 则确定该字词为个性字词。  The third comparison sub-unit 7023 is configured to further compare the user word frequency and the system word frequency of the word when the word selected by the user exists in the existing word, and the system word frequency is pre-prescribed in the input method system vocabulary The word frequency information corresponding to the existing word is set; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personalized word.
优选的, 所述新词获取单元 703可以进一步包括:  Preferably, the new word obtaining unit 703 may further include:
统计子单元 7031 , 用于统计所述个性字词在预置的互联网页面数据库中 出现的次数, 从而获得该字词的互联网词频;  a statistical subunit 7031, configured to count the number of occurrences of the personalized word in a preset Internet page database, thereby obtaining an Internet word frequency of the word;
新词确定子单元 7032 , 与统计子单元相连, 用于判断所述互联网词频是 否大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。  The new word determining subunit 7032 is connected to the statistical subunit for determining whether the internet word frequency is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
优选的, 所述新词获取装置还可以包括:  Preferably, the new word acquiring device may further include:
词库生成单元 704 , 用于根据输出的新词生成新词库或者将得到的新词添 加至原有词库, 得到新词库或者新版的全词库。 互联网页面数据库生成单元 705 , 用于对互联网页面进行权重赋值; 并将 权重值大于或者等于预置阀值的互联网页面存储至互联网页面数据库。  The thesaurus generating unit 704 is configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus to obtain a new thesaurus or a new version of the full thesaurus. The Internet page database generating unit 705 is configured to perform weighting on the Internet page; and store the Internet page whose weight value is greater than or equal to the preset threshold to the Internet page database.
由于字词收集单元可以在收集用户所选字词的同时收集该字词的用户词 频, 所以更进一步, 所述词库生成单元 704生成的词库中还可以包括字词相应 的用户词频。 为了保证所述词库中一个字词对应一个词频, 则可以对所述用户 词频和互联网词频进行权重修正后叠加求和, 赋予用户个性字词一个权重词 频。 然后根据该权重词频进行过滤去除等步骤, 例如, 判断所述个性字词的权 重词频是否大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。  Since the word collecting unit can collect the user's word frequency of the word while collecting the word selected by the user, the vocabulary generated by the vocabulary generating unit 704 can further include the user word frequency corresponding to the word. In order to ensure that a word in the lexicon corresponds to a word frequency, the user word frequency and the internet word frequency may be weighted and superimposed and summed, and the user's personality word is given a weight word frequency. Then, filtering and removing according to the weight word frequency, for example, determining whether the weight word frequency of the personality word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
由于某个字词的用户词频的增加反映到互联网的词频统计中的增加 ,需要 一些时间, 甚至较长的时间, 而权重词频可以兼顾二者, 得到较为准确的新词 及其词频, 更加有利于增加用户的输入体验。  Since the increase of the word frequency of a certain word reflects the increase of the word frequency statistics of the Internet, it takes some time, even a long time, and the weight of the word frequency can take both into consideration, and get more accurate new words and their word frequency. Conducive to increasing the user's input experience.
本发明还公开了一种词库生成方法, 参照图 8 , 图 8a和图 8b分别描述了 该词库生成方法的两种实施例, 具体详述如下: 图 8a所示的词库生成方法包括以下步骤: The invention also discloses a vocabulary generating method. Two embodiments of the vocabulary generating method are respectively described with reference to FIG. 8 , FIG. 8 a and FIG. 8 b , and the details are as follows: The thesaurus generation method shown in Figure 8a includes the following steps:
步骤 801a, 收集各用户的输入行为信息, 所述输入行为信息包括用户输 入过程中的所选字词以及该字词相应的用户词频;所述收集可以为本发明前述 提及的各种方式。  Step 801a: Collect input behavior information of each user, where the input behavior information includes a selected word in the user input process and a corresponding user word frequency of the word; the collection may be various manners mentioned in the foregoing.
步骤 802a, 对字词相应的各用户词频进行权重修正, 计算各字词的用户 累积词频;所述权重修正可以通过对某一字词相应的各用户词频进行分析后完 成, 例如, 首先对该字词相应的各用户词频进行分析, 找到分布趋势, 通过某 个词频值出现的概率或者该词频值距离平均范围的大小对其进行修正。上述修 正后计算得到的用户累积词频, 可以去除一些用户的偶然行为或者恶意行为, 得到比较客观、 准确的用户累积词频, 进而保证词库的准确性。  Step 802a: performing weight correction on the word frequency of each user corresponding to the word, and calculating a cumulative word frequency of the user of each word; the weight correction may be performed by analyzing the word frequency of each user corresponding to a certain word, for example, first The word frequency of each user corresponding to the word is analyzed to find the distribution trend, and the probability of occurrence of a word frequency value or the frequency value of the word frequency is corrected by the average range of the word range. The user accumulated word frequency calculated after the above correction can remove some users' accidental behavior or malicious behavior, and obtain a more objective and accurate user cumulative word frequency, thereby ensuring the accuracy of the thesaurus.
步骤 803a, 去除用户累积词频小于或等于一定阔值的字词。 本步骤为一 优选步骤, 用于进一步提高收入词库中的字词的普遍性。  Step 803a, removing words whose user cumulative word frequency is less than or equal to a certain threshold. This step is a preferred step for further improving the ubiquity of words in the revenue lexicon.
步骤 804a , 生成词库, 所述词库包括字词及其相应的用户累积词频。 由于输入法的用户非常多, 通过对大量输入法用户的输入行为信息的收 集, 就可以获得具有普遍意义的词库。 该词库可以直接提供给输入法系统作为 系统词库使用;也可以作为用户词库由用户自行导入,并与系统词库配合使用。  Step 804a, generating a thesaurus, the words database including words and their corresponding user cumulative word frequencies. Due to the large number of users of the input method, a universal vocabulary can be obtained by collecting the input behavior information of a large number of input method users. The thesaurus can be directly provided to the input method system as a system vocabulary; it can also be imported as a user vocabulary by the user and used in conjunction with the system vocabulary.
优选的, 图 8a所示的词库生成方法还可以进一步包括以下步骤: 步骤 805a, 比较所述生成的词库与现有词库, 根据比对结果去除不符合 预置规则的字词, 输出用户个性字词; 其中所述的预置规则可以由本领域技术 人员根据需要设定即可, 例如, 本发明前面描述的步骤 302中根据比对结果获 取用户个性字词的四种方式。  Preferably, the vocabulary generating method shown in FIG. 8a may further include the following steps: Step 805a: Comparing the generated lexicon with an existing vocabulary, and removing words that do not conform to the preset rule according to the comparison result, and outputting User-specific words; wherein the preset rules can be set by a person skilled in the art as needed, for example, in the foregoing step 302 of the present invention, four ways of obtaining user personality words according to the comparison result are obtained.
步骤 806a, 根据所述用户个性字词生成个性字词库。  Step 806a: Generate a personalized word dictionary according to the user personality word.
图 8b所示的词库生成方法包括以下步骤:  The thesaurus generation method shown in Figure 8b includes the following steps:
步骤 801b , 收集各用户的输入行为信息, 所述输入行为信息包括用户输 入过程中的所选字词以及该字词相应的用户词频。  Step 801b: Collect input behavior information of each user, where the input behavior information includes a selected word in the user input process and a corresponding word frequency of the word.
步骤 802b , 对字词相应的各用户词频进行权重修正, 计算各字词的用户 累积词频。  Step 802b, performing weight correction on the word frequency of each user corresponding to the word, and calculating the cumulative word frequency of the user of each word.
步骤 803b, 去除用户累积词频小于或等于一定阔值的字词。  Step 803b, removing words whose user cumulative word frequency is less than or equal to a certain threshold.
步骤 804b, 生成词库, 所述词库包括字词及其相应的用户累积词频。 步骤 805b, 比较所述生成的词库与现有词库, 根据比对结果去除不符合 预置规则的字词, 输出用户个性字词; Step 804b, generating a vocabulary, the vocabulary including words and their corresponding user cumulative word frequency. Step 805b: Compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
步骤 806b, 统计所述个性字词在预置的互联网页面数据库中出现的次数, 得到互联网词频;  Step 806b, counting the number of occurrences of the personalized words in the preset Internet page database, and obtaining an Internet word frequency;
步骤 807b , 对所述个性字词的用户累积词频和互联网词频进行权重修正 后求和,得到该个性字词的权重词频; 如果所述个性字词的权重词频大于或者 等于预置阔值, 则将该字词作为新词输出;  Step 807b, performing weight correction on the cumulative word frequency of the user word and the Internet word frequency, and obtaining a weighted word frequency of the personalized word; if the weight word frequency of the personalized word is greater than or equal to a preset threshold, Output the word as a new word;
步骤 808b, 根据所述输出的新词生成新词库, 所述新词库包括新词及其 相应的权重词频。  Step 808b: Generate a new vocabulary according to the outputted new word, the new vocabulary including the new word and its corresponding weight word frequency.
本发明还公开了一种词库生成装置, 包括以下部件:  The invention also discloses a thesaurus generating device, comprising the following components:
收集单元, 用于收集各用户的输入行为信息, 所述输入行为信息包括用户 输入过程中的所选字词以及该字词相应的用户词频;  a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a selected word in a user input process and a corresponding word frequency of the word;
词频计算单元, 用于对字词相应的各用户词频进行权重修正,计算各字词 的累积词频;  a word frequency calculation unit, configured to perform weight correction on each word frequency of each word corresponding to the word, and calculate a cumulative word frequency of each word;
词库生成单元, 用于生成词库, 所述词库包括字词及其相应的累积词频。 所述的词库生成装置还可以包括: 个性字词确定单元, 用于比较所述生成 的词库与现有词库,根据比对结果去除不符合预置规则的字词, 输出用户个性 或者, 所述的词库生成装置还可以包括:  The thesaurus generating unit is configured to generate a thesaurus, the thesaurus including the words and their corresponding cumulative word frequencies. The vocabulary generating device may further include: a personalized word determining unit, configured to compare the generated vocabulary with an existing vocabulary, and remove a word that does not conform to the preset rule according to the comparison result, and output the user personality or The the thesaurus generating device may further include:
个性字词确定单元, 用于比较所述生成的词库与现有词库,根据比对结果 去除不符合预置规则的字词, 输出用户个性字词;  a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
统计单元,用于统计所述个性字词在预置的互联网页面数据库中出现的次 数, 得到互联网词频;  a statistical unit, configured to count the number of occurrences of the personalized words in a preset Internet page database, and obtain an Internet word frequency;
权重词频确定单元,用于对所述个性字词的用户累积词频和互联网词频进 行权重爹正后求和, 得到该字词的权重词频;  a weight word frequency determining unit, configured to perform weighting on the cumulative word frequency of the user of the personalized word and the Internet word frequency, and obtain a weighted word frequency of the word;
新词确定单元,如果所述个性字词的权重词频大于或者等于预置阔值, 则 将该字词作为新词输出;  a new word determining unit, if the weight word frequency of the personality word is greater than or equal to a preset threshold, the word is output as a new word;
所述词库生成单元根据所述输出的新词生成新词库,所述新词库包括新词 及其相应的权重词频。 由于本发明使用了基于互联网信息的词频统计技术,并以用户输入行为信 息作为新词的来源, 可以方便快捷的得到了大量的各个用户频繁使用的新词, 这些新词经过汇总筛选, 又不断的提供给输入法用户使用,使得这些用户在使 用过程中能够时刻跟踪互联网信息的变化,不断的能够输入新词而又不用每次 输入新词的时候都要经过繁瑣的选词过程, 使得新词也能够成为用户的首选 词, 提高用户输入新词时的首选词命中率, 并可以提高候选词排序的合理性。 The thesaurus generating unit generates a new thesaurus according to the outputted new words, the new thesaurus including new words and their corresponding weight words. Since the present invention uses the word frequency statistics technology based on Internet information, and the user inputs the behavior information as the source of the new word, a large number of new words frequently used by each user can be conveniently and quickly obtained, and these new words are collectively filtered and continuously Provided to input method users, so that these users can track changes in Internet information at all times during use, and constantly input new words without having to go through a tedious process of selecting words each time a new word is entered, so that new Words can also become the user's preferred words, improve the preferred word hit rate when users input new words, and improve the rationality of candidate word sorting.
由于本发明篇幅有限,在方法的描述部分较为详细, 系统部分的描述未详 尽之处, 请参见前述相关部分。  Since the present invention is limited in length, the description of the method is more detailed, and the description of the system part is not detailed, please refer to the related parts mentioned above.
以上对本发明所提供的一种获取新词的方法、新词获取系统、新词获取装 置以及一种输入法系统, 进行了详细介绍, 本文中应用了具体个例对本发明的 原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方 法及其核心思想; 同时, 对于本领域的一般技术人员, 依据本发明的思想, 在 具体实施方式及应用范围上均会有改变之处, 综上所述, 本说明书内容不应理 解为对本发明的限制。  The method for acquiring a new word, the new word acquisition system, the new word acquisition device, and an input method system provided by the present invention are described in detail above. In this paper, a specific example is applied to implement the principle and implementation manner of the present invention. It is to be noted that the description of the above embodiments is only for helping to understand the method of the present invention and its core ideas; at the same time, for those skilled in the art, according to the idea of the present invention, there will be changes in specific embodiments and applications. In the above, the contents of the specification are not to be construed as limiting the invention.

Claims

权 利 要 求 Rights request
1、 一种获取新词的方法, 其特征在于, 包括:  A method for acquiring a new word, comprising:
在用户输入过程中, 获取用户选择的字词;  In the user input process, the word selected by the user is obtained;
比较用户所选字词与现有字词, 根据比对结果获取用户个性字词; 收集各个用户的个性字词;  Comparing the words selected by the user with the existing words, obtaining the user's individual words according to the comparison result; collecting the individual words of each user;
根据所述个性字词获得新词。  Obtain new words based on the personality words.
2、 如权利要求 1所述的方法, 其特征在于, 还包括:  2. The method of claim 1, further comprising:
在用户输入过程中,记录用户词频, 所述用户词频为用户输入该字词的频 率信息。  During the user input process, the user word frequency is recorded, and the user word frequency is the frequency information of the user inputting the word.
3、 如权利要求 1所述的方法, 其特征在于, 所述比较为:  3. The method of claim 1 wherein: the comparing is:
记录用户所选字词至用户词库, 输入法系统词库中存储现有字词, 比较用 户词库与输入法系统词库;  Record the words selected by the user to the user vocabulary, store the existing words in the input system lexicon, and compare the user lexicon with the input method system vocabulary;
或者直接比较用户每次所选字词与现有字词。  Or directly compare the user’s selected words to existing words.
4、 如权利要求 1所述的方法, 其特征在于, 通过以下步骤实现用户个性 字词的获取:  4. The method according to claim 1, wherein the obtaining of the user's personal words is achieved by the following steps:
判断用户所选字词在现有字词中是否存在;  Determine if the word selected by the user exists in an existing word;
如果不存在, 则确定该字词为用户个性字词。  If it does not exist, it is determined that the word is a user's personality word.
5、 如权利要求 2所述的方法, 其特征在于, 通过以下步骤实现用户个性 字词的获取:  5. The method according to claim 2, wherein the obtaining of the user's personal words is achieved by the following steps:
判断用户所选字词在现有字词中是否存在;  Determine if the word selected by the user exists in an existing word;
如果不存在, 进一步判断该字词相应的用户词频;  If not, further determine the corresponding word frequency of the word;
如果该字词相应的用户词频大于或者等于预定阔值,则确定该字词为个性 字词。  If the word frequency of the word corresponding to the word is greater than or equal to the predetermined threshold, then the word is determined to be a personality word.
6、 如权利要求 2所述的方法, 其特征在于, 通过以下步骤实现用户个性 字词的获取:  6. The method of claim 2, wherein the obtaining of the user's personality word is achieved by the following steps:
判断用户所选字词在现有字词中是否存在;  Determine if the word selected by the user exists in an existing word;
如果不存在, 则确定该字词为用户个性字词;  If not, determine that the word is a user's personality word;
如果存在, 则进一步对比该字词的用户词频和系统词频, 所述系统词频为 在输入法系统词库中预置的现有字词相应的词频信息; 如果用户词频与系统词频的比值大于或者等于预定阔值,则确定该字词为 个性字词。 If yes, further comparing the user word frequency and the system word frequency of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the input method system vocabulary; If the ratio of the user's word frequency to the system word frequency is greater than or equal to the predetermined threshold, then the word is determined to be a personal word.
7、 如权利要求 2所述的方法, 其特征在于, 通过以下步骤实现用户个性 字词的获取:  7. The method according to claim 2, wherein the obtaining of the user's personal words is achieved by the following steps:
判断用户所选字词在现有字词中是否存在;  Determine if the word selected by the user exists in an existing word;
如果不存在, 进一步判断该字词相应的用户词频; 如果该字词相应的用户 词频大于或者等于预定阔值, 则确定该字词为个性字词;  If not, further determining a corresponding word frequency of the word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold, determining that the word is a personalized word;
如果存在, 则进一步对比该字词的用户词频和系统词频, 所述系统词频为 在输入法系统词库中预置的现有字词相应的词频信息;如果用户词频与系统词 频的比值大于或者等于预定阔值, 则确定该字词为个性字词。  If yes, further comparing the user word frequency and the system word frequency of the word, the system word frequency is the word frequency information corresponding to the existing word preset in the input method system vocabulary; if the ratio of the user word frequency to the system word frequency is greater than or Equal to the predetermined threshold, then the word is determined to be a personal word.
8、 如权利要求 1所述的方法, 其特征在于, 还包括:  8. The method of claim 1, further comprising:
统计所述个性字词在预置的互联网页面数据库中出现的次数;  Counting the number of times the personalized word appears in the preset Internet page database;
如果所述个性字词的出现次数大于或者等于预置阔值,则将该字词作为新 词输出。  If the number of occurrences of the personality word is greater than or equal to the preset threshold, the word is output as a new word.
9、 如权利要求 8所述的提取新词的方法, 其特征在于, 通过以下步骤获 得预置的互联网页面数据库:  9. The method of extracting new words according to claim 8, wherein the preset internet page database is obtained by the following steps:
对互联网页面进行权重赋值;  Weighting the Internet page;
将权重值大于或者等于预置阔值的互联网页面存储至互联网页面数据库。 The Internet page whose weight value is greater than or equal to the preset threshold is stored in the Internet page database.
10、 如权利要求 1所述的方法, 其特征在于, 所述收集为: 输入法用户计 10. The method of claim 1, wherein the collecting is: an input method user meter
11、 如权利要求 1所述的方法, 其特征在于, 还包括: 根据输出的新词生 成新词库或者将得到的新词添加至原有词库, 得到新词库或者新版的全词库。 The method according to claim 1, further comprising: generating a new thesaurus according to the outputted new words or adding the obtained new words to the original thesaurus, obtaining a new thesaurus or a new version of the thesaurus .
12、 一种获取新词的方法, 其特征在于, 包括:  12. A method of acquiring a new word, comprising:
在用户输入过程中, 获取用户选择的字词;  In the user input process, the word selected by the user is obtained;
收集各个用户的所选字词;  Collect selected words from individual users;
比较用户所选字词与现有字词, 根据比对结果获取用户个性字词; 根据所述个性字词获得新词。  Comparing the words selected by the user with the existing words, obtaining the user's personalized words according to the comparison result; obtaining new words according to the personalized words.
13、 一种基于输入法的新词获取系统, 其特征在于, 包括:  13. A new word acquisition system based on an input method, characterized in that it comprises:
字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选 择的字词; a word extraction unit, connected to the input method system, for obtaining user selection during user input Selected words;
字词比对单元,与字词提取单元相连,用于比较用户所选字词与现有字词 , 根据比对结果获取用户个性字词;  a word matching unit, connected to the word extracting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result;
收集单元, 用于收集各个用户的个性字词;  a collection unit for collecting individual words of each user;
新词获取单元, 用于根据所述个性字词获取新词。  a new word acquisition unit, configured to acquire a new word according to the personality word.
14、 一种基于输入法的新词获取系统, 其特征在于, 包括:  14. A new word acquisition system based on an input method, characterized in that:
字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选 择的字词;  a word extraction unit, connected to the input method system, for acquiring a word selected by the user during the user input process;
收集单元, 用于收集各个用户的所选字词;  a collection unit for collecting selected words of each user;
字词比对单元, 与收集单元相连, 用于比较用户所选字词与现有字词, 根 据比对结果获取用户个性字词;  a word matching unit, connected to the collecting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result;
新词获取单元, 用于根据所述个性字词获取新词。  a new word acquisition unit, configured to acquire a new word according to the personality word.
15、 一种输入法系统, 包括输入接口单元、 显示单元以及系统词库, 其特 征在于, 还包括:  15. An input method system comprising an input interface unit, a display unit, and a system vocabulary, wherein the method further comprises:
字词提取单元, 与输入法系统相连, 用于在用户输入过程中, 获取用户选 择的字词;  a word extraction unit, connected to the input method system, for acquiring a word selected by the user during the user input process;
字词比对单元,与字词提取单元相连,用于比较用户所选字词与现有字词 , 根据比对结果获取用户个性字词。  The word comparison unit is connected to the word extraction unit for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result.
16、 如权利要求 15所述的输入法系统, 其特征在于,  16. The input method system of claim 15 wherein:
所述输入法系统的输入接口单元、显示单元以及系统词库位于同一计算设 备中;  The input interface unit, the display unit, and the system vocabulary of the input method system are located in the same computing device;
或者所述输入法系统的输入接口单元、显示单元位于第一计算设备中, 系 统词库位于第二计算设备中, 所述输入法系统根据用户输入的信息,从位于第 二计算设备中获取相应信息, 在第一计算设备显示相应字符。  Or the input interface unit and the display unit of the input method system are located in the first computing device, and the system vocabulary is located in the second computing device, and the input method system obtains corresponding information from the second computing device according to the information input by the user. Information, the corresponding character is displayed on the first computing device.
17、 如权利要求 15所述的输入法系统, 其特征在于, 还包括:  The input method system of claim 15, further comprising:
通信单元, 用于发送所述个性字词。  a communication unit, configured to send the personalized word.
18、 如权利要求 15所述的输入法系统, 其特征在于, 还包括:  The input method system of claim 15, further comprising:
用户词库, 用于存储用户所选字词。  User vocabulary for storing user-selected words.
19、 如权利要求 15所述的输入法系统, 其特征在于, 还包括: 词频记录单元, 与输入法系统相连, 用于在用户输入过程中, 记录用户词 频, 所述用户词频为用户输入该字词的频率信息。 The input method system of claim 15, further comprising: The word frequency recording unit is connected to the input method system for recording a user word frequency during the user input process, and the user word frequency is frequency information of the user inputting the word.
20、 如权利要求 19所述的输入法系统, 其特征在于, 所述字词比对单元 包括:  The input method system according to claim 19, wherein the word comparison unit comprises:
第一比对子单元, 用于判断用户所选字词在现有字词中是否存在; 如果存 在, 则输出该字词至第三比对子单元, 如果不存在, 则输出该字词至第二比对 子单元;  a first comparison subunit, configured to determine whether a word selected by the user exists in an existing word; if yes, output the word to a third comparison subunit, if not, output the word to Second alignment subunit;
第二比对单元, 用于当用户所选字词在现有字词中不存在时, 进一步判断 该字词相应的用户词频; 如果该字词相应的用户词频大于或者等于预定阔值, 则确定该字词为个性字词;  a second comparison unit, configured to further determine a user word frequency of the word when the word selected by the user does not exist in the existing word; if the corresponding word frequency of the word is greater than or equal to a predetermined threshold, Make sure the word is a personal word;
第三比对子单元, 用于当用户所选字词在现有字词中存在时, 进一步对比 该字词的用户词频和系统词频,所述系统词频为在输入法系统词库中预置的现 有字词相应的词频信息;如果用户词频与系统词频的比值大于或者等于预定阔 值, 则确定该字词为个性字词。  a third comparison subunit, configured to further compare a user word frequency and a system word frequency of the word when the word selected by the user exists in the existing word, wherein the system word frequency is preset in the input method system vocabulary The word frequency information corresponding to the existing word; if the ratio of the user word frequency to the system word frequency is greater than or equal to the predetermined threshold, the word is determined to be a personality word.
21、 一种新词获取装置, 其特征在于, 包括:  21. A new word acquisition device, comprising:
个性字词收集单元, 用于收集各用户的个性字词;  a personalized word collection unit for collecting individual words of each user;
统计单元,用于统计所述个性字词在预置的互联网页面数据库中出现的次 数;  a statistical unit, configured to count the number of occurrences of the personalized words in a preset Internet page database;
新词确定单元, 与统计单元相连, 用于判断所述个性字词的出现次数是否 大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。  The new word determining unit is connected to the statistical unit, and is configured to determine whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, output the word as a new word.
22、 如权利要求 21所述的新词获取装置, 其特征在于, 所述收集为用户 计算设备向所述个性字词收集单元实时或者定时的发送用户个性字词。  22. The new word acquisition apparatus according to claim 21, wherein the collecting is that the user computing device sends the user personality word to the personality word collecting unit in real time or at a time.
23、 如权利要求 21所述的新词获取装置, 其特征在于, 还包括: 词库生成单元,用于根据输出的新词生成新词库或者将得到的新词添加至 原有词库, 得到新词库或者新版的全词库。  The new word obtaining apparatus according to claim 21, further comprising: a thesaurus generating unit, configured to generate a new thesaurus according to the outputted new words or add the obtained new words to the original thesaurus, Get a new thesaurus or a new version of the thesaurus.
24、 如权利要求 21所述的新词获取装置, 其特征在于, 还包括: 互联网页面数据库生成单元, 用于对互联网页面进行权重赋值; 并将权重 值大于或者等于预置阀值的互联网页面存储至互联网页面数据库。  The new word obtaining apparatus according to claim 21, further comprising: an internet page database generating unit, configured to perform weighting on the internet page; and use an internet page whose weight value is greater than or equal to a preset threshold Store to the internet page database.
25、 一种新词获取装置, 其特征在于, 包括: 字词收集单元, 用于收集各用户的所选字词; 25. A new word acquisition device, comprising: a word collecting unit for collecting selected words of each user;
字词比对单元 ,与字词收集单元相连,用于比较用户所选字词与现有字词 , 根据比对结果获取用户个性字词;  a word matching unit, connected to the word collecting unit, for comparing the selected word with the existing word, and obtaining the user's individual word according to the comparison result;
新词获取单元, 用于根据所述个性字词获取新词。  a new word acquisition unit, configured to acquire a new word according to the personality word.
26、 如权利要求 25所述的新词获取装置, 其特征在于, 所述新词获取单 元包括:  The new word acquisition device according to claim 25, wherein the new word acquisition unit comprises:
统计子单元,用于统计所述个性字词在预置的互联网页面数据库中出现的 次数;  a statistical subunit for counting the number of occurrences of the personalized words in a preset Internet page database;
新词确定子单元, 与统计子单元相连, 用于判断所述个性字词的出现次数 是否大于或者等于预置阔值, 如果是, 则将该字词作为新词输出。  The new word determining subunit is connected to the statistical subunit for determining whether the number of occurrences of the personalized word is greater than or equal to a preset threshold, and if so, outputting the word as a new word.
27、 如权利要求 25所述的新词获取装置, 其特征在于, 还包括: 所述字词收集单元还用于收集用户所选字词相应的用户词频;  The new word acquiring apparatus according to claim 25, further comprising: the word collecting unit is further configured to collect a user word frequency corresponding to the word selected by the user;
统计子单元,用于统计所述个性字词在预置的互联网页面数据库中出现的 次数, 得到互联网词频;  a statistical sub-unit for counting the number of occurrences of the personalized words in a preset Internet page database, and obtaining an Internet word frequency;
权重词频确定子单元 ,用于对所述新词的用户词频和互联网词频进行权重 爹正后求和, 得到该新词的权重词频;  a weight word frequency determining subunit, configured to weight the user word frequency and the internet word frequency of the new word, and obtain a weighting word frequency of the new word;
新词确定子单元,用于判断所述个性字词的权重词频是否大于或者等于预 置阔值, 如果是, 则将该字词作为新词输出。  The new word determining subunit is configured to determine whether the weight word frequency of the personalized word is greater than or equal to a preset threshold, and if so, output the word as a new word.
28、 一种词库生成方法, 其特征在于, 包括:  28. A method of generating a thesaurus, characterized by comprising:
收集各用户的输入行为信息 ,所述输入行为信息包括用户输入过程中的所 选字词以及该字词相应的用户词频;  Collecting input behavior information of each user, the input behavior information including the selected word in the user input process and the corresponding word frequency of the word;
对字词相应的各用户词频进行权重修正, 计算各字词的用户累积词频; 生成词库, 所述词库包括字词及其相应的用户累积词频。  Weight correction is performed on the word frequency of each user corresponding to the word, and the cumulative word frequency of the user of each word is calculated; a vocabulary is generated, the vocabulary including the word and its corresponding user cumulative word frequency.
29、 如权利要求 28所述的方法, 其特征在于, 还包括:  29. The method of claim 28, further comprising:
去除用户累积词频小于或等于一定阔值的字词。  Remove words whose user cumulative word frequency is less than or equal to a certain threshold.
30、 如权利要求 28所述的方法, 其特征在于, 还包括:  30. The method of claim 28, further comprising:
比较所述生成的词库与现有词库,根据比对结果去除不符合预置规则的字 词, 输出用户个性字词;  Comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words;
才艮据所述用户个性字词生成个性字词库。 The personality word library is generated according to the user's personality words.
31、 如权利要求 28所述的方法, 其特征在于, 还包括: 31. The method of claim 28, further comprising:
比较所述生成的词库与现有词库,根据比对结果去除不符合预置规则的字 词, 输出用户个性字词;  Comparing the generated thesaurus with the existing thesaurus, and removing the words that do not conform to the preset rules according to the comparison result, and outputting the user's personalized words;
统计所述个性字词在预置的互联网页面数据库中出现的次数,得到互联网 词频;  Counting the number of occurrences of the personalized words in the preset Internet page database, and obtaining the Internet word frequency;
对所述个性字词的用户累积词频和互联网词频进行权重爹正后求和,得到 该新词的权重词频;  The user cumulative word frequency and the internet word frequency of the personalized words are weighted and summed to obtain a weighted word frequency of the new word;
如果所述个性字词的权重词频大于或者等于预置阔值,则将该字词作为新 词输出;  If the weight of the personalized word is greater than or equal to the preset threshold, the word is output as a new word;
根据所述输出的新词生成新词库 ,所述新词库包括新词及其相应的权重词 频。  A new vocabulary is generated based on the outputted new words, the new vocabulary including new words and their corresponding weighted words.
32、 一种词库生成装置, 其特征在于, 包括:  32. A vocabulary generating device, comprising:
收集单元, 用于收集各用户的输入行为信息, 所述输入行为信息包括用户 输入过程中的所选字词以及该字词相应的用户词频;  a collecting unit, configured to collect input behavior information of each user, where the input behavior information includes a selected word in a user input process and a corresponding word frequency of the word;
词频计算单元, 用于对字词相应的各用户词频进行权重修正,计算各字词 的累积词频;  a word frequency calculation unit, configured to perform weight correction on each word frequency of each word corresponding to the word, and calculate a cumulative word frequency of each word;
词库生成单元, 用于生成词库, 所述词库包括字词及其相应的累积词频。 The thesaurus generating unit is configured to generate a thesaurus, the thesaurus including the words and their corresponding cumulative word frequencies.
33、 如权利要求 32所述的装置, 其特征在于, 还包括: 33. The device of claim 32, further comprising:
个性字词确定单元, 用于比较所述生成的词库与现有词库,根据比对结果 去除不符合预置规则的字词, 输出用户个性字词;  a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
34、 如权利要求 32所述的装置, 其特征在于, 还包括: 34. The device of claim 32, further comprising:
个性字词确定单元, 用于比较所述生成的词库与现有词库,根据比对结果 去除不符合预置规则的字词, 输出用户个性字词;  a personalized word determining unit, configured to compare the generated thesaurus with the existing thesaurus, and remove the words that do not conform to the preset rules according to the comparison result, and output the user personalized words;
统计单元,用于统计所述个性字词在预置的互联网页面数据库中出现的次 数, 得到互联网词频;  a statistical unit, configured to count the number of occurrences of the personalized words in a preset Internet page database, and obtain an Internet word frequency;
权重词频确定单元,用于对所述个性字词的用户累积词频和互联网词频进 行权重爹正后求和, 得到该字词的权重词频;  a weight word frequency determining unit, configured to perform weighting on the cumulative word frequency of the user of the personalized word and the Internet word frequency, and obtain a weighted word frequency of the word;
新词确定单元,如果所述个性字词的权重词频大于或者等于预置阔值, 则 将该字词作为新词输出; a new word determining unit, if the weight word frequency of the personality word is greater than or equal to a preset threshold, then Output the word as a new word;
所述词库生成单元根据所述输出的新词生成新词库,所述新词库包括新词 及其相应的权重词频。  The thesaurus generating unit generates a new thesaurus according to the outputted new words, the new thesaurus including new words and their corresponding weight words.
PCT/CN2007/070419 2006-08-09 2007-08-06 Method and device for obtaining the new words and input method system WO2008022581A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200610109732.X 2006-08-09
CN200610109732A CN1924858B (en) 2006-08-09 2006-08-09 Method and device for fetching new words and input method system

Publications (1)

Publication Number Publication Date
WO2008022581A1 true WO2008022581A1 (en) 2008-02-28

Family

ID=37817498

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/070419 WO2008022581A1 (en) 2006-08-09 2007-08-06 Method and device for obtaining the new words and input method system

Country Status (2)

Country Link
CN (1) CN1924858B (en)
WO (1) WO2008022581A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109254972A (en) * 2018-07-23 2019-01-22 努比亚技术有限公司 A kind of offline order Word library updating method, terminal and computer readable storage medium
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398834B (en) * 2007-09-29 2010-08-11 北京搜狗科技发展有限公司 Processing method and device for input information and input method system
CN101470732B (en) * 2007-12-26 2012-04-18 北京搜狗科技发展有限公司 Auxiliary word stock generation method and apparatus
CN101290632B (en) * 2008-05-30 2011-09-14 北京搜狗科技发展有限公司 Input method for user words participating in intelligent word-making and input method system
CN101533310A (en) * 2009-04-02 2009-09-16 孙强国 Pinyin character word input and selection method
CN102163198B (en) * 2010-02-24 2014-10-22 北京搜狗科技发展有限公司 A method and a system for providing new or popular terms
CN102193920B (en) * 2010-03-04 2016-01-20 深圳市世纪光速信息技术有限公司 A kind of name word stock generating method, device and character input system
CN102270048B (en) * 2010-06-03 2016-04-20 北京搜狗科技发展有限公司 A kind of method and system of noun input
CN102298581B (en) * 2010-06-23 2015-11-25 深圳市腾讯计算机系统有限公司 A kind of disposal route of input method dictionary and device
CN102508554A (en) * 2011-10-02 2012-06-20 上海量明科技发展有限公司 Input method with communication association, personal repertoire and system
CN103324627A (en) * 2012-03-21 2013-09-25 宇龙计算机通信科技(深圳)有限公司 Terminal and input processing method
CN102982070A (en) * 2012-10-26 2013-03-20 北京百度网讯科技有限公司 Word bank updating method and system and cloud server used for input method application program
CN104345899B (en) * 2013-08-08 2018-01-19 阿里巴巴集团控股有限公司 Field conversion method and client for input method
WO2016058138A1 (en) 2014-10-15 2016-04-21 Microsoft Technology Licensing, Llc Construction of lexicon for selected context
CN105069064B (en) * 2015-07-29 2019-04-30 百度在线网络技术(北京)有限公司 Acquisition methods and device, the method for pushing and device of vocabulary
KR102462365B1 (en) * 2016-02-29 2022-11-04 삼성전자주식회사 Method and apparatus for predicting text input based on user demographic information and context information
CN105956158B (en) * 2016-05-17 2019-08-09 清华大学 The method that network neologisms based on massive micro-blog text and user information automatically extract
CN107544685A (en) * 2016-06-29 2018-01-05 百度在线网络技术(北京)有限公司 Information-pushing method and device
CN106294650B (en) * 2016-08-03 2019-08-20 北京金和网络股份有限公司 Neologisms method for digging a little is buried based on search
CN109426356B (en) * 2017-09-01 2022-07-15 百度在线网络技术(北京)有限公司 Information input method and device
CN108733650B (en) * 2018-05-14 2022-06-07 科大讯飞股份有限公司 Personalized word obtaining method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1570901A (en) * 2003-07-23 2005-01-26 台达电子工业股份有限公司 Hand-held interactive dictionary enquiry device and method
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478033B2 (en) * 2004-03-16 2009-01-13 Google Inc. Systems and methods for translating Chinese pinyin to Chinese characters

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1570901A (en) * 2003-07-23 2005-01-26 台达电子工业股份有限公司 Hand-held interactive dictionary enquiry device and method
CN1629836A (en) * 2003-12-17 2005-06-22 北京大学 Method and apparatus for learning Chinese new words

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109254972A (en) * 2018-07-23 2019-01-22 努比亚技术有限公司 A kind of offline order Word library updating method, terminal and computer readable storage medium
CN109254972B (en) * 2018-07-23 2022-09-13 上海法本信息技术有限公司 Offline command word bank updating method, terminal and computer readable storage medium
CN109472022A (en) * 2018-10-15 2019-03-15 平安科技(深圳)有限公司 New word identification method and terminal device based on machine learning

Also Published As

Publication number Publication date
CN1924858A (en) 2007-03-07
CN1924858B (en) 2010-05-12

Similar Documents

Publication Publication Date Title
WO2008022581A1 (en) Method and device for obtaining the new words and input method system
CN108304375B (en) Information identification method and equipment, storage medium and terminal thereof
JP5647508B2 (en) System and method for identifying short text communication topics
WO2008014702A1 (en) Method and system of extracting new words
CN109726274B (en) Question generation method, device and storage medium
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
KR102170206B1 (en) Information Search System and Method using keyword and relation information
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
CN109783631B (en) Community question-answer data verification method and device, computer equipment and storage medium
KR20080068825A (en) Selecting high quality reviews for display
US8793120B1 (en) Behavior-driven multilingual stemming
WO2008028421A1 (en) Method for obtaining new encode character string, inputting method system and word base generation device
CN107688616A (en) Show unique fact of entity
WO2023108980A1 (en) Information push method and device based on text adversarial sample
Bykau et al. Fine-grained controversy detection in Wikipedia
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
JP5302614B2 (en) Facility related information search database formation method and facility related information search system
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN111488453A (en) Resource grading method, device, equipment and storage medium
CN103226601A (en) Method and device for image search
JP6942759B2 (en) Information processing equipment, programs and information processing methods
JP5179564B2 (en) Query segment position determination device
KR101758555B1 (en) Method and system for extracting topic expression
CN110209804B (en) Target corpus determining method and device, storage medium and electronic device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07800906

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07800906

Country of ref document: EP

Kind code of ref document: A1