CN101364220A - Method for generating word frequency database based on user personality - Google Patents

Method for generating word frequency database based on user personality Download PDF

Info

Publication number
CN101364220A
CN101364220A CNA2007101707166A CN200710170716A CN101364220A CN 101364220 A CN101364220 A CN 101364220A CN A2007101707166 A CNA2007101707166 A CN A2007101707166A CN 200710170716 A CN200710170716 A CN 200710170716A CN 101364220 A CN101364220 A CN 101364220A
Authority
CN
China
Prior art keywords
word frequency
user
webpage
frequency base
generation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101707166A
Other languages
Chinese (zh)
Inventor
林正昱
王正明
林国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI APE-TECH CORP
Original Assignee
SHANGHAI APE-TECH CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI APE-TECH CORP filed Critical SHANGHAI APE-TECH CORP
Priority to CNA2007101707166A priority Critical patent/CN101364220A/en
Publication of CN101364220A publication Critical patent/CN101364220A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a method for generating a word frequency library based on user properties, in particular to a method for generating a word frequency library based on user properties in a real-time updating and personalized manner. The method comprises the following steps: visiting a website via a browser by a user; calling a plug-in of the browser for capturing a word; analyzing the words of the web page information; carrying out the word frequency statistics for entries; saving and updating the word frequency library in a real-time manner, etc. The method for generating the word frequency library based on the user properties can be updated in a real-time manner without downloading and updating operations, and only the web pages browsed by an individual user are required to be analyzed instead of all the web pages. The personalized word frequency library based on the user properties has the advantages of low cost and high operability.

Description

Generation method based on the word frequency base of user personality
Technical field
The present invention relates to a kind of generation method of the word frequency base based on user personality, particularly relate to generation method a kind of real-time update, personalized word frequency base based on user personality.
Background technology
Input method is all supported the input of phrase mode in order to improve input speed, and therefore in some sense, the capacity of dictionary and the frequency of utilization of speech have become to influence the key factor of input speed.In input method early, the capacity of dictionary is just decided when installing for the first time, can not upgrade automatically.But along with the arriving of information age, constantly have new phrase and join daily interchange, these speech all can't embody in the dictionary of these input methods automatically.
In order to address these problems, the input method of Google and Sogou all provides the function of automatic renewal dictionary.Because all there is own search engine in two companies, so they can collect the user search for maximum speech on search engine, regularly put in order these speech in the dictionary and are put on the server their regular down loading updating of input method of confession.This mode has solved the problem that the speech in the dictionary can't upgrade automatically, but the speech of these renewals all is popular speech, but might not be the content that current input method user is concerned about; Simultaneously these new speech all need to obtain by network download, and such mode can influence the user and obtain up-to-date dictionary under the not so good situation of user network.
Chinese invention patent application 200610086577.4 " based on the generation method and system of the input-method word frequency base of internet information " discloses a kind of generation method of the input-method word frequency base based on internet information, obtain pages of Internet by the web crawlers technology, webpage is carried out being saved in word frequency base after the word segmentation processing statistics, the resulting word frequency base of this method is based on very huge internet webpage, each website all needs to distribute a reptile to pay close attention to constantly, well imagine the input of a very huge workload and excess, the speech that is provided also is the popular speech of generally being concerned about, and the word frequency base of this method is realized by regular down loading updating.
The world all trades and professions, each industry all has its singularity, and everyone obtains the speech of oneself being concerned about most to express with the fastest speed with pleasure.Wish that such as patent work person the speech that handle and patent are associated sorts forward as far as possible, commonly used as patented claim to " background " speech, behind input Pinyin " beijing ", almost can make number one " Beijing " bar none, and foregoing prior art statistics also is that " Beijing " is more than " background " certainly, but the number of times that the local people who is engaged in patent work perhaps uses " background " beyond in Beijing is considerably beyond " Beijing ", and they wish that " background " can come first of word order.In addition, because the user can visit a lot of webpages every day, the keyword of each webpage all can be logged in the dictionary and go.Some speech can be user's needs, but not all speech is all like this, so after browser plug-in added word frequency bases to these speech, the word frequency that can't adjust it was to the foremost.So to say, the people of a wedding celebration industry, so in " bridegroom " his word frequency base of adding and to be listed in first be correct, if but news is seen in his connected reference Sina website, " Sina " also entered his word frequency base, come that first is just not too suitable but also surpassed " bridegroom ".
In sum, what prior art provided all is popular speech, is to use the more speech of frequency ratio, has represented ubiquity; Simultaneously these new speech all need regularly to download by network and obtain, and such mode more can influence the user and obtain up-to-date dictionary under the not so good situation of user network, and often will upgrade operation.Everyone wishes that word frequency base is the content that current input method user is concerned about, this just needs word frequency base to possess personalization features, and possesses the immediate updating function.
Summary of the invention
The generation method that the purpose of this invention is to provide a kind of word frequency base based on user personality just provides generation method a kind of real-time update, personalized word frequency base based on user personality.In order to solve the problem of present input method to Word library updating, the present invention by browser with get combining of speech plug-in unit, in user's browsing page, will analyze web page contents automatically, obtain the content that the user pays close attention to, and these contents are joined in the input method dictionary as speech.
The invention solves word frequency base need by regular down loading updating, can't real-time update problem, having remedied prior art can only be at popular and can not embody personalized deficiency.
The generation method of the word frequency base based on user personality of the present invention is by user to view Internet webpage or local page, call the content of Html analyzer analysis user accessed web page by getting the speech plug-in unit, get the expansion that the speech plug-in unit is a browser, has the ability that communicates with browser, can obtain the current state of browser, and change the default behavior of browser.Realize the webpage of user capture is analyzed by it, and structure is recorded into assigned address, the Html analyzer calls the Html interpreter, the abstract syntax tree of requested webpage, the Html interpreter returns the abstract syntax tree of webpage, browser calls the request of Html analyzer the abstract syntax tree of obtaining is analyzed, the Html analyzer gets that attribute is Meta in the abstract syntax tree, Title etc. deposit the node of webpage key content, return the webpage keyword, browser calls the Html analyzer webpage keyword of obtaining is carried out participle, the Html analyzer returns the phrase tabulation behind the participle, browser calls the interpolation phrase method of input method to all phrases of finding out, and input method interpolation phrase method is deposited into the phrase of receiving in the dictionary goes.
As optimized technical scheme:
Generation method based on the word frequency base of user personality may further comprise the steps:
(1) user is by the browser access webpage;
(2) browser calls and gets the speech plug-in unit;
(3) info web is carried out word segmentation processing;
(4) entry is carried out word frequency statistics;
(5) preserve also real-time update word frequency base.
Described webpage comprises pages of Internet and local page.
Described local page comprises the webpage on LAN (Local Area Network) webpage and the local hard drive.
Described user is the individual consumer of separate unit terminal.
The described speech plug-in unit of getting is only analyzed in the webpage that each is activated.
Described step uninterruptedly circulates.
The invention has the beneficial effects as follows:
1, real-time update need not to download, and does not need to upgrade operation;
2, do not need all webpages are handled, as long as analyze the webpage that the individual consumer browsed;
3, the personalized word frequency base of user personality;
4, cost is low, and is workable;
5, comprise search engine searches less than LAN (Local Area Network) or local page content.
Description of drawings
Fig. 1 is the abstract syntax tree graph of the embodiment of the invention.
Embodiment
Below in conjunction with embodiment, further set forth the present invention.Should be understood that these embodiment only to be used to the present invention is described and be not used in and limit the scope of the invention.Should be understood that in addition those skilled in the art can make various changes or modifications the present invention after the content of having read the present invention's instruction, these equivalent form of values fall within the application's appended claims institute restricted portion equally.
Embodiment 1
Supposing to have in the word frequency base and phonetic is only arranged is " Xing ' Lang " " bridegroom " this phrase.User capture in certain day www.sina.com.cn, phonetic is similarly " Xing ' Lang " phrase of " Sina " has been added in the word frequency base, but " Sina " come " bridegroom " afterwards.Reason is that the user can visit a lot of webpages every day, and the keyword of each webpage all can be logged in the dictionary and go.Some speech can be user's needs, but not all speech is all like this, so after browser plug-in added word frequency bases to these speech, the word frequency that can't adjust it was to the foremost.The user needed input " Sina " in certain day, as long as he import " Xing ' Lang ", input method can be listed " bridegroom " in " Sina " before according to this, and the user chooses second speech " Sina ".Certain day afterwards user need input " Sina " once more he as long as input " Xing ' Lang ", the order that current input method is listed has changed over " Sina ", " bridegroom ", first speech is " Sina " that he needs.Vice versa, and browser plug-in can make number one " Sina " because of often visiting Sina website forever.
Embodiment 2
Obtaining the keyword of webpage, is example with following webpage:
Figure A200710170716D00071
Obtain abstract syntax tree (seeing accompanying drawing 1) by the Html interpreter.
For the Meta node of abstract syntax tree, comprise two attribute Name and Content, wherein Content is used for representing the key content of webpage by most websites, phrase or phrase that these contents are separated by symbol often.The Html analyzer is responsible for the attribute that comprises key content of relevant node (being similar to the node that Meta represents or comprise the webpage key content) is taken out, and these contents are further segmented, and is divided into user's acceptable phrase.
Prove that through test of many times the generation method that the present invention is based on the word frequency base of user personality really can real-time update, need not to download, do not need to upgrade operation; The present invention does not need all webpages are handled, as long as analyze the webpage that the individual consumer browsed; The present invention has obtained the personalized word frequency base of user personality; Cost of the present invention is low, and is workable, comprise search engine searches less than LAN (Local Area Network) or local page content.

Claims (6)

1. based on the generation method of the word frequency base of user personality, may further comprise the steps:
(1) user is by the browser access webpage;
(2) browser calls and gets the speech plug-in unit;
(3) info web is carried out word segmentation processing;
(4) entry is carried out word frequency statistics;
(5) preserve also real-time update word frequency base.
2. the generation method of word frequency base as claimed in claim 1 is characterized in that described webpage comprises pages of Internet and local page.
3. webpage as claimed in claim 1 or 2 is characterized in that described local page comprises the webpage on LAN (Local Area Network) webpage and the local hard drive.
4. the generation method of word frequency base as claimed in claim 1 is characterized in that described user is the individual consumer of separate unit terminal.
5. the generation method of word frequency base as claimed in claim 1 is characterized in that the described speech plug-in unit of getting only analyzes in the webpage that each is activated.
6. the generation method of word frequency base as claimed in claim 1 is characterized in that described step uninterruptedly circulates.
CNA2007101707166A 2007-11-21 2007-11-21 Method for generating word frequency database based on user personality Pending CN101364220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101707166A CN101364220A (en) 2007-11-21 2007-11-21 Method for generating word frequency database based on user personality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2007101707166A CN101364220A (en) 2007-11-21 2007-11-21 Method for generating word frequency database based on user personality

Publications (1)

Publication Number Publication Date
CN101364220A true CN101364220A (en) 2009-02-11

Family

ID=40390591

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101707166A Pending CN101364220A (en) 2007-11-21 2007-11-21 Method for generating word frequency database based on user personality

Country Status (1)

Country Link
CN (1) CN101364220A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399890A (en) * 2013-07-22 2013-11-20 百度在线网络技术(北京)有限公司 Method and equipment for collecting words on input method client side
CN103425742A (en) * 2013-07-16 2013-12-04 北京中科汇联信息技术有限公司 Method and device for searching website
CN103823849A (en) * 2014-02-11 2014-05-28 百度在线网络技术(北京)有限公司 Method and device for acquiring entries
CN106933379A (en) * 2017-02-13 2017-07-07 北京奇虎科技有限公司 The generation method and device of a kind of dictionary

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425742A (en) * 2013-07-16 2013-12-04 北京中科汇联信息技术有限公司 Method and device for searching website
CN103399890A (en) * 2013-07-22 2013-11-20 百度在线网络技术(北京)有限公司 Method and equipment for collecting words on input method client side
CN103399890B (en) * 2013-07-22 2016-10-26 百度在线网络技术(北京)有限公司 At the method and apparatus that input method client collects words
CN103823849A (en) * 2014-02-11 2014-05-28 百度在线网络技术(北京)有限公司 Method and device for acquiring entries
WO2015120713A1 (en) * 2014-02-11 2015-08-20 百度在线网络技术(北京)有限公司 Method and apparatus for acquiring entry, computer storage medium and device
CN106933379A (en) * 2017-02-13 2017-07-07 北京奇虎科技有限公司 The generation method and device of a kind of dictionary

Similar Documents

Publication Publication Date Title
JP5205684B2 (en) Bookmark management system and bookmark management method
US8832058B1 (en) Systems and methods for syndicating and hosting customized news content
CA2673110C (en) Method and system for intellegent processing of electronic information
JP5133984B2 (en) Input candidate providing device, input candidate providing system, input candidate providing method, and input candidate providing program
US20090006388A1 (en) Search result ranking
US20050091203A1 (en) Method and apparatus for improving the readability of an automatically machine-generated summary
US20130086053A1 (en) Personalized Meta-Search Method and Application Terminal Thereof
US20100125781A1 (en) Page generation by keyword
JP2002073677A (en) Device for collecting personal preference information on reader and information reading support device using the information collecting device
Sethi et al. A novel page ranking mechanism based on user browsing patterns
EP1282864A2 (en) System and method for capturing and managing information from digital source
CN101364220A (en) Method for generating word frequency database based on user personality
JP2008537809A (en) Information search service providing server, method and system using page group
JP4469432B2 (en) INTERNET INFORMATION PROCESSING DEVICE, INTERNET INFORMATION PROCESSING METHOD, AND COMPUTER-READABLE RECORDING MEDIUM CONTAINING PROGRAM FOR CAUSING COMPUTER TO EXECUTE THE METHOD
KR20020022977A (en) Internet resource retrieval and browsing method based on expanded web site map and expanded natural domain names assigned to all web resources
JP2002149668A (en) Internet auxiliary software and recording medium having the same software recorded
Liu et al. Digging for gold on the Web: Experience with the WebGather
JP2000231569A (en) Internet information retrieving device, internet information retrieving method and computer readable recording medium with program making computer execute method recorded therein
CN102375835B (en) A kind of information search system and method
KR100839619B1 (en) Internet search history managing method
JP3586272B2 (en) Search engine, search system, and storage medium
JP2006235882A (en) Method and system for browsing informations
KR20010060361A (en) Method for displaying search results in a web search site
Praba et al. Evaluation of Web Searching Method Using a Novel WPRR Algorithm for Two Different Case Studies
JP2005284978A (en) Program and method for providing portal service

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice

Addressee: Shanghai Ape-Tech Corp.

Document name: the First Notification of an Office Action

DD01 Delivery of document by public notice

Addressee: Shanghai Ape-Tech Corp.

Document name: Notification that Application Deemed to be Withdrawn

C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20090211