CN1306258A - Method for judging position correlation of a group of query keys or words on network page - Google Patents
Method for judging position correlation of a group of query keys or words on network page Download PDFInfo
- Publication number
- CN1306258A CN1306258A CN 01109132 CN01109132A CN1306258A CN 1306258 A CN1306258 A CN 1306258A CN 01109132 CN01109132 CN 01109132 CN 01109132 A CN01109132 A CN 01109132A CN 1306258 A CN1306258 A CN 1306258A
- Authority
- CN
- China
- Prior art keywords
- word
- webpage
- speech
- adjacent
- position correlation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
A method for judging the position correlation of a group of query key words or on phrases on network page is disclosed. The forward adjacent word or phrase and the backward adjacent one of each key word/phrase on network page are calculated by search engine. Based on said information, it the adjacent word/phrase in the term inquired by user is adjacent on network page is judged. If they still are adjacent, the right value of the network page is raised to output the inquired result. Its advantages are less time and space flexibility and high correctness.
Description
The present invention relates to technical field of information retrieval, the information retrieval technique of particularly Chinese and English Web search engine system.
In order to improve the service quality of search engine, search engine system all will be considered be retrieved position correlation in the webpage of key word (speech) that query term comprises as a result the time in output usually.For example, in webpage, connect together, promptly comprise in the webpage and query term key word (speech) the identical information that puts in order, then when Query Result is exported, such webpage is placed on the front if judge these key words (speech); The search engine system that has is then only exported the webpage of complete match user query term more sans phrase.In order to judge that whether the position of these key words (speech) in webpage connects together, and has two kinds of ways usually:
1, string matching fully;
2, note the position that each key word (speech) occurs during web page analysis in webpage, then according to the position
Information is judged position correlation.
First method is simple, but needs to preserve whole web page contents.This not only can waste too many storage space, and efficient is too low, surpasses 1,000,000,000 webpages owing to deposited on the WWW, if these nearly 1,000,000,000 webpages are all carried out the query term string matching, then inquiry velocity can must allow the user stand slowly.And the search engine system that discloses query term position correlation analytical technology at present is that the Google system of the U.S. is (referring to the paper that S.Brin and L.Page delivered: The Anatomy of a Large-Scale Hypertextual Web Search Engine, In proceedings of 7th World Wide Web Conference, 1998).Google has safeguarded maximum in the world Web information database, also is one of the most well-known in the world search engine at present.Google is in order to judge that the position correlation between each key word (speech) has adopted second method in the query term.
The Google system collects device, index and searcher 3 parts by webpage and forms.Collecting device is responsible for collecting webpage and it is carried out analyzing and processing.When analyzing a webpage, need record which key word (speech) and in article, occur, how many times has appearred in these words (speech) in webpage, and the position of this word (speech) in article when at every turn occurring.So just obtained forward direction concordance list as shown in Figure 1.Index can generate inverted index table again according to the forward direction concordance list, as shown in Figure 2.When the user submits a query term to, the searcher of Google at first is decomposed into several key words or speech (unless this query term itself is exactly a keyword) to this query term, and finds out the webpage that has comprised all these key words or speech according to inverted index table; Calculate the weights of these webpages then, and according to the output of sorting of these weights.When calculating weights, also to calculate its position correlation according to the position of each key word (speech) in webpage of writing down in the inverted index table, correlativity is high more, and additional weights are also just high more, and the possibility that corresponding webpage comes the front is also just big more.
And also there is the too high shortcoming of space complexity and time complexity in the second method that Google adopted.At first, under this method, each position that needs each key word of record in webpage, to occur, space complexity is very high; Secondly, when whether the position of searcher each key word (speech) in according to the position judgment query term of these key words in webpage gets together, need carry out a large amount of compare operations, time complexity is also very high, can influence system performance.In fact, Google is in order to reduce space complexity and time complexity, and it limits the positional information of record, and promptly it has only considered the positional information of preceding 4K the key word (speech) of every piece of webpage.Even after handling like this, its space complexity and time complexity are still very high, and bring other shortcoming, promptly can't judge the position correlation of the key word (speech) that 4K key word (speech) occurs afterwards in the webpage, this will influence retrieval quality.Our problem to be solved is exactly that the accuracy rate of judging position correlation is being influenced on the little basis, store the least possible information to reduce space complexity, these information can help again in the extremely short time position correlation being made judgement simultaneously, promptly have lower time complexity.
For fear of bigger time complexity and the space complexity of Google system for judging that position correlation caused, we have designed another and have judged the method for a group polling key word (speech) position correlation in webpage.
Content of the present invention and technical scheme are as follows:
When the search engine system analyzing web page, at first to extract keyword and key word.In our method, no longer write down these high frequency words (speech) each position that occurs in webpage, instead, we only determine adjacent words in its front (or speech) and the adjacent words in back (or speech) for each key word (speech).When the submit queries request, the positional information that searcher can write down when collecting webpage judges whether word (speech) adjacent in the user inquiring speech is also adjacent in webpage.If adjacent, then the weights with webpage suitably improve.In Query Result, keep the webpage of neighbouring relations in the user inquiring speech will come the front like this.
Forward direction adjacent words (speech) and the back of determining certain key word (speech) in the webpage are frequencies to the main foundation of adjacent words (speech).Though be positioned at before a key word or the speech and the crucial words adjacent with this key word (speech) a lot, have usually one maximum with the adjacent number of times of this key word (speech), we are its forward direction adjacent words (speech) as key word (speech).Similarly, we can calculate the back to adjacent words (speech) of a key word (speech).
Forward direction adjacent words (speech) and the back of determining certain key word (speech) particularly to the step of adjacent words (speech) are: collect device and scan webpage at first through and through, word segmentation done in the sentence that occurs handles, obtain one group of crucial character/word that in webpage, occurs, each is numbered; Write down the order that each character/word occurs first in webpage, and write down the position neighbor information between the adjacent character/word, promptly forward direction adjacent words (speech) and back are to the numbering of adjacent words (speech); When the scanning process of collecting device finishes, to each crucial character/word, front/rearly to adjacent character/word and they how many times has appearred separately according to its that note, with maximum front/rear final front/rear to adjacent character/word as it to adjacent character/word of occurrence number.Collect device and come the forward direction concordance list (Fig. 3) of structural belt position correlation information with final front/rear information to adjacent character/word.Index generates the inverted index table (Fig. 4) of band position correlation information according to the forward direction concordance list of band position correlation information.When retrieving from now on, the front/rear of certain key word (speech) judged the position correlation of institute's key word of the inquiry (speech) in webpage to adjacent words/word information in the inverted index table of the band position correlation information that searcher just generates by index.
The Figure of description explanation:
The forward direction concordance list of Fig. 1, Google search engine system
The inverted index table of Fig. 2, Google search engine system
Docid is a web page identifier in Fig. 1 and Fig. 2; Wordid is the identifier of key word (speech); Hit is the position (account for 2 bytes) of key word (speech) in the webpage of docid correspondence of wordid correspondence; Nbits is the number of times (promptly be used for showing what hit are arranged) that the key word (speech) of wordid correspondence occurs in the webpage of docid correspondence; Ndocs is the webpage number that has comprised the key word (speech) of wordid correspondence.
The forward direction concordance list of Fig. 3, band position correlation
The inverted index table of Fig. 4, band position correlation
In Fig. 3 and Fig. 4, docid represents corresponding webpage numbering, and wordid represents the numbering of certain key word (speech) in dictionary, and my_no represents the numbering of a key word (speech) (its in dictionary be numbered wordid) in webpage; Prev_no represents the numbering of this key word (speech) forward direction adjacent words (speech) in webpage, and next_no represents that this key word (speech) back is to the numbering of adjacent words (speech) in webpage.
Fig. 5 uses the common Web search engine system structural drawing of this method
(1) individual module is represented the collection device of webpage among the figure, and (2) individual module is represented raw data base, and (3) individual module is represented index, and (4) individual module is represented index data base, and (5) individual module is represented searcher, and (6) individual module is represented user interface.
The forward direction concordance list example of Fig. 6, band position correlation
Be described further below in conjunction with embodiment.
Passage is the content of certain webpage below supposing.
" day net " the new member of seminar of search engine
Field responsible official: Li Xiaoming
Project leader: Li Xiaoming
Wang Jianyong
Project developer:
Dan Songwei
Xie Zhengmao
Zhao Jianghua
Yan Hongfei
Chen Hua
Luo Chang
Guo Lin
Gong Bihong
Collect cutting of device (the 1st module among Fig. 5) and obtain a key word (speech) sequence { day net after speech is handled, search engine, newly, problem, group, the member, the field, be responsible for, the people, Lee, dawn, bright, project, be responsible for, the people, Lee, dawn, bright, the king, build, bravely, project, exploitation, the people, the member, single, pine, towering, thank, just, luxuriant, Zhao, the river, China, Yan, grand, fly, old, China, sieve, long day, Guo, beautiful jade, Gong, pen, grand }, wherein different key word (speech) and numbered sequence thereof are { day net (1), search engine (2), newly (3), problem (4), group (5), member (6), field (7), be responsible for (8), people (9), Lee (10), dawn (11), bright (12), project (13), king (14), build (15), bravely (16), exploitation (17), member (18), single (19), pine (20), towering (21), thank (22), just (23), luxuriant (24), Zhao (25), river (26), China (27), Yan (28), grand (29), fly (30), old (31), sieve (32), long day (33), Guo (34), beautiful jade (35), Gong (36), pen (37) }.And then collect the forward direction concordance list (Fig. 6) that device can go out this webpage according to above-mentioned information structuring, and it is stored in the 2nd module among Fig. 5, promptly in the raw data base.After forward direction concordance list had as shown in Figure 6 been arranged, index (the 3rd module of Fig. 5) can generate inverted index table very simply, and deposits in the 4th module of Fig. 5, promptly in the index data base.After the user submits a query requests to, user interface (the 6th module of Fig. 5) is intercepted and captured this request, and be transmitted to searcher (the 5th module among Fig. 5), judge according to position correlation information whether the position of several key words (speech) in webpage that this query term decomposes be adjacent by it, influence the ordering of this webpage in the output result according to the position degree of correlation then.
Give some instances below and illustrate searcher is how to judge position correlation.If the user looks into " Chen Hua ", because the back of " old " word 27 can judge that " old " with " China " is adjacent (" China " is numbered 27) to being numbered of adjacent words; When looking into " Gong Bihong ", the forward direction adjacent words of " pen " word is numbered 36 (i.e. the numberings of " Gong ") as can be known, and the back can infer that to 29 (i.e. the numberings of " grand ") that are numbered of adjacent words 3 words of " Gong " " pen " " grand " are that complete position is relevant.
Advantage of the present invention and good effect are:
Method with existing judgement position correlation is compared, and judgement one group polling keyword or word method of position correlation in webpage that we propose have following advantage and good effect:
1. lower space complexity is arranged, can save memory space. Under the method, for each keyword that is extracted (word) of certain webpage, only need 3 information relevant with position correlation of record. And in the method for Google, it need to record the position that a keyword (word) occurs in webpage. Usually the frequency that the keyword that is selected (word) may occur in webpage is very high, that have even occur up to a hundred time or thousands of times, and be that the position that occurs each time of a keyword of record (word) need to be greater than the storage (Google 16bits) of 13bits, the required space expense of method of visible Google will be much larger than this method.
2. have lower time complexity, can improve inquiry response speed. In the method, can judge very quickly whether two keys word of the inquiry (word) are adjacent according to forward direction adjacent words (word) and the backward adjacent words (word) of each keyword (word) that occurs in the webpage. And under the method for Google, need to take out corresponding two keys word of the inquiry (word) all positional informations in webpage, and judge position correlation according to these positional informations again, need a large amount of compare operations, can affect inquiry velocity.
3. be easy to handle the caused big index problem of minority high frequency word (speech).No matter be all to comprise some high frequency words (or speech) in Chinese or the English webpage, as " ", " in " etc., the probability that these high frequency words (speech) occur in webpage is very big.According to statistics, have in per 3,000,000 webpages more than 2,000,000 webpages comprised " " word.In other words, in Fig. 2 " " number of the index entry of word correspondence is greater than 2,000,000 (being its ndocs>2,000,000), in case such key word of user inquiring (speech), the time that is consumed will be felt and can't be stood.A kind of solution is that these a spot of key words (speech) are configured to ignore word (speech): done its rational one side like this, because the user inquires about this class high frequency word (speech) seldom separately.But handle so simply, then can cause new problem.Such as, if the user looks into " Jin Dazong " because " in " word is left in the basket, search engine can occur " gold is big " webpage return, and the name that can cause other comes output result's front as " golden ocean ", and the accuracy of inquiry is reduced greatly.Yet we can utilize position correlation to address this problem well: though promptly we ignored " in " word, but we can according to " greatly " word back to adjacent words (speech) Qafter judge " greatly " word back be " in " word, and then raising webpage weights, make this webpage position in advance, improve query accuracy.And for the method for Google if " in " high frequency word such as word is configured to ignore word, then owing to do not have " in " positional information of word, can't judge the complete position correlation of " Jin Dazong ", its query accuracy will reduce.
4. accuracy rate is higher.This method draws according to the statistical law analysis, can show the position correlation of the overwhelming majority's inquiry word (speech).And Google is in order to save the space, only write down the positional information of preceding 4096 key words (speech) in the webpage, thereby preceding 4096 key words (speech) that drop on webpage when the key word (speech) of user inquiring can't be judged position correlation afterwards the time accurately.
For the effect of the method for testing us, we list in part checking report (taking from the test result of Peking University's on Dec 18th, 2000 " day net " search engine) in the table 1.As can be seen, when looking into " Jin Dazong ", retrieve 106607 pieces of articles altogether, wherein preceding 1777 pieces of articles are the relevant fully article in position; Look into " A Night At Moscow Suburb ", find 66 pieces of articles altogether, wherein preceding 57 pieces is the relevant fully article in position; Look into " dawn 1000 ", return 248 pieces of articles altogether, wherein preceding 32 pieces is the relevant fully article in position; And look into " five road junctions ", and find 4075 pieces of articles altogether, wherein having only preceding 758 pieces is the relevant fully article in position.Here for the query term example be respectively name, song title, ProductName and place name, the user always wishes the appearance that connects together of these query terms in webpage, otherwise with meaningless.Utilize our method to find out the relevant fully article in position, and be put in output result's foremost, improved the rationality of Query Result output.
The partial test result of table 1, position correlation
Query term | Return total article number | The complete related article number of home position |
Jin Dazong | 106607 | ?1777 |
A Night At Moscow Suburb | 66 | ?57 |
Dawn 1000 | 248 | ?32 |
Five road junctions | 4075 | ?758 |
Claims (7)
1, the method for a kind of judgement one group polling key word or speech position correlation in webpage, the corresponding search engine system of using this method comprises that mainly webpage collects device, index and 3 parts of searcher, it is characterized in that: for each key word or keyword calculate adjacent character/word of adjacent character/word in its front and back; When the submit queries request, searcher judges whether character/word adjacent in the crucial character/word of user inquiring is also adjacent in webpage; If adjacent fully, then the weights with webpage suitably improve, according to weights output Query Result.
2, the method for judgement one group polling key word according to claim 1 or speech position correlation in webpage is characterized in that: according to frequency determine key word or keyword the adjacent character/word of forward direction and the back to adjacent character/word.
3, the method for judgement one group polling key word according to claim 1 and 2 or speech position correlation in webpage, it is characterized in that: collect device and scan webpage at first through and through, word segmentation done in the sentence that occurs handles, obtain one group of crucial character/word that in webpage, occurs, write down the order that each character/word occurs first in webpage, and write down the position neighbor information between the adjacent character/word.
4, the method for judgement one group polling key word according to claim 3 or speech position correlation in webpage, it is characterized in that: when collecting the scanning process end of device, to each crucial character/word, front/rearly to adjacent character/word and they how many times has appearred separately according to its that note, with maximum front/rear final front/rear to adjacent character/word as it to adjacent character/word of occurrence number.
5, the method for judgement one group polling key word according to claim 4 or speech position correlation in webpage is characterized in that: collect device and come the forward direction concordance list of structural belt position correlation information with final front/rear information to adjacent character/word.
6, the method for judgement one group polling key word according to claim 5 or speech position correlation in webpage is characterized in that: index generates the inverted index table of band position correlation information according to the forward direction concordance list of band position correlation information.
7, the method for judgement one group polling key word according to claim 6 or speech position correlation in webpage, it is characterized in that: when the user submits a query term to, searcher at first is decomposed into several key words or speech to this query term, the inverted index table of the band position correlation information that generates according to index is found out the webpage that has comprised all these crucial character/word then, calculates the weights of these webpages; And whether the position is adjacent fully in these webpages to judge these inquiry character/word according to the position correlation information in the inverted index table of band position correlation information, if adjacent then the raising of the weights of corresponding web page, at last Query Result ordering output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 01109132 CN1227611C (en) | 2001-03-09 | 2001-03-09 | Method for judging position correlation of a group of query keys or words on network page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN 01109132 CN1227611C (en) | 2001-03-09 | 2001-03-09 | Method for judging position correlation of a group of query keys or words on network page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1306258A true CN1306258A (en) | 2001-08-01 |
CN1227611C CN1227611C (en) | 2005-11-16 |
Family
ID=4657739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN 01109132 Expired - Fee Related CN1227611C (en) | 2001-03-09 | 2001-03-09 | Method for judging position correlation of a group of query keys or words on network page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1227611C (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100390792C (en) * | 2003-09-08 | 2008-05-28 | 国际商业机器公司 | Uniform search system and method for selectively sharing distributed access-controlled documents |
WO2008098467A1 (en) * | 2007-02-15 | 2008-08-21 | Erzhong Liu | Convenient method and system of electric text processing and retrieve |
CN1637741B (en) * | 2003-09-10 | 2010-07-21 | 微软公司 | Annotation management in pen-based computing system |
CN102708115A (en) * | 2004-08-09 | 2012-10-03 | 亚马逊技术股份有限公司 | Method and system for identifying keywords for use in placing keyword-targeted advertisements |
CN103793418A (en) * | 2012-10-31 | 2014-05-14 | 珠海富讯网络科技有限公司 | Search method of real-time vertical search engine for security industry |
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN106095779A (en) * | 2016-05-26 | 2016-11-09 | 达而观信息科技(上海)有限公司 | A kind of search method based on key word position and device |
CN110334269A (en) * | 2019-07-11 | 2019-10-15 | 中国船舶工业综合技术经济研究院 | A kind of information retrieval method and system |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100458797C (en) * | 2007-06-20 | 2009-02-04 | 精实万维软件(北京)有限公司 | Process for ordering network advertisement |
-
2001
- 2001-03-09 CN CN 01109132 patent/CN1227611C/en not_active Expired - Fee Related
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100390792C (en) * | 2003-09-08 | 2008-05-28 | 国际商业机器公司 | Uniform search system and method for selectively sharing distributed access-controlled documents |
CN1637741B (en) * | 2003-09-10 | 2010-07-21 | 微软公司 | Annotation management in pen-based computing system |
CN102708115B (en) * | 2004-08-09 | 2015-12-09 | 亚马逊技术股份有限公司 | Identify the method and system of the keyword for placing keyword target advertisement |
CN102708115A (en) * | 2004-08-09 | 2012-10-03 | 亚马逊技术股份有限公司 | Method and system for identifying keywords for use in placing keyword-targeted advertisements |
US9489449B1 (en) | 2004-08-09 | 2016-11-08 | Amazon Technologies, Inc. | Method and system for identifying keywords for use in placing keyword-targeted advertisements |
US10402431B2 (en) | 2004-08-09 | 2019-09-03 | Amazon Technologies, Inc. | Method and system for identifying keywords for use in placing keyword-targeted advertisements |
WO2008098467A1 (en) * | 2007-02-15 | 2008-08-21 | Erzhong Liu | Convenient method and system of electric text processing and retrieve |
CN103793418A (en) * | 2012-10-31 | 2014-05-14 | 珠海富讯网络科技有限公司 | Search method of real-time vertical search engine for security industry |
CN104778262A (en) * | 2015-04-21 | 2015-07-15 | 无锡天脉聚源传媒科技有限公司 | Searching method and searching device |
CN104778262B (en) * | 2015-04-21 | 2018-07-24 | 无锡天脉聚源传媒科技有限公司 | A kind of searching method and device |
CN106095779A (en) * | 2016-05-26 | 2016-11-09 | 达而观信息科技(上海)有限公司 | A kind of search method based on key word position and device |
CN110334269A (en) * | 2019-07-11 | 2019-10-15 | 中国船舶工业综合技术经济研究院 | A kind of information retrieval method and system |
CN110334269B (en) * | 2019-07-11 | 2021-05-07 | 中国船舶工业综合技术经济研究院 | Information retrieval method and system |
Also Published As
Publication number | Publication date |
---|---|
CN1227611C (en) | 2005-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10671676B2 (en) | Multiple index based information retrieval system | |
US9990421B2 (en) | Phrase-based searching in an information retrieval system | |
CN100405371C (en) | Method and system for abstracting new word | |
US9384224B2 (en) | Information retrieval system for archiving multiple document versions | |
CA2513850C (en) | Phrase identification in an information retrieval system | |
CN101246499B (en) | Network information search method and system | |
CA2813644C (en) | Phrase-based searching in an information retrieval system | |
CN1389811A (en) | Intelligent search method of search engine | |
US20060020571A1 (en) | Phrase-based generation of document descriptions | |
CN1335574A (en) | Intelligent semantic searching method | |
CN100562713C (en) | The information retrieval method of electronic navigation system and device | |
CN104375992A (en) | Address matching method and device | |
CN1227611C (en) | Method for judging position correlation of a group of query keys or words on network page | |
Chen et al. | Template detection for large scale search engines | |
CN101079064A (en) | Web page sequencing method and device | |
CN1818908A (en) | Feedbakc information use of searcher in search engine | |
CN1145899C (en) | Method for automatic generating abstract from word or file | |
CN102955812B (en) | A kind of method of index building storehouse, device and querying method and device | |
CN101876979B (en) | Query expansion method and equipment | |
CN102339294A (en) | Searching method and system for preprocessing keywords | |
CN103064847A (en) | Indexing equipment, indexing method, search device, search method and search system | |
CN111782699A (en) | Intelligent interest point searching method based on user history tile browsing records | |
CN110245275A (en) | A kind of extensive similar quick method for normalizing of headline | |
Kim et al. | Efficient processing of substring match queries with inverted q-gram indexes | |
Lo et al. | The numeric indexing for music data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20051116 Termination date: 20140309 |