CN101923556A - Method and device for searching webpages according to sentence serial numbers - Google Patents

Method and device for searching webpages according to sentence serial numbers Download PDF

Info

Publication number
CN101923556A
CN101923556A CN2010101103153A CN201010110315A CN101923556A CN 101923556 A CN101923556 A CN 101923556A CN 2010101103153 A CN2010101103153 A CN 2010101103153A CN 201010110315 A CN201010110315 A CN 201010110315A CN 101923556 A CN101923556 A CN 101923556A
Authority
CN
China
Prior art keywords
sentence
webpage
punctuation mark
word
serial numbers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010101103153A
Other languages
Chinese (zh)
Other versions
CN101923556B (en
Inventor
杜一华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Yutian Information Technology Co.,Ltd.
Original Assignee
SHANGHAI LAISEEK INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI LAISEEK INFORMATION TECHNOLOGY CO LTD filed Critical SHANGHAI LAISEEK INFORMATION TECHNOLOGY CO LTD
Priority to CN 201010110315 priority Critical patent/CN101923556B/en
Publication of CN101923556A publication Critical patent/CN101923556A/en
Application granted granted Critical
Publication of CN101923556B publication Critical patent/CN101923556B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses method and device for searching webpages according to sentence serial numbers. The method comprises the following steps of: A, obtaining a plurality of webpages and downloading to a webpage database; B, carrying out sentence segmentation on the plurality of webpages and respectively distributing serial numbers for sentences of each webpage; C, making a forward index table including sentence serial numbers; D, making an inverted index table including the sentence index numbers; E, inputting a search item and segmenting the search item into at least one key letter, one key word or a punctuation mark; and F, calculating a sequencing weight value of a webpage including the key letter, the key word or the punctuation mark according to the inverted index table and outputting search results. By adopting the method and the device of the invention, the sequencing weight value of webpages with zero distances or smaller distances among sentences including the key letter, the keyword or the punctuation mark can be increased, thereby putting the ranking of the webpages forwards to increase the search satisfaction of users.

Description

Carry out the method and apparatus of Webpage search according to sentence serial numbers
Technical field
The present invention relates to information retrieval field and natural language processing field, relate in particular to a kind of method and apparatus that carries out Webpage search according to sentence serial numbers.
Background technology
Existing main flow search engine all is to search for by key word or keyword as Google, Yahoo, Baidu etc.All must comprise key word or keyword in the index structure of these search engines.
On the 7th web-seminar in 1998, the paper that is entitled as " The Anatomy of a Large-Scale Hypertextual Web Search Engine " that Sergey Brin and Lawrence Page deliver discloses the index structure of Google search engine.The forward direction concordance list of Google search engine and back all comprise a preceding 4K word, speech or the positional information of punctuation mark in webpage of webpage that this search engine is downloaded to concordance list.
The patent No. is ZL01109132.0, and denomination of invention discloses the index structure of another kind of search engine for the patent of invention of " judging the method for a group polling key word or speech position correlation in webpage ".Forward direction concordance list and back all comprise word, speech or the punctuation mark position in webpage of webpage that this search engine is downloaded to concordance list, position in webpage of the word that forward direction is adjacent, speech or punctuation mark and back to adjacent word, speech or the punctuation mark information such as position in webpage.
Existing index structure also is the sentence information that forward direction concordance list (Forward Index) and inverted index table (InvertedIndex) all do not comprise webpage that search engine is downloaded.Therefore, existing search engine is on very big probability, and the Search Results that key word, keyword or the punctuation mark that search terms decomposed can be scattered in the webpage of some different sentences returns search subscriber.For example, have " evening that spring breeze is got drunk " of Yu Dafu a sentence " this is asked through her, I again situation in privation over the past half year in layer wanted to come out.”。Use existing main flow search engine, inputted search item " one deck over the past half year ", the forward several search and webpages of rank are all irrelevant with this piece article of Yu Dafu as a result.In the Search Results that existing search engine returns, the sequencing weight that has certain probability can lay respectively at " over the past half year " and " one deck " webpage of article initial and end is provided with higher, also is that rank is forward.For example, might return following webpage, content is " evening November 11; the Guangzhou Northern Guangdong Province has welcome over the past half year first help rain, and this rain descend 6 o'clock mornings from 6 o'clock in evening always, but also under continuing; only rainfall is slightly littler, and air quality is slightly decline also.The arrival of this rain be we can say, to the locality arid nearly six months, the common people that Lian Shui does not drink soon, can be really a help rain! The street in city not only is washed away completely by this rain, and is with the fresh air, and also there has been big hope in common people's crops, and indescribable everybody mood is how happy and glad! Lose no time to relax in the rain and seize the tight heart! Heavy rain has been coverd with one deck shade with the night scene of the Zhujiang River tributary Bei Jiang of beauty.”。In this webpage, keyword " over the past half year " is positioned at the beginning of this webpage, and keyword " one deck " is positioned at the end of this webpage.Obviously, in this webpage, the loose interconnectivity of these two keywords, this webpage are not that the user wants the object searched for.
Existing search engine does not carry out sentence to the download webpage to be cut apart, without any the sentence information of download webpage.Therefore, the key word that existing search engine can only obtain decomposing, keyword or the punctuation mark position distance in certain webpage, for example the key word of Fen Xieing, keyword or punctuation mark are at a distance of the distance of what bytes.But, can not directly obtain key word, keyword or the punctuation mark sentence distance in certain webpage of decomposing, also be the absolute value of the difference of sentence serial numbers.Hence one can see that, and existing search engine can not guarantee that sentence distance is that the rank of webpage of zero (key word, keyword or punctuation mark are positioned at same sentence) or sentence distance less (key word, keyword or punctuation mark are positioned at adjacent sentence or at a distance of nearer sentence) is forward.
Summary of the invention
Because the above-mentioned defective of prior art, technical matters to be solved by this invention provides a kind of method and apparatus that carries out Webpage search according to sentence serial numbers, the sentence distance that improves key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
The invention discloses and a kind ofly carry out the method for Webpage search, may further comprise the steps according to sentence serial numbers:
A), obtain plurality of webpages, and be downloaded to web database;
B), described plurality of webpages carried out sentence cut apart, and be respectively the distributing serial numbers for sentences of each webpage;
C), make the forward direction concordance list, described forward direction concordance list comprises sentence serial numbers;
D), make inverted index table, described inverted index table comprises described sentence serial numbers;
E), the inputted search item, described search terms is decomposed at least one key word, keyword or punctuation mark;
F), according to described inverted index table, calculate the sequencing weight of the webpage comprise described key word, keyword or punctuation mark, the output Search Results.
Further, described step B) further may further comprise the steps:
B1), described each webpage of index scanning, do the word cutting for described each webpage, write down each speech, word or the punctuation mark position in webpage;
B2), according to described each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent, carry out sentence and cut apart;
B3), be each distributing serial numbers for sentences, determine the sentence serial numbers of described each speech, word or punctuation mark.
Preferably, the rule that described sentence is cut apart is: if fullstop, question mark, suspension points or exclamation mark in quotation marks, and are positioned at paragraph and finish part, the ending of sentence is fullstop, question mark, suspension points or exclamation mark and back quote; If fullstop, question mark, suspension points or exclamation mark are outside quotation marks, the sentence ending is fullstop, question mark, suspension points or exclamation mark.
Preferably, described forward direction concordance list comprises the page sequence number of described each speech, word or punctuation mark, described each speech, word or punctuation mark, the sentence serial numbers of the sequence number of described each speech, word or punctuation mark and described each speech, word or punctuation mark.
Preferably, described inverted index table comprises described each speech, word or punctuation mark, the sequence number of described each speech, word or punctuation mark, the webpage quantity that comprises described each speech, word or punctuation mark, the sentence serial numbers of the page sequence of described each speech, word or punctuation mark number and described each speech, word or punctuation mark.
Further, described step F) in the webpage that comprises described key word, keyword or punctuation mark, according to described inverted index table, judge whether described key word, keyword or punctuation mark belong to same sentence, if belong to same sentence, improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of described key word, keyword or punctuation mark, if described sentence distance is big, then reduce the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark, if described sentence distance is little, then improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark.
Preferably, the sequencing weight of described webpage is by the sentence distance of described key word, keyword or punctuation mark, the authority of described webpage place domain name, the pouplarity of described webpage, whether described key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of described webpage and click-through rate, the log-on data and the public station data of website, described webpage place comprehensively determine.
Preferably, if described key word, keyword or punctuation mark belong to same sentence, further natural language processing done in described sentence.
The invention also discloses and a kind ofly carry out the device of Webpage search, comprise according to sentence serial numbers
The webpage getter is used to obtain and download plurality of webpages;
Web database is used to store the described plurality of webpages of download;
Index is used for that described plurality of webpages is carried out sentence and cuts apart, and is respectively the distributing serial numbers for sentences of each webpage, makes the forward direction concordance list and the inverted index table that comprise sentence serial numbers;
Index data base is used to store described forward direction concordance list and described inverted index table;
Searcher is used for search terms is decomposed at least one key word, keyword or punctuation mark, according to described inverted index table, calculates the sequencing weight of the webpage that comprises described key word, keyword or punctuation mark, the output Search Results;
Described webpage getter, described web database, described index, described index data base, described searcher connect successively.
Beneficial effect of the present invention is:
The forward direction concordance list of the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention and the sentence serial numbers that inverted index table has all comprised webpage, by inquiry sentence serial numbers information, the sentence distance that search engine can improve key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
Of the present invention according to sentence serial numbers carry out the method and apparatus of Webpage search can be directly according to the sentence serial numbers of each word, speech or punctuation mark in each webpage, judge whether two or more key words to be checked, keyword or punctuation mark belong to same sentence or sentence close together fast, and do not need a large amount of comparison operations.The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention has lower time complexity, thereby improves the response speed of search, for the user brings search experience more efficiently.
The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention can provide condition precedent for follow-up natural language processing.If two or more key words to be checked, keyword or punctuation mark belong to same sentence, search engine can be done further deep natural language processing to this sentence.For example, various syntactic analyses done in this sentence, as interdependent syntactic analysis, with dependence between the vocabulary that obtains this sentence and head; Perhaps can do based on sentiment classification (passing judgement on analysis), with tendentiousness of learning this sentence etc. to this sentence.
Description of drawings
Fig. 1 is the process flow diagram that carries out the method for Webpage search according to sentence serial numbers of the present invention;
Fig. 2 is the structural representation of the forward direction concordance list of the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention;
Fig. 3 is the structural representation of the inverted index table of the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention;
Fig. 4 is the structural representation that carries out the device of Webpage search according to sentence serial numbers of the present invention.
Embodiment
Be described further below with reference to the technique effect of accompanying drawing, to understand purpose of the present invention, feature and effect fully design of the present invention, concrete structure and generation.
As shown in Figure 1, the invention discloses and a kind ofly carry out the method for Webpage search, may further comprise the steps according to sentence serial numbers:
Step 101, obtain plurality of webpages, and be downloaded to web database;
Search engine companies is obtained plurality of webpages by the webpage getter from the internet, and plurality of webpages is downloaded in the computing machine of search engine companies, also is in the web database.
Step 102, plurality of webpages is carried out sentence cut apart, and be respectively the distributing serial numbers for sentences of each webpage;
At first, index scans each webpage, does the word cutting for each webpage, writes down each speech, word or the punctuation mark position in webpage;
Secondly, according to each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent, carry out sentence and cut apart;
Once more, be each distributing serial numbers for sentences, determine the sentence serial numbers of each speech, word or punctuation mark.The sentence serial numbers of each webpage is to number separately.
Step 103, making forward direction concordance list, the forward direction concordance list comprises sentence serial numbers;
The forward direction concordance list comprises the page sequence number of each speech, word or punctuation mark, each speech, word or punctuation mark, the sentence serial numbers of the sequence number of each speech, word or punctuation mark and each speech, word or punctuation mark.The forward direction concordance list can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.
Step 104, making inverted index table, inverted index table comprises sentence serial numbers;
Inverted index table comprises each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark, the webpage quantity that comprises each speech, word or punctuation mark, the sentence serial numbers of the page sequence of each speech, word or punctuation mark number and each speech, word or punctuation mark.Inverted index table also can comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.
Step 105, inputted search item are decomposed at least one key word, keyword or punctuation mark with search terms;
User's inputted search item, searcher is decomposed into a plurality of key words, keyword or punctuation mark with search terms.Certainly, the search terms of user's input also may itself be key word, keyword or a punctuation mark, and searcher does not then need this is decomposed.
Step 106, according to inverted index table, calculate the sequencing weight of the webpage comprise key word, keyword or punctuation mark, the output Search Results.
In the webpage that comprises described key word, keyword or punctuation mark,, judge whether described key word, keyword or punctuation mark belong to same sentence according to inverted index table.If belong to same sentence, improve the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of key word, keyword or punctuation mark.If the sentence distance is big, then reduce the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark, if the sentence distance is little, then improve the sequencing weight of the affiliated webpage of key word, keyword or punctuation mark.
See also Fig. 2, the forward direction concordance list comprises the page sequence docid of each speech, word or punctuation mark, each speech, word or punctuation mark word1, word2, word3 ... sequence number word id1, the word id2 of each speech, word or punctuation mark, word id3 ..., sentence serial numbers sentence id1, the sentence id2 of each speech, word or punctuation mark, sentence id3 ...The page sequence of each speech, word or punctuation mark number, each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark is unique.But the sentence serial numbers of each speech, word or punctuation mark can be for one or more.Because speech, word or a punctuation mark can occur in a plurality of sentences in the webpage.
Certainly, the forward direction concordance list can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.But because information such as side-play amount are widely-used in existing search engine, so do not repeat them here.
See also Fig. 3, inverted index table comprises each speech, word or punctuation mark word1, word2, word3 ... each speech, the sequence number word id1 of word or punctuation mark, word id2, word id3 ... comprise each speech, the webpage quantity ndocs1 of word or punctuation mark, ndocs2, ndocs3 ... each speech, the page sequence docid1 of word or punctuation mark, docid2, docid3, docid4, docid5, docid6 ..., each speech, the sentence serial numbers sentence id1 of word or punctuation mark, sentence id2, sentence id3, sentence id4, sentence id5, sentence id6 ...Each speech, word or punctuation mark, the sequence number of each speech, word or punctuation mark comprises the webpage quantity of each speech, word or punctuation mark, and the page sequence of each speech, word or punctuation mark number is unique.But the sentence serial numbers of each speech, word or punctuation mark can be for one or more.Because speech, word or a punctuation mark can occur in a plurality of sentences in the webpage.
Certainly, inverted index table can also comprise each speech, word or the punctuation mark position in webpage, also is information such as side-play amount.But because information such as side-play amount are widely-used in existing search engine, so do not repeat them here.
In the first embodiment of the present invention, the full content of first webpage following (selecting from Yu Dafu " evening that spring breeze is got drunk "):
Because from since last year, I am one day on the one dispirited, almost " whom I am? " how " I am now residing to be a kind of circumstances? " " my still sad at heart still happiness? " these ideas have all been forgotten about.This is asked through her, I again situation in privation over the past half year in layer wanted to come out.So listen after her question, I dull see her, lose one's tongue quite a while.She has seen my this appearance, thinks that I also am a homeless outcast.Just played a kind of expression of loneliness on the face, slight sighing said immediately:
" sound of sighing! You also are the same with me? "
Slight has sighed after one, and she is just silent.
See also Fig. 4, of the present inventionly carry out the device of Webpage search according to sentence serial numbers, also be search engine 40 by webpage getter 401, with the computing machine of first page download, also be web database 402 to search engine companies.
Index 403 scannings first webpage is that first webpage is done the word cutting, writes down each speech, word or the punctuation mark position in webpage.Then, index 403 carries out sentence and cuts apart according to each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent.
Sentence is meant the syntactical unit with independent elocutionary meaning that is made of speech and phrase.In Chinese, the sentence ending should be fullstop, question mark, suspension points or exclamation mark.If these symbols appear in the quotation marks, when being positioned at paragraph, these symbols finish part, and these symbols and back quote are defined as the ending of sentence together.Certainly, the rule of cutting apart of sentence of the present invention is not limited to this, can be set by index 403 to cut apart rule.For example, if fullstop, question mark, suspension points or exclamation mark appear in the quotation marks, even these symbols are positioned at paragraph beginning or paragraph center section, these symbols and back quote also can be defined as the ending of sentence together.
After end cut apart in sentence, be each distributing serial numbers for sentences, thereby can determine the sentence serial numbers of each speech, word or punctuation mark.Preferably, sentence serial numbers is 0,1,2,3,4 ...But the present invention is not limited to this, and sentence serial numbers can be 1,2,3,4 ..., perhaps 2,3,4 ... Deng.The Base Serial Number of sentence serial numbers can be arbitrary integer.
As another embodiment of the present invention, sentence serial numbers also can be 1,3,5,7 ..., perhaps 2,6,10,14 ... Deng.Difference between the sentence serial numbers also can be any natural number.
As another embodiment of the present invention, sentence serial numbers also can for ... 4,3,2,1 etc.Sentence serial numbers also can successively decrease successively.
Sentence serial numbers only need be by the rule unified distribution of setting, promptly applicable to the present invention.
First webpage can be split into following five sentences:
[0] because from since last year, I am one day on the one dispirited, almost " whom I am? " how " I am now residing to be a kind of circumstances? " " my still sad at heart still happiness? " these ideas have all been forgotten about.
[1] this is asked through her, I again over the past half year in privation in layer wanted to come out.
[2] so listen after her question, I dull see her, lose one's tongue quite a while.
[3] she has seen my this appearance, thinks that I also am a homeless outcast.
[4] just played a kind of expression of loneliness on the face, slight sighing said immediately: sound of sighing! You also are the same with me? "
[5] slight having sighed after one, she is just silent.
Certainly, according to the different rules of cutting apart that index 403 is set, first webpage can be divided into and be less than five or more than five sentence.For example also can be that zero sentence is divided into four sentences again with sentence serial numbers.
Index 403 is made the forward direction concordance list, and deposits index data base 404 in.The forward direction concordance list of first webpage as shown in Table 1.Docid is the page sequence number of each speech, word or punctuation mark, and word is each speech, word or punctuation mark, and word id is the sequence number of each speech, word or punctuation mark, and sentence id is the sentence serial numbers of each speech, word or punctuation mark.
The forward direction concordance list of table one first webpage
docid word word?id sentence?id
0 Because 0 0
0 From 1 0
0 Last year 2 0
0 Since 3 0
0 I 4 0,1,2,3,4
0 Just 5 0,2
0 One day 6 0
0 {。##.##1}, 7 0,1,2,3,4,5
0 Dispirited 8 0
0 Go down 9 0
0 10 0,1,2,3
0 Almost 11 0
0 {。##.##1}, 12 0
0 13 0
0 Be 14 0
0 What 15 0
0 The people 16 0
0 17 0
0 18 0,4
0 Now 19 0
0 The institute 20 0
0 The place 21 0
0 How 22 0
0 A kind of 23 0,4
0 Circumstances 24 0
0 At heart 25 0
0 Still 26 0
0 Sad 27 0
0 Happiness 28 0
0 These 29 0
0 Idea 30 0
0 All 31 0
0 Forget about 32 0
0 33 0,1,3,4,5
0 34 0,1,2,3,5
0 Warp 35 1
0 She 36 1,2,3,5
0 This 37 1
0 One asks 38 1
0 Again 39 1
0 40 1
0 Over the past half year 41 1
0 In privation 42 1
0 Situation 43 1
0 One deck 44 1
0 Think 45 1
0 Come out 46 1
0 So 47 2
0 Listen 48 2
0 Question 49 2
0 After 50 2
0 Dull 51 2
0 See 52 2,3
0 Quite a while 53 2
0 Say 54 2,4
0 No 55 2
0 Go out 56 2
0 Words 57 2
0 Come 58 1,2
0 This 59 3
0 Appearance 60 3
0 Think 61 3
0 Also be 62 3,4
0 One 63 3
0 Homeless 64 3
0 The outcast 65 3
0 On the face 66 4
0 Just 67 4,5
0 Immediately 68 4
0 Rise 69 4
0 Lonely 70 4
0 Expression 71 4
0 Slightly 72 4,5
0 Sighing 73 4
0 74 4
0 Sound of sighing 75 4
0 76 4
0 You 77 4
0 With 78 4
0 The same 79 4
0 {。##.##1}, 80 4
0 Sigh 81 5
0 One 82 5
0 Afterwards 83 5
0 No 84 5
0 Speak 85 5
In the second embodiment of the present invention, the full content of second webpage following (selecting from king's melt " stepping on Stork "):
Daytime, to the greatest extent, ocean current was gone in the Yellow River near the mountain.Ascend another storey to see a thousand miles further.
Equally, second webpage also can pass through webpage getter 401, is downloaded to the computing machine of search engine companies, also is web database 402.403 pairs second webpages of index are cut apart as sentence, and distribute sentence serial numbers.
Second webpage can be split into following two sentences:
[0] daytime uses up near the mountain, and ocean current is gone in the Yellow River.
[1] ascends another storey to see a thousand miles further.
Index 403 is made the forward direction concordance list, and deposits index data base 404 in.The forward direction concordance list of second webpage as shown in Table 2.
The forward direction concordance list of table two second webpage
Docid Word Word id Sentence id
1 Daytime 86 0
1 Comply with 87 0
1 The mountain 88 0
1 To the greatest extent 89 0
1 , 10 0,1
1 The Yellow River 90 0
1 Go into 91 0
1 The sea 92 0
1 Stream 93 0
1 34 0,1
1 Desire poor 94 1
1 A thousand li 95 1
1 Order 96 1
1 More go up 97 1
1 One deck 44 1
1 The building 98 1
As shown in Table 2, the sentence serial numbers of each webpage is an independent numbering.In a second embodiment, the sentence serial numbers numbering of starting from scratch again.Table one is numbered in turn but the page sequence docid of each speech, word or punctuation mark, the sequence number word id of each speech, word or punctuation mark continue.Be noted that, ", ", ".", the word id of " one deck " is assigned to 10,34,44 in Table 1 respectively.Therefore, in table two, the word id of reservation table one still.Hence one can see that, and in whole search engine 40, the sequence number word id of each speech, word or punctuation mark is unique.
After table one and table two completed, index 403 was merged into a total forward direction concordance list with table one and table two.Index 403 is an independent forward direction concordance list of each webpage making, more some forward direction concordance lists is merged into a total forward direction concordance list.Some forward direction concordance lists merge into prior art, do not repeat them here.
According to table one and table two, index 403 is made inverted index table, and deposits index data base 404 in.The inverted index table of first webpage and second webpage as shown in Table 3.Word is each speech, word or punctuation mark, word id is the sequence number of each speech, word or punctuation mark, ndocs is the webpage quantity that comprises each speech, word or punctuation mark, docid is the page sequence number of each speech, word or punctuation mark, and sentence id is the sentence serial numbers of each speech, word or punctuation mark.
The inverted index table of table three first webpage and second webpage
Figure GSA00000033598900141
Figure GSA00000033598900151
Come out 46 1 0 1
So 47 1 0 2
Listen 48 1 0 2
Question 49 1 0 2
After 50 1 0 2
Dull 51 1 0 2
See 52 1 0 2,3
Quite a while 53 1 0 2
Say 54 1 0 2,4
No 55 1 0 2
Go out 56 1 0 2
Words 57 1 0 2
Come 58 1 0 1,2
This 59 1 0 3
Appearance 60 1 0 3
Think 61 1 0 3
Also be 62 1 0 3,4
One 63 1 0 3
Homeless 64 1 0 3
The outcast 65 1 0 3
On the face 66 1 0 4
Just 67 1 0 4,5
Immediately 68 1 0 4
Rise 69 1 0 4
Lonely 70 1 0 4
Expression 71 1 0 4
Slightly 72 1 0 4,5
Sighing 73 1 0 4
74 1 0 4
Sound of sighing 75 1 0 4
76 1 0 4
You 77 1 0 4
With 78 1 0 4
The same 79 1 0 4
{。##.##1}, 80 1 0 4
Sigh 81 1 0 5
One 82 1 0 5
Afterwards 83 1 0 5
No 84 1 0 5
Speak 85 1 0 5
Daytime 86 1 1 0
Comply with 87 1 1 0
The mountain 88 1 1 0
To the greatest extent 89 1 1 0
The Yellow River 90 1 1 0
Go into 91 1 1 0
The sea 92 1 1 0
Stream 93 1 1 0
Desire poor 94 1 1 1
A thousand li 95 1 1 1
Order 96 1 1 1
More go up 97 1 1 1
The building 98 1 1 1
Be noted that, have in first webpage and second webpage ", ", ".", " one deck ".Therefore, the Dui Ying webpage quantity ndocs that comprises each speech, word or punctuation mark is 2.
Behind the search subscriber 406 inputted search items, searcher 405 is decomposed into a plurality of key words, keyword or punctuation mark with search terms.Certainly, the search terms of search subscriber 406 inputs also may itself be key word, keyword or a punctuation mark, and 405 of searchers do not need this is decomposed.
Searcher 405 judges according to the sentence serial numbers information of table three whether a plurality of key words, keyword or punctuation mark that search terms decomposes belong to same sentence or the less sentence (for example, the sentence distance is 1, also is adjacent sentence) of sentence distance at webpage.
For example, the search terms of search subscriber 406 is " one deck over the past half year ", and search terms is broken down into keyword " over the past half year " and " one deck ".Searcher 405 question blanks three, the page sequence docid of keyword " over the past half year " and " one deck " all are 0, and sentence serial numbers sentence id is 1, can judge that promptly two keywords " over the past half year ", " one deck " belong to same sentence.For example, the search terms of search subscriber 406 is " a lonely expression ", and search terms is broken down into keyword " loneliness ", " expression ".Searcher 405 question blanks three, the page sequence docid of keyword " loneliness " and " expression " all are 0, and sentence serial numbers sentence id is 4, can judge that promptly both keyword " loneliness ", " expression " belong to same sentence.
Obviously, a plurality of key words, keyword or the punctuation mark that belong to same sentence have higher correlativity under equal sort criteria, and the sequencing weight of affiliated webpage should improve (promptly under equal sort criteria, affiliated webpage is should rank forward).
Do not belong to the webpage of same sentence for a plurality of key words, keyword or punctuation mark, can calculate the sentence distance (absolute value of the difference of sentence serial numbers) of a plurality of key words, keyword or punctuation mark.The sequencing weight of the webpage that the sentence distance is little should improve, and the sequencing weight of the webpage that the sentence distance is big should reduce.
Certainly, the sequencing weight of webpage is by combined factors decision in many ways.Sentence distance except key word, keyword or punctuation mark, the authority that also has webpage place domain name, the pouplarity of webpage, whether key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of webpage and click-through rate, some factors such as the log-on data of website, webpage place and public station data.
In addition, if a plurality of key word, keyword or punctuation mark belong to same sentence, then can further do natural language processing to sentence.For example, various syntactic analyses done in this sentence,, obtain the dependence between the vocabulary of this sentence and the head of this sentence as interdependent syntactic analysis.For example, based on sentiment classification (passing judgement on analysis) made in this sentence, learn the tendentiousness of this sentence.Above-mentioned analysis can be simultaneously displayed in the Search Results, for search client 406 provides more perfect value-added service.
As shown in Figure 4, the present invention also provides a kind of and has carried out the device of Webpage search according to sentence serial numbers, also is search engine 40, comprises webpage getter 401, is used to obtain and download plurality of webpages; Web database 402 is used to store the plurality of webpages of download; Index 403 is used for that plurality of webpages is carried out sentence and cuts apart, and is respectively the distributing serial numbers for sentences of each webpage, makes the forward direction concordance list and the inverted index table that comprise sentence serial numbers; Index data base 404 is used to store forward direction concordance list and inverted index table; Searcher 405 is used for search terms is decomposed at least one key word, keyword or punctuation mark, according to inverted index table, calculates the sequencing weight of the webpage that comprises key word, keyword or punctuation mark, the output Search Results.Webpage getter 401, web database 402, index 403, index data base 404, searcher 405 connect successively.Search engine 40 is back to search subscriber 406 with final search result.
First embodiment and second embodiment are example with the Chinese web page, and the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention is set forth.But the present invention is not limited to this, and the method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention also can be applicable to the various information retrieval that comprise the natural language of punctuation mark such as English, German, Russian, Japanese, Spanish.The present invention can be applicable to the search of webpage, e-book, structured text etc.
The inverted index table that carries out the method and apparatus of Webpage search according to sentence serial numbers of the present invention comprises the sentence serial numbers of webpage, by inquiry sentence serial numbers information, the sentence distance that search engine can improve key word, keyword or punctuation mark is a sequencing weight zero or webpage that the sentence distance is less, thereby make the rank of webpage forward, promote user's search satisfaction.
Of the present invention according to sentence serial numbers carry out the method and apparatus of Webpage search can be directly according to the sentence serial numbers of each word, speech or punctuation mark in each webpage, judge whether two or more key words to be checked, keyword or punctuation mark belong to same sentence or sentence close together fast, and do not need a large amount of comparison operations.The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention has lower time complexity, thereby improves the response speed of search, for the user brings search experience more efficiently.
The method and apparatus that carries out Webpage search according to sentence serial numbers of the present invention can provide condition precedent for follow-up natural language processing.If two or more key words to be checked, keyword or punctuation mark belong to same sentence, search engine can be done further deep natural language processing to this sentence.
More than describe preferred embodiment of the present invention in detail.The ordinary skill that should be appreciated that this area need not creative work and just can design according to the present invention make many modifications and variations.Therefore, all technician in the art all should be in claim protection domain of the present invention under this invention's idea on the basis of existing technology by the available technical scheme of logical analysis, reasoning, or a limited experiment.

Claims (15)

1. one kind is carried out the method for Webpage search according to sentence serial numbers, it is characterized in that, may further comprise the steps:
A), obtain plurality of webpages, and be downloaded to web database;
B), described plurality of webpages carried out sentence cut apart, and be respectively the distributing serial numbers for sentences of each webpage;
C), make the forward direction concordance list, described forward direction concordance list comprises sentence serial numbers;
D), make inverted index table, described inverted index table comprises described sentence serial numbers;
E), the inputted search item, described search terms is decomposed at least one key word, keyword or punctuation mark;
F), according to described inverted index table, calculate the sequencing weight of the webpage comprise described key word, keyword or punctuation mark, the output Search Results.
2. as claimed in claim 1ly carry out the method for Webpage search, it is characterized in that described step B according to sentence serial numbers) further may further comprise the steps:
B1), described each webpage of index scanning, do the word cutting for described each webpage, write down each speech, word or the punctuation mark position in webpage;
B2), according to described each speech, word or the punctuation mark position in webpage and the position of punctuation mark in webpage of rear adjacent, carry out sentence and cut apart;
B3), be each distributing serial numbers for sentences, determine the sentence serial numbers of described each speech, word or punctuation mark.
3. method of carrying out Webpage search according to sentence serial numbers as claimed in claim 2, it is characterized in that, the rule that described sentence is cut apart is: if fullstop, question mark, suspension points or exclamation mark are in quotation marks, and be positioned at paragraph and finish part, the ending of sentence is fullstop, question mark, suspension points or exclamation mark and back quote; If fullstop, question mark, suspension points or exclamation mark are outside quotation marks, the sentence ending is fullstop, question mark, suspension points or exclamation mark.
4. method of carrying out Webpage search according to sentence serial numbers as claimed in claim 2, it is characterized in that, described forward direction concordance list comprises the page sequence number of described each speech, word or punctuation mark, described each speech, word or punctuation mark, the sentence serial numbers of the sequence number of described each speech, word or punctuation mark and described each speech, word or punctuation mark.
5. method of carrying out Webpage search according to sentence serial numbers as claimed in claim 2, it is characterized in that, described inverted index table comprises described each speech, word or punctuation mark, the sequence number of described each speech, word or punctuation mark, the webpage quantity that comprises described each speech, word or punctuation mark, the sentence serial numbers of the page sequence of described each speech, word or punctuation mark number and described each speech, word or punctuation mark.
6. as the described method of carrying out Webpage search according to sentence serial numbers of arbitrary claim in the claim 1~5, it is characterized in that, described step F) further in the webpage that comprises described key word, keyword or punctuation mark, according to described inverted index table, judge whether described key word, keyword or punctuation mark belong to same sentence, if belong to same sentence, improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of described key word, keyword or punctuation mark, if described sentence distance is big, then reduce the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark, if described sentence distance is little, then improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark.
7. method of carrying out Webpage search according to sentence serial numbers as claimed in claim 6, it is characterized in that, the sequencing weight of described webpage is by the sentence distance of described key word, keyword or punctuation mark, the authority of described webpage place domain name, the pouplarity of described webpage, whether described key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of described webpage and click-through rate, the log-on data and the public station data of website, described webpage place comprehensively determine.
8. as claimed in claim 6ly carry out the method for Webpage search, it is characterized in that,, further natural language processing done in described sentence if described key word, keyword or punctuation mark belong to same sentence according to sentence serial numbers.
9. one kind is carried out the device of Webpage search according to sentence serial numbers, comprises
The webpage getter is used to obtain and download plurality of webpages;
Web database is used to store the described plurality of webpages of download;
Index is used for that described plurality of webpages is carried out sentence and cuts apart, and is respectively the distributing serial numbers for sentences of each webpage, makes the forward direction concordance list and the inverted index table that comprise sentence serial numbers;
Index data base is used to store described forward direction concordance list and described inverted index table;
Searcher is used for search terms is decomposed at least one key word, keyword or punctuation mark, according to described inverted index table, calculates the sequencing weight of the webpage that comprises described key word, keyword or punctuation mark, the output Search Results;
Described webpage getter, described web database, described index, described index data base, described searcher connect successively.
10. the device that carries out Webpage search according to sentence serial numbers as claimed in claim 9, it is characterized in that, described forward direction concordance list comprises the page sequence number of each speech, word or the punctuation mark of described plurality of webpages, described each speech, word or punctuation mark, the sentence serial numbers of the sequence number of described each speech, word or punctuation mark and described each speech, word or punctuation mark.
11. the device that carries out Webpage search according to sentence serial numbers as claimed in claim 9, it is characterized in that, described inverted index table comprises each speech, word or the punctuation mark of described plurality of webpages, the sequence number of described each speech, word or punctuation mark, the webpage quantity that comprises described each speech, word or punctuation mark, the sentence serial numbers of the page sequence of described each speech, word or punctuation mark number and described each speech, word or punctuation mark.
12. the device that carries out Webpage search according to sentence serial numbers as claimed in claim 9, it is characterized in that, the rule that described sentence is cut apart is: if fullstop, question mark, suspension points or exclamation mark are in quotation marks, and be positioned at paragraph and finish part, the ending of sentence is fullstop, question mark, suspension points or exclamation mark and back quote; If fullstop, question mark, suspension points or exclamation mark are outside quotation marks, the sentence ending is fullstop, question mark, suspension points or exclamation mark.
13. as the described device that carries out Webpage search according to sentence serial numbers of arbitrary claim in the claim 9~12, it is characterized in that, described searcher also is used at the webpage that comprises described key word, keyword or punctuation mark, according to described inverted index table, judge whether described key word, keyword or punctuation mark belong to same sentence, if belong to same sentence, improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark; If do not belong to same sentence, calculate the sentence distance of described key word, keyword or punctuation mark, if described sentence distance is big, then reduce the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark, if described sentence distance is little, then improve the sequencing weight of the affiliated webpage of described key word, keyword or punctuation mark.
14. the device that carries out Webpage search according to sentence serial numbers as claimed in claim 9, it is characterized in that, the sequencing weight of described webpage is by the sentence distance of described key word, keyword or punctuation mark, the authority of described webpage place domain name, the pouplarity of described webpage, whether described key word, keyword or punctuation mark appear in network address, title, anchor text or the metatag, the flowing of access of described webpage and click-through rate, the log-on data and the public station data of website, described webpage place comprehensively determine.
15. as claimed in claim 9ly carry out the device of Webpage search, it is characterized in that if described key word, keyword or punctuation mark belong to same sentence, described searcher also is used for natural language processing done in described sentence according to sentence serial numbers.
CN 201010110315 2010-02-09 2010-02-09 Method and device for searching webpages according to sentence serial numbers Expired - Fee Related CN101923556B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010110315 CN101923556B (en) 2010-02-09 2010-02-09 Method and device for searching webpages according to sentence serial numbers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010110315 CN101923556B (en) 2010-02-09 2010-02-09 Method and device for searching webpages according to sentence serial numbers

Publications (2)

Publication Number Publication Date
CN101923556A true CN101923556A (en) 2010-12-22
CN101923556B CN101923556B (en) 2013-01-02

Family

ID=43338494

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010110315 Expired - Fee Related CN101923556B (en) 2010-02-09 2010-02-09 Method and device for searching webpages according to sentence serial numbers

Country Status (1)

Country Link
CN (1) CN101923556B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102110160A (en) * 2011-02-24 2011-06-29 上海莱希信息科技有限公司 Method and device for searching web pages according to tendency values
CN103810220A (en) * 2012-11-15 2014-05-21 腾讯科技(深圳)有限公司 Microblog search method and device
CN103886039A (en) * 2014-03-10 2014-06-25 百度在线网络技术(北京)有限公司 Optimization method and device with searching
CN104778262A (en) * 2015-04-21 2015-07-15 无锡天脉聚源传媒科技有限公司 Searching method and searching device
CN106095779A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 A kind of search method based on key word position and device
CN107784123A (en) * 2017-11-06 2018-03-09 北京中科智营科技发展有限公司 A kind of chess game optimization method based on theme
CN109992647A (en) * 2019-04-04 2019-07-09 北京神州泰岳软件股份有限公司 A kind of content search method and device
WO2023040808A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage retrieval method and related device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100416570C (en) * 2006-09-22 2008-09-03 浙江大学 FAQ based Chinese natural language ask and answer method
CN101196898A (en) * 2007-08-21 2008-06-11 新百丽鞋业(深圳)有限公司 Method for applying phrase index technology into internet search engine

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102110160A (en) * 2011-02-24 2011-06-29 上海莱希信息科技有限公司 Method and device for searching web pages according to tendency values
CN103810220A (en) * 2012-11-15 2014-05-21 腾讯科技(深圳)有限公司 Microblog search method and device
CN103810220B (en) * 2012-11-15 2018-02-27 腾讯科技(深圳)有限公司 A kind of microblogging searching method and device
CN103886039A (en) * 2014-03-10 2014-06-25 百度在线网络技术(北京)有限公司 Optimization method and device with searching
CN103886039B (en) * 2014-03-10 2018-01-19 百度在线网络技术(北京)有限公司 Using the optimization method and device of retrieval
CN104778262A (en) * 2015-04-21 2015-07-15 无锡天脉聚源传媒科技有限公司 Searching method and searching device
CN104778262B (en) * 2015-04-21 2018-07-24 无锡天脉聚源传媒科技有限公司 A kind of searching method and device
CN106095779A (en) * 2016-05-26 2016-11-09 达而观信息科技(上海)有限公司 A kind of search method based on key word position and device
CN107784123A (en) * 2017-11-06 2018-03-09 北京中科智营科技发展有限公司 A kind of chess game optimization method based on theme
CN107784123B (en) * 2017-11-06 2021-01-01 北京中科智营科技发展有限公司 Topic-based search optimization method
CN109992647A (en) * 2019-04-04 2019-07-09 北京神州泰岳软件股份有限公司 A kind of content search method and device
WO2023040808A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage retrieval method and related device

Also Published As

Publication number Publication date
CN101923556B (en) 2013-01-02

Similar Documents

Publication Publication Date Title
CN101923556B (en) Method and device for searching webpages according to sentence serial numbers
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN109543178B (en) Method and system for constructing judicial text label system
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN102163198B (en) A method and a system for providing new or popular terms
WO2017076205A1 (en) Method and apparatus for obtaining reply prompt content for chat start sentence
CN104216942B (en) Query suggestion template
CN110781670B (en) Chinese place name semantic disambiguation method based on encyclopedic knowledge base and word vectors
CN103455487B (en) The extracting method and device of a kind of search term
CN105824933A (en) Automatic question-answering system based on theme-rheme positions and realization method of automatic question answering system
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN106970991B (en) Similar application identification method and device, application search recommendation method and server
JP2019504410A (en) Travel guide generation method and system
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN103984771B (en) Method for extracting geographical interest points in English microblog and perceiving time trend of geographical interest points
CN105426514A (en) Personalized mobile APP recommendation method
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN103150356B (en) A kind of the general demand search method and system of application
CN106354844B (en) Service combination package recommendation system and method based on text mining
CN102262670A (en) Cross-media information retrieval system and method based on mobile visual equipment
CN109657053A (en) More text snippet generation methods, device, server and storage medium
CN103886020A (en) Quick search method of real estate information
CN110059177A (en) A kind of activity recommendation method and device based on user's portrait
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN102214186B (en) Method and system for displaying object relation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHANGHAI YUTIAN INFORMATION TECHNOLOGY CO.,LTD.

Free format text: FORMER OWNER: SHANGHAI LAISEEK INFORMATION TECHNOLOGY CO., LTD.

Effective date: 20140911

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 201112 MINHANG, SHANGHAI TO: 200120 PUDONG NEW AREA, SHANGHAI

TR01 Transfer of patent right

Effective date of registration: 20140911

Address after: Bi Sheng Lu Pudong New Area Zhangjiang hi tech park Shanghai city 200120 289 Lane 6, Room 202

Patentee after: Shanghai Yutian Information Technology Co.,Ltd.

Address before: 201112, room 505, building 1, building 1588, union airway, Shanghai, Minhang District

Patentee before: Shanghai Laiseek Information Technology Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130102

Termination date: 20170209