CN107220307A

CN107220307A - Web search method and device

Info

Publication number: CN107220307A
Application number: CN201710326803.XA
Authority: CN
Inventors: 黄永峰; 刘俊鑫; 吴方照; 刘佳伟; 袁志刚; 吴思行
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-05-10
Filing date: 2017-05-10
Publication date: 2017-09-29
Anticipated expiration: 2037-05-10
Also published as: CN107220307B

Abstract

The present invention proposes a kind of web search method and device, wherein, web search method includes：Scanned for according to keyword, after obtaining matching webpage and its sorting, the first webpage in matching webpage is chosen by user to be labeled, and then according to the degree of correlation between the second webpage matched in webpage the first webpage related and/or unrelated to being labeled as, estimate the degree of correlation between the second webpage and search intention, and the second webpage is resequenced accordingly.This method, on the one hand the search intention of user has been taken into full account, user can be helped quickly to navigate to useful information, on the other hand, user only needs to carry out a small amount of webpage in search result according to search intention the mark of correlation, just the sequence of second webpage related to search intention can be lifted, the time that user obtains target information is reduced, improves the Experience Degree of user.

Description

Web search method and device

Technical field

The present invention relates to technical field of information retrieval, more particularly to a kind of web search method and device.

Background technology

With the popularization and development of internet, more and more people use a network for the retrieval of information.Internet development is extremely Information content on the present, network is very huge, and people carry out information retrieval on the internet using search engine, and search engine is to searching Hitch fruit is arranged and sorted, and search result then is fed back into user.

At present, when user carries out information retrieval using search engine, search engine only rests on user with interacting for user In the search keyword of input, or part make use of the information such as the travel log of user to carry out Optimizing Search result.

In the prior art, search engine is presented to the search result of user, often exist webpage sorting needed for user compared with For situation rearward, cause user effectively to navigate to required webpage, add the time that user obtains target information, drop The low Experience Degree of user.

The content of the invention

It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.

Therefore, first purpose of the present invention is to propose a kind of web search method, to realize by calculating the second net Correlation between page and user search intent, realizes to search result and resequences, quickly to navigate to the letter of user's request Breath, solves in the prior art because not taking into full account the search intention of user, causes user to obtain the time of demand information It is longer, the technical problem of user experience difference.

Second object of the present invention is to propose a kind of Webpage search device.

Third object of the present invention is to propose a kind of computer equipment.

Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.

For up to above-mentioned purpose, first aspect present invention embodiment proposes a kind of web search method, including：

Obtain the keyword of search；

Scanned for according to keyword, obtain matching webpage, and match the sequence of webpage；

After the first webpage that user is chosen in matching webpage is labeled, the mark of the first webpage is obtained；Wherein, mark Note, for indicating the correlation between search intention of first webpage with using keyword search；

According to the degree of correlation between the second webpage matched in webpage the first webpage related to being labeled as, and/or root According to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the phase between the second webpage of estimation and search intention Pass degree；

According to the degree of correlation between the second webpage and search intention, the second webpage is resequenced.

A kind of web search method of the embodiment of the present invention, wherein, obtain the keyword of search；Searched according to keyword Rope, obtains matching webpage, and match the sequence of webpage；The first webpage that user is chosen in matching webpage is labeled；According to Match the degree of correlation between the second webpage first webpage related to being labeled as in webpage, and/or according to the second webpage and It is labeled as the degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and the search intention； According to the degree of correlation, the second webpage is resequenced.User is according to search intention to the matching webpage in search result It is labeled, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account searching for user Suo Yitu, can help user quickly to navigate to useful information, reduce the time that user obtains useful information, improve user Experience Degree.

For up to above-mentioned purpose, second aspect of the present invention embodiment proposes a kind of Webpage search device, including：

Acquisition module, the keyword for obtaining search；

Search module, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage；

Labeling module, after the first webpage for being chosen as user in matching webpage is labeled, obtains the first webpage Mark；Wherein, the mark, for indicating the correlation between search intention of first webpage with using keyword search；

Correlation between computing module, the first webpage related to being labeled as the second webpage in matching webpage Degree, and/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation and search Degree of correlation between intention；

Reorder module, for according to the degree of correlation between the second webpage and search intention, being weighed to the second webpage New sort.

A kind of Webpage search device of the embodiment of the present invention, wherein, acquisition module, the keyword for obtaining search；Search Rope module, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage；Labeling module, for working as After the first webpage that user is chosen in matching webpage is labeled, the mark of the first webpage is obtained；Wherein, the mark, is used Correlation between the search intention of the first webpage and use keyword search is indicated；Computing module, for according to pair net The degree of correlation between the second webpage first webpage related to being labeled as in page, and/or according to the second webpage with being labeled as Degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and search intention；Reorder module, For according to the degree of correlation between the second webpage and search intention, being resequenced to the second webpage.User is according to search Intention is labeled to the matching webpage in search result, and the degree of correlation obtained according to calculating is carried out again to the second webpage Sequence, it has taken into full account the search intention of user, and user can be helped quickly to navigate to useful information, reduces user's acquisition The time of useful information, improve the Experience Degree of user.

For up to above-mentioned purpose, third aspect present invention embodiment proposes a kind of computer equipment, including：Memory, place Manage device and store the computer program that can be run in memory and on a processor, when the computing device computer program When, for performing the web search method described in first aspect.

To achieve these goals, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory Medium, is stored thereon with computer program, when the program is executed by processor, for performing the Webpage search described in first aspect Method.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and be readily appreciated that, wherein：

A kind of schematic flow sheet for web search method that Fig. 1 is provided by the embodiment of the present invention；

A kind of estimation webpage and the flow of the method for search intention degree of correlation that Fig. 2 is provided by the embodiment of the present invention show It is intended to；

Fig. 3 is a kind of schematic flow sheet for Web page text Text Extraction that the present embodiment is proposed；

Fig. 4 is a kind of schematic flow sheet for method of removal comprising a large amount of hinged nodes that the present embodiment is proposed；

Fig. 5 is a kind of structural representation of Webpage search device provided in an embodiment of the present invention；

Fig. 6 is the structural representation of another Webpage search device provided in an embodiment of the present invention；And

The structural representation for the extraction unit 341 that Fig. 7 is provided by the embodiment of the present invention.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.

Below with reference to the accompanying drawings the web search method and device of the embodiment of the present invention described.

The schematic flow sheet for a kind of web search method that Fig. 1 is provided by the embodiment of the present invention, as shown in figure 1, the party Method comprises the following steps：

Step S1, obtains the keyword of search.

Specifically, in the user input interface of search engine, user inputs keyword and carries out information inquiry according to demand, searches Index holds up the keyword for identifying user's input.

Step S2, is scanned for according to keyword, obtains matching webpage, and match the sequence of webpage.

Specifically, the keyword that search engine is inputted according to user, is believed in database using web crawlers technology After the inquiry of breath, the webpage for inquiring the keyword match inputted with user, calculate webpage and user is searched between element intention The degree of correlation, obtains matching the sequence of webpage, records the corresponding initial sequence of obtain each matching webpage, and by pair net Page is shown in the displaying interface of user according to initial sequence.

Wherein, the keyword that search engine is inputted according to the user of acquisition, target web is searched for using web crawlers technology, As a kind of possible implementation, using (Get) request method is looked into during calling search engine, in request URL (Uniform Resoure Locator, URL) below with question mark () form add and issue the parameter of server, multiple parameters Between separated with symbol ＆, search engine can be analyzed based on the parameter submitted in URL, obtained corresponding search result, such as made With Baidu search keyword " Tsing-Hua University ", then access：“http://www.baidu.com/sQ1=Tsing-Hua University ". Other are used for the Advanced Search parameter for limiting webpage renewal time etc., also can add corresponding parameter behind URL to realize, example Such as, the time of the webpage to be searched for and the bar number of every page of display are limited using Baidu search, then is accessed： http:// www.baidu.com/sQ1=Qing Hua great Xue ＆lm=7＆rn=5, that is, the webpage searched in Tsing-Hua University, nearest one week, and often The bar number of page display is 5.

Step S3, after the first webpage that user is chosen in matching webpage is labeled, obtains the mark of the first webpage.

Specifically, user is carried out according to the principle with search need degree of correlation size to the matching webpage for returning to user Mark, and the info web being identified by returns to search engine, wherein being referred to as the first webpage by the webpage of user annotation.Need It is noted that by the first webpage of user annotation, there is the webpage related to user's search need, also have and search element with user The unrelated webpage of demand, for the ease of distinguishing, the matching webpage related to user's search need is named as the first net of correlation Page, and the unrelated matching webpage of user's search need are named as the first unrelated webpage.

Further, user is labeled to matching webpage, is in the interaction page that search engine system is generated and is shown Complete, the interaction page shows the link of all matching pages, and the check box for being labeled.Set in check box It is equipped with and chooses button, sets two kinds of " correlation " and " unrelated " to choose button by each matching page, if the page is The page related to search need, then choose " correlation " button, is the first related webpage by the page setup；If the page It is the page unrelated with search need, then chooses " unrelated " button, is the first unrelated webpage by the page setup.Need explanation , the first related webpage and the first unrelated webpage are that user chooses determination according to search need by user, its quantity It can be one or be multiple.

Step S4, according to the degree of correlation between the second webpage matched in webpage the first webpage related to being labeled as And/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation and search intention Between degree of correlation.

Specifically, the keyword of user's input is the search intention of correspondence user, and the body text information in webpage is It is corresponding that plain intentions is searched with user, so that, the degree of correlation between calculating webpage and user search intent is, it is necessary to extract webpage Body text.

First, the body text of the first webpage is extracted, according to the body text, title and brief introduction of the first webpage, generation the The document of one webpage, wherein, the document of the first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as unrelated The first webpage document.

Secondly, title and the brief introduction of the second webpage are extracted, the document of the second webpage is generated.

Finally, according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation, and/ Or it is labeled as the degree of correlation between the document of the first unrelated webpage and the document of the second webpage, the second webpage of estimation and search Degree of correlation between intention.

It should be noted that the second webpage is typically, it is search in obtained matching webpage, in addition to the first webpage Webpage.Those skilled in the art could be aware that specifically, the second webpage can be matched in webpage in addition to the first webpage This is not construed as limiting in whole webpages or part webpage, the present embodiment.

Step S5, according to the degree of correlation between the second webpage and search intention, resequences to the second webpage.

Specifically, it is bigger with the degree of correlation between the second webpage and search intention, after the second webpage is resequenced Sequence it is more forward be principle, the second webpage is resequenced.

In a kind of web search method of the embodiment of the present invention, the keyword of search is obtained, is scanned for according to keyword, Obtain matching webpage, and match the sequence of webpage；The first webpage that user is chosen in matching webpage is labeled；According to matching Degree of correlation between the second webpage in webpage first webpage related to being labeled as, and/or according to the second webpage and mark For the degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and the search intention；According to The degree of correlation, resequences to the second webpage.User is carried out according to search intention to the matching webpage in search result Mark, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account the search meaning of user Figure, can help user quickly to navigate to useful information, reduce the time that user obtains useful information, improve the body of user Degree of testing.

In order to further clearly illustrate the step S4 in a upper embodiment, as a kind of possible implementation, Fig. 2 is this A kind of estimation webpage and the schematic flow sheet of the method for search intention degree of correlation that inventive embodiments are provided.

As shown in Fig. 2 step S4 comprises the following steps：

Step S41, extracts the body text of the first webpage.

Specifically, many text messages are included in webpage, there is the text message related to user's search need, also have and The unrelated text message of user's search need, for the ease of difference, is referred to as text text by the text message related to user's request This, will be unrelated with user's request, it is impossible to reflects the text message of user search intent, such as navigation bar, outer link and advertisement letter Breath, referred to as noise text.In order to more accurately characterize the search intention of user, it is necessary to which unrelated noise text is removed, only Retain body text.

Step S42, according to the body text, title and brief introduction of the first webpage, generates the document of the first webpage.

Specifically, the results page returned for search engine, defines different lookup modes, by results page respectively Title, profile information extracts, and the first webpage extracted original text, generate the document of the first webpage.Wherein, The document of first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the document of the first unrelated webpage.

Step S43, extracts title and the brief introduction of the second webpage, generates the document of the second webpage.

Specifically, the results page returned for search engine, defines different lookup modes, by results page respectively Title, profile information extracts, and generates the document of the second webpage.

It is to be appreciated that the second webpage of correspondence, does not extract the body text information of the second webpage, only extracts the second net The title of page and brief introduction, in order to reduce the initiation number of times of network connection, so as to reduce the processing time of system background, make The system of obtaining with the search need of quick response user, can improve the Experience Degree of user under conditions of enough accuracy is ensured.

Step S44, according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation, And/or be labeled as the degree of correlation between the document of the first unrelated webpage and the document of the second webpage, the second webpage of estimation with Degree of correlation between search intention.

Specifically, the similarity degree between two documents, i.e. degree of correlation are calculated, most common method is to map document The similarity between two documents is measured into vector, then with the similarity degree between the corresponding vector of two documents.

As a kind of possible implementation, document is mapped to vector using bag of words, bag of words are nature languages A kind of method that document data is mapped to vector commonly used in speech processing.Assuming that having N number of word in dictionary, all documents are equal It is made up of this N number of word in dictionary, any one document can be mapped to the vector of a N-dimensional using bag of words, vector Kth dimension correspondence dictionary in weight of k-th of word in the document.The weight of word can be that the word occurs in the document Frequency, the most frequently used weight determination methods are according to word frequency of the word in the document-reverse document-frequency (Term Frequency-inverse document frequency, TF-IDF) value judges.Wherein, word frequency (Term Frequency, TF), refer to the frequency that some word occurs in a document, the frequency of appearance is higher, illustrate that the word is more important, weigh Again bigger, TF value is bigger；Reverse document-frequency (Inverse document frequency, IDF), refers to for common Word assign less weight, larger weight, i.e. IDF size and the common degree of a word are assigned for uncommon word It is inversely proportional.TF is multiplied by IDF by the TF-IDF values of one word, is worth the weight of bigger expression word in a document higher.

It is to be appreciated that, it is necessary to first to document progress word segmentation processing, utilize existing point before calculating TF-IDF values Word instrument converts a document into the set of word, then counts the number of times that each word occurs in the document, calculates the word TF-IDF values so that the vectorization for obtaining the document is represented.

It should be noted that in actual applications, the dictionary size used in system is about 300,000 words, it can include substantially big Partial Chinese word, but be due to that neologisms continuously emerge on network, add participle instrument participle mistake that may be present, participle As a result non-existent word in a small amount of dictionary is likely to occur in, for these words, system is using the strategy directly cast out.

As alternatively possible implementation, document is mapped to vector using distributed vector representation model, word Distribution is represented, is commonly called as term vector, is referred to a real-valued vectors being mapped to each word in dictionary in vector space, one As can be by training " neural language " model obtain, vectorial dimension can be adjusted setting when training as needed. The semantic information of word can be characterized well by training obtained " term vector ", the close word of semantic information in vector space " away from From " close.Have after the vector representation of word, there can be a variety of methods to expand to the vector representation of document.It is possible as one kind Implementation, system uses the vector representation that the average weighted method of the vector representation of word is obtained to document, first by advance The language material training being collected obtains term vector；Then word segmentation processing is carried out to document using participle instrument, counts each word TF-IDF values；The term vector of these words is obtained into document vector by weights weighted average of TF-IDF values.

Further, using bag of words, the first web document vectorization for being labeled as correlation can be expressed as V_{Correlation 1}, will It is labeled as the first unrelated web document vectorization and is expressed as V_{Unrelated 1}, the second web document vectorization is expressed as V_{As a result 1}；Using point Cloth vector representation model, can be expressed as V by the first web document vectorization for being labeled as correlation_{Correlation 2}, will be labeled as unrelated The first web document vectorization be expressed as V_{Unrelated 2}, the second web document vectorization is expressed as V_{As a result 2}。

Specifically, after the first web document and the second web document vectorization being represented, you can between document vector Similarity measures the similarity between document, as a kind of possible implementation, system represented using COS distance to Similarity between amount, it is thus possible to calculate V using cosine similarity_{Correlation 1}With V_{As a result 1}Between cosine similarity S_{Correlation 1}；V_{Correlation 2} With V_{As a result 2}Between cosine similarity S_{Correlation 2}；V_{Unrelated 1}With V_{As a result 1}Between cosine similarity S_{Unrelated 1}；V_{Unrelated 2}With V_{As a result 2}Between cosine similarity S_{Unrelated 2}。

Further, according to obtained the first web document and the cosine similarity of the second web document, formula is utilizedCalculating obtains second webpage and anticipated with the search Degree of correlation Score between figure.

Wherein, α, β, γ are default weight, and n is sequence of second webpage in the matching webpage.

Finally, according to the value for calculating the degree of correlation Score between the second obtained webpage and search intention, to the second net Page is resequenced, and the webpage after rearrangement is presented into user.Wherein, Score values are bigger, corresponding second webpage Sorted after being reordered more forward.

In the embodiment of the present invention, the body text of the first webpage is extracted；According to the body text, title and letter of the first webpage It is situated between, generates the first web document；Title and the brief introduction of the second webpage are extracted, the document of the second webpage is generated；According to being labeled as phase Degree of correlation between the document of the first webpage and the document of the second webpage that close, and/or according to being labeled as the first unrelated net Degree of correlation between the document of page and the document of the second webpage, the degree of correlation between the second webpage of estimation and search intention. User is labeled according to search intention to the matching webpage in search result, and the degree of correlation obtained according to calculating is to second Webpage is resequenced, and it has taken into full account the search intention of user, and user can be helped quickly to navigate to useful information, drop Low user obtains the time of useful information, improves the Experience Degree of user.

In a upper embodiment, it is necessary to which useless noise text is removed when being extracted to the text included in webpage, Only retain body text, but be due to larger, the ununified text resolution algorithm of different webpage pattern difference, be used as one kind Possible implementation, the present embodiment proposes a kind of Web page text Text Extraction.

Specifically, for the process of the step S41 in further clearly explanation above-described embodiment, Fig. 3 is the present embodiment The schematic flow sheet of a kind of Web page text Text Extraction proposed, as shown in figure 3, step S41 comprises the following steps：

Step S411, obtains the html files of the first webpage.

Specifically, the html files of the first webpage of user annotation are obtained, according to DOM Document Object Model (Document Object Model, DOM), each composition in html files is a node, and all information are maintained in html files In different nodes in.Wherein, whole document is a document node；Each html labels are a node elements；Each Html attributes are an attribute nodes；Annotation is comment nodes.

Step S412, removes the node of correspondence control, Show Styles and/or annotation in html files.

Specifically, when extracting body text, text envelope is not included in the node of correspondence control, Show Styles and/or annotation Breath, and the search need of user are unrelated, when extracting body text, it is necessary to which these nodes without body text information are gone Remove, these need the node removed to comprise at least with lower node：

Remove<script>Node, the node is used to define client script；

Remove<noscript>Node, the node is used to be defined on the replacement (text) when script is not performed；

Remove<style>Node, the node is used to define style information for html documents；

Remove<span>Node, the node is used to combine the row interior element in document；

Remove<meta>Node, the node is used to provide the metamessage (meta-information) about the page, such as Description and keyword for search engine and update frequency；

The node for including " style=displaynone " attribute is removed, the node is used for the hidden object in webpage, and It is not that the object being hidden retains physical space；

Comment nodes are removed, refer to the annotation inserted in source code.

Step S413, the quantity of the link contained by each node in html files, it is determined that the section comprising content of text Point.

Specifically, it is contemplated that the part such as navigation bar, advertisement of the page generally comprises more link, and body text is general Comprising less link or without link, therefore can be according in node and the quantity of node surrounding links node judges the section Whether the content of text of point belongs to body text, if included in the node and around node compared with multi-link, judges the node Text be not body text, it is necessary to delete, otherwise, the text for judging the node is body text, it is necessary to retain.

Step S414, according to the node comprising content of text, generates the body text of the first webpage.

Specifically, the node of correspondence control, Show Styles and/or annotation in html files is removed, and is removed comprising chain Connect after more node, delete space unnecessary in html files, label, the content of text for extracting remaining node obtains page The body text information in face.

To remove the process for including the node largely linked in further clearly interpretation procedure S413, as a kind of possible Implementation, Fig. 4 be the present embodiment propose a kind of removal comprising a large amount of hinged nodes method schematic flow sheet, including Following steps：

Step S4131, finds out all in html files<a>Node.

Wherein,<a>Node refers to hyperlink node.For each<a>Node, performs the operation of following steps.

Step S4132, finds out this<a>The father node of node, is designated as parent, and initializing variable temp=null.

Specifically, for each<a>For node, there is a unique father node, temp refers to temporary variable.

Step S4133, finds out all under parent<a>Node, calculates under parent and owns<a>The character length of node Summation l_a, and calculate parent character length l_p。

Specifically, there may be multiple child nodes under each parent father node<a>, multiple child nodes<a>Between be same The relation of level, by counting all child nodes under parent father nodes<a>Character length summation, whether judge parent nodes Include more hyperlink node, if to need the noise node deleted.

Step S4134, judges l_a/l_pWhether a pre-set threshold value t is more than, if it is, performing step S4135, otherwise performs step S4136.

Specifically, if l_a/l_pValue be more than default threshold value t, then it is assumed that the parent nodes are comprising the section that largely links Point, belongs to the noise node of needs deletion, parent is assigned into temp, while the father that pointer is pointed into parent nodes saves Point；If l_a/l_pValue be less than or equal to default threshold value t, then it is assumed that the node belongs to text node, it is not necessary to delete.

Step S4135, temp is assigned to by parent, and makes parent point to the father node of parent nodes, is returned simultaneously Receipt row step S4133.

Step S4136, if temp is not null, deletes temp, if temp is null, retains temp.It is then back to Step S4132.

It should be noted that regardless of whether deleting temp, return performs step S4132, handles next<a>Node, It is all until handled that step S4131 finds out<a>Untill node.

In the corresponding embodiments of Fig. 3 and Fig. 4, body text information is extracted from the html files obtained, correlation is obtained The first webpage document and the document of unrelated the first webpage.In addition, the title of the second webpage of extraction and brief introduction obtain second The document of webpage, and according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation, and/ Or according to the degree of correlation between the document and the document of the second webpage for being labeled as the first unrelated webpage, the second webpage of estimation with Degree of correlation between search intention, resequences, it is fully examined according to the degree of correlation that calculating is obtained to the second webpage Considered the search intention of user, user can be helped quickly to navigate to useful information, reduce user obtain useful information when Between, improve the Experience Degree of user.

In order to realize above-described embodiment, the present invention also proposes a kind of device of Webpage search.

Fig. 5 is a kind of structural representation of Webpage search device provided in an embodiment of the present invention.

As shown in figure 5, the Webpage search device includes：Acquisition module 31, search module 32, labeling module 33, calculating mould Block 34 and the module 35 that reorders.

Acquisition module 31, the keyword for obtaining search.

Search module 32, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage.

Labeling module 33, after the first webpage for being chosen as user in matching webpage is labeled, obtains the first net The mark of page；Wherein, mark, for indicating the correlation between search intention of first webpage with using keyword search.

Phase between computing module 34, the first webpage related to being labeled as the second webpage in matching webpage Pass degree, and/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation is with searching Degree of correlation between Suo Yitu.

The module that reorders 35, for according to the degree of correlation between the second webpage and search intention, being carried out to the second webpage Rearrangement.

Wherein, as a kind of possible implementation, the module that reorders 35, specifically for the second webpage and search intention Between degree of correlation it is bigger, the second webpage reordered after sequence it is more forward be principle, to the second webpage carry out again Sequence.

In the embodiment of the present invention, acquisition module, the keyword for obtaining search；Search module, for according to keyword Scan for, obtain matching webpage, and match the sequence of webpage；Labeling module, for being chosen as user in matching webpage After first webpage is labeled, the mark of the first webpage is obtained；Wherein, the mark, for indicating the first webpage with using Correlation between the search intention of keyword search；Computing module, for the second webpage in matching webpage and mark For the degree of correlation between the first related webpage, and/or according between the second webpage first webpage unrelated with being labeled as Degree of correlation, the degree of correlation between the second webpage of estimation and search intention；Reorder module, for according to the second webpage with searching Degree of correlation between Suo Yitu, resequences to the second webpage.User is according to search intention in search result It is labeled with webpage, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account use The search intention at family, can help user's express delivery to navigate to useful information, reduce the time that user obtains useful information, improve The Experience Degree of user.

It should be noted that the foregoing explanation to embodiment of the method is also applied for the device of the present embodiment, herein not Repeat again.

Based on above-described embodiment, the embodiment of the present invention additionally provides another webpage and searches the possible realization side that white, quiet clothes are put Formula, Fig. 6 is the structural representation of another Webpage search device provided in an embodiment of the present invention, on the basis of a upper embodiment On, Webpage search device also includes：Interactive display module 36, for generating and showing interaction page, wherein, interaction page is shown There is the link of the matching page, and for obtaining the check box of mark.

In addition, on the basis of a upper embodiment, computing module 34 includes：Extraction unit 341, generation unit 342, extraction Generation unit 343 and computing unit 344.

Extraction unit 341, the body text for extracting the first webpage.

Generation unit 342, for the body text according to the first webpage, title and brief introduction, generates the document of the first webpage； Wherein, the document of the first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the first unrelated webpage Document.

Generation unit 343 is extracted, title and brief introduction for extracting the second webpage generate the document of the second webpage.

Computing unit 344, is labeled as between the document of the first webpage of correlation and the document of the second webpage for basis Degree of correlation, and/or the degree of correlation being labeled as between the document of the first unrelated webpage and the document of the second webpage, estimation the Degree of correlation between two webpages and search intention.

Wherein, as a kind of possible implementation, computing unit 344, specifically for according to formulaCalculating obtains second webpage and anticipated with the search Degree of correlation Score between figure.

In the embodiment of the present invention, user is labeled according to search intention to the matching webpage in search result, calculates Degree of correlation between the second webpage and search intention, and the second webpage is carried out again according to obtained degree of correlation is calculated Sequence, it has taken into full account the search intention of user, and user's express delivery can be helped to navigate to useful information, reduces user's acquisition The time of useful information, improve the Experience Degree of user.

It should be noted that the foregoing explanation to embodiment of the method is also applied for the device of the embodiment of the present invention, this Place is repeated no more.

Based on above-described embodiment, as a kind of possible implementation, Fig. 7 is extraction unit provided in an embodiment of the present invention 341 structural representation, as shown with 7, extraction unit 341 include：Subelement 3411 is obtained, subelement 3412 is removed, determines son Unit 3413 and generation subelement 3414.

Subelement 3411 is obtained, the html files for obtaining the first webpage.

Subelement 3412 is removed, for removing correspondence control, Show Styles and/or the node of annotation.

Determination subelement 3413, for the number of links contained by each node in html files, it is determined that including text The node of content.

Subelement 3414 is generated, for according to the node comprising content of text, generating the body text of the first webpage.

Wherein, as a kind of possible implementation, determination subelement 3412, specifically for by the html files The quantity of link contained by each node is less than the node of predetermined threshold value, is defined as including the node of content of text.

In order to realize above-described embodiment, the present invention also proposes a kind of computer equipment, including memory, processor and storage On a memory and the computer program that can run on a processor, when computer program is executed by processor, perform foregoing Web search method in embodiment.

In order to realize above-described embodiment, the present invention also proposes a kind of non-transitorycomputer readable storage medium, works as storage Program in medium by computing device when, perform the web search method in previous embodiment.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.

In addition, term " first ", " second " are only used for describing purpose, and it is not intended that indicating or implying relative importance Or the implicit quantity for indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can express or Implicitly include at least one this feature.In the description of the invention, " multiple " are meant that at least two, such as two, three It is individual etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, fragment or the portion of the code of one or more executable instructions for the step of realizing custom logic function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

Represent in flow charts or logic and/or step described otherwise above herein, for example, being considered use In the order list for the executable instruction for realizing logic function, it may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass Defeated program is for instruction execution system, device or equipment or the dress for combining these instruction execution systems, device or equipment and using Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wirings Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits Reservoir (CDROM).In addition, can even is that can be in the paper of printing described program thereon or other are suitable for computer-readable medium Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media His suitable method is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.Such as, if realized with hardware with another embodiment, following skill well known in the art can be used Any one of art or their combination are realized：With the logic gates for realizing logic function to data-signal from Scattered logic circuit, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can be compiled Journey gate array (FPGA) etc..

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only storage, disk or CD etc..Although having been shown and retouching above Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention Type.

Claims

1. a kind of web search method, it is characterised in that comprise the following steps：

Obtain the keyword of search；

Scanned for according to the keyword, obtain matching webpage, and the sequence for matching webpage；

After the first webpage that user is chosen in the matching webpage is labeled, the mark of first webpage is obtained；Its In, the mark, for indicating the correlation between search intention of first webpage with using the keyword search；

According to the degree of correlation between second webpage matched in webpage, the first webpage related to being labeled as, and/or root According to the degree of correlation between second webpage, the first webpage unrelated with being labeled as, second webpage and the search are estimated Degree of correlation between intention；

According to the degree of correlation between second webpage and the search intention, second webpage is resequenced.

2. web search method according to claim 1, it is characterised in that it is described in the matching webpage second Degree of correlation between webpage the first webpage related to being labeled as, and/or it is unrelated with being labeled as according to second webpage Degree of correlation between first webpage, estimates that the degree of correlation between second webpage and the search intention includes：

Extract the body text of first webpage；

According to body text, title and the brief introduction of first webpage, the document of first webpage is generated；Wherein, described The document of one webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the document of the first unrelated webpage；

Title and the brief introduction of second webpage are extracted, the document of second webpage is generated；

According to the degree of correlation being labeled as between the document of the first webpage of correlation and the document of second webpage, and/ Or the degree of correlation being labeled as between the document of the document of the first unrelated webpage and second webpage, estimate described the Degree of correlation between two webpages and the search intention.

3. web search method according to claim 2, it is characterised in that estimation second webpage is searched with described Degree of correlation between Suo Yitu, including：

According to formulaCalculating obtain second webpage with Degree of correlation Score between the search intention；

Wherein, α, β, γ are default weight, and n is sequence of second webpage in the matching webpage；

The document of first webpage for being labeled as correlation, V is expressed as based on bag of words vectorization_{Correlation 1}；It is described be labeled as it is unrelated The first webpage document, V is expressed as based on bag of words vectorization_{Unrelated 1}；The document of first webpage for being labeled as correlation, Word-based distributed vector representation model vector is expressed as V_{Correlation 2}；The document for being labeled as the first unrelated webpage, base V is expressed as in the distributed vector representation model vector of word_{Unrelated 2}；

S_{Correlation 1}For V_{Correlation 1}V represented based on bag of words vectorization with second webpage_{As a result 1}Between cosine similarity；S_{Correlation 2}For V_{Correlation 2}V is represented with the word-based distributed vector representation model vector of second webpage_{As a result 2}Between cosine similarity； S_{Unrelated 1}For V_{Unrelated 1}V represented based on bag of words vectorization with second webpage_{As a result 1}Between cosine similarity；S_{Unrelated 2}For V_{Unrelated 2}With The word-based distributed vector representation model vector of second webpage represents V_{As a result 2}Between cosine similarity.

4. web search method according to claim 2, it is characterised in that the text text of extraction first webpage This, including：

Obtain the html files of first webpage；

The quantity of the link contained by each node in the html files, it is determined that the node comprising content of text；

According to the node comprising content of text, the body text of first webpage is generated.

5. web search method according to claim 4, it is characterised in that the html texts of acquisition first webpage After part, in addition to：

Remove the node of correspondence control, Show Styles and/or annotation.

6. web search method according to claim 4, it is characterised in that each section in the html files The quantity of the contained link of point, it is determined that the node comprising content of text, including：

The quantity of link contained by each node in the html files is less than to the node of predetermined threshold value, is defined as described include The node of content of text.

7. the web search method according to claim any one of 1-6, it is characterised in that described according to second webpage With the degree of correlation between the search intention, second webpage is resequenced, including：

It is bigger with the degree of correlation between second webpage and the search intention, after second webpage is resequenced Sequence it is more forward be principle, second webpage is resequenced.

8. the web search method according to claim any one of 1-6, it is characterised in that acquisition first webpage Mark before, in addition to：

Generate and show interaction page, wherein, the interaction page shows the link of the matching page, and for obtaining The check box of mark.

9. a kind of Webpage search device, it is characterised in that including：

Acquisition module, the keyword for obtaining search；

Search module, for being scanned for according to the keyword, obtains matching webpage, and the sequence for matching webpage；

Labeling module, after the first webpage for being chosen as user in the matching webpage is labeled, obtains described first The mark of webpage；Wherein, the mark, for indicate first webpage with using the keyword search search intention it Between correlation；

Correlation between computing module, the first webpage related to being labeled as the second webpage in the matching webpage Degree, and/or according to the degree of correlation between second webpage, the first webpage unrelated with being labeled as, estimate second net Degree of correlation between page and the search intention；

Reorder module, for according to the degree of correlation between second webpage and the search intention, to second net Page is resequenced.

10. Webpage search device according to claim 9, it is characterised in that the computing module, including：

Extraction unit, the body text for extracting first webpage；

Generation unit, for body text, title and the brief introduction according to first webpage, generates the text of first webpage Shelves；Wherein, the document of first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as unrelated first The document of webpage；

Generation unit is extracted, title and brief introduction for extracting second webpage generate the document of second webpage；

Computing unit, for being labeled as according between the document of the first webpage of correlation and the document of second webpage Degree of correlation, and/or the related journey being labeled as between the document of the first unrelated webpage and the document of second webpage Degree, estimates the degree of correlation between second webpage and the search intention.

11. Webpage search device according to claim 10, it is characterised in that the computing unit, specifically for：

12. Webpage search device according to claim 10, it is characterised in that the extraction unit, including：

Obtain subelement, the html files for obtaining first webpage；

Determination subelement, for the connection quantity contained by each node in the html files, it is determined that including content of text Node；

Subelement is generated, for according to the node comprising content of text, generating the body text of first webpage.

13. Webpage search device according to claim 12, it is characterised in that the extraction unit, in addition to：

Subelement is removed, for removing correspondence control, Show Styles and/or the node of annotation.

14. Webpage search device according to claim 12, it is characterised in that the determination subelement, specifically for：

15. the Webpage search device according to claim any one of 9-14, it is characterised in that the module that reorders, tool Body is used for：

16. the Webpage search device according to claim any one of 9-14, it is characterised in that the Webpage search device, Also include：

Interactive display module, for generating and showing interaction page, wherein, the interaction page shows the matching page Link, and for obtaining the check box of mark.

17. a kind of computer equipment, it is characterised in that including memory, processor and store on a memory and can locate The computer program run on reason device, it is characterised in that described in the computing device during computer program, realizes that right such as will Seek any described web search method in 1-8.

18. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, it is characterised in that the program When being executed by processor, the web search method as described in any in claim 1-8 is realized.