CN107220307B - Webpage searching method and device - Google Patents

Webpage searching method and device Download PDF

Info

Publication number
CN107220307B
CN107220307B CN201710326803.XA CN201710326803A CN107220307B CN 107220307 B CN107220307 B CN 107220307B CN 201710326803 A CN201710326803 A CN 201710326803A CN 107220307 B CN107220307 B CN 107220307B
Authority
CN
China
Prior art keywords
webpage
correlation
web page
degree
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710326803.XA
Other languages
Chinese (zh)
Other versions
CN107220307A (en
Inventor
黄永峰
刘俊鑫
吴方照
刘佳伟
袁志刚
吴思行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201710326803.XA priority Critical patent/CN107220307B/en
Publication of CN107220307A publication Critical patent/CN107220307A/en
Application granted granted Critical
Publication of CN107220307B publication Critical patent/CN107220307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention provides a webpage searching method and a webpage searching device, wherein the webpage searching method comprises the following steps: and searching according to the keywords to obtain matched webpages and sorting, selecting a first webpage in the matched webpages by the user for labeling, further estimating the degree of correlation between a second webpage and the search intention according to the degree of correlation between the second webpage in the matched webpages and the first webpage labeled as related and/or unrelated, and re-sorting the second webpage according to the degree of correlation. According to the method, on one hand, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, on the other hand, the user can promote the sequence of the second webpage related to the search intention only by carrying out relevance labeling on a small number of webpages in the search result according to the search intention, the time for the user to obtain the target information is reduced, and the experience degree of the user is improved.

Description

Webpage searching method and device
Technical Field
The invention relates to the technical field of information retrieval, in particular to a webpage searching method and device.
Background
With the popularization and development of the internet, more and more people utilize the network to search information. The development of the internet to date has huge information amount on the network, people utilize a search engine to carry out information retrieval on the internet, the search engine sorts and sorts search results, and then the search results are fed back to users.
Currently, when a user utilizes a search engine to perform information retrieval, the interaction between the search engine and the user only stays in search keywords input by the user, or information such as a browsing log of the user is partially utilized to optimize search results.
In the prior art, the search result presented to the user by the search engine often has the situation that the webpage ordering required by the user is relatively backward, so that the user cannot effectively locate the required webpage, the time for the user to acquire the target information is increased, and the experience degree of the user is reduced.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present invention is to provide a web page search method, so as to implement reordering of search results by calculating a correlation between a second web page and a search intention of a user, so as to quickly locate information required by the user, and solve the technical problems in the prior art that a user needs to acquire required information for a long time and the user experience is poor due to insufficient consideration of the search intention of the user.
The second objective of the present invention is to provide a web page searching apparatus.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for searching a web page, including:
acquiring a searched keyword;
searching according to the keywords to obtain matched webpages and the sequence of the matched webpages;
after a user selects a first webpage in the matched webpages for labeling, acquiring the label of the first webpage; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keywords;
estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant;
and reordering the second webpage according to the degree of correlation between the second webpage and the search intention.
The embodiment of the invention provides a webpage searching method, wherein search keywords are obtained; searching according to the keywords to obtain matched webpages and the sequence of the matched webpages; a user selects a first webpage in the matched webpages for labeling; estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant; and reordering the second webpage according to the degree of correlation. The user marks the matched web pages in the search result according to the search intention, and reorders the second web pages according to the calculated correlation degree, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
In order to achieve the above object, a second embodiment of the present invention provides a web page searching apparatus, including:
the acquisition module is used for acquiring searched keywords;
the searching module is used for searching according to the keywords to obtain matched webpages and the sequence of the matched webpages;
the labeling module is used for acquiring the label of the first webpage after the user selects the first webpage in the matched webpages for labeling; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keywords;
the calculation module is used for estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant;
and the reordering module is used for reordering the second webpage according to the correlation degree between the second webpage and the search intention.
The webpage searching device comprises an acquisition module, a searching module and a searching module, wherein the acquisition module is used for acquiring searched keywords; the searching module is used for searching according to the keywords to obtain matched webpages and the sequence of the matched webpages; the labeling module is used for acquiring the label of the first webpage after the user selects the first webpage in the matched webpages for labeling; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keywords; the calculation module is used for estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant; and the reordering module is used for reordering the second webpage according to the correlation degree between the second webpage and the search intention. The user marks the matched web pages in the search result according to the search intention, and reorders the second web pages according to the calculated correlation degree, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
To achieve the above object, a third embodiment of the present invention provides a computer device, including: a memory, a processor and a computer program stored in the memory and executable on the processor for performing the web page search method of the first aspect when the computer program is executed by the processor.
In order to achieve the above object, a fourth embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, is configured to perform the web page search method according to the first aspect.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flowchart of a web page search method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for estimating a degree of relevance between a web page and a search intention according to an embodiment of the present invention;
fig. 3 is a schematic flowchart of a method for extracting text from a webpage text according to this embodiment;
FIG. 4 is a flowchart illustrating a method for removing a node containing a large number of links according to this embodiment;
fig. 5 is a schematic structural diagram of a web page search apparatus according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of another web page search apparatus according to an embodiment of the present invention; and
fig. 7 is a schematic structural diagram of an extraction unit 341 according to an embodiment of the present invention.
Detailed Description
A web page search method and apparatus according to an embodiment of the present invention will be described with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a web page search method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:
in step S1, a keyword for search is acquired.
Specifically, in a user input interface of the search engine, a user inputs keywords according to requirements to perform information query, and the search engine identifies the keywords input by the user.
And step S2, searching according to the keywords to obtain the matched web pages and the sequence of the matched web pages.
Specifically, the search engine queries information in a database by using a web crawler technology according to keywords input by a user, calculates the correlation between the web pages and the search intention of the user after querying the web pages matched with the keywords input by the user, obtains the sequence of the matched web pages, records the initial sequence corresponding to each obtained matched web page, and displays the matched web pages in a display interface of the user according to the initial sequence.
The search engine searches a target webpage by using a web crawler technology according to an obtained keyword input by a user, as a possible implementation manner, a query (Get) request manner is adopted when calling the search engine, a parameter sent to a server is added behind a request Uniform resource Locator (Uniform resource Locator, URL) in a form of a question mark (: "http:// www.baidu.com/s? q1 is just as good as Qinghua university. Other advanced search parameters for limiting the update time of web pages and the like can also be realized by adding corresponding parameters behind the URL, for example, if the time of the web pages to be searched and the number of displayed pieces of each page are limited by using hundred-degree search, then: http:// www.baidu.com/s? q1 equals to qinghua university & lm equals 7& rn equals 5, i.e. the web pages in the latest week of qinghua university are searched, and the number of displayed pieces per page is 5.
Step S3, when the user selects the first web page of the matched web pages for labeling, the label of the first web page is obtained.
Specifically, the user labels the matched web pages returned to the user according to the principle of the degree of correlation with the search requirement, and returns the labeled web page information to the search engine, wherein the web pages labeled by the user are called as first web pages. It should be noted that, among the first web pages labeled by the user, there are web pages related to the user search requirement and web pages unrelated to the user search requirement, and for convenience of distinction, the matching web pages related to the user search requirement are named as related first web pages, and the matching web pages unrelated to the user search requirement are named as unrelated first web pages.
Further, the labeling of the matching web pages by the user is completed in an interactive page generated and displayed by the search engine system, and the interactive page displays links of all the matching pages and check boxes for labeling. Checking buttons are arranged in the check boxes, two kinds of checking buttons of 'relevant' and 'irrelevant' are arranged beside each matched page, if the page is a page relevant to the search requirement, the 'relevant' button is checked, and the page is set as a relevant first webpage; if the page is a page unrelated to the search requirement, the "don't care" button is clicked, and the page is set as the first web page unrelated. It should be noted that the related first web page and the unrelated first web page are determined by the user according to the search requirement, and the number of the related first web page and the unrelated first web page may be one or more.
And step S4, estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpage and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant.
Specifically, the keywords input by the user correspond to the search intention of the user, and the text information of the body in the web page corresponds to the search intention of the user, so that the body text of the web page needs to be extracted to calculate the degree of correlation between the web page and the search intention of the user.
Firstly, text of a first webpage is extracted, and documents of the first webpage are generated according to the text, the title and the brief introduction of the first webpage, wherein the documents of the first webpage comprise the documents marked as related first webpages and the documents marked as unrelated first webpages.
Then, the title and the brief introduction of the second webpage are extracted, and a document of the second webpage is generated.
Finally, the degree of correlation between the second webpage and the search intention is estimated according to the degree of correlation between the document marked as the related first webpage and the document of the second webpage and/or the degree of correlation between the document marked as the unrelated first webpage and the document of the second webpage.
The second web page is generally a web page other than the first web page among the searched matching web pages. Those skilled in the art can know that, specifically, the second web page may be all web pages in the matching web pages except the first web page, or may be a part of web pages, which is not limited in this embodiment.
In step S5, the second web page is reordered according to the degree of correlation between the second web page and the search intention.
Specifically, the second web page is reordered on the basis that the greater the degree of correlation between the second web page and the search intention is, the earlier the ranking of the second web page after reordering is.
In the webpage searching method, the searched keywords are obtained, searching is carried out according to the keywords, and matched webpages and the sequence of the matched webpages are obtained; a user selects a first webpage in the matched webpages for labeling; estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant; and reordering the second webpage according to the degree of correlation. The user marks the matched web pages in the search result according to the search intention, and reorders the second web pages according to the calculated correlation degree, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
To further clarify step S4 in the previous embodiment, as a possible implementation manner, fig. 2 is a flowchart illustrating a method for estimating a degree of correlation between a web page and a search intention according to an embodiment of the present invention.
As shown in fig. 2, step S4 includes the following steps:
in step S41, the text of the first web page is extracted.
Specifically, the web page contains a lot of text information, including text information related to the user search requirement and text information unrelated to the user search requirement, for convenience of distinction, the text information related to the user requirement is called a text, and the text information unrelated to the user requirement and incapable of reflecting the user search intention, such as information of a navigation bar, an external link, an advertisement and the like, is called a noise text. In order to more accurately represent the search intention of the user, irrelevant noise texts need to be removed, and only the text texts are reserved.
In step S42, a document of the first web page is generated based on the text, title and brief introduction of the body of the first web page.
Specifically, different searching modes are respectively defined for result pages returned by a search engine, titles and brief introduction information in the result pages are extracted, and documents of the first webpage are generated by extracting the original text of the first webpage. The documents of the first webpage comprise documents marked as related first webpages and documents marked as unrelated first webpages.
In step S43, the title and the introduction of the second web page are extracted to generate a document of the second web page.
Specifically, different search modes are respectively defined for result pages returned by the search engine, and the title and the brief introduction information in the result pages are extracted to generate the document of the second webpage.
It should be understood that, corresponding to the second web page, the text information of the second web page is not extracted, and only the title and the brief introduction of the second web page are extracted, so as to reduce the number of times of initiating the network connection, thereby reducing the processing time of the background of the system, enabling the system to quickly respond to the search requirement of the user under the condition of ensuring sufficient accuracy, and improving the experience of the user.
Step S44, estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the document marked as relevant first webpage and the document of the second webpage and/or the degree of correlation between the document marked as irrelevant first webpage and the document of the second webpage.
Specifically, the degree of similarity between two documents, i.e., the degree of correlation, is calculated, and the most common method is to map the documents into vectors and measure the degree of similarity between the two documents using the degree of similarity between the corresponding vectors of the two documents.
As a possible implementation, the documents are mapped into vectors by using a bag-of-words model, which is a method for mapping document data into vectors commonly used in natural language processing. Assuming that the dictionary has N words, and all documents are composed of the N words in the dictionary, any one document can be mapped into an N-dimensional vector by using the bag-of-words model, and the kth dimension of the vector corresponds to the weight of the kth word in the dictionary in the document. The weight of a word may be the frequency of the word appearing in the document, and the most common weight judgment method is to judge according to the term frequency-inverse document frequency (TF-IDF) value of the word in the document. The Term Frequency (TF) refers to the frequency of a certain term appearing in a document, and the higher the frequency of appearance is, the more important the term is, the higher the weight is, and the larger the value of TF is; the Inverse Document Frequency (IDF) means that a smaller weight is given to a common word and a larger weight is given to an uncommon word, that is, the size of the IDF is inversely proportional to the degree of commonness of one word. The TF-IDF value of a word, i.e., multiplying TF by IDF, a larger value indicates a higher weight for the word in the document.
It should be understood that before calculating the TF-IDF value, the document needs to be first participled, the document is converted into a set of terms by using the existing participle tool, then the number of times each term appears in the document is counted, and the TF-IDF value of the term is calculated, so as to obtain the vectorized representation of the document.
It should be noted that, in practical applications, the dictionary used in the system is about 30 ten thousand words, which can basically contain most of the chinese words, but because new words are continuously appearing on the network, and because the word segmentation tool may have word segmentation errors, a small number of words that do not exist in the dictionary may appear in the word segmentation result, and for these words, the system adopts a direct discarding strategy.
As another possible implementation, a distributed vector representation model is used to map a document into a vector, and a distributed representation of a word, commonly referred to as a word vector, refers to a real-valued vector that maps each word in a dictionary into a vector space, and can be obtained by training a "neural language" model, and the dimension of the vector can be adjusted and set as needed during training. The word vector obtained by training can well represent the semantic information of words, and the words with similar semantic information have similar distance in the vector space. With the vector representation of words, there are a number of ways to extend to the vector representation of documents. As a possible implementation mode, the system obtains the vector representation of the document by adopting a method of weighted average of the vector representation of the words, and firstly, word vectors are obtained by using the collected corpus in advance; then, word segmentation processing is carried out on the document by adopting a word segmentation tool, and the TF-IDF value of each word is counted; and taking TF-IDF values as weight values to perform weighted average on the word vectors of the words to obtain document vectors.
Further, utilizing the bag-of-words model, the first web page document labeled as relevant may be vectorized as VCorrelation 1Vectorizing the first web page document marked as irrelevant as VIrrelevant 1Vectorizing the second web document to be represented as VResults 1(ii) a Using the distributed vector representation model, a first web page document labeled as relevant can be vectorized as VCorrelation 2Vectorizing the first web page document marked as irrelevant as VIrrelevant 2Vectorizing the second web document to be represented as VResults 2
Specifically, after the first webpage document and the second webpage document are vectorized and expressed, the similarity between the documents can be measured by using the similarity between the document vectors, and as a possible implementation manner, the system adopts cosine distance to express the similarity between the vectors, so that the cosine similarity can be used for calculating VCorrelation 1And VResults 1Cosine similarity between them SCorrelation 1;VCorrelation 2And VResults 2Cosine similarity between them SCorrelation 2;VIrrelevant 1And VResults 1Cosine similarity between them SIrrelevant 1;VIrrelevant 2And VResults 2Cosine similarity between them SIrrelevant 2
Furthermore, according to the cosine similarity of the first webpage document and the second webpage document, a formula is utilized
Figure GDA0002551053020000071
And calculating the degree of correlation Score between the second webpage and the search intention.
And the alpha, the beta and the gamma are preset weights, and the n is the sequence of the second webpage in the matched webpage.
And finally, reordering the second webpage according to the calculated value of the correlation degree Score between the second webpage and the search intention, and displaying the reordered webpage to the user. And the larger the Score value is, the more front the corresponding second webpage is ranked after the reordering.
In the embodiment of the invention, the text of the first webpage is extracted; generating a first webpage document according to the text, the title and the brief introduction of the first webpage; extracting the title and the brief introduction of the second webpage to generate a document of the second webpage; the degree of relevance between the second webpage and the search intent is estimated based on the degree of relevance between the documents of the first webpage labeled as relevant and the documents of the second webpage, and/or based on the degree of relevance between the documents of the first webpage labeled as irrelevant and the documents of the second webpage. The user marks the matched web pages in the search result according to the search intention, and reorders the second web pages according to the calculated correlation degree, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
In the above embodiment, when extracting a text included in a web page, useless noise text needs to be removed, and only a text is retained, but because different web page styles have large differences, a uniform text analysis algorithm is not provided.
Specifically, in order to further clearly describe the process of step S41 in the foregoing embodiment, fig. 3 is a schematic flow chart of a method for extracting text from a body of a web page according to this embodiment, as shown in fig. 3, step S41 includes the following steps:
step S411, obtaining the html file of the first webpage.
Specifically, an html file of a first webpage labeled by a user is obtained, each component in the html file is a node according to a Document Object Model (DOM), and all information is stored in different nodes in the html file. Wherein, the whole document is a document node; each html tag is an element node; each html attribute is an attribute node; the annotations are annotation nodes.
And step S412, removing nodes corresponding to control, display styles and/or comments in the html file.
Specifically, when extracting the body text, the nodes corresponding to the control, display style and/or annotation do not contain text information, and are irrelevant to the search requirement of the user, and when extracting the body text, the nodes not containing the body text information need to be removed, and the nodes needing to be removed at least comprise the following nodes:
removing < script > nodes, the nodes for defining client side scripts;
removing a < noscript > node for defining alternative content (text) when the script is not executed;
removing a < style > node, wherein the < style > node is used for defining style information for the html document;
removing < span > nodes for combining inline elements in the document;
removing < meta > node for providing meta information (meta-information) on pages such as descriptions and keywords for search engine and update frequency;
removing a node containing a "style ═ displaynone" attribute, wherein the node is used for hiding the object in the webpage and does not reserve a physical space for the hidden object;
removing the annotation node refers to the annotation inserted in the source code.
In step S413, a node containing text content is determined according to the number of links included in each node in the html file.
Specifically, considering that parts of a page such as a navigation bar and an advertisement generally contain more links, and a text of the page generally contains fewer links or no links, whether the text content of the node belongs to the text of the page can be judged according to the number of the link nodes in and around the node, if the node contains more links, the text of the node is judged to be not the text of the page and needs to be deleted, otherwise, the text of the node is judged to be the text of the page and needs to be reserved.
Step S414, generating the text of the first webpage according to the nodes containing the text content.
Specifically, after removing nodes corresponding to control, display styles and/or comments in the html file and removing nodes with more links, deleting redundant spaces and labels in the html file, and extracting text contents of the remaining nodes to obtain text information of the page.
To further clearly explain the process of removing the node containing a large number of links in step S413, as a possible implementation manner, fig. 4 is a schematic flowchart of a method for removing the node containing a large number of links proposed in this embodiment, and includes the following steps:
step S4131 finds all < a > nodes in the html file.
Wherein the < a > node refers to a hyperlink node. For each < a > node, the following steps are performed.
In step S4132, the parent node of the < a > node is found and is marked as parent, and the variable temp is initialized to null.
Specifically, for each < a > node, there is a unique parent node, temp refers to a temporary variable.
Step S4133, find all under parent<a>Node, compute all under parent<a>Sum of character lengths l of nodesaAnd calculates the character length l of parentp
Specifically, each parent node may have a plurality of child nodes < a >, and the child nodes < a > have the same level relationship, and by counting the sum of the character lengths of all the child nodes < a > under the parent node, it is determined whether the parent node contains more hyperlink nodes or not, and is a noise node to be deleted.
Step S4134, determinea/lpWhether or not it is greater than a predetermined thresholdAnd t, if yes, performing step S4135, otherwise performing step S4136.
In particular, ifa/lpIf the value of the node is larger than a preset threshold value t, the parent node is considered to be a node containing a large number of links and belongs to a noise node needing to be deleted, the parent is assigned to temp, and meanwhile, a pointer points to a parent node of the parent node; if la/lpIf the value of (d) is less than or equal to the preset threshold value t, the node is considered to belong to the text node and does not need to be deleted.
In step S4135, parent is assigned to temp and pointed to the parent node of the parent node, while returning to perform step S4133.
In step S4136, if temp is not null, then temp is deleted, and if temp is null, then temp is retained. And then returns to step S4132.
It should be noted that, regardless of whether temp is deleted, the process returns to step S4132, and the process proceeds to the next < a > node until all < a > nodes found in step S4131 are processed.
In the embodiments corresponding to fig. 3 and fig. 4, text information is extracted from the obtained html file, and a document of the relevant first web page and a document of the irrelevant first web page are obtained. In addition, the title and the introduction of the second webpage are extracted to obtain the document of the second webpage, the degree of correlation between the document marked as the related first webpage and the document of the second webpage is estimated according to the degree of correlation between the document marked as the unrelated first webpage and the document of the second webpage, the degree of correlation between the second webpage and the search intention is estimated, the second webpage is reordered according to the calculated degree of correlation, the search intention of the user is fully considered, the user can be helped to quickly locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
In order to implement the above embodiment, the present invention further provides a device for web page search.
Fig. 5 is a schematic structural diagram of a web page search apparatus according to an embodiment of the present invention.
As shown in fig. 5, the web page search apparatus includes: an acquisition module 31, a search module 32, a labeling module 33, a calculation module 34 and a reordering module 35.
An obtaining module 31, configured to obtain a keyword for search.
And the searching module 32 is used for searching according to the keywords to obtain the matched web pages and the sequence of the matched web pages.
The labeling module 33 is configured to obtain a label of a first webpage after the user selects the first webpage in the matched webpages for labeling; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keywords.
And the calculating module 34 is used for estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant.
And the reordering module 35 is configured to reorder the second web page according to the degree of correlation between the second web page and the search intention.
As a possible implementation manner, the reordering module 35 is specifically configured to reorder the second web pages on the basis that the larger the correlation degree between the second web pages and the search intention is, the closer the reordered ordering of the second web pages is to the front.
In the embodiment of the invention, the acquisition module is used for acquiring the searched keywords; the searching module is used for searching according to the keywords to obtain matched webpages and the sequence of the matched webpages; the labeling module is used for acquiring the label of the first webpage after the user selects the first webpage in the matched webpages for labeling; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keywords; the calculation module is used for estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage marked as relevant in the matched webpages and/or according to the degree of correlation between the second webpage and the first webpage marked as irrelevant; and the reordering module is used for reordering the second webpage according to the correlation degree between the second webpage and the search intention. The user marks the matched web pages in the search result according to the search intention, and reorders the second web pages according to the calculated correlation degree, the search intention of the user is fully considered, the user can be helped to express and locate useful information, the time for the user to obtain the useful information is shortened, and the experience degree of the user is improved.
It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and are not repeated herein.
Based on the foregoing embodiment, an embodiment of the present invention further provides a possible implementation manner of another web page search apparatus, and fig. 6 is a schematic structural diagram of another web page search apparatus provided in an embodiment of the present invention, where on the basis of the foregoing embodiment, the web page search apparatus further includes: and the interactive display module 36 is used for generating and displaying an interactive page, wherein the interactive page displays links of the matched page and check boxes for acquiring the labels.
In addition, on the basis of the above embodiment, the calculation module 34 includes: an extraction unit 341, a generation unit 342, an extraction generation unit 343, and a calculation unit 344.
The extracting unit 341 is configured to extract the body text of the first web page.
A generating unit 342 for generating a document of the first web page based on the body text, title, and introduction of the first web page; the documents of the first webpage comprise documents marked as related first webpages and documents marked as unrelated first webpages.
The extracting and generating unit 343 is configured to extract the title and the introduction of the second web page, and generate a document of the second web page.
A calculating unit 344 for estimating a degree of correlation between the second webpage and the search intention according to a degree of correlation between the document of the first webpage labeled as relevant and the document of the second webpage, and/or a degree of correlation between the document of the first webpage labeled as irrelevant and the document of the second webpage.
Therein, as a possible implementation manner, the calculating unit 344 is specifically configured to calculate according to a formula
Figure GDA0002551053020000111
And calculating the degree of correlation Score between the second webpage and the search intention.
In the embodiment of the invention, the user marks the matched web pages in the search result according to the search intention, calculates the degree of correlation between the second web page and the search intention, and reorders the second web page according to the calculated degree of correlation, so that the search intention of the user is fully considered, the user can be helped to express and locate useful information, the time for the user to obtain the useful information is reduced, and the experience degree of the user is improved.
It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus according to the embodiments of the present invention, and are not repeated herein.
Based on the foregoing embodiment, as a possible implementation manner, fig. 7 is a schematic structural diagram of an extracting unit 341 provided in the embodiment of the present invention, and as shown in fig. 7, the extracting unit 341 includes: fetch subunit 3411, remove subunit 3412, determine subunit 3413, and generate subunit 3414.
The obtaining subunit 3411 is configured to obtain an html file of the first web page.
A removal subunit 3412 is used to remove nodes corresponding to controls, display styles, and/or annotations.
The determining subunit 3413 is configured to determine a node containing text content according to the number of links contained in each node in the html file.
The generating subunit 3414 is configured to generate a body text of the first web page according to the node containing the text content.
As a possible implementation manner, the determining subunit 3412 is specifically configured to determine, as a node containing text content, a node whose number of links included in each node in the html file is lower than a preset threshold.
It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and are not repeated herein.
In the embodiment of the invention, the user marks the matched web pages in the search result according to the search intention, calculates the degree of correlation between the second web page and the search intention, and reorders the second web page according to the calculated degree of correlation, so that the search intention of the user is fully considered, the user can be helped to express and locate useful information, the time for the user to obtain the useful information is reduced, and the experience degree of the user is improved.
In order to implement the foregoing embodiments, the present invention also provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the computer program is executed by the processor, the web page search method in the foregoing embodiments is performed.
In order to implement the above embodiments, the present invention also proposes a non-transitory computer-readable storage medium that, when a program in the storage medium is executed by a processor, performs the web search method in the foregoing embodiments.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (12)

1. A web page search method is characterized by comprising the following steps:
acquiring a searched keyword;
searching according to the keywords to obtain matched webpages and the sequence of the matched webpages;
after a user selects a first webpage in the matched webpages for labeling, acquiring the label of the first webpage; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keyword;
estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the second webpage and the first webpage labeled as relevant in the matching webpages and/or according to the degree of correlation between the second webpage and the first webpage labeled as irrelevant, specifically comprising:
extracting a body text of the first webpage, wherein the extracting the body text of the first webpage comprises: acquiring an html file of the first webpage; determining nodes containing text content according to the number of links contained in each node in the html file; generating a text of the first webpage according to the nodes containing the text content; removing nodes corresponding to the controls, display styles and/or annotations;
generating a document of the first webpage according to the text, the title and the brief introduction of the first webpage; wherein the documents of the first webpage comprise documents labeled as related first webpages and documents labeled as unrelated first webpages;
extracting the title and the brief introduction of the second webpage to generate a document of the second webpage;
estimating the degree of correlation between the second webpage and the search intention according to the degree of correlation between the document marked as the related first webpage and the document of the second webpage and/or the degree of correlation between the document marked as the unrelated first webpage and the document of the second webpage;
reordering the second web page according to a degree of correlation between the second web page and the search intent.
2. The method of searching for web pages according to claim 1, wherein said estimating a degree of correlation between the second web page and the search intention comprises:
according to the formula
Figure FDA0002551053010000011
Calculating to obtain a correlation degree Score between the second webpage and the search intention;
wherein α, β, γ are preset weights, and n is the ranking of the second web page in the matched web page;
the document marked as the related first webpage is represented as V based on the bag-of-words model vectorizationCorrelation 1(ii) a The document marked as the irrelevant first webpage is represented as V based on the bag-of-words model vectorizationIrrelevant 1(ii) a The document marked as the related first webpage is vectorized and represented as V by the word-based distributed vector representation modelCorrelation 2(ii) a The document marked as the irrelevant first webpage is represented as V through vectorization based on the distributed vector representation model of the wordsIrrelevant 2
SCorrelation 1Is a VCorrelation 1Bag-of-words model-based vectorized representation V with the second web pageResults 1Cosine similarity between them; sCorrelation 2Is a VCorrelation 2Word-based distributed vector representation model vectorized representation V with the second web pageResults 2Cosine similarity between them; sIrrelevant 1Is a VIrrelevant 1Bag-of-words model-based vectorized representation V with the second web pageResults 1Cosine similarity between them; sIrrelevant 2Is a VIrrelevant 2Word-based distributed vector representation model vectorized representation V with the second web pageResults 2Cosine similarity between them.
3. The method for searching web pages according to claim 1, wherein the determining the nodes containing text content according to the number of links contained in each node in the html file comprises:
and determining the nodes with the link quantity lower than a preset threshold value in each node in the html file as the nodes containing the text content.
4. The web page searching method according to any one of claims 1 to 3, wherein the reordering of the second web page according to the degree of correlation between the second web page and the search intention comprises:
and reordering the second webpage on the basis that the larger the correlation degree between the second webpage and the search intention is, the more the ranked second webpage is ranked before the ranked second webpage is reordered.
5. The web page searching method according to any one of claims 1 to 3, wherein before the obtaining the label of the first web page, the method further comprises:
and generating and displaying an interactive page, wherein the interactive page displays a link of a matched page and a check box for acquiring the label.
6. A web page search apparatus, comprising:
the acquisition module is used for acquiring searched keywords;
the searching module is used for searching according to the keywords to obtain matched webpages and the sequence of the matched webpages;
the labeling module is used for acquiring the label of the first webpage after the user selects the first webpage in the matched webpages for labeling; wherein the label is used for indicating the correlation between the first webpage and the search intention searched by the keyword;
a calculating module, configured to estimate a degree of correlation between a second webpage and the search intention according to a degree of correlation between the second webpage and a first webpage labeled as relevant in the matching webpages and/or according to a degree of correlation between the second webpage and a first webpage labeled as irrelevant, where the method specifically includes:
an extracting unit, configured to extract a body text of the first webpage, where the extracting unit includes: the acquiring subunit is used for acquiring the html file of the first webpage; the determining subunit is used for determining the nodes containing the text content according to the connection number contained in each node in the html file; the generating subunit is used for generating a text of the first webpage according to the nodes containing text contents; a removal subunit, configured to remove nodes corresponding to the control, display style, and/or annotation;
the generating unit is used for generating a document of the first webpage according to the text, the title and the brief introduction of the first webpage; wherein the documents of the first webpage comprise documents labeled as related first webpages and documents labeled as unrelated first webpages;
the extraction generating unit is used for extracting the title and the brief introduction of the second webpage and generating a document of the second webpage;
a calculating unit, configured to estimate a degree of correlation between the second webpage and the search intention according to a degree of correlation between the document labeled as the relevant first webpage and the document of the second webpage, and/or a degree of correlation between the document labeled as the irrelevant first webpage and the document of the second webpage;
and the reordering module is used for reordering the second webpage according to the correlation degree between the second webpage and the search intention.
7. The web page search device according to claim 6, wherein the computing unit is specifically configured to:
according to the formula
Figure FDA0002551053010000031
Calculating to obtain a correlation degree Score between the second webpage and the search intention;
wherein α, β, γ are preset weights, and n is the ranking of the second web page in the matched web page;
the document labeled as the relevant first web page,bag-of-words model-based vectorized representation as VCorrelation 1(ii) a The document marked as the irrelevant first webpage is represented as V based on the bag-of-words model vectorizationIrrelevant 1(ii) a The document marked as the related first webpage is vectorized and represented as V by the word-based distributed vector representation modelCorrelation 2(ii) a The document marked as the irrelevant first webpage is represented as V through vectorization based on the distributed vector representation model of the wordsIrrelevant 2
SCorrelation 1Is a VCorrelation 1Bag-of-words model-based vectorized representation V with the second web pageResults 1Cosine similarity between them; sCorrelation 2Is a VCorrelation 2Word-based distributed vector representation model vectorized representation V with the second web pageResults 2Cosine similarity between them; sIrrelevant 1Is a VIrrelevant 1Bag-of-words model-based vectorized representation V with the second web pageResults 1Cosine similarity between them; sIrrelevant 2Is a VIrrelevant 2Word-based distributed vector representation model vectorized representation V with the second web pageResults 2Cosine similarity between them.
8. The apparatus according to claim 6, wherein the determining subunit is specifically configured to:
and determining the nodes with the link quantity lower than a preset threshold value in each node in the html file as the nodes containing the text content.
9. The apparatus for web page search according to any one of claims 6-8, wherein the reordering module is specifically configured to:
and reordering the second webpage on the basis that the larger the correlation degree between the second webpage and the search intention is, the more the ranked second webpage is ranked before the ranked second webpage is reordered.
10. The apparatus for web page search according to any one of claims 6 to 8, further comprising:
and the interactive display module is used for generating and displaying an interactive page, wherein the interactive page displays a link of a matched page and a check box for acquiring a label.
11. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the web page search method of any one of claims 1-5.
12. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the web page search method according to any one of claims 1 to 5.
CN201710326803.XA 2017-05-10 2017-05-10 Webpage searching method and device Active CN107220307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710326803.XA CN107220307B (en) 2017-05-10 2017-05-10 Webpage searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710326803.XA CN107220307B (en) 2017-05-10 2017-05-10 Webpage searching method and device

Publications (2)

Publication Number Publication Date
CN107220307A CN107220307A (en) 2017-09-29
CN107220307B true CN107220307B (en) 2020-09-25

Family

ID=59944267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710326803.XA Active CN107220307B (en) 2017-05-10 2017-05-10 Webpage searching method and device

Country Status (1)

Country Link
CN (1) CN107220307B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107832432A (en) * 2017-11-15 2018-03-23 北京百度网讯科技有限公司 A kind of search result ordering method, device, server and storage medium
CN111177514B (en) * 2019-12-31 2023-06-09 沈阳航空航天大学 Information source evaluation method and device based on website feature analysis, storage device and program
CN111552879B (en) * 2020-04-29 2023-10-03 百度在线网络技术(北京)有限公司 Data processing method and device
CN111813930B (en) * 2020-06-15 2024-02-20 语联网(武汉)信息技术有限公司 Similar document retrieval method and device
CN115495636A (en) * 2021-06-18 2022-12-20 华为技术有限公司 Webpage searching method, device and storage medium
CN115034388B (en) * 2022-07-07 2023-04-28 北京百度网讯科技有限公司 Determination method and device for quantization parameters of ranking model and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996316A (en) * 2007-01-09 2007-07-11 天津大学 Search engine searching method based on web page correlation
CN101281545A (en) * 2008-05-30 2008-10-08 清华大学 Three-dimensional model search method based on multiple characteristic related feedback
CN101359331A (en) * 2008-05-04 2009-02-04 索意互动(北京)信息技术有限公司 Method and system for reordering search result
US8032535B2 (en) * 2009-04-21 2011-10-04 Yahoo! Inc. Personalized web search ranking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1996316A (en) * 2007-01-09 2007-07-11 天津大学 Search engine searching method based on web page correlation
CN101359331A (en) * 2008-05-04 2009-02-04 索意互动(北京)信息技术有限公司 Method and system for reordering search result
CN101281545A (en) * 2008-05-30 2008-10-08 清华大学 Three-dimensional model search method based on multiple characteristic related feedback
US8032535B2 (en) * 2009-04-21 2011-10-04 Yahoo! Inc. Personalized web search ranking

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一种基于文档相似度的检索结果重排序方法";周博等;《中文信息学报》;20100531;第19-23、36页 *

Also Published As

Publication number Publication date
CN107220307A (en) 2017-09-29

Similar Documents

Publication Publication Date Title
CN107220307B (en) Webpage searching method and device
CN110674429B (en) Method, apparatus, device and computer readable storage medium for information retrieval
US9183281B2 (en) Context-based document unit recommendation for sensemaking tasks
US8190601B2 (en) Identifying task groups for organizing search results
US8321424B2 (en) Bipartite graph reinforcement modeling to annotate web images
CN103955529B (en) A kind of internet information search polymerize rendering method
KR101377341B1 (en) Training a ranking function using propagated document relevance
US8457416B2 (en) Estimating word correlations from images
TWI557664B (en) Product information publishing method and device
US8983965B2 (en) Document rating calculation system, document rating calculation method and program
US20120179667A1 (en) Searching through content which is accessible through web-based forms
US20090276414A1 (en) Ranking model adaptation for searching
US8788494B2 (en) Method, device and system for processing, browsing and searching an electronic documents
CN106095738B (en) Recommending form fragments
WO2008106667A1 (en) Searching heterogeneous interrelated entities
CN102236640A (en) Disambiguation of named entities
JP2008535095A (en) Fact Query Engine user interface with snippets from information sources including query terms and response terms
US20100010982A1 (en) Web content characterization based on semantic folksonomies associated with user generated content
CN107870915B (en) Indication of search results
US9262510B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
US8700624B1 (en) Collaborative search apps platform for web search
TW201415254A (en) Method and system for recommending semantic annotations
WO2018013400A1 (en) Contextual based image search results
JP5146108B2 (en) Document importance calculation system, document importance calculation method, and program
CN114141384A (en) Method, apparatus and medium for retrieving medical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant