CN107220307A - Web search method and device - Google Patents
Web search method and device Download PDFInfo
- Publication number
- CN107220307A CN107220307A CN201710326803.XA CN201710326803A CN107220307A CN 107220307 A CN107220307 A CN 107220307A CN 201710326803 A CN201710326803 A CN 201710326803A CN 107220307 A CN107220307 A CN 107220307A
- Authority
- CN
- China
- Prior art keywords
- webpage
- correlation
- search
- document
- degree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention proposes a kind of web search method and device, wherein, web search method includes:Scanned for according to keyword, after obtaining matching webpage and its sorting, the first webpage in matching webpage is chosen by user to be labeled, and then according to the degree of correlation between the second webpage matched in webpage the first webpage related and/or unrelated to being labeled as, estimate the degree of correlation between the second webpage and search intention, and the second webpage is resequenced accordingly.This method, on the one hand the search intention of user has been taken into full account, user can be helped quickly to navigate to useful information, on the other hand, user only needs to carry out a small amount of webpage in search result according to search intention the mark of correlation, just the sequence of second webpage related to search intention can be lifted, the time that user obtains target information is reduced, improves the Experience Degree of user.
Description
Technical field
The present invention relates to technical field of information retrieval, more particularly to a kind of web search method and device.
Background technology
With the popularization and development of internet, more and more people use a network for the retrieval of information.Internet development is extremely
Information content on the present, network is very huge, and people carry out information retrieval on the internet using search engine, and search engine is to searching
Hitch fruit is arranged and sorted, and search result then is fed back into user.
At present, when user carries out information retrieval using search engine, search engine only rests on user with interacting for user
In the search keyword of input, or part make use of the information such as the travel log of user to carry out Optimizing Search result.
In the prior art, search engine is presented to the search result of user, often exist webpage sorting needed for user compared with
For situation rearward, cause user effectively to navigate to required webpage, add the time that user obtains target information, drop
The low Experience Degree of user.
The content of the invention
It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.
Therefore, first purpose of the present invention is to propose a kind of web search method, to realize by calculating the second net
Correlation between page and user search intent, realizes to search result and resequences, quickly to navigate to the letter of user's request
Breath, solves in the prior art because not taking into full account the search intention of user, causes user to obtain the time of demand information
It is longer, the technical problem of user experience difference.
Second object of the present invention is to propose a kind of Webpage search device.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.
For up to above-mentioned purpose, first aspect present invention embodiment proposes a kind of web search method, including:
Obtain the keyword of search;
Scanned for according to keyword, obtain matching webpage, and match the sequence of webpage;
After the first webpage that user is chosen in matching webpage is labeled, the mark of the first webpage is obtained;Wherein, mark
Note, for indicating the correlation between search intention of first webpage with using keyword search;
According to the degree of correlation between the second webpage matched in webpage the first webpage related to being labeled as, and/or root
According to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the phase between the second webpage of estimation and search intention
Pass degree;
According to the degree of correlation between the second webpage and search intention, the second webpage is resequenced.
A kind of web search method of the embodiment of the present invention, wherein, obtain the keyword of search;Searched according to keyword
Rope, obtains matching webpage, and match the sequence of webpage;The first webpage that user is chosen in matching webpage is labeled;According to
Match the degree of correlation between the second webpage first webpage related to being labeled as in webpage, and/or according to the second webpage and
It is labeled as the degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and the search intention;
According to the degree of correlation, the second webpage is resequenced.User is according to search intention to the matching webpage in search result
It is labeled, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account searching for user
Suo Yitu, can help user quickly to navigate to useful information, reduce the time that user obtains useful information, improve user
Experience Degree.
For up to above-mentioned purpose, second aspect of the present invention embodiment proposes a kind of Webpage search device, including:
Acquisition module, the keyword for obtaining search;
Search module, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage;
Labeling module, after the first webpage for being chosen as user in matching webpage is labeled, obtains the first webpage
Mark;Wherein, the mark, for indicating the correlation between search intention of first webpage with using keyword search;
Correlation between computing module, the first webpage related to being labeled as the second webpage in matching webpage
Degree, and/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation and search
Degree of correlation between intention;
Reorder module, for according to the degree of correlation between the second webpage and search intention, being weighed to the second webpage
New sort.
A kind of Webpage search device of the embodiment of the present invention, wherein, acquisition module, the keyword for obtaining search;Search
Rope module, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage;Labeling module, for working as
After the first webpage that user is chosen in matching webpage is labeled, the mark of the first webpage is obtained;Wherein, the mark, is used
Correlation between the search intention of the first webpage and use keyword search is indicated;Computing module, for according to pair net
The degree of correlation between the second webpage first webpage related to being labeled as in page, and/or according to the second webpage with being labeled as
Degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and search intention;Reorder module,
For according to the degree of correlation between the second webpage and search intention, being resequenced to the second webpage.User is according to search
Intention is labeled to the matching webpage in search result, and the degree of correlation obtained according to calculating is carried out again to the second webpage
Sequence, it has taken into full account the search intention of user, and user can be helped quickly to navigate to useful information, reduces user's acquisition
The time of useful information, improve the Experience Degree of user.
For up to above-mentioned purpose, third aspect present invention embodiment proposes a kind of computer equipment, including:Memory, place
Manage device and store the computer program that can be run in memory and on a processor, when the computing device computer program
When, for performing the web search method described in first aspect.
To achieve these goals, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory
Medium, is stored thereon with computer program, when the program is executed by processor, for performing the Webpage search described in first aspect
Method.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
Substantially and be readily appreciated that, wherein:
A kind of schematic flow sheet for web search method that Fig. 1 is provided by the embodiment of the present invention;
A kind of estimation webpage and the flow of the method for search intention degree of correlation that Fig. 2 is provided by the embodiment of the present invention show
It is intended to;
Fig. 3 is a kind of schematic flow sheet for Web page text Text Extraction that the present embodiment is proposed;
Fig. 4 is a kind of schematic flow sheet for method of removal comprising a large amount of hinged nodes that the present embodiment is proposed;
Fig. 5 is a kind of structural representation of Webpage search device provided in an embodiment of the present invention;
Fig. 6 is the structural representation of another Webpage search device provided in an embodiment of the present invention;And
The structural representation for the extraction unit 341 that Fig. 7 is provided by the embodiment of the present invention.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.
Below with reference to the accompanying drawings the web search method and device of the embodiment of the present invention described.
The schematic flow sheet for a kind of web search method that Fig. 1 is provided by the embodiment of the present invention, as shown in figure 1, the party
Method comprises the following steps:
Step S1, obtains the keyword of search.
Specifically, in the user input interface of search engine, user inputs keyword and carries out information inquiry according to demand, searches
Index holds up the keyword for identifying user's input.
Step S2, is scanned for according to keyword, obtains matching webpage, and match the sequence of webpage.
Specifically, the keyword that search engine is inputted according to user, is believed in database using web crawlers technology
After the inquiry of breath, the webpage for inquiring the keyword match inputted with user, calculate webpage and user is searched between element intention
The degree of correlation, obtains matching the sequence of webpage, records the corresponding initial sequence of obtain each matching webpage, and by pair net
Page is shown in the displaying interface of user according to initial sequence.
Wherein, the keyword that search engine is inputted according to the user of acquisition, target web is searched for using web crawlers technology,
As a kind of possible implementation, using (Get) request method is looked into during calling search engine, in request URL
(Uniform Resoure Locator, URL) below with question mark () form add and issue the parameter of server, multiple parameters
Between separated with symbol &, search engine can be analyzed based on the parameter submitted in URL, obtained corresponding search result, such as made
With Baidu search keyword " Tsing-Hua University ", then access:“http://www.baidu.com/sQ1=Tsing-Hua University ".
Other are used for the Advanced Search parameter for limiting webpage renewal time etc., also can add corresponding parameter behind URL to realize, example
Such as, the time of the webpage to be searched for and the bar number of every page of display are limited using Baidu search, then is accessed: http://
www.baidu.com/sQ1=Qing Hua great Xue &lm=7&rn=5, that is, the webpage searched in Tsing-Hua University, nearest one week, and often
The bar number of page display is 5.
Step S3, after the first webpage that user is chosen in matching webpage is labeled, obtains the mark of the first webpage.
Specifically, user is carried out according to the principle with search need degree of correlation size to the matching webpage for returning to user
Mark, and the info web being identified by returns to search engine, wherein being referred to as the first webpage by the webpage of user annotation.Need
It is noted that by the first webpage of user annotation, there is the webpage related to user's search need, also have and search element with user
The unrelated webpage of demand, for the ease of distinguishing, the matching webpage related to user's search need is named as the first net of correlation
Page, and the unrelated matching webpage of user's search need are named as the first unrelated webpage.
Further, user is labeled to matching webpage, is in the interaction page that search engine system is generated and is shown
Complete, the interaction page shows the link of all matching pages, and the check box for being labeled.Set in check box
It is equipped with and chooses button, sets two kinds of " correlation " and " unrelated " to choose button by each matching page, if the page is
The page related to search need, then choose " correlation " button, is the first related webpage by the page setup;If the page
It is the page unrelated with search need, then chooses " unrelated " button, is the first unrelated webpage by the page setup.Need explanation
, the first related webpage and the first unrelated webpage are that user chooses determination according to search need by user, its quantity
It can be one or be multiple.
Step S4, according to the degree of correlation between the second webpage matched in webpage the first webpage related to being labeled as
And/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation and search intention
Between degree of correlation.
Specifically, the keyword of user's input is the search intention of correspondence user, and the body text information in webpage is
It is corresponding that plain intentions is searched with user, so that, the degree of correlation between calculating webpage and user search intent is, it is necessary to extract webpage
Body text.
First, the body text of the first webpage is extracted, according to the body text, title and brief introduction of the first webpage, generation the
The document of one webpage, wherein, the document of the first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as unrelated
The first webpage document.
Secondly, title and the brief introduction of the second webpage are extracted, the document of the second webpage is generated.
Finally, according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation, and/
Or it is labeled as the degree of correlation between the document of the first unrelated webpage and the document of the second webpage, the second webpage of estimation and search
Degree of correlation between intention.
It should be noted that the second webpage is typically, it is search in obtained matching webpage, in addition to the first webpage
Webpage.Those skilled in the art could be aware that specifically, the second webpage can be matched in webpage in addition to the first webpage
This is not construed as limiting in whole webpages or part webpage, the present embodiment.
Step S5, according to the degree of correlation between the second webpage and search intention, resequences to the second webpage.
Specifically, it is bigger with the degree of correlation between the second webpage and search intention, after the second webpage is resequenced
Sequence it is more forward be principle, the second webpage is resequenced.
In a kind of web search method of the embodiment of the present invention, the keyword of search is obtained, is scanned for according to keyword,
Obtain matching webpage, and match the sequence of webpage;The first webpage that user is chosen in matching webpage is labeled;According to matching
Degree of correlation between the second webpage in webpage first webpage related to being labeled as, and/or according to the second webpage and mark
For the degree of correlation between the first unrelated webpage, the degree of correlation between the second webpage of estimation and the search intention;According to
The degree of correlation, resequences to the second webpage.User is carried out according to search intention to the matching webpage in search result
Mark, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account the search meaning of user
Figure, can help user quickly to navigate to useful information, reduce the time that user obtains useful information, improve the body of user
Degree of testing.
In order to further clearly illustrate the step S4 in a upper embodiment, as a kind of possible implementation, Fig. 2 is this
A kind of estimation webpage and the schematic flow sheet of the method for search intention degree of correlation that inventive embodiments are provided.
As shown in Fig. 2 step S4 comprises the following steps:
Step S41, extracts the body text of the first webpage.
Specifically, many text messages are included in webpage, there is the text message related to user's search need, also have and
The unrelated text message of user's search need, for the ease of difference, is referred to as text text by the text message related to user's request
This, will be unrelated with user's request, it is impossible to reflects the text message of user search intent, such as navigation bar, outer link and advertisement letter
Breath, referred to as noise text.In order to more accurately characterize the search intention of user, it is necessary to which unrelated noise text is removed, only
Retain body text.
Step S42, according to the body text, title and brief introduction of the first webpage, generates the document of the first webpage.
Specifically, the results page returned for search engine, defines different lookup modes, by results page respectively
Title, profile information extracts, and the first webpage extracted original text, generate the document of the first webpage.Wherein,
The document of first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the document of the first unrelated webpage.
Step S43, extracts title and the brief introduction of the second webpage, generates the document of the second webpage.
Specifically, the results page returned for search engine, defines different lookup modes, by results page respectively
Title, profile information extracts, and generates the document of the second webpage.
It is to be appreciated that the second webpage of correspondence, does not extract the body text information of the second webpage, only extracts the second net
The title of page and brief introduction, in order to reduce the initiation number of times of network connection, so as to reduce the processing time of system background, make
The system of obtaining with the search need of quick response user, can improve the Experience Degree of user under conditions of enough accuracy is ensured.
Step S44, according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation,
And/or be labeled as the degree of correlation between the document of the first unrelated webpage and the document of the second webpage, the second webpage of estimation with
Degree of correlation between search intention.
Specifically, the similarity degree between two documents, i.e. degree of correlation are calculated, most common method is to map document
The similarity between two documents is measured into vector, then with the similarity degree between the corresponding vector of two documents.
As a kind of possible implementation, document is mapped to vector using bag of words, bag of words are nature languages
A kind of method that document data is mapped to vector commonly used in speech processing.Assuming that having N number of word in dictionary, all documents are equal
It is made up of this N number of word in dictionary, any one document can be mapped to the vector of a N-dimensional using bag of words, vector
Kth dimension correspondence dictionary in weight of k-th of word in the document.The weight of word can be that the word occurs in the document
Frequency, the most frequently used weight determination methods are according to word frequency of the word in the document-reverse document-frequency (Term
Frequency-inverse document frequency, TF-IDF) value judges.Wherein, word frequency (Term
Frequency, TF), refer to the frequency that some word occurs in a document, the frequency of appearance is higher, illustrate that the word is more important, weigh
Again bigger, TF value is bigger;Reverse document-frequency (Inverse document frequency, IDF), refers to for common
Word assign less weight, larger weight, i.e. IDF size and the common degree of a word are assigned for uncommon word
It is inversely proportional.TF is multiplied by IDF by the TF-IDF values of one word, is worth the weight of bigger expression word in a document higher.
It is to be appreciated that, it is necessary to first to document progress word segmentation processing, utilize existing point before calculating TF-IDF values
Word instrument converts a document into the set of word, then counts the number of times that each word occurs in the document, calculates the word
TF-IDF values so that the vectorization for obtaining the document is represented.
It should be noted that in actual applications, the dictionary size used in system is about 300,000 words, it can include substantially big
Partial Chinese word, but be due to that neologisms continuously emerge on network, add participle instrument participle mistake that may be present, participle
As a result non-existent word in a small amount of dictionary is likely to occur in, for these words, system is using the strategy directly cast out.
As alternatively possible implementation, document is mapped to vector using distributed vector representation model, word
Distribution is represented, is commonly called as term vector, is referred to a real-valued vectors being mapped to each word in dictionary in vector space, one
As can be by training " neural language " model obtain, vectorial dimension can be adjusted setting when training as needed.
The semantic information of word can be characterized well by training obtained " term vector ", the close word of semantic information in vector space " away from
From " close.Have after the vector representation of word, there can be a variety of methods to expand to the vector representation of document.It is possible as one kind
Implementation, system uses the vector representation that the average weighted method of the vector representation of word is obtained to document, first by advance
The language material training being collected obtains term vector;Then word segmentation processing is carried out to document using participle instrument, counts each word
TF-IDF values;The term vector of these words is obtained into document vector by weights weighted average of TF-IDF values.
Further, using bag of words, the first web document vectorization for being labeled as correlation can be expressed as VCorrelation 1, will
It is labeled as the first unrelated web document vectorization and is expressed as VUnrelated 1, the second web document vectorization is expressed as VAs a result 1;Using point
Cloth vector representation model, can be expressed as V by the first web document vectorization for being labeled as correlationCorrelation 2, will be labeled as unrelated
The first web document vectorization be expressed as VUnrelated 2, the second web document vectorization is expressed as VAs a result 2。
Specifically, after the first web document and the second web document vectorization being represented, you can between document vector
Similarity measures the similarity between document, as a kind of possible implementation, system represented using COS distance to
Similarity between amount, it is thus possible to calculate V using cosine similarityCorrelation 1With VAs a result 1Between cosine similarity SCorrelation 1;VCorrelation 2
With VAs a result 2Between cosine similarity SCorrelation 2;VUnrelated 1With VAs a result 1Between cosine similarity SUnrelated 1;VUnrelated 2With VAs a result 2Between cosine similarity
SUnrelated 2。
Further, according to obtained the first web document and the cosine similarity of the second web document, formula is utilizedCalculating obtains second webpage and anticipated with the search
Degree of correlation Score between figure.
Wherein, α, β, γ are default weight, and n is sequence of second webpage in the matching webpage.
Finally, according to the value for calculating the degree of correlation Score between the second obtained webpage and search intention, to the second net
Page is resequenced, and the webpage after rearrangement is presented into user.Wherein, Score values are bigger, corresponding second webpage
Sorted after being reordered more forward.
In the embodiment of the present invention, the body text of the first webpage is extracted;According to the body text, title and letter of the first webpage
It is situated between, generates the first web document;Title and the brief introduction of the second webpage are extracted, the document of the second webpage is generated;According to being labeled as phase
Degree of correlation between the document of the first webpage and the document of the second webpage that close, and/or according to being labeled as the first unrelated net
Degree of correlation between the document of page and the document of the second webpage, the degree of correlation between the second webpage of estimation and search intention.
User is labeled according to search intention to the matching webpage in search result, and the degree of correlation obtained according to calculating is to second
Webpage is resequenced, and it has taken into full account the search intention of user, and user can be helped quickly to navigate to useful information, drop
Low user obtains the time of useful information, improves the Experience Degree of user.
In a upper embodiment, it is necessary to which useless noise text is removed when being extracted to the text included in webpage,
Only retain body text, but be due to larger, the ununified text resolution algorithm of different webpage pattern difference, be used as one kind
Possible implementation, the present embodiment proposes a kind of Web page text Text Extraction.
Specifically, for the process of the step S41 in further clearly explanation above-described embodiment, Fig. 3 is the present embodiment
The schematic flow sheet of a kind of Web page text Text Extraction proposed, as shown in figure 3, step S41 comprises the following steps:
Step S411, obtains the html files of the first webpage.
Specifically, the html files of the first webpage of user annotation are obtained, according to DOM Document Object Model (Document
Object Model, DOM), each composition in html files is a node, and all information are maintained in html files
In different nodes in.Wherein, whole document is a document node;Each html labels are a node elements;Each
Html attributes are an attribute nodes;Annotation is comment nodes.
Step S412, removes the node of correspondence control, Show Styles and/or annotation in html files.
Specifically, when extracting body text, text envelope is not included in the node of correspondence control, Show Styles and/or annotation
Breath, and the search need of user are unrelated, when extracting body text, it is necessary to which these nodes without body text information are gone
Remove, these need the node removed to comprise at least with lower node:
Remove<script>Node, the node is used to define client script;
Remove<noscript>Node, the node is used to be defined on the replacement (text) when script is not performed;
Remove<style>Node, the node is used to define style information for html documents;
Remove<span>Node, the node is used to combine the row interior element in document;
Remove<meta>Node, the node is used to provide the metamessage (meta-information) about the page, such as
Description and keyword for search engine and update frequency;
The node for including " style=displaynone " attribute is removed, the node is used for the hidden object in webpage, and
It is not that the object being hidden retains physical space;
Comment nodes are removed, refer to the annotation inserted in source code.
Step S413, the quantity of the link contained by each node in html files, it is determined that the section comprising content of text
Point.
Specifically, it is contemplated that the part such as navigation bar, advertisement of the page generally comprises more link, and body text is general
Comprising less link or without link, therefore can be according in node and the quantity of node surrounding links node judges the section
Whether the content of text of point belongs to body text, if included in the node and around node compared with multi-link, judges the node
Text be not body text, it is necessary to delete, otherwise, the text for judging the node is body text, it is necessary to retain.
Step S414, according to the node comprising content of text, generates the body text of the first webpage.
Specifically, the node of correspondence control, Show Styles and/or annotation in html files is removed, and is removed comprising chain
Connect after more node, delete space unnecessary in html files, label, the content of text for extracting remaining node obtains page
The body text information in face.
To remove the process for including the node largely linked in further clearly interpretation procedure S413, as a kind of possible
Implementation, Fig. 4 be the present embodiment propose a kind of removal comprising a large amount of hinged nodes method schematic flow sheet, including
Following steps:
Step S4131, finds out all in html files<a>Node.
Wherein,<a>Node refers to hyperlink node.For each<a>Node, performs the operation of following steps.
Step S4132, finds out this<a>The father node of node, is designated as parent, and initializing variable temp=null.
Specifically, for each<a>For node, there is a unique father node, temp refers to temporary variable.
Step S4133, finds out all under parent<a>Node, calculates under parent and owns<a>The character length of node
Summation la, and calculate parent character length lp。
Specifically, there may be multiple child nodes under each parent father node<a>, multiple child nodes<a>Between be same
The relation of level, by counting all child nodes under parent father nodes<a>Character length summation, whether judge parent nodes
Include more hyperlink node, if to need the noise node deleted.
Step S4134, judges la/lpWhether a pre-set threshold value t is more than, if it is, performing step
S4135, otherwise performs step S4136.
Specifically, if la/lpValue be more than default threshold value t, then it is assumed that the parent nodes are comprising the section that largely links
Point, belongs to the noise node of needs deletion, parent is assigned into temp, while the father that pointer is pointed into parent nodes saves
Point;If la/lpValue be less than or equal to default threshold value t, then it is assumed that the node belongs to text node, it is not necessary to delete.
Step S4135, temp is assigned to by parent, and makes parent point to the father node of parent nodes, is returned simultaneously
Receipt row step S4133.
Step S4136, if temp is not null, deletes temp, if temp is null, retains temp.It is then back to
Step S4132.
It should be noted that regardless of whether deleting temp, return performs step S4132, handles next<a>Node,
It is all until handled that step S4131 finds out<a>Untill node.
In the corresponding embodiments of Fig. 3 and Fig. 4, body text information is extracted from the html files obtained, correlation is obtained
The first webpage document and the document of unrelated the first webpage.In addition, the title of the second webpage of extraction and brief introduction obtain second
The document of webpage, and according to the degree of correlation being labeled as between the document of the first webpage and the document of the second webpage of correlation, and/
Or according to the degree of correlation between the document and the document of the second webpage for being labeled as the first unrelated webpage, the second webpage of estimation with
Degree of correlation between search intention, resequences, it is fully examined according to the degree of correlation that calculating is obtained to the second webpage
Considered the search intention of user, user can be helped quickly to navigate to useful information, reduce user obtain useful information when
Between, improve the Experience Degree of user.
In order to realize above-described embodiment, the present invention also proposes a kind of device of Webpage search.
Fig. 5 is a kind of structural representation of Webpage search device provided in an embodiment of the present invention.
As shown in figure 5, the Webpage search device includes:Acquisition module 31, search module 32, labeling module 33, calculating mould
Block 34 and the module 35 that reorders.
Acquisition module 31, the keyword for obtaining search.
Search module 32, for being scanned for according to keyword, obtains matching webpage, and match the sequence of webpage.
Labeling module 33, after the first webpage for being chosen as user in matching webpage is labeled, obtains the first net
The mark of page;Wherein, mark, for indicating the correlation between search intention of first webpage with using keyword search.
Phase between computing module 34, the first webpage related to being labeled as the second webpage in matching webpage
Pass degree, and/or according to the degree of correlation between the second webpage first webpage unrelated with being labeled as, the second webpage of estimation is with searching
Degree of correlation between Suo Yitu.
The module that reorders 35, for according to the degree of correlation between the second webpage and search intention, being carried out to the second webpage
Rearrangement.
Wherein, as a kind of possible implementation, the module that reorders 35, specifically for the second webpage and search intention
Between degree of correlation it is bigger, the second webpage reordered after sequence it is more forward be principle, to the second webpage carry out again
Sequence.
In the embodiment of the present invention, acquisition module, the keyword for obtaining search;Search module, for according to keyword
Scan for, obtain matching webpage, and match the sequence of webpage;Labeling module, for being chosen as user in matching webpage
After first webpage is labeled, the mark of the first webpage is obtained;Wherein, the mark, for indicating the first webpage with using
Correlation between the search intention of keyword search;Computing module, for the second webpage in matching webpage and mark
For the degree of correlation between the first related webpage, and/or according between the second webpage first webpage unrelated with being labeled as
Degree of correlation, the degree of correlation between the second webpage of estimation and search intention;Reorder module, for according to the second webpage with searching
Degree of correlation between Suo Yitu, resequences to the second webpage.User is according to search intention in search result
It is labeled with webpage, and the degree of correlation obtained according to calculating is resequenced to the second webpage, it has taken into full account use
The search intention at family, can help user's express delivery to navigate to useful information, reduce the time that user obtains useful information, improve
The Experience Degree of user.
It should be noted that the foregoing explanation to embodiment of the method is also applied for the device of the present embodiment, herein not
Repeat again.
Based on above-described embodiment, the embodiment of the present invention additionally provides another webpage and searches the possible realization side that white, quiet clothes are put
Formula, Fig. 6 is the structural representation of another Webpage search device provided in an embodiment of the present invention, on the basis of a upper embodiment
On, Webpage search device also includes:Interactive display module 36, for generating and showing interaction page, wherein, interaction page is shown
There is the link of the matching page, and for obtaining the check box of mark.
In addition, on the basis of a upper embodiment, computing module 34 includes:Extraction unit 341, generation unit 342, extraction
Generation unit 343 and computing unit 344.
Extraction unit 341, the body text for extracting the first webpage.
Generation unit 342, for the body text according to the first webpage, title and brief introduction, generates the document of the first webpage;
Wherein, the document of the first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the first unrelated webpage
Document.
Generation unit 343 is extracted, title and brief introduction for extracting the second webpage generate the document of the second webpage.
Computing unit 344, is labeled as between the document of the first webpage of correlation and the document of the second webpage for basis
Degree of correlation, and/or the degree of correlation being labeled as between the document of the first unrelated webpage and the document of the second webpage, estimation the
Degree of correlation between two webpages and search intention.
Wherein, as a kind of possible implementation, computing unit 344, specifically for according to formulaCalculating obtains second webpage and anticipated with the search
Degree of correlation Score between figure.
In the embodiment of the present invention, user is labeled according to search intention to the matching webpage in search result, calculates
Degree of correlation between the second webpage and search intention, and the second webpage is carried out again according to obtained degree of correlation is calculated
Sequence, it has taken into full account the search intention of user, and user's express delivery can be helped to navigate to useful information, reduces user's acquisition
The time of useful information, improve the Experience Degree of user.
It should be noted that the foregoing explanation to embodiment of the method is also applied for the device of the embodiment of the present invention, this
Place is repeated no more.
Based on above-described embodiment, as a kind of possible implementation, Fig. 7 is extraction unit provided in an embodiment of the present invention
341 structural representation, as shown with 7, extraction unit 341 include:Subelement 3411 is obtained, subelement 3412 is removed, determines son
Unit 3413 and generation subelement 3414.
Subelement 3411 is obtained, the html files for obtaining the first webpage.
Subelement 3412 is removed, for removing correspondence control, Show Styles and/or the node of annotation.
Determination subelement 3413, for the number of links contained by each node in html files, it is determined that including text
The node of content.
Subelement 3414 is generated, for according to the node comprising content of text, generating the body text of the first webpage.
Wherein, as a kind of possible implementation, determination subelement 3412, specifically for by the html files
The quantity of link contained by each node is less than the node of predetermined threshold value, is defined as including the node of content of text.
It should be noted that the foregoing explanation to embodiment of the method is also applied for the device of the present embodiment, herein not
Repeat again.
In the embodiment of the present invention, user is labeled according to search intention to the matching webpage in search result, calculates
Degree of correlation between the second webpage and search intention, and the second webpage is carried out again according to obtained degree of correlation is calculated
Sequence, it has taken into full account the search intention of user, and user's express delivery can be helped to navigate to useful information, reduces user's acquisition
The time of useful information, improve the Experience Degree of user.
In order to realize above-described embodiment, the present invention also proposes a kind of computer equipment, including memory, processor and storage
On a memory and the computer program that can run on a processor, when computer program is executed by processor, perform foregoing
Web search method in embodiment.
In order to realize above-described embodiment, the present invention also proposes a kind of non-transitorycomputer readable storage medium, works as storage
Program in medium by computing device when, perform the web search method in previous embodiment.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described
Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not
Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office
Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area
Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification
Close and combine.
In addition, term " first ", " second " are only used for describing purpose, and it is not intended that indicating or implying relative importance
Or the implicit quantity for indicating indicated technical characteristic.Thus, define " first ", the feature of " second " can express or
Implicitly include at least one this feature.In the description of the invention, " multiple " are meant that at least two, such as two, three
It is individual etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, represent to include
Module, fragment or the portion of the code of one or more executable instructions for the step of realizing custom logic function or process
Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable
Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Represent in flow charts or logic and/or step described otherwise above herein, for example, being considered use
In the order list for the executable instruction for realizing logic function, it may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system including the system of processor or other can be held from instruction
The system of row system, device or equipment instruction fetch and execute instruction) use, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicate, propagate or pass
Defeated program is for instruction execution system, device or equipment or the dress for combining these instruction execution systems, device or equipment and using
Put.The more specifically example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wirings
Connecting portion (electronic installation), portable computer diskette box (magnetic device), random access memory (RAM), read-only storage
(ROM), erasable edit read-only storage (EPROM or flash memory), fiber device, and portable optic disk is read-only deposits
Reservoir (CDROM).In addition, can even is that can be in the paper of printing described program thereon or other are suitable for computer-readable medium
Medium, because can then enter edlin, interpretation or if necessary with it for example by carrying out optical scanner to paper or other media
His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned
In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage
Or firmware is realized.Such as, if realized with hardware with another embodiment, following skill well known in the art can be used
Any one of art or their combination are realized:With the logic gates for realizing logic function to data-signal from
Scattered logic circuit, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can be compiled
Journey gate array (FPGA) etc..
Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried
Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium
In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.
In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also
That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould
Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as
Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only storage, disk or CD etc..Although having been shown and retouching above
Embodiments of the invention are stated, it is to be understood that above-described embodiment is exemplary, it is impossible to be interpreted as the limit to the present invention
System, one of ordinary skill in the art can be changed to above-described embodiment, change, replace and become within the scope of the invention
Type.
Claims (18)
1. a kind of web search method, it is characterised in that comprise the following steps:
Obtain the keyword of search;
Scanned for according to the keyword, obtain matching webpage, and the sequence for matching webpage;
After the first webpage that user is chosen in the matching webpage is labeled, the mark of first webpage is obtained;Its
In, the mark, for indicating the correlation between search intention of first webpage with using the keyword search;
According to the degree of correlation between second webpage matched in webpage, the first webpage related to being labeled as, and/or root
According to the degree of correlation between second webpage, the first webpage unrelated with being labeled as, second webpage and the search are estimated
Degree of correlation between intention;
According to the degree of correlation between second webpage and the search intention, second webpage is resequenced.
2. web search method according to claim 1, it is characterised in that it is described in the matching webpage second
Degree of correlation between webpage the first webpage related to being labeled as, and/or it is unrelated with being labeled as according to second webpage
Degree of correlation between first webpage, estimates that the degree of correlation between second webpage and the search intention includes:
Extract the body text of first webpage;
According to body text, title and the brief introduction of first webpage, the document of first webpage is generated;Wherein, described
The document of one webpage includes the document for being labeled as the first webpage of correlation, and is labeled as the document of the first unrelated webpage;
Title and the brief introduction of second webpage are extracted, the document of second webpage is generated;
According to the degree of correlation being labeled as between the document of the first webpage of correlation and the document of second webpage, and/
Or the degree of correlation being labeled as between the document of the document of the first unrelated webpage and second webpage, estimate described the
Degree of correlation between two webpages and the search intention.
3. web search method according to claim 2, it is characterised in that estimation second webpage is searched with described
Degree of correlation between Suo Yitu, including:
According to formulaCalculating obtain second webpage with
Degree of correlation Score between the search intention;
Wherein, α, β, γ are default weight, and n is sequence of second webpage in the matching webpage;
The document of first webpage for being labeled as correlation, V is expressed as based on bag of words vectorizationCorrelation 1;It is described be labeled as it is unrelated
The first webpage document, V is expressed as based on bag of words vectorizationUnrelated 1;The document of first webpage for being labeled as correlation,
Word-based distributed vector representation model vector is expressed as VCorrelation 2;The document for being labeled as the first unrelated webpage, base
V is expressed as in the distributed vector representation model vector of wordUnrelated 2;
SCorrelation 1For VCorrelation 1V represented based on bag of words vectorization with second webpageAs a result 1Between cosine similarity;SCorrelation 2For
VCorrelation 2V is represented with the word-based distributed vector representation model vector of second webpageAs a result 2Between cosine similarity;
SUnrelated 1For VUnrelated 1V represented based on bag of words vectorization with second webpageAs a result 1Between cosine similarity;SUnrelated 2For VUnrelated 2With
The word-based distributed vector representation model vector of second webpage represents VAs a result 2Between cosine similarity.
4. web search method according to claim 2, it is characterised in that the text text of extraction first webpage
This, including:
Obtain the html files of first webpage;
The quantity of the link contained by each node in the html files, it is determined that the node comprising content of text;
According to the node comprising content of text, the body text of first webpage is generated.
5. web search method according to claim 4, it is characterised in that the html texts of acquisition first webpage
After part, in addition to:
Remove the node of correspondence control, Show Styles and/or annotation.
6. web search method according to claim 4, it is characterised in that each section in the html files
The quantity of the contained link of point, it is determined that the node comprising content of text, including:
The quantity of link contained by each node in the html files is less than to the node of predetermined threshold value, is defined as described include
The node of content of text.
7. the web search method according to claim any one of 1-6, it is characterised in that described according to second webpage
With the degree of correlation between the search intention, second webpage is resequenced, including:
It is bigger with the degree of correlation between second webpage and the search intention, after second webpage is resequenced
Sequence it is more forward be principle, second webpage is resequenced.
8. the web search method according to claim any one of 1-6, it is characterised in that acquisition first webpage
Mark before, in addition to:
Generate and show interaction page, wherein, the interaction page shows the link of the matching page, and for obtaining
The check box of mark.
9. a kind of Webpage search device, it is characterised in that including:
Acquisition module, the keyword for obtaining search;
Search module, for being scanned for according to the keyword, obtains matching webpage, and the sequence for matching webpage;
Labeling module, after the first webpage for being chosen as user in the matching webpage is labeled, obtains described first
The mark of webpage;Wherein, the mark, for indicate first webpage with using the keyword search search intention it
Between correlation;
Correlation between computing module, the first webpage related to being labeled as the second webpage in the matching webpage
Degree, and/or according to the degree of correlation between second webpage, the first webpage unrelated with being labeled as, estimate second net
Degree of correlation between page and the search intention;
Reorder module, for according to the degree of correlation between second webpage and the search intention, to second net
Page is resequenced.
10. Webpage search device according to claim 9, it is characterised in that the computing module, including:
Extraction unit, the body text for extracting first webpage;
Generation unit, for body text, title and the brief introduction according to first webpage, generates the text of first webpage
Shelves;Wherein, the document of first webpage includes the document for being labeled as the first webpage of correlation, and is labeled as unrelated first
The document of webpage;
Generation unit is extracted, title and brief introduction for extracting second webpage generate the document of second webpage;
Computing unit, for being labeled as according between the document of the first webpage of correlation and the document of second webpage
Degree of correlation, and/or the related journey being labeled as between the document of the first unrelated webpage and the document of second webpage
Degree, estimates the degree of correlation between second webpage and the search intention.
11. Webpage search device according to claim 10, it is characterised in that the computing unit, specifically for:
According to formulaCalculating obtain second webpage with
Degree of correlation Score between the search intention;
Wherein, α, β, γ are default weight, and n is sequence of second webpage in the matching webpage;
The document of first webpage for being labeled as correlation, V is expressed as based on bag of words vectorizationCorrelation 1;It is described be labeled as it is unrelated
The first webpage document, V is expressed as based on bag of words vectorizationUnrelated 1;The document of first webpage for being labeled as correlation,
Word-based distributed vector representation model vector is expressed as VCorrelation 2;The document for being labeled as the first unrelated webpage, base
V is expressed as in the distributed vector representation model vector of wordUnrelated 2;
SCorrelation 1For VCorrelation 1V represented based on bag of words vectorization with second webpageAs a result 1Between cosine similarity;SCorrelation 2For
VCorrelation 2V is represented with the word-based distributed vector representation model vector of second webpageAs a result 2Between cosine similarity;
SUnrelated 1For VUnrelated 1V represented based on bag of words vectorization with second webpageAs a result 1Between cosine similarity;SUnrelated 2For VUnrelated 2With
The word-based distributed vector representation model vector of second webpage represents VAs a result 2Between cosine similarity.
12. Webpage search device according to claim 10, it is characterised in that the extraction unit, including:
Obtain subelement, the html files for obtaining first webpage;
Determination subelement, for the connection quantity contained by each node in the html files, it is determined that including content of text
Node;
Subelement is generated, for according to the node comprising content of text, generating the body text of first webpage.
13. Webpage search device according to claim 12, it is characterised in that the extraction unit, in addition to:
Subelement is removed, for removing correspondence control, Show Styles and/or the node of annotation.
14. Webpage search device according to claim 12, it is characterised in that the determination subelement, specifically for:
The quantity of link contained by each node in the html files is less than to the node of predetermined threshold value, is defined as described include
The node of content of text.
15. the Webpage search device according to claim any one of 9-14, it is characterised in that the module that reorders, tool
Body is used for:
It is bigger with the degree of correlation between second webpage and the search intention, after second webpage is resequenced
Sequence it is more forward be principle, second webpage is resequenced.
16. the Webpage search device according to claim any one of 9-14, it is characterised in that the Webpage search device,
Also include:
Interactive display module, for generating and showing interaction page, wherein, the interaction page shows the matching page
Link, and for obtaining the check box of mark.
17. a kind of computer equipment, it is characterised in that including memory, processor and store on a memory and can locate
The computer program run on reason device, it is characterised in that described in the computing device during computer program, realizes that right such as will
Seek any described web search method in 1-8.
18. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, it is characterised in that the program
When being executed by processor, the web search method as described in any in claim 1-8 is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326803.XA CN107220307B (en) | 2017-05-10 | 2017-05-10 | Webpage searching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710326803.XA CN107220307B (en) | 2017-05-10 | 2017-05-10 | Webpage searching method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107220307A true CN107220307A (en) | 2017-09-29 |
CN107220307B CN107220307B (en) | 2020-09-25 |
Family
ID=59944267
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710326803.XA Active CN107220307B (en) | 2017-05-10 | 2017-05-10 | Webpage searching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107220307B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107832432A (en) * | 2017-11-15 | 2018-03-23 | 北京百度网讯科技有限公司 | A kind of search result ordering method, device, server and storage medium |
CN111177514A (en) * | 2019-12-31 | 2020-05-19 | 沈阳航空航天大学 | Information source evaluation method and device based on website characteristic analysis, storage equipment and program |
CN111552879A (en) * | 2020-04-29 | 2020-08-18 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
WO2021253873A1 (en) * | 2020-06-15 | 2021-12-23 | 语联网(武汉)信息技术有限公司 | Method and apparatus for retrieving similar document |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | Webpage graph convolution document ordering method and system based on comparison learning |
CN115034388A (en) * | 2022-07-07 | 2022-09-09 | 北京百度网讯科技有限公司 | Method and device for determining quantization parameters of sequencing model and electronic equipment |
WO2022262632A1 (en) * | 2021-06-18 | 2022-12-22 | 华为技术有限公司 | Webpage search method and apparatus, and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1996316A (en) * | 2007-01-09 | 2007-07-11 | 天津大学 | Search engine searching method based on web page correlation |
CN101281545A (en) * | 2008-05-30 | 2008-10-08 | 清华大学 | Three-dimensional model search method based on multiple characteristic related feedback |
CN101359331A (en) * | 2008-05-04 | 2009-02-04 | 索意互动(北京)信息技术有限公司 | Method and system for reordering search result |
US8032535B2 (en) * | 2009-04-21 | 2011-10-04 | Yahoo! Inc. | Personalized web search ranking |
-
2017
- 2017-05-10 CN CN201710326803.XA patent/CN107220307B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1996316A (en) * | 2007-01-09 | 2007-07-11 | 天津大学 | Search engine searching method based on web page correlation |
CN101359331A (en) * | 2008-05-04 | 2009-02-04 | 索意互动(北京)信息技术有限公司 | Method and system for reordering search result |
CN101281545A (en) * | 2008-05-30 | 2008-10-08 | 清华大学 | Three-dimensional model search method based on multiple characteristic related feedback |
US8032535B2 (en) * | 2009-04-21 | 2011-10-04 | Yahoo! Inc. | Personalized web search ranking |
Non-Patent Citations (1)
Title |
---|
周博等: ""一种基于文档相似度的检索结果重排序方法"", 《中文信息学报》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107832432A (en) * | 2017-11-15 | 2018-03-23 | 北京百度网讯科技有限公司 | A kind of search result ordering method, device, server and storage medium |
CN111177514A (en) * | 2019-12-31 | 2020-05-19 | 沈阳航空航天大学 | Information source evaluation method and device based on website characteristic analysis, storage equipment and program |
CN111177514B (en) * | 2019-12-31 | 2023-06-09 | 沈阳航空航天大学 | Information source evaluation method and device based on website feature analysis, storage device and program |
CN111552879A (en) * | 2020-04-29 | 2020-08-18 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
CN111552879B (en) * | 2020-04-29 | 2023-10-03 | 百度在线网络技术(北京)有限公司 | Data processing method and device |
WO2021253873A1 (en) * | 2020-06-15 | 2021-12-23 | 语联网(武汉)信息技术有限公司 | Method and apparatus for retrieving similar document |
WO2022262632A1 (en) * | 2021-06-18 | 2022-12-22 | 华为技术有限公司 | Webpage search method and apparatus, and storage medium |
CN114817639A (en) * | 2022-05-18 | 2022-07-29 | 山东大学 | Webpage graph convolution document ordering method and system based on comparison learning |
CN114817639B (en) * | 2022-05-18 | 2024-05-10 | 山东大学 | Webpage diagram convolution document ordering method and system based on contrast learning |
CN115034388A (en) * | 2022-07-07 | 2022-09-09 | 北京百度网讯科技有限公司 | Method and device for determining quantization parameters of sequencing model and electronic equipment |
CN115034388B (en) * | 2022-07-07 | 2023-04-28 | 北京百度网讯科技有限公司 | Determination method and device for quantization parameters of ranking model and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107220307B (en) | 2020-09-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107220307A (en) | Web search method and device | |
CN103678576B (en) | The text retrieval system analyzed based on dynamic semantics | |
Weninger et al. | CETR: content extraction via tag ratios | |
US7680778B2 (en) | Support for reverse and stemmed hit-highlighting | |
CN101454750B (en) | Disambiguation of named entities | |
US8630972B2 (en) | Providing context for web articles | |
CN108763321B (en) | Related entity recommendation method based on large-scale related entity network | |
US20090300046A1 (en) | Method and system for document classification based on document structure and written style | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
CN108415902A (en) | A kind of name entity link method based on search engine | |
US9031898B2 (en) | Presentation of search results based on document structure | |
US8788494B2 (en) | Method, device and system for processing, browsing and searching an electronic documents | |
US20090265330A1 (en) | Context-based document unit recommendation for sensemaking tasks | |
EP2019361A1 (en) | A method and apparatus for extraction of textual content from hypertext web documents | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
CN103886020B (en) | A kind of real estate information method for fast searching | |
US20100257177A1 (en) | Document rating calculation system, document rating calculation method and program | |
JP2005122295A (en) | Relationship figure creation program, relationship figure creation method, and relationship figure generation device | |
US20080168049A1 (en) | Automatic acquisition of a parallel corpus from a network | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
Carey et al. | HTML web content extraction using paragraph tags | |
Uzun et al. | An effective and efficient Web content extractor for optimizing the crawling process | |
Versley et al. | Not just bigger: Towards better-quality Web corpora | |
JP5629976B2 (en) | Patent specification evaluation / creation work support apparatus, method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |