CN105095386A - Device and method for determining web page quality - Google Patents

Device and method for determining web page quality Download PDF

Info

Publication number
CN105095386A
CN105095386A CN201510377033.2A CN201510377033A CN105095386A CN 105095386 A CN105095386 A CN 105095386A CN 201510377033 A CN201510377033 A CN 201510377033A CN 105095386 A CN105095386 A CN 105095386A
Authority
CN
China
Prior art keywords
url
classification
module
web page
anchor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510377033.2A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510377033.2A priority Critical patent/CN105095386A/en
Publication of CN105095386A publication Critical patent/CN105095386A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention relates to the technical field of network data communication and discloses a device and a method for determining web page quality. The device comprises an acquisition module, a classification module, a searching module and a determining module, wherein the acquisition module is applicable in acquiring a web page URL to be determined and an anchor word corresponding to the URL; the classification module is applicable in dividing the URL into different types according to website modes to which the URL belongs; the searching module is applicable in finding a plurality of URL containing similar contents according to an anchor document corresponding to the URL in each type; and the determining module is applicable in determining quality of the webpage corresponding to the URL according to preset characteristics of the URL based on the multiple URL containing the similar contents. The device and the method for determining the web page quality provided by the invention can solve the problem about low efficiency in an operation of determining the web page quality, obtain the beneficial effect of increasing efficiency of the operation in determining the web page quality, can then realize grasping according to the web page quality, and thus can save a grasping flow and increase coverage rate and timeliness of search engine inclusion.

Description

Determine the device and method of web page quality
Technical field
The present invention relates to network data communication technical field, be specifically related to the device and method determining web page quality.
Background technology
The spider of search engine can find the new URL produced in a large amount of internets every day.Because search engine every day can the actual URL limited amount captured, so spider is before capturing, needs the URL to finding to select, therefrom selecting high-quality URL to capture.In the prior art, the method that quality determines is carried out to URL and comprise the following two kinds.A kind of is the quality determining the webpage that URL is corresponding according to web page contents.Another kind is that the crawl amount of the webpage similar with this web page contents is to determine the quality of the webpage that URL is corresponding.Such as, if when the crawl amount of the webpage similar with this web page contents is larger, the crawl probability of webpage corresponding to URL is reduced.
But, determine web page quality by the crawl amount of web page contents or related web page, need to obtain more information, and complicated to the process of information, cause the efficiency of the operation determining web page quality lower.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of device and method of determination web page quality overcoming the problems referred to above or solve the problem at least in part.
According to one aspect of the present invention, provide a kind of device determining web page quality, this device comprises:
Acquisition module, is suitable for obtaining anchor word corresponding to webpage URL and URL to be determined;
Sort module, is suitable for the network address pattern belonging to described URL, described URL is divided into each classification;
Search module, be suitable in each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content;
Determination module, is suitable for the multiple URL for having Similar content, determines the quality of the webpage that URL is corresponding according to the default feature of URL.
Alternatively, described in search module, comprise further:
Calculating sub module, is suitable for calculating similarity in classification between each URL according to anchor word corresponding to URL;
Search submodule, be suitable for finding out in classification multiple URL with Similar content according to similarity between URL.
Alternatively, described determination module, be further adapted for according to the attention rate of URL from described have multiple URL of Similar content select URL, webpage capture is carried out to the URL selected.
Alternatively, described sort module, is further adapted for and the URL with same network address pattern belonging to same website is divided into same classification.
Alternatively, described determination module, is further adapted for and sorts to described multiple URL with Similar content by the attention rate of URL; Top n URL in selected and sorted carries out webpage capture, and wherein, N is for presetting positive integer.
Alternatively, described calculating sub module, is further adapted for and anchor word corresponding for URL in same classification is carried out participle respectively, from participle gained entry, determine keyword; Similarity between URL is calculated according to keyword in anchor word.
Alternatively, search module described in also to comprise:
Remove submodule, be suitable for after anchor word corresponding for URL in same classification is carried out participle respectively, utilize and preset vocabulary to anchor word participle gained entry, get rid of the entry that to have nothing to do with the feature of anchor word.
Alternatively, described calculating sub module, is further adapted for the frequency corresponding to entry, from participle gained entry, determine keyword.
According to a further aspect in the invention, provide a kind of method determining web page quality, the method comprises:
Obtain the anchor word that webpage URL and URL to be determined is corresponding;
Network address pattern belonging to described URL, is divided into each classification by described URL;
In each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content;
For multiple URL with Similar content, determine the quality of the webpage that URL is corresponding according to the default feature of URL.
Alternatively, the described anchor file corresponding according to URL finds out multiple URL with Similar content, comprises further:
The anchor word corresponding according to URL calculates the similarity in classification between each URL;
Multiple URL with Similar content are found out in classification according to similarity between URL.
Alternatively, the described default feature according to URL determines the quality of the webpage that URL is corresponding, comprises further:
According to the attention rate of URL from described have multiple URL of Similar content select URL, webpage capture is carried out to the URL selected.
Alternatively, described network address pattern belonging to described URL, is divided into each classification by described URL, comprises further:
The URL with same network address pattern belonging to same website is divided into same classification.
Alternatively, the described attention rate according to URL from described have multiple URL of Similar content select URL, webpage capture is carried out to the URL selected and comprises further:
By the attention rate of URL, described multiple URL with Similar content are sorted;
Top n URL in selected and sorted carries out webpage capture, and wherein, N is for presetting positive integer.
Alternatively, the described anchor word corresponding according to URL calculates the similarity in classification between each URL, comprises further:
Anchor word corresponding for URL in same classification is carried out participle respectively, from participle gained entry, determines keyword;
Similarity between URL is calculated according to keyword in anchor word.
Alternatively, described method also comprises:
After anchor word corresponding for URL in same classification is carried out participle respectively, utilize and preset vocabulary to anchor word participle gained entry, get rid of the entry that to have nothing to do with the feature of anchor word.
Alternatively, describedly from participle gained entry, determine keyword, comprise further:
Frequency corresponding to entry, determines keyword from participle gained entry.
Anchor word corresponding to webpage URL and URL to be determined can be obtained according to technical scheme of the present invention; Network address pattern belonging to URL, is divided into each classification by URL; In each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content; For multiple URL with Similar content, determine the quality of the webpage that URL is corresponding according to the default feature of URL.Solve thus and determine web page quality by the crawl amount of web page contents or related web page, need to obtain more information, and it is complicated to the process of information, cause the problem that the efficiency of the operation determining web page quality is lower, achieve the beneficial effect improving and determine the efficiency that web page quality operates, and can capture according to web page quality afterwards, and then the flow of crawl can be saved, promote the search engine coverage rate of including and ageing.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the method determining web page quality according to an embodiment of the invention;
Fig. 2 shows the process flow diagram of the operation of searching the URL with Similar content according to an embodiment of the invention;
Fig. 3 shows and selects URL to carry out the process flow diagram of the operation captured according to an embodiment of the invention;
Fig. 4 shows the process flow diagram of the method determining web page quality according to an embodiment of the invention;
Fig. 5 shows the structural drawing of the device determining web page quality according to an embodiment of the invention; And
Fig. 6 shows the structural drawing of the device determining web page quality according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Fig. 1 shows the process flow diagram of the method determining web page quality according to an embodiment of the invention.The method is suitable for the various server relevant to search engine, and as shown in Figure 1, the method comprises the steps.
In step s 110, anchor word corresponding to webpage URL and URL to be determined is obtained.
Wherein, anchor file (anchor), for connecting source word, is the passage that webpage is corresponding, for being described webpage.
For example, find new URL at spider from network, this new URL is webpage URL to be determined, before spider captures, obtains the anchor word that URL and URL of webpage to be determined is corresponding.Afterwards, by the analysis to anchor word corresponding to URL and URL, the quality of webpage to be determined is determined.
In the step s 120, the network address pattern belonging to URL, is divided into each classification by URL.
In one embodiment, described network address pattern belonging to described URL, is divided into each classification by described URL, comprises further: the URL with same network address pattern belonging to same website is divided into same classification.
For example, for the website (such as zhidao.baidu.com) of question and answer class, user may produce a large amount of problems every day, and the meeting that these problems have is answered by other users, and what have can not be answered.In these problems, search engine only can include answer and the part webpage that is comparatively satisfied with of answer to same problem.
For the following webpage in zhidao.baidu.com website, they have same network address pattern, are therefore divided into same classification.
http://zhidao.baidu.com/question/433737807751460604.html
http://zhidao.baidu.com/question/1605209362191413347.html
http://zhidao.baidu.com/question/618238863630856372.html
http://zhidao.baidu.com/question/625161396233610844.html
http://zhidao.baidu.com/question/1367620128259860259.html
http://zhidao.baidu.com/question/2139209187911446788.html
http://zhidao.baidu.com/question/584108667629594845.html
Determine that the first order network address of above-mentioned URL is identical, judge that above-mentioned URL belongs to same website; From front to back, when the progression in URL network address at different levels with identical network address exceedes predetermined threshold value, or network address progression is identical and when only having afterbody network address different, judge that URL has same network address pattern.To belong to same website, and the URL with same network address pattern is divided into same classification.
In step s 130, which, in each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content.
In one embodiment, as shown in Figure 2, the described anchor file corresponding according to URL finds out multiple URL with Similar content, comprises the steps S210 and step S220.
In step S210, the anchor word corresponding according to URL calculates the similarity in classification between each URL.
Further, the described anchor word corresponding according to URL calculates the similarity in classification between each URL, can comprise: anchor word corresponding for URL in same classification is carried out participle respectively, from participle gained entry, determine keyword; Similarity between URL is calculated according to keyword in anchor word.
Wherein, describedly from participle gained entry, determine keyword, can comprise: the frequency corresponding to entry, from participle gained entry, determine keyword.
In addition, for avoiding irrelevant entry to disturb, can filter participle gained entry.Described method also can comprise: after anchor word corresponding for URL in same classification is carried out participle respectively, utilizes and presets vocabulary to anchor word participle gained entry, gets rid of the entry that to have nothing to do with the feature of anchor word.
In step S220, find out in classification multiple URL with Similar content according to similarity between URL.
Such as, in same classification, search the URL being greater than default similarity threshold with the similarity of same URL, the URL found and this same URL is formed multiple URL with Similar content.
For example, the anchor word obtaining new discovery URL and correspondence is as shown in the table.Wherein, URL belongs to same website, and has same network address pattern (network address pattern is http://zhidao.baidu.com/question/*.html), is divided into same classification.
Calculate the similarity of anchor word corresponding to URL, using the similarity of anchor word similarity as URL.The calculating of anchor word similarity can adopt various ways, such as, adopt the anchor word similarity as under type calculating URL in document http://www.cnblogs.com/liangxiaxu/archive/2012/05/05/2484972.ht ml.
The first step, two text-string A and B, namely two anchor word A and B, do word segmentation processing respectively to A and B, will have little significance but higher word, symbol, punctuate and the mess code etc. of the frequency of occurrences remove according to the word in default vocabulary in A and B to content of text identification.As " this, and, meeting, for " etc. word almost appear in any one section of Chinese text, but they to the meaning expressed by this text almost without any contribution word, be configured in default vocabulary, this kind of word comprised in A and B got rid of.
Second step, after completing participle gained word filtering, determines some keywords according to the frequency of remaining word.Frequency calculates with reference to TF (termfrequency, term frequency) formula.Weighting is the mechanism varying in size for the embodiment effect of each keyword to text feature and arrange, and weight computing is with reference to IDF (inversedocumentfrequency, reversing document frequency) formula.
3rd step, anchor word character string, to turn to the weight of keyword be component that N dimensional vector represents.Such as A=(a1, a2, a3 ... an), B=(b1, b2, b3 ... .bn), then compute vector (a1, a2, a3 ... an) cosine value of angle is as the similarity of A and B and between (b1, b2, b3....bn).The cosine value of angle calculates by following formula one:
Formula one (a1 × b1+a2 × b2+ ... + an × bn)/(sqrt (a1 × a1+a2 × a2+ ... + an × an) × sqrt (b1 × b1+b2 × b2+ ... + bn × bn)), wherein radical sign is opened in function sqrt () expression
In step S140, for multiple URL with Similar content, determine the quality of the webpage that URL is corresponding according to the default feature of URL.
In one embodiment, the described default feature according to URL determines the quality of the webpage that URL is corresponding, can comprise: the attention rate according to URL selects URL from multiple URL with Similar content, carries out webpage capture to the URL selected.
Further, as shown in Figure 3, the described attention rate according to URL from described have multiple URL of Similar content select URL, carrying out webpage capture to the URL selected can comprise:
Step S310, sorts to multiple URL with Similar content by the attention rate of URL.
Step S320, the top n URL in selected and sorted carries out webpage capture.Wherein, N is for presetting positive integer.
The present embodiment solves determines web page quality by the crawl amount of web page contents or related web page, need to obtain more information, and it is complicated to the process of information, cause the problem that the efficiency of the operation determining web page quality is lower, achieve the beneficial effect improving and determine the efficiency that web page quality operates, and can capture according to web page quality afterwards, and then the flow of crawl can be saved, promote the search engine coverage rate of including and ageing.
Fig. 4 shows the process flow diagram of the method determining web page quality according to an embodiment of the invention; The method is suitable for the various server relevant to search engine, and as shown in Figure 4, the method comprises the steps.
In step S402, obtain the anchor word that webpage URL and URL to be determined is corresponding.
In step s 404, the URL with same network address pattern belonging to same website is divided into same classification.
In step S406, anchor word corresponding for URL in same classification is carried out participle respectively.
In step S408, utilize and preset vocabulary to anchor word participle gained entry, get rid of the entry that to have nothing to do with the feature of anchor word.
In step S410, from residue entry, determine keyword, calculate similarity between URL according to keyword in anchor word.
In step S412, find out in classification multiple URL with Similar content according to similarity between URL.
In step S414, for multiple URL with Similar content, by the attention rate of URL, multiple URL with Similar content are sorted.
In step S416, the top n URL in selected and sorted carries out webpage capture.
Wherein, N is for presetting positive integer.
Above are only the exemplary of the method for participle procedure identification organization names that utilizes of the present invention to illustrate, the present invention is not limited thereto.All do within spirit of the present invention or principle any amendment, equivalent replacement, improvement etc., be all included in protection scope of the present invention.
Fig. 5 shows the structural drawing of the device determining web page quality according to an embodiment of the invention.This device is suitable for the various server relevant to search engine, and as shown in Figure 5, this device comprises as lower module.
Acquisition module 510, is suitable for obtaining anchor word corresponding to webpage URL and URL to be determined;
Sort module 520, is suitable for the network address pattern belonging to described URL, described URL is divided into each classification;
Search module 530, be suitable in each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content;
Determination module 540, is suitable for the multiple URL for having Similar content, determines the quality of the webpage that URL is corresponding according to the default feature of URL.
For example, find new URL at spider from network, this new URL is webpage URL to be determined, and before spider captures, acquisition module 510 obtains anchor word corresponding to URL and URL of webpage to be determined.Afterwards, the network address pattern of sort module 520 belonging to described URL, is divided into each classification by described URL; Search module 530 in each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content; Determination module 540, for multiple URL with Similar content, determines the quality of the webpage that URL is corresponding according to the default feature of URL.
In one embodiment, sort module 520, is specifically suitable for the URL with same network address pattern belonging to same website to be divided into same classification.
For example, for the website (such as zhidao.baidu.com) of question and answer class, user may produce a large amount of problems every day, and the meeting that these problems have is answered by other users, and what have can not be answered.In these problems, search engine only can include answer and the part webpage that is comparatively satisfied with of answer to same problem.
For the following webpage in zhidao.baidu.com website, they have same network address pattern, are therefore divided into same classification.
http://zhidao.baidu.com/question/433737807751460604.html
http://zhidao.baidu.com/question/1605209362191413347.html
http://zhidao.baidu.com/question/618238863630856372.html
http://zhidao.baidu.com/question/625161396233610844.html
http://zhidao.baidu.com/question/1367620128259860259.html
http://zhidao.baidu.com/question/2139209187911446788.html
http://zhidao.baidu.com/question/584108667629594845.html
Sort module 520 determines that the first order network address of above-mentioned URL is identical, judges that above-mentioned URL belongs to same website; From front to back, when the progression in URL network address at different levels with identical network address exceedes predetermined threshold value, or network address progression is identical and when only having afterbody network address different, judge that URL has same network address pattern.Sort module 520 will belong to same website, and the URL with same network address pattern is divided into same classification.
In one embodiment, as shown in Figure 6, search module 530, can comprise:
Calculating sub module 532, is suitable for calculating similarity in classification between each URL according to anchor word corresponding to URL;
Search submodule 534, be suitable for finding out in classification multiple URL with Similar content according to similarity between URL.
Further, calculating sub module 532, is specifically suitable for anchor word corresponding for URL in same classification to carry out participle respectively, from participle gained entry, determines keyword; Similarity between URL is calculated according to keyword in anchor word.
Wherein, calculating sub module 532, is further adapted for the frequency corresponding to entry, from participle gained entry, determine keyword.
Further, search module 530 as shown in Figure 6 also to comprise: remove submodule 536, be suitable for after anchor word corresponding for URL in same classification is carried out participle respectively, utilize and preset vocabulary to anchor word participle gained entry, get rid of the entry that to have nothing to do with the feature of anchor word.
For example, the anchor word obtaining new discovery URL and correspondence is as shown in the table.Wherein, URL belongs to same website, and has same network address pattern (network address pattern is http://zhidao.baidu.com/question/*.html), and these URL are divided into same classification by sort module 520.
Calculating sub module 532 calculates the similarity of anchor word corresponding to URL, using the similarity of anchor word similarity as URL.The calculating of anchor word similarity can adopt various ways, such as, adopt the anchor word similarity as under type calculating URL in document http://www.cnblogs.com/liangxiaxu/archive/2012/05/05/2484972.ht ml.
Two text-string A and B, i.e. two anchor word A and B, calculating sub module 532 couples of A and B do word segmentation processing respectively, remove submodule 536 and will to have little significance to content of text identification in A and B according to the word in default vocabulary but higher word, symbol, punctuate and the mess code etc. of the frequency of occurrences remove.As " this, and, meeting, for " etc. word almost appear in any one section of Chinese text, but they to the meaning expressed by this text almost without any contribution word, be configured in default vocabulary, this kind of word comprised in A and B got rid of.
After removal submodule 536 completes participle gained word filtering, calculating sub module 532 determines some keywords according to the frequency of remaining word.Frequency calculates with reference to TF (termfrequency, term frequency) formula.Weighting is the mechanism varying in size for the embodiment effect of each keyword to text feature and arrange, and weight computing is with reference to IDF (inversedocumentfrequency, reversing document frequency) formula.
Anchor word character string, to turn to the weight of keyword be component to calculating sub module 532 that N dimensional vector represents.Such as A=(a1, a2, a3 ... an), B=(b1, b2, b3....bn), then compute vector (a1, a2, a3 ... and (b1, b2, b3 an) ... .bn) between, the cosine value of angle is as the similarity of A and B.Calculating by following formula one of the cosine value of angle:
Formula one (a1 × b1+a2 × b2+ ... + an × bn)/(sqrt (a1 × a1+a2 × a2+ ... + an × an) × sqrt (b1 × b1+b2 × b2+ ... + bn × bn)), wherein radical sign is opened in function sqrt () expression
Search submodule 534 in same classification, search the URL being greater than default similarity threshold with the similarity of same URL, the URL found and this same URL is formed multiple URL with Similar content.
In one embodiment, determination module 540, be specifically suitable for according to the attention rate of URL from described have multiple URL of Similar content select URL, webpage capture is carried out to the URL selected.
Further, determination module 540, is specifically suitable for sorting to described multiple URL with Similar content by the attention rate of URL; Top n URL in selected and sorted carries out webpage capture, and wherein, N is for presetting positive integer.
The exemplary illustration of said apparatus, see the detailed description of corresponding part in said method, does not repeat them here.
The present embodiment solves determines web page quality by the crawl amount of web page contents or related web page, need to obtain more information, and it is complicated to the process of information, cause the problem that the efficiency of the operation determining web page quality is lower, achieve the beneficial effect improving and determine the efficiency that web page quality operates, and can capture according to web page quality afterwards, and then the flow of crawl can be saved, promote the search engine coverage rate of including and ageing.
It should be noted that:
Intrinsic not relevant to any certain computer, virtual bench or miscellaneous equipment with display at this algorithm provided.Various fexible unit also can with use based on together with this teaching.According to description above, the structure constructed required by this kind of device is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the equipment of the determination web page quality of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (10)

1. determine a device for web page quality, this device comprises:
Acquisition module, is suitable for obtaining anchor word corresponding to webpage URL and URL to be determined;
Sort module, is suitable for the network address pattern belonging to described URL, described URL is divided into each classification;
Search module, be suitable in each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content;
Determination module, is suitable for the multiple URL for having Similar content, determines the quality of the webpage that URL is corresponding according to the default feature of URL.
2. device according to claim 1, wherein, described in search module, comprise further:
Calculating sub module, is suitable for calculating similarity in classification between each URL according to anchor word corresponding to URL;
Search submodule, be suitable for finding out in classification multiple URL with Similar content according to similarity between URL.
3. device according to claim 1 and 2, wherein, described determination module, be further adapted for according to the attention rate of URL from described have multiple URL of Similar content select URL, webpage capture is carried out to the URL selected.
4., according to the arbitrary described device of claims 1 to 3, wherein, described sort module, is further adapted for and the URL with same network address pattern belonging to same website is divided into same classification.
5. the device according to any one of claim 1-4, wherein, described determination module, is further adapted for and sorts to described multiple URL with Similar content by the attention rate of URL; Top n URL in selected and sorted carries out webpage capture, and wherein, N is for presetting positive integer.
6. the device according to any one of claim 1-5, wherein, described calculating sub module, is further adapted for and anchor word corresponding for URL in same classification is carried out participle respectively, from participle gained entry, determine keyword; Similarity between URL is calculated according to keyword in anchor word.
7. the device according to any one of claim 1-6, wherein, described in search module and also comprise:
Remove submodule, be suitable for after anchor word corresponding for URL in same classification is carried out participle respectively, utilize and preset vocabulary to anchor word participle gained entry, get rid of the entry that to have nothing to do with the feature of anchor word.
8. the device according to any one of claim 1-7, wherein, described calculating sub module, is further adapted for the frequency corresponding to entry, from participle gained entry, determine keyword.
9. determine a method for web page quality, the method comprises:
Obtain the anchor word that webpage URL and URL to be determined is corresponding;
Network address pattern belonging to described URL, is divided into each classification by described URL;
In each classification, the anchor file corresponding according to URL finds out multiple URL with Similar content;
For multiple URL with Similar content, determine the quality of the webpage that URL is corresponding according to the default feature of URL.
10. method according to claim 9, wherein, the described anchor file corresponding according to URL finds out multiple URL with Similar content, comprises further:
The anchor word corresponding according to URL calculates the similarity in classification between each URL;
Multiple URL with Similar content are found out in classification according to similarity between URL.
CN201510377033.2A 2015-06-30 2015-06-30 Device and method for determining web page quality Pending CN105095386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510377033.2A CN105095386A (en) 2015-06-30 2015-06-30 Device and method for determining web page quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510377033.2A CN105095386A (en) 2015-06-30 2015-06-30 Device and method for determining web page quality

Publications (1)

Publication Number Publication Date
CN105095386A true CN105095386A (en) 2015-11-25

Family

ID=54575823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510377033.2A Pending CN105095386A (en) 2015-06-30 2015-06-30 Device and method for determining web page quality

Country Status (1)

Country Link
CN (1) CN105095386A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886554A (en) * 2016-12-27 2017-06-23 苏州思杰马克丁软件有限公司 A kind of determination method and device of article quality
CN106886532A (en) * 2015-12-16 2017-06-23 北京奇虎科技有限公司 Mode and device based on Authoritative Web pages checking POI data accuracy
CN109522345A (en) * 2018-11-19 2019-03-26 百度在线网络技术(北京)有限公司 For showing the method and device of information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
CN102654861A (en) * 2011-03-01 2012-09-05 腾讯科技(深圳)有限公司 Method and system for calculating webpage extraction accuracy
CN103631906A (en) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 Method and device for recognizing page number identification in webpage URL
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102654861A (en) * 2011-03-01 2012-09-05 腾讯科技(深圳)有限公司 Method and system for calculating webpage extraction accuracy
CN102402619A (en) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 Search method and device
CN103631906A (en) * 2013-11-25 2014-03-12 北京奇虎科技有限公司 Method and device for recognizing page number identification in webpage URL
CN104063506A (en) * 2014-07-08 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for identifying repeated web pages

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886532A (en) * 2015-12-16 2017-06-23 北京奇虎科技有限公司 Mode and device based on Authoritative Web pages checking POI data accuracy
CN106886554A (en) * 2016-12-27 2017-06-23 苏州思杰马克丁软件有限公司 A kind of determination method and device of article quality
CN109522345A (en) * 2018-11-19 2019-03-26 百度在线网络技术(北京)有限公司 For showing the method and device of information

Similar Documents

Publication Publication Date Title
CN103631887B (en) Browser side carries out the method and browser of web search
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
KR101315554B1 (en) Keyword assignment to a web page
US20040260695A1 (en) Systems and methods to tune a general-purpose search engine for a search entry point
CN103617241B (en) Search information processing method, browser terminal and server
CN104965905A (en) Web page classifying method and apparatus
CN105574062A (en) File retrieval method and apparatus and terminal
CN104216881A (en) Method and device for recommending individual labels
CN103365904B (en) A kind of advertising message searching method and system
CN105005564A (en) Data processing method and apparatus based on question-and-answer platform
CN104715064A (en) Method and server for marking keywords on webpage
CN105488221A (en) Method and system for recommending query terms for conducting searching in search interface
CN105095391A (en) Device and method for identifying organization name by word segmentation program
CN104077415A (en) Searching method and device
CN103514282A (en) Method and device for displaying search results of videos
CN104699837B (en) Method, device and server for selecting illustrated pictures of web pages
CN104317931A (en) Webpage title determining method and device
CN103744887A (en) Method and device for people search and computer equipment
CN106603490A (en) Phishing website detecting method and system
CN103942264A (en) Method and device for pushing webpages containing news information
CN105095386A (en) Device and method for determining web page quality
CN102902794A (en) Web page classification system and method
CN105159898A (en) Searching method and searching device
CN105159921A (en) Method and apparatus for de-duplicating point-of-interest (POI) data in map
CN104778233A (en) Searching method and device based on click rate

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20151125