CN104462492A - Method and device for grabbing question and answer webpages - Google Patents

Method and device for grabbing question and answer webpages Download PDF

Info

Publication number
CN104462492A
CN104462492A CN201410801976.9A CN201410801976A CN104462492A CN 104462492 A CN104462492 A CN 104462492A CN 201410801976 A CN201410801976 A CN 201410801976A CN 104462492 A CN104462492 A CN 104462492A
Authority
CN
China
Prior art keywords
answer
webpage
question
target question
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410801976.9A
Other languages
Chinese (zh)
Other versions
CN104462492B (en
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410801976.9A priority Critical patent/CN104462492B/en
Publication of CN104462492A publication Critical patent/CN104462492A/en
Application granted granted Critical
Publication of CN104462492B publication Critical patent/CN104462492B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a method and device for grabbing question and answer webpages. The method for grabbing the question and answer webpages comprises the steps that a target question and answer webpage of the preset content type is recognized in the grabbed question and answer webpages; the user access data of the target question and answer webpage are acquired; when the user access data meet a preset condition, the question and answer webpage is grabbed again. According to the scheme, effective answers in the question and answer webpages can be recorded in time, and the efficiency of grabbing the question and answer webpages is improved.

Description

Capture the method and apparatus of question and answer class webpage
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of method and apparatus capturing question and answer class webpage.
Background technology
The webpage of including in search engine needs and webpage in internet is consistent, guarantee is presented to the content of user and conforms to the actual content on network, that is when in internet, web page contents changes, search engine also should upgrade its webpage of including, otherwise directly affects the experience that user uses network.Therefore search engine generally regularly can scan the webpage of including, when finding to occur re-starting crawl when upgrading.
But for the webpage of some particular types in network, the time of its more new change is unfixed, this kind of webpage is used to the mode of existing periodic scanning, can cause a large amount of wastes (such as taking a large amount of network traffics).
Question and answer class webpage is exactly random one above-mentioned webpage update time, and this kind of webpage refers in a certain website RELEASE PROBLEM, waits for that other users of this website carry out the special web page answered.The development such as existing question and answer class webpage such as 360 question and answer is very fast, has attracted a large amount of users.After issue question and answer class webpage, the time that problem obtains effective answer is unfixed, and some may be answered after distribution at once, and some problem may need several days even time of last month could obtain answer, even some problem unmanned answer all the time.
Update time, therefore another aspect enormous amount again, if used the shorter scan period at random on the one hand for question and answer class webpage, a large amount of resource consumptions can be caused, if use the longer scan period, the effective answer cannot including problem in time can be caused again, cause the experience that user is poor.Therefore the fetching of effectively process question and answer class webpage is lacked in prior art.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of overcoming the problems referred to above or the device of crawl question and answer class webpage solved the problem at least in part and the corresponding method capturing question and answer class webpage.The present invention's further object is the webpage that will make more effectively to capture renewal.
Another further object of the present invention will more effectively utilize crawl flow, avoids the wasting of resources.
According to an aspect of the present invention, a kind of method capturing question and answer class webpage is provided.The method of this crawl question and answer class webpage comprises: the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured; Obtain the user accesses data of target question and answer webpage; When user accesses data meets pre-conditioned, target question and answer webpage is captured again.
Alternatively, predetermined content type comprises the page not comprising answer in question and answer class webpage.
Alternatively, the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured comprises: the question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in question and answer webpage; To the question and answer webpage of answer do not comprised as target question and answer webpage.
Alternatively, also comprise identify the target question and answer webpage of predetermined content type in the question and answer class webpage captured after: obtain the issuing time of target question and answer webpage and the crawl time of target question and answer webpage; Calculate the mistiming of issuing time and the time of crawl; When determining that the mistiming is more than or equal to predetermined threshold value, perform the step of the user accesses data obtaining target question and answer webpage.
Alternatively, the user accesses data obtaining target question and answer webpage comprises: the independent visitor's data obtaining target question and answer webpage; Pre-conditionedly to comprise: independent visitor's number that in the time within the mistiming, target question and answer webpage is newly-increased reaches predetermined number.
Alternatively, the user accesses data obtaining target question and answer webpage comprises: obtain target question and answer webpage as the accessed information of the hyperlink of other webpages; Pre-conditionedly to comprise: in the time within the mistiming, target question and answer webpage is as the accessed mistake of hyperlink of other webpages.
Alternatively, when determining that the mistiming is less than predetermined threshold value, said method also comprises: directly again capture target question and answer webpage.
Alternatively, the issuing time of target question and answer webpage comprises: the creation-time of target question and answer webpage or search engine find the time of target question and answer webpage.
According to another aspect of the present invention, a kind of device capturing question and answer class webpage is additionally provided.The device of this crawl question and answer class webpage comprises: identification module, and be configured to the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured, predetermined content type comprises the page not comprising answer in question and answer class webpage; Visit data acquisition module, is configured to the user accesses data obtaining target question and answer webpage; Handling module, is configured to when user accesses data meets pre-conditioned, again captures target question and answer webpage.
Alternatively, identification module is also configured to: the question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in question and answer webpage; To the question and answer webpage of answer do not comprised as target question and answer webpage.
Alternatively, the device of above crawl question and answer class webpage also comprises: time-obtaining module, is configured to obtain the issuing time of target question and answer webpage and the crawl time of target question and answer webpage; Computing module, is configured to the mistiming calculating issuing time and the time of crawl; Visit data acquisition module, is also configured to when determining that the mistiming is more than or equal to predetermined threshold value, performs the step of the user accesses data obtaining target question and answer webpage.
Alternatively, visit data acquisition module, is also configured to the independent visitor's data obtaining target question and answer webpage; Handling module, is also configured to the newly-increased independent visitor's number of target question and answer webpage in the time within the mistiming when reaching predetermined number, again captures target question and answer webpage.
Alternatively, visit data acquisition module, is also configured to obtain target question and answer webpage as the accessed information of the hyperlink of other webpages; Handling module, it is accessed out-of-date as the hyperlink of other webpages to be also configured to target question and answer webpage in the time within the mistiming, again captures target question and answer webpage.
Alternatively, handling module is also configured to: when determining that the mistiming is less than predetermined threshold value, directly again captures target question and answer webpage.
Alternatively, time-obtaining module is also configured to: obtain the creation-time of target question and answer webpage or the time of search engine discovery target question and answer webpage, using the issuing time as target question and answer webpage.
The method and apparatus of crawl question and answer class webpage of the present invention, the accessing characteristic of the question and answer class webpage captured as required, formulate and capture strategy accordingly, using the user accesses data of question and answer class that grabs as basis for estimation, when user accesses data meets pre-conditioned, determine that effective answer appears in target question and answer webpage, target question and answer webpage is captured again, thus in time the effective answer occurred in question and answer class webpage is included, be convenient to user use, improve the efficiency of question and answer class webpage capture.
Further, the method and apparatus of crawl question and answer class webpage of the present invention, screened question and answer class webpage, only again captures the target question and answer page not comprising answer before crawl, can avoid taking a large amount of crawl bandwidth, decrease resource consumption.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
According to hereafter by reference to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 is the schematic diagram of the method capturing question and answer class webpage according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the method capturing question and answer class webpage according to an embodiment of the invention
Fig. 3 is the schematic diagram of the device capturing question and answer class webpage according to an embodiment of the invention; And
Fig. 4 is the schematic diagram of the device capturing question and answer class webpage in accordance with another embodiment of the present invention.
Embodiment
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
Embodiments provide a kind of method capturing question and answer class webpage, Fig. 1 is the schematic diagram of the method capturing question and answer class webpage according to an embodiment of the invention, and the method for this crawl question and answer class webpage can comprise in general manner:
Step S102, identifies the target question and answer webpage of predetermined content type in the question and answer class webpage captured;
Step S104, obtains the user accesses data of target question and answer webpage;
Step S106, judges whether user accesses data meets pre-conditioned;
Step S108, if so, captures again to target question and answer webpage.
The answer of above target question and answer webpage is carried out answer by the viewer of webpage according to the problem issued and is uploaded, and the time that effective answer occurs is not fixed.Inventor completes in process of the present invention the analysis of mass data and summary, find user accesses data directly reflects in target question and answer webpage whether comprise effective answer, occur that newly-increased user accesses this question and answer class webpage if that is a large amount of, this target question and answer webpage generally comprises the answer that user needs, therefore, can using user accesses data as the basis for estimation whether again capturing webpage.
The method of the crawl question and answer class webpage of the embodiment of the present invention is preferentially used in the crawl flow process of the question and answer websites such as 360 question and answer, this above step S102 predetermined content type is the page not comprising answer in question and answer class webpage, such as in 360 question and answer websites, there is the new question and answer page, the Web Spider of search engine can capture this emerging question and answer page in time, and the time that record captures.By content recognition, can determine whether to comprise in the question and answer page issuing time by the answer taked and this webpage, therefrom select target question and answer webpage.
The optional flow process of one that step S102 chooses target question and answer webpage is: the question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website (such as http://wenda.so.com etc.) carries out content scanning, to determine the answer quantity comprised in question and answer webpage; To the question and answer webpage of answer do not comprised as target question and answer webpage.
Above user accesses data can comprise: independent visitor's data (uniquevisitor of target question and answer webpage, be called for short uv), represent the number of times that this webpage is accessed in different address, if the uv data of target question and answer webpage reach predetermined number, the information containing calling party in this webpage can be determined, now this webpage can be re-started crawl, such as: the independent visitor's data obtaining target question and answer webpage; Independent visitor's number that in time within the mistiming, target question and answer webpage is newly-increased reaches predetermined number, when thinking that user accesses data meets pre-conditioned, again captures target question and answer webpage.
Above user accesses data can also comprise: target question and answer webpage is as the accessed information of the hyperlink of other webpages, if the hyperlink that this target question and answer webpage becomes this webpage is recommended, and user accesses, also the content of the needs containing calling party in this webpage can be determined, this target question and answer webpage can be re-started crawl in the case, such as: target question and answer webpage can be obtained as the accessed information of the hyperlink of other webpages; In time within the mistiming, this target question and answer webpage is as the accessed mistake of hyperlink of other webpages, then think user accesses data meet above-mentioned pre-conditioned time, target question and answer webpage is captured again.
In another embodiment of the invention, the issuing time of target question and answer webpage and the crawl time of target question and answer webpage can first be obtained after step s 102; Calculate the mistiming of issuing time and the time of crawl; When determining that the mistiming is more than or equal to predetermined threshold value, then perform step S104.
If there is determining that the mistiming is less than the situation of predetermined threshold value, can perform and target question and answer webpage is captured again, this is because time of occurring of the effective answer of question and answer class webpage often after Homepage Publishing soon.The problem webpage that such as question and answer class webpage is newly issued more easily can appear at homepage or push in the page, and easier interviewee sees, on the other hand, the wish that visitor also actively can answer based on reasons such as rewards on total mark is also stronger.For issuing the webpage not having in a period of time to answer, then may be that belong to knotty problem, its probability being responded answer then reduces because the reason in difficulty or field.Therefore can utilize the feature of question and answer class webpage answer time of occurrence, according to the time capturing time gap issuing time, arrange again to capture, thus include effective answer content in time.After exceeding preset time period, the judgement that the visit data recycling webpage captures again.
The issuing time of above target question and answer webpage can comprise: the creation-time of target question and answer webpage or search engine find the time of target question and answer webpage.Because some webpage may can not record its initial creation-time but only record update time, in this case, in the present embodiment, search engine is found the issuing time of time as target question and answer webpage of this target question and answer webpage.
The issuing time of webpage is nearer, and the frequency capturing question and answer class webpage can arrange more intensive.Along with the growth of time, increase the crawl time capturing target question and answer webpage gradually.Thus can more effectively capture question and answer class web page contents with less crawl flow.In the another kind of embodiment of the present invention, after the mistiming capturing time gap issuing time exceedes predetermined threshold value, can also according to the accessed data of target question and answer webpage, determine whether to need again to capture, accuracy is higher.Such as, determining that the mistiming is greater than or after predetermined threshold value, obtains the user accesses data of target question and answer webpage; When user accesses data meets pre-conditioned, target question and answer webpage is captured again.
Fig. 2 be the method capturing question and answer class webpage according to an embodiment of the invention can process flow diagram, after a collection of question and answer class webpage of search engine collecting, perform following steps:
Step S202, carries out content recognition to the question and answer class webpage grabbed;
Step S204, judges whether the question and answer class grabbed does not comprise effective answer, if so, performs step S206;
Step S206, obtains the issuing time of this webpage and the crawl time of this webpage, and calculates the mistiming capturing time gap issuing time;
Step S208, judges whether the mistiming is less than Preset Time (can be flexibly set, such as, be set to 1 day), if so, performs step S214, performs step S210 if not;
Step S210, obtains the uv data of this webpage, and judges whether uv is greater than predetermined number, if perform step S214, performs step S212 if not;
Step S212, obtains this webpage as the accessed information of hyperlink, if as the accessed mistake of hyperlink, then performs step S214;
Step S214, captures again to this webpage.
The execution sequence of above step S210 and step S212 can be arranged flexibly, only needs after judging that the mistiming is not less than Preset Time, and the information completing uv data and hyperlink access judges.And in some embodiment, can only to uv data and hyperlink access information in one judge, do not need to carry out two judgements.
Based on a large amount of data analyses excavate, inventor sum up the answer of question and answer class page produce time often problem propose after soon.If after that is a problem is suggested, if having people to answer majority is answer in the short time after problem proposes, if beyond the regular hour, less by the possibility answered.Therefore above flow process is performed, ask without answering webpage for what newly capture, (being such as less than 1 day) that if the issuing time nearest crawl time is shorter, need again to capture again once, be greater than certain threshold value for capturing recently time gap current time, then whether paid close attention to by user recently according to this question and answer class webpage and come to arrange whether to capture.If the time has user to access recently, arrange again to capture renewal.Also just identify whether as question and answer page and the answer number that comprises for the new webpage captured, mark wherein have the webpage (as above target question and answer webpage) of asking without answering.Extracting to have asks without the issuing time answering webpage (time of this this webpage replaces for not having the webpage of issuing time that search engine can be used to find), calculate the difference capturing time gap issuing time recently, difference is less than certain threshold value and then needs again to capture renewal.For capturing more of a specified duration the asking without answering webpage then according to whether accessed by the user coming formulates update time recently of time gap issuing time recently.Whether paid close attention to by user, the uv data of webpage can be utilized, also whether can be recommended to come to replace by super chain by other webpages according to the URL(uniform resource locator) of this webpage.
Utilize the above method of the present embodiment, effectively utilize crawl flow, and the new answer that question and answer class webpage occurs can be found in time, effectively ensure the consistance of the webpage that search engine is included and internet web page content.Present invention also offers a kind of device capturing question and answer class webpage, for performing the method capturing question and answer class webpage in above embodiment, and can be arranged in the server of network search engines, for capturing above question and answer class webpage, while saving the resource capturing question and answer class webpage, improve the validity capturing question and answer class webpage, thus ensure the consistance capturing webpage on question and answer class webpage and network.
Fig. 3 is the schematic diagram of the device capturing question and answer class webpage according to an embodiment of the invention, and the device 300 of this crawl question and answer class webpage can comprise in general manner: identification module 310, visit data acquisition module 320, handling module 330.
With in upper-part, identification module 310 is configured to the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured; Predetermined content type can for not comprising the page of answer in question and answer class webpage.
Visit data acquisition module 320 is configured to the user accesses data obtaining target question and answer webpage, and handling module 330, when user accesses data meets pre-conditioned, captures again to target question and answer webpage.
Visit data acquisition module 350 can use independent visitor's data of target question and answer webpage as above user accesses data, thus visit data acquisition module 350 can obtain independent visitor's data of target question and answer webpage, when independent visitor's number that in the time of handling module 340 within the mistiming, target question and answer webpage is newly-increased reaches predetermined number, again to capture target question and answer webpage.
The information that visit data acquisition module 350 can also be used as the hyperlink of other webpages accessed is as above user accesses data, thus visit data acquisition module 350 can obtain target question and answer webpage as the accessed information of the hyperlink of other webpages; So that target question and answer webpage is accessed out-of-date as the hyperlink of other webpages in the time of handling module 340 within the mistiming, target question and answer webpage is captured again.
Utilize the device 300 of the crawl question and answer class webpage of above embodiment, using user accesses data as the basis for estimation whether again capturing webpage, in time the effective answer occurred in question and answer class webpage is included, improve the efficiency of question and answer class webpage capture.
Fig. 4 is the schematic diagram of the device capturing question and answer class webpage according to another embodiment of the present invention, on the basis of above embodiment, increases and is provided with time-obtaining module 340, computing module 350.
Time-obtaining module 340 is configured to obtain the issuing time of target question and answer webpage and the crawl time of target question and answer webpage; The issuing time of above target question and answer webpage can comprise: the creation-time of target question and answer webpage or search engine find the time of target question and answer webpage.Because some webpage may can not record its initial creation-time but only record update time, in this case, in the present embodiment, search engine is found the issuing time of time as target question and answer webpage of this target question and answer webpage.
Computing module 350 is configured to the mistiming calculating issuing time and the time of crawl.Handling module 330, when determining that the mistiming is less than predetermined threshold value, captures again to target question and answer webpage.Namely utilize the answer time of occurrence of question and answer class webpage to be generally the feature within a period of time of up-to-date RELEASE PROBLEM, more effectively capture question and answer class web page contents with less crawl flow.
Correspondingly, visit data acquisition module 350 is after the mistiming capturing time gap issuing time exceedes predetermined threshold value, obtain user accesses data, will as basis for estimation, when user accesses data meets pre-conditioned, determine that effective answer appears in target question and answer webpage, again to be captured by handling module 330 pairs of target question and answer webpages.
The feature of the target question and answer webpage using the device 300 of the crawl question and answer class webpage of the present embodiment to capture as required, formulate and capture strategy accordingly, using the user accesses data of question and answer class that grabs as basis for estimation, determine whether again to capture, in time the effective content occurred in webpage is included, so that user uses, improve the efficiency of webpage capture.And before crawl, question and answer class webpage is screened, only the question and answer page not comprising answer is captured again, can avoid taking a large amount of network bandwidths, decrease network resource consumption.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the device of the crawl question and answer class webpage of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although multiple exemplary embodiment of the present invention is illustrate and described herein detailed, but, without departing from the spirit and scope of the present invention, still can directly determine or derive other modification many or amendment of meeting the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or amendments.
The embodiment of the present invention additionally provides the method that A1. mono-kind captures question and answer class webpage, comprising:
The target question and answer webpage of predetermined content type is identified in the question and answer class webpage captured;
Obtain the user accesses data of described target question and answer webpage;
When described user accesses data meets pre-conditioned, described target question and answer webpage is captured again.
A2. the method according to A1, wherein,
Described predetermined content type comprises the page not comprising answer in described question and answer class webpage.
A3. the method according to A1 or A2, wherein, the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured comprises:
The question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in described question and answer webpage;
To the described question and answer webpage of answer do not comprised as described target question and answer webpage.
A4. the method according to any one of A1 to A3, wherein, also comprises after identifying the target question and answer webpage of predetermined content type in the question and answer class webpage captured:
Obtain the issuing time of described target question and answer webpage and the crawl time of described target question and answer webpage;
Calculate the mistiming of described issuing time and described crawl time;
When determining that the described mistiming is more than or equal to predetermined threshold value, perform the step of the user accesses data obtaining described target question and answer webpage.
A5. the method according to any one of A1 to A4, wherein,
The user accesses data obtaining described target question and answer webpage comprises: the independent visitor's data obtaining described target question and answer webpage;
Describedly pre-conditionedly to comprise: the newly-increased independent visitor's number of the webpage of target question and answer described in the time within the described mistiming reaches predetermined number.
A6. the method according to any one of A1 to A5, wherein,
The user accesses data obtaining described target question and answer webpage comprises: obtain described target question and answer webpage as the accessed information of the hyperlink of other webpages;
Describedly pre-conditionedly to comprise: the webpage of target question and answer described in the time within the described mistiming is as the accessed mistake of hyperlink of other webpages.
A7. the method according to any one of A1 to A6, wherein when determining that the described mistiming is less than predetermined threshold value, described method also comprises:
Described target question and answer webpage is directly captured again.
A8. the method according to any one of A1 to A7, wherein
The issuing time of described target question and answer webpage comprises: the creation-time of described target question and answer webpage or search engine find the time of described target question and answer webpage.
The embodiment of the present invention additionally provides the device that B9. mono-kind captures question and answer class webpage, comprising:
Identification module, be configured to the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured, described predetermined content type comprises the page not comprising answer in described question and answer class webpage;
Visit data acquisition module, is configured to the user accesses data obtaining described target question and answer webpage;
Handling module, is configured to when described user accesses data meets pre-conditioned, again captures described target question and answer webpage.
B10. the device according to B9, wherein said identification module is also configured to:
The question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in described question and answer webpage;
To the described question and answer webpage of answer do not comprised as described target question and answer webpage.
B11. the device according to B9 or B10, also comprises:
Time-obtaining module, is configured to obtain the issuing time of described target question and answer webpage and the crawl time of described target question and answer webpage;
Computing module, is configured to the mistiming calculating described issuing time and described crawl time;
Described visit data acquisition module, is also configured to when determining that the described mistiming is more than or equal to predetermined threshold value, performs the step of the user accesses data obtaining described target question and answer webpage.
B12. the device according to B11, wherein,
Described visit data acquisition module, is also configured to the independent visitor's data obtaining described target question and answer webpage;
Described handling module, is also configured to the webpage of target question and answer described in the time within the described mistiming newly-increased independent visitor's number when reaching predetermined number, again captures described target question and answer webpage.
B13. the device according to B11, wherein,
Described visit data acquisition module, is also configured to obtain described target question and answer webpage as the accessed information of the hyperlink of other webpages;
Described handling module, it is accessed out-of-date as the hyperlink of other webpages to be also configured to the webpage of target question and answer described in the time within the described mistiming, again captures described target question and answer webpage.
B14. the device according to any one of B11 to B13, wherein said handling module is also configured to:
When determining that the described mistiming is less than predetermined threshold value, described target question and answer webpage is directly captured again.
B15. the device according to any one of B11 to B14, wherein said time-obtaining module is also configured to:
Obtain the time that the creation-time of described target question and answer webpage or search engine find described target question and answer webpage, using the issuing time as described target question and answer webpage.

Claims (10)

1. capture a method for question and answer class webpage, comprising:
The target question and answer webpage of predetermined content type is identified in the question and answer class webpage captured;
Obtain the user accesses data of described target question and answer webpage;
When described user accesses data meets pre-conditioned, described target question and answer webpage is captured again.
2. method according to claim 1, wherein,
Described predetermined content type comprises the page not comprising answer in described question and answer class webpage.
3. method according to claim 1 and 2, wherein, the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured comprises:
The question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in described question and answer webpage;
To the described question and answer webpage of answer do not comprised as described target question and answer webpage.
4. also comprise according to the method in any one of claims 1 to 3, wherein, identify the target question and answer webpage of predetermined content type in the question and answer class webpage captured after:
Obtain the issuing time of described target question and answer webpage and the crawl time of described target question and answer webpage;
Calculate the mistiming of described issuing time and described crawl time;
When determining that the described mistiming is more than or equal to predetermined threshold value, perform the step of the user accesses data obtaining described target question and answer webpage.
5. method according to any one of claim 1 to 4, wherein,
The user accesses data obtaining described target question and answer webpage comprises: the independent visitor's data obtaining described target question and answer webpage;
Describedly pre-conditionedly to comprise: the newly-increased independent visitor's number of the webpage of target question and answer described in the time within the described mistiming reaches predetermined number.
6. method according to any one of claim 1 to 5, wherein,
The user accesses data obtaining described target question and answer webpage comprises: obtain described target question and answer webpage as the accessed information of the hyperlink of other webpages;
Describedly pre-conditionedly to comprise: the webpage of target question and answer described in the time within the described mistiming is as the accessed mistake of hyperlink of other webpages.
7. method according to any one of claim 1 to 6, wherein when determining that the described mistiming is less than predetermined threshold value, described method also comprises:
Described target question and answer webpage is directly captured again.
8. method according to any one of claim 1 to 7, wherein
The issuing time of described target question and answer webpage comprises: the creation-time of described target question and answer webpage or search engine find the time of described target question and answer webpage.
9. capture a device for question and answer class webpage, comprising:
Identification module, be configured to the target question and answer webpage identifying predetermined content type in the question and answer class webpage captured, described predetermined content type comprises the page not comprising answer in described question and answer class webpage;
Visit data acquisition module, is configured to the user accesses data obtaining described target question and answer webpage;
Handling module, is configured to when described user accesses data meets pre-conditioned, again captures described target question and answer webpage.
10. device according to claim 9, wherein said identification module is also configured to:
The question and answer webpage unified resource symbol captured in preset time period being belonged to question and answer class website carries out content scanning, to determine the answer quantity comprised in described question and answer webpage;
To the described question and answer webpage of answer do not comprised as described target question and answer webpage.
CN201410801976.9A 2014-12-18 2014-12-18 The method and apparatus for capturing question and answer class webpage Active CN104462492B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410801976.9A CN104462492B (en) 2014-12-18 2014-12-18 The method and apparatus for capturing question and answer class webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410801976.9A CN104462492B (en) 2014-12-18 2014-12-18 The method and apparatus for capturing question and answer class webpage

Publications (2)

Publication Number Publication Date
CN104462492A true CN104462492A (en) 2015-03-25
CN104462492B CN104462492B (en) 2018-01-16

Family

ID=52908527

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410801976.9A Active CN104462492B (en) 2014-12-18 2014-12-18 The method and apparatus for capturing question and answer class webpage

Country Status (1)

Country Link
CN (1) CN104462492B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033262A (en) * 2018-07-09 2018-12-18 北京寻领科技有限公司 Question and answer knowledge base update method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
CN102637170A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Question pushing method and system
US20130144858A1 (en) * 2011-01-21 2013-06-06 Google Inc. Scheduling resource crawls
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN103577558A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for optimizing search ranking of frequently asked question and answer pairs

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178736A (en) * 2007-12-11 2008-05-14 腾讯科技(深圳)有限公司 Web page collecting method and web page collecting server
US20130144858A1 (en) * 2011-01-21 2013-06-06 Google Inc. Scheduling resource crawls
CN102637170A (en) * 2011-02-10 2012-08-15 北京百度网讯科技有限公司 Question pushing method and system
CN103577557A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for determining capturing frequency of network resource point
CN103577558A (en) * 2013-10-21 2014-02-12 北京奇虎科技有限公司 Device and method for optimizing search ranking of frequently asked question and answer pairs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王彤: "《数字媒体内容管理技术与实践》", 31 May 2014, 中国传媒大学出版社 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109033262A (en) * 2018-07-09 2018-12-18 北京寻领科技有限公司 Question and answer knowledge base update method and device

Also Published As

Publication number Publication date
CN104462492B (en) 2018-01-16

Similar Documents

Publication Publication Date Title
US10069857B2 (en) Performing rule-based actions based on accessed domain name registrations
US10216848B2 (en) Method and system for recommending cloud websites based on terminal access statistics
US7860971B2 (en) Anti-spam tool for browser
CN104536973B (en) The method and browser client of picture recognition
CN105243159A (en) Visual script editor-based distributed web crawler system
CN102932207B (en) The method of monitoring website access information and server
CN108304410A (en) A kind of detection method, device and the data analysing method of the abnormal access page
CN103617241B (en) Search information processing method, browser terminal and server
US10073886B2 (en) Search results based on a search history
CA3120833C (en) Identifying equivalent links on a page
CN106411639A (en) Method and system for monitoring access data
Steinmetz et al. Web service search on large scale
CN104391953B (en) Detect the method and device of webpage renewal
CN110069693A (en) Method and apparatus for determining target pages
CN110535974A (en) Method for pushing, driving means, equipment and the storage medium of resource to be put
CN102902784B (en) Web page classification storage system and method
Mehta et al. A comparative study of various approaches to adaptive web scraping
CN103354546A (en) Message filtering method and message filtering apparatus
KR20120071827A (en) Seed information collecting device for detecting landing, hopping and distribution sites of malicious code and seed information collecting method for the same
Chabot et al. Event reconstruction: A state of the art
CN104462492A (en) Method and device for grabbing question and answer webpages
CN105763530A (en) Web-based threat information acquisition system and method
CN104462493A (en) Method and device for grabbing question and answer webpages
Kargaran et al. On detecting hidden third-party web trackers with a wide dependency chain graph: A representation learning approach
CN110825976B (en) Website page detection method and device, electronic equipment and medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220725

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.