CN106156200B - Webpage content updating speed comparison method and device - Google Patents
Webpage content updating speed comparison method and device Download PDFInfo
- Publication number
- CN106156200B CN106156200B CN201510194529.6A CN201510194529A CN106156200B CN 106156200 B CN106156200 B CN 106156200B CN 201510194529 A CN201510194529 A CN 201510194529A CN 106156200 B CN106156200 B CN 106156200B
- Authority
- CN
- China
- Prior art keywords
- keywords
- content
- content item
- webpage
- keyword
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention relates to a method for comparing the updating speed of webpage content, which comprises the following steps: acquiring keywords to be compared; circularly requesting search result webpages corresponding to the keywords from each target website; extracting content items corresponding to the keywords updated by the search result webpage requested by the circulation relative to the search result webpage requested by the circulation last time by the target-dividing website; if the updated content item is extracted, taking the current cycle processing time as the updating time of the content item, and recording the corresponding relation among the corresponding target website, the keyword, the content item and the updating time; comparing the updating time of the same content items corresponding to the updated keywords between the target websites; and generating updating difference data for updating the same content items corresponding to the keywords among the target websites according to the comparison result. The webpage content updating speed comparison method can obtain an accurate comparison result. In addition, a device for comparing the updating speed of the webpage content is also provided.
Description
Technical Field
The invention relates to the technical field of networks, in particular to a method and a device for comparing webpage content updating speed.
Background
With the development of network technology and mobile terminal technology, various network service applications provide various network services for people, for example, a video website provides a video viewing service, a novel website provides a novel browsing service, a news website provides a current news browsing service, and the like.
The launch time of the web service is very important and is related to the user experience, so that the user possession of the website is affected. For example, users tend to prefer a video web site that updates a series of a television episode at a first time, a novel web site that updates a novel chapter at a first time, a news web site that publishes newsfeed at a first time, and so on.
The fierce competition among the websites determines that each website needs to evaluate the speed difference of updating the network service content between the website and a competitor so as to make a relevant decision according to the speed difference, for example, the application service performance of the website is improved; in addition, the user needs to know the speed difference, so that the user can go to a website with higher updating speed of the network service content to enjoy the corresponding service.
The conventional method generally determines the update time of the web page content by extracting the publication time explicitly indicated in the web page content, and since the publication time is manually set by website personnel, there is a risk of inaccuracy, thereby resulting in inaccuracy of the comparison result of the update speed of the web page content.
Disclosure of Invention
Based on this, there is a need to provide a method and an apparatus for comparing update speed of web page content, which can obtain accurate comparison result.
A method for comparing the update speed of web page contents comprises the following steps:
acquiring keywords to be compared;
circularly requesting search result webpages corresponding to the keywords from each target website;
extracting content items corresponding to the keywords updated by the search result webpage requested by the circulation relative to the search result webpage requested by the circulation last time by the target-dividing website;
if the updated content item is extracted, taking the current cycle processing time as the updating time of the content item, and recording the corresponding relation among the corresponding target website, the keyword, the content item and the updating time;
comparing the updating time of the same content items corresponding to the updated keywords between the target websites;
and generating updating difference data for updating the same content items corresponding to the keywords among the target websites according to the comparison result.
A device for comparing the update speed of web page contents comprises
The comparison keyword acquisition module is used for acquiring keywords to be compared;
a keyword search web page request module, configured to request search result web pages corresponding to the keywords from each target website in a circulating manner;
an updated content item extraction module, configured to extract, by a target website, a content item corresponding to the keyword, which is updated by the search result webpage that is requested by the keyword search webpage request module in the current cycle, relative to the search result webpage that is requested by the updated content item extraction module in the last cycle;
the recording module is used for taking the current cycle processing time as the updating time of the content item and recording the corresponding relation among the corresponding target website, the keywords, the content item and the updating time if the updated content item is extracted;
the updating time comparison module is used for comparing the updating time of the same content items corresponding to the updated keywords between the target websites;
and the update difference data generation module is used for generating update difference data of the same content items corresponding to the updated keywords among the target websites according to the comparison result.
The method and the device for comparing the webpage content updating speed circularly request the search result webpage corresponding to the keyword from each target website and extract the updated content item of the search result webpage, thereby being capable of monitoring whether the target website updates the content item and which content items are updated, the method and the device take the current circulation processing time as the updating time of the updated content item, the time is actually equivalent to the monitored time of the updated content item, because the method and the device circularly request the search result webpage corresponding to the keyword from each target website, the time of monitoring the updated content item by the method and the device is very similar to the actual release time of the updated content item, the actual release time is equivalent to the actual updating time, therefore, the method and the device can accurately obtain the updating time of the content item, and comparing the updating speed of the webpage content between the target websites according to the accurate updating time, thereby obtaining an accurate comparison result.
Drawings
Fig. 1 is a block diagram illustrating a part of a terminal or a server capable of operating the method for comparing update speed of web page content according to the present application;
FIG. 2 is a flowchart illustrating a method for comparing update rates of web page content according to an embodiment;
FIG. 3 is a flow diagram that illustrates the process of setting up storing keywords to be compared in one embodiment;
FIG. 4 is a flowchart illustrating step S206 in FIG. 2 according to an embodiment;
FIG. 5 is a flowchart illustrating a method for comparing update rates of web page content according to an embodiment;
FIG. 6 is a diagram illustrating an exemplary embodiment of a device for comparing update speed of web page content;
FIG. 7 is a diagram illustrating an exemplary embodiment of a device for comparing update speed of web page content;
FIG. 8 is a block diagram that illustrates an update content entry extraction module, according to one embodiment;
FIG. 9 is a diagram illustrating an exemplary embodiment of a device for comparing update speed of web page content;
fig. 10 is a schematic structural diagram of a device for comparing update speed of web page content in an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a block diagram of a part of a terminal or a server capable of operating the method for comparing update speed of web page content according to the present application in one embodiment. As shown in FIG. 1, in one embodiment, the server includes a processor, a storage medium, a memory, and a network interface connected by a system bus; the network interface is used for communicating with a network, the memory is used for caching data, and the storage medium stores an operating system, a database and a software instruction for realizing the webpage content updating speed comparison method; the database can be used for storing keywords to be compared and the like, and the data required by the webpage content updating speed comparison method or generated in the intermediate processing process; the processor coordinates the operation of the components and executes the instructions to implement the web page content update speed comparison method described herein.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the terminals or servers to which the subject application applies, as a particular terminal or server may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
As shown in fig. 2, in an embodiment, a method for comparing update speed of web page content includes the following steps:
step S202, acquiring keywords to be compared.
In one embodiment, the number of keywords to be compared may include one or more.
When the number of the keywords to be compared includes a plurality of keywords, the processing procedure described in the following step S204 to step S212 may be performed for each keyword to be compared, respectively. When two content items are compared in the following process, the content items corresponding to the same keyword are compared.
In one embodiment, the keywords to be compared may be preset and stored.
In one embodiment, keywords that specify a domain or specify a category with a popularity exceeding a threshold may be searched for as keywords to be compared. For example, a video name whose video category popularity exceeds a threshold may be searched as a keyword to be compared, where the video name may correspond to, but is not limited to, a name of a tv show or a name of a movie, etc.
In another embodiment, a keyword that specifies a top preset position in a network ranking list may be obtained as the keyword to be compared. For example, a music ranking list may be searched for a music name in the top preset position as a keyword to be compared, and so on.
In an embodiment, before step 202, the method for comparing the update speed of the web page content further includes setting a process of storing keywords to be compared; as shown in fig. 3, in one embodiment, the process includes the steps of:
step S302, acquiring preset webpage content classification categories.
In one embodiment, the categories of web page content categories include, but are not limited to, video, novel, music, news, and the like.
Step S304, requests web page content from each target website.
In one embodiment, the URL information of the target website may be preset and stored.
In one embodiment, the web page content of the website's home page may be requested from various target websites. Specifically, a web page pull request including a URL address of a website home page may be sent to each target website, and a source code corresponding to the home page returned by each target website may be received.
Step S306, the keywords corresponding to all categories in the webpage content of all target websites are crawled.
In one embodiment, the URL addresses of the web pages corresponding to the various categories may be extracted from the source code of the home page of the various target web sites. Furthermore, a webpage pulling request containing a webpage URL address corresponding to the category can be sent to the target website, and a webpage source code corresponding to the category returned by the target website is received.
The process is equivalent to that a user opens a home page of the target website and further clicks a link corresponding to the category on the home page, so that the browser receives a webpage source code corresponding to the category returned by the target website.
In one embodiment, the URL address of the web page corresponding to the href field in the html tag corresponding to the category may be extracted from the source code.
For example, in the following source codes, if the web page URL address corresponding to the href field in the html tag corresponding to the category "entertainment" is "http:// yule. iqiyi. com/", the web page URL address can be extracted as the web page URL address corresponding to the category "entertainment":
< h3> < a href ═ http:// yule. iqiyi. com/"> < span tear ═ entertainment" > entertainment > The method is characterized in that em </span </a > </h3 >.
Further, keywords corresponding to specified fields in html tags meeting preset rules can be extracted from the webpage source codes corresponding to the categories, and therefore keywords corresponding to the categories are obtained.
Furthermore, the method can also extract the webpage links corresponding to the sub-categories contained in the categories from the webpage source codes corresponding to the categories, send a webpage pulling request containing the webpage links to the corresponding target website, and receive the corresponding webpage source codes returned by the target website; and circulating the process until the webpage source codes corresponding to all the descendant categories contained in the categories are crawled, and extracting keywords corresponding to specified fields in the html tags which accord with preset rules from the received webpage source codes in the crawling process as the keywords corresponding to the categories.
The descendant categories contained in the categories comprise: subcategories and all categories branching from subcategories.
In step S308, repeated keywords are filtered from the extracted keywords.
And step S310, storing the residual keywords after filtering as the keywords to be compared.
In one embodiment, the keywords to be compared may be sorted according to the category corresponding to the keywords to be compared.
Step S204, circularly requesting search result webpages corresponding to the keywords from each target website.
The search result web page corresponding to the keyword is equivalent to a web page available for searching the keyword in the target web site. For example, if a user inputs a keyword in a search input box of a target website, and clicks a search button, a corresponding result webpage can be presented to the user, where the webpage is a webpage obtained by searching for the keyword.
In one embodiment, in the process of requesting a search result webpage corresponding to a keyword from a certain target website, a webpage link can be generated according to a preset rule, and the webpage link represents a search result webpage corresponding to the keyword requested from the target website; and further sending a webpage pulling request containing the webpage link to the target website, and receiving a webpage source code returned by the target website, namely a source code corresponding to the search result webpage corresponding to the keyword.
For example, a search result web page corresponding to the keyword "running bar brother" is requested from a target web site having a URL address "http:// www.iqiyi.com", the following web page links may be generated: http:// so. iqiyi.com/so/q _ brother of running bar? source is input; and a search result webpage corresponding to the keyword 'bear is absent' is requested from the target website, and the following webpage links can be generated: http:// so. iqiyi.com/so/q _ bear-on? source is input; wherein, the contained keywords of the two web page links are different.
After the search result webpages corresponding to the keywords are respectively requested from each target website, the corresponding processing of the steps S206 and S208 is executed aiming at the search result webpages, the search result webpages corresponding to the keywords are respectively continuously requested from each target website, the corresponding processing of the steps S206 and S208 is continuously executed aiming at the search result webpages, and the request process and the processing process are circulated until the preset circulation end condition is triggered.
In one embodiment, a search result webpage corresponding to the primary keyword is requested from all target websites respectively, and the corresponding processing of steps S206 and S208 is performed on the search result webpage, which is referred to as a primary loop or a primary loop process.
In one embodiment, the step S204 includes the steps of:
and requesting search result webpages corresponding to the keywords from each target website at intervals of preset time, wherein the preset time does not exceed a threshold value.
For example, a search result web page corresponding to the keyword is requested from each target website every 1 minute, and so on.
In one embodiment, the threshold value not exceeded by the preset duration is a smaller value; therefore, the effect similar to the effect of monitoring the updated webpage content of the target website in real time can be achieved.
Step S206, the target-divided website extracts content items corresponding to the updated keywords of the search result webpage requested by the circulation request relative to the search result webpage requested by the circulation request at the last time.
In an embodiment, in step S206, the search result webpage requested in the current loop corresponding to the same target website may be compared with the search result webpage requested in the previous loop, and the content item corresponding to the keyword updated by the target website is extracted.
As shown in fig. 4, in one embodiment, step S206 includes the steps of:
step S402, extracting the latest content items corresponding to the keywords in each search result webpage requested by the circulation.
In one embodiment, the latest content item corresponding to the keyword can be extracted at a preset specified position of the source code of the search result webpage.
The search result web page corresponding to the keyword comprises a search result list corresponding to the keyword.
The search result list may contain both exact match search results and fuzzy match search results. In one embodiment, search results that are fuzzy matches to the keywords in the search result list may be filtered out, and search results that are exact matches may be retained.
In some web sites, the most recent content item corresponding to the keyword is arranged in the search result list at a position that is ahead of other content items. Thus, in one embodiment, the content item at the first position may be extracted from the search result list data of the search results with fuzzy matches filtered for keyword correspondence in the search result web page source code as the latest content item for keyword correspondence.
In other websites, the latest content entry corresponding to the keyword includes a predetermined designated field, such as "update to" or the like, indicating the latest content. Therefore, in one embodiment, the content item corresponding to the preset specified field can be extracted from the search result list data corresponding to the keyword in the search result webpage source code as the latest content item corresponding to the keyword.
In step S404, the targetable website compares whether the extracted latest content item is the same as the content item recorded most recently.
For the latest content item corresponding to the keyword of a certain target website extracted currently, whether the latest content item is the same as the content item corresponding to the keyword of the target website recorded recently or not can be determined.
In step S406, the latest extracted content item different from the most recently recorded content item is acquired as the updated content item of the corresponding destination website.
If the latest content item corresponding to the keyword of a certain target website extracted currently is different from the content item corresponding to the keyword of the target website recorded recently, the latest content item extracted currently can be obtained as the updated content item of the target website corresponding to the keyword.
In an embodiment, each time the latest content item corresponding to the keyword of one target website is extracted, the currently extracted latest content item is compared with the content item corresponding to the keyword of the target website which is recorded recently, whether the two content items are the same or not is judged, and if the two content items are different, the currently extracted latest content item is obtained as the content item corresponding to the updated keyword of the target website.
In another embodiment, the latest content items corresponding to the keywords of all the target websites can be extracted, and then the extracted latest content items of each target website are compared with the content items corresponding to the keywords of the corresponding target websites which are recorded recently.
In step S208, if a content item corresponding to the updated keyword of the target website is extracted, the current loop processing time is used as the update time of the content item, and the corresponding relationship among the corresponding target website, the keyword, the content item, and the update time is recorded.
In an embodiment, if a content item corresponding to a keyword updated by a search result webpage of a certain target website requested by the current loop with respect to a search result webpage of the target website requested by the previous loop is extracted, the processing time of the current loop of the content item may be the receiving time of the search result webpage received in the current loop, or may be the current time, or may be the receiving time or a time within a slight fluctuation range of the current time, that is, a time within a small range which is not related to the receiving time or the current time.
Step S210, comparing the update time of the same content item corresponding to the update keyword between the target websites.
In one embodiment, the update times of the same content items corresponding to the update keywords of the two target websites may be subtracted to obtain the difference time.
Step S212, generating the updating difference data of the same content items corresponding to the updating keywords between the target websites according to the comparison result.
The updated difference data includes, but is not limited to, chart data and the like, which correspond to a plurality of expressions such as tables, graphs, bars, and the like.
In one embodiment, the keywords to be compared comprise a plurality of keywords of different categories, and the update difference data can be generated by category.
In an embodiment, the method for comparing the update speed of the web page content further includes the following steps: the update difference data is sent to a designated mailbox or a designated application program interface.
Automatically sending the update difference data to a designated mailbox may be used to notify the associated user of the update difference data.
And the updating difference data is automatically sent to the appointed application program interface, so that the application program interface can conveniently carry out preset logic processing on the updating difference data.
In an embodiment, the method for comparing the update speed of the web page content further includes the following steps: displaying the updated difference data according to the expression form corresponding to the updated difference data; and so on.
For example, if the update difference data is table data, the update difference data is shown in a table representation form, and so on.
In an embodiment, the method for comparing the update speed of the web page content further includes the following steps: extracting feature identifiers of the content items; in the step of recording the content items and comparing the content items, the recording and comparing are performed based on the feature identifiers of the content items.
In one embodiment, the feature identifier of the content item may be extracted at a preset specified position in the content item. For example, the content corresponding to the title field may be extracted from the html tag corresponding to the content item as the feature identifier of the content item. In one embodiment, the feature identifier may be further formatted according to a preset processing logic, so that the feature identifier conforms to a preset format.
For example, html source code corresponding to a content item is as follows:
<a class="album_link"data-playsrc-elem="firstlink"target="_blank"data-searchpingback-elem="link"data-searchpingback-param="ptype=1-3-1"
href="http://www.iqiyi.com/v_19rro0o1ds.html#vfrm=2-3-0-1"
data-pb ═ iqiyi & p2 ═ 9000 ═ title ═ 2015-02-24, people who had too great a fingerboard limit task who had too great a running bar brother, "data-tvlist-elem ═ 2015-02-24, people who had too great a fingerboard limit task;
content "2015-02-24: excessive task of finger board of brother of running bar" corresponding to title field can be extracted as the characteristic identifier of content item, and further 2015-02-24 can be formatted into 20150224 to meet the unified format.
In one embodiment, the content item may be semantically analyzed to obtain a feature identification of the content item. For example, the content item may be semantically analyzed by a semantic analysis tool.
For example, the target website is a news website, and after a content item corresponding to a keyword updated by a certain target website is extracted, semantic analysis can be performed on the content item to obtain a feature identifier of the content item.
Therefore, when the updating speeds of the two target websites for the same content items corresponding to the same keyword are compared, the updating speeds of the two target websites for the same feature identification corresponding to the same keyword can be compared.
FIG. 5 is a flowchart illustrating a method for comparing update speeds of web page contents in an embodiment. As shown in fig. 5, the method for comparing the update speed of the web page content includes the following steps:
step S502, acquiring keywords to be compared.
Step S504, circularly requesting search result webpages corresponding to the keywords from each target website.
Step S506, the latest content item corresponding to the keyword in each search result webpage requested by the current cycle is extracted.
Step S508, the targetable website compares whether the extracted latest feature identifier is the same as the feature identifier recorded most recently.
In step S510, the latest extracted feature identifier different from the most recently recorded feature identifier is obtained as the updated feature identifier of the corresponding target website.
Step S512, if the updated feature identifier is extracted, taking the current cycle processing time as the update time of the feature identifier, and recording the corresponding relationship among the corresponding target website, the keyword, the feature identifier, and the update time.
Step S514, comparing the updating time of the same feature identifier corresponding to the updating keyword between the target websites.
Step S516, according to the comparison result, the update difference data of the same feature identifier corresponding to the update keyword between the target websites is generated.
In one embodiment, the content items corresponding to the feature identifiers may be obtained, and update difference data of the content items corresponding to the feature identifiers corresponding to the update keywords between the target websites is generated according to a comparison result of update times of the same feature identifiers corresponding to the update keywords between the target websites.
The above-mentioned web content update speed comparison method is described below with reference to a specific application scenario. In one embodiment, the above-mentioned web content update speed comparison method is used for comparing the update speed of the target website video episode with the update speed of the novel.
A video episode often includes multiple episodes or periods requiring update times for updating the video episode relative to the target web site. And a novel often includes a plurality of chapters, and the update time of the novel chapters needs to be updated compared with the target website. The specific process is as follows:
(1) and acquiring pre-stored keywords to be compared of the category of the video episode and keywords to be compared of the category of the novel. The obtained keywords to be compared are shown in table 1 below.
TABLE 1
Video episode name | Name of novel book |
Speak out loud 2015 | Living people forbidden land |
Where happiness is | Is pure and ambiguous |
Great difference in health | Teacher legend |
Huaxia micro-shadow | Immortal adverse flow of qi |
I maybe some time... | Super doctor |
Green swordsman | Killing spirit |
Salad | Anti-solar blood soul |
Big card driver | Super island owner |
Orange road theater edition 1 wish to return to the past | Medicine for city immortal |
Food to run | Mountain village stranger biography |
Similar processing as in steps (2) to (7) is performed for each keyword in table 1, and the keyword "loud 2015" is exemplified in steps (2) to (7).
(2) A search result webpage corresponding to 'speak out loud 2015' is requested from a target website: generating a webpage link according to a preset rule, wherein the webpage link represents a search result webpage corresponding to a request for 'speak out loud 2015' from a target website; and further sending a webpage pulling request containing the webpage link to the target website, and receiving a webpage source code returned by the target website, namely the source code of the corresponding search result webpage of the 'talking out aloud 2015'.
Taking the URL address of the home page of the target website as "http:// www.iqiyi.com" as an example, the following web page links can be generated: http:// so. iqiyi.com/so/q _ loud 2015? source is input; the web page link represents a corresponding search result web page that requests 2015 "aloud from the target web site.
(3) And filtering fuzzy matching search results from the search result list data contained in the source code of the search result webpage corresponding to the 'big speaking 2015' received by the target website, and extracting content items at a first position from the filtered search result list data to be used as latest content items corresponding to the 'big speaking 2015' in the target website.
(4) And extracting the content corresponding to the title field from the html tag of the latest content item corresponding to the "loud speaking out 2015" as the feature identifier of the latest content item, namely the latest feature identifier corresponding to the "loud speaking out 2015".
For example, the latest content entry html corresponding to "speak out loud 2015" is as follows:
<li class="album_item"><a class="album_link"data-playsrc-elem="firstlink"target="_blank"
data-searchpingback-elem="link"data-searchpingback-param="ptype=1-3-1"href="http://vod.kankan.com/v/70/70367/470137.shtml?id=731100"data-pb="rtgt=kankan&p2=9000"
the title is 2015-03-22, the lady pranks cause the andros to jump the building and cause the fracture data tvlist-elem is data tvlist-elem is >
2015-03-22, fracture is caused by the jumping of stairs of the boyfriend due to mischief of the girfriend.
The content '2015-03-22' corresponding to the title field, that is, the lady friend pranks the building and fractures the male friend, can be extracted from the content as the latest feature identifier of the content item, and the latest feature identifier corresponding to 'loud speaking out 2015' is obtained.
(5) And formatting the characteristic identifier according to a preset processing logic so that the characteristic identifier conforms to the preset format.
For example, removing the short horizontal lines included in the date, and removing the preset special symbols such as colon, quotation marks and the like included in the feature identifier, and the like, the obtained formatted feature identifier is: "20150322 fracture of male friend jumping from building due to mischief of female friend".
(6) Comparing whether the characteristic identification corresponding to the target website recorded recently is the same as '20150322 friend misopera causes man friend building jump to cause fracture', if not, taking the time of receiving the source code in the step (3) as the updating time or taking the current time as the updating time. For example, if the update time is 2015, 03, 26, 16, 29, "http:// www.iqiyi.com", "loud say 2015", "20150322" friend who prankly causes friend who jumps building and fractures "and" 2015, 03, 26, 16, 29 "are respectively used as the URL, keyword, feature identifier and update time of the target website, and the corresponding relationship between them is recorded.
And (5) circularly executing the steps (2) to (6) until a preset circulation end condition is triggered.
(7) And comparing the updating time of the same characteristic identification corresponding to the updated keyword between the target websites, and generating updating difference data of the same characteristic identification corresponding to the updated keyword between the target websites according to the comparison result.
In one embodiment, the update time differences between the same feature identifiers corresponding to the update keywords of the two target websites may be obtained by subtracting the update times of the same feature identifiers corresponding to the update keywords of the two target websites.
For example, an update difference data table as shown below may be generated.
If the updating difference time is positive, the updating time of the first target website is later than that of the second target website; if the difference time between the update times is negative, the update time of the first target website is earlier than the update time of the second target website.
As shown in fig. 6, in one embodiment, a web page content update speed comparison apparatus includes a comparison keyword obtaining module 602, a keyword search web page request module 604, an updated content item extracting module 606, a recording module 608, an update time comparison module 610, and an update difference data generating module 612, wherein:
the comparison keyword obtaining module 602 is configured to obtain a keyword to be compared.
In one embodiment, the number of keywords to be compared may include one or more.
A keyword search web page request module 604, an updated content item extraction module 606, a recording module 608, an updated time comparison module 610 and an updated difference data generation module 612, which can respectively process each keyword to be compared; when two content items are compared, the content items corresponding to the same keyword are compared.
As shown in fig. 7, in an embodiment, the apparatus for comparing update speed of web page content further includes a comparison keyword setting storage module 702, configured to set and store a keyword to be compared.
In one embodiment, the comparison keyword setting storage module 702 may search for keywords having a popularity exceeding a threshold for a specified domain or a specified category as the keywords to be compared. For example, a video name whose video category popularity exceeds a threshold may be searched as a keyword to be compared, where the video name may correspond to, but is not limited to, a name of a tv show or a name of a movie, etc.
In one embodiment, the comparison keyword setting storage module 702 may obtain the keyword with the ranking in the first preset position in the specified network ranking list as the keyword to be compared. For example, a music ranking list may be searched for a music name in the top preset position as a keyword to be compared, and so on.
In one embodiment, the comparison keyword setting storage module 702 is configured to obtain preset classification categories of web page contents, request the web page contents from each target website, crawl keywords corresponding to each category in the web page contents of each target website, filter repeated keywords from the extracted keywords, and store the filtered remaining keywords as keywords to be compared.
In one embodiment, the categories of web page content categories include, but are not limited to, video, novel, music, news, and the like.
In one embodiment, the URL information of the target website may be preset and stored.
In one embodiment, the comparison keyword setting storage module 702 may request the web page content of the website's home page from each target website. Specifically, a web page pull request including a URL address of a website home page may be sent to each target website, and a source code corresponding to the home page returned by each target website may be received.
In one embodiment, the comparison keyword setting storage module 702 may extract the URL addresses of the web pages corresponding to the categories from the source codes of the top pages of the target websites. Further, the comparison keyword setting storage module 702 may send a web page pulling request including a web page URL address corresponding to the category to the target website, and receive a web page source code corresponding to the category returned by the target website.
In one embodiment, the comparison keyword setting storage module 702 may extract the URL address of the web page corresponding to the href field in the html tag corresponding to the category from the source code.
Further, the keyword comparison setting storage module 702 may extract keywords corresponding to the specified fields in the html tags that meet the preset rules from the web page source codes corresponding to the categories, so as to obtain keywords corresponding to the categories.
Further, the comparison keyword setting storage module 702 may further extract a web page link corresponding to a sub-category included in the category from a web page source code corresponding to the category, send a web page pulling request including the web page link to a corresponding target website, and receive a corresponding web page source code returned by the target website; and circulating the process until the webpage source codes corresponding to all the descendant categories contained in the categories are crawled, and extracting keywords corresponding to specified fields in the html tags which accord with preset rules from the received webpage source codes in the crawling process as the keywords corresponding to the categories.
The descendant categories contained in the categories comprise: subcategories and all categories branching from subcategories.
In one embodiment, the comparison keyword setting storage module 702 may store the keywords to be compared in categories according to the categories corresponding to the keywords to be compared.
The keyword search web page request module 604 is configured to request search result web pages corresponding to the keywords from each target website in a looping manner.
The search result web page corresponding to the keyword is equivalent to a web page available for searching the keyword in the target web site. For example, if a user inputs a keyword in a search input box of a target website, and clicks a search button, a corresponding result webpage can be presented to the user, where the webpage is a webpage obtained by searching for the keyword.
In one embodiment, the keyword search web page request module 604 may generate a web page link according to a preset rule in a process of requesting a search result web page corresponding to a keyword from a certain target website, where the web page link represents a search result web page corresponding to the keyword requested from the target website; and further sending a webpage pulling request containing the webpage link to the target website, and receiving a webpage source code returned by the target website, namely a source code corresponding to the search result webpage corresponding to the keyword.
After the keyword search web page request module 604 requests search result web pages corresponding to the keywords from each target website, the updated content item extraction module 606 and the recording module 608 may perform corresponding processing on the search result web pages; further, the keyword search web page request module 604 may continue to request search result web pages corresponding to the keywords from each target website, and the updated content item extraction module 606 and the recording module 608 may continue to perform corresponding processing on the search result web pages until a preset loop end condition is triggered.
In one embodiment, the keyword search web page request module 604 requests the search result web pages corresponding to the keywords from all the target websites one time, and the updated content item extraction module 606 and the recording module 608 perform corresponding processing on the search result web pages, which is called a loop or a loop process.
The keyword search web page request module 604 is configured to request search result web pages corresponding to keywords from each target website every preset time interval, where the preset time interval does not exceed a threshold.
For example, a search result web page corresponding to the keyword is requested from each target website every 1 minute, and so on.
In one embodiment, the threshold value not exceeded by the preset duration is a smaller value; therefore, the effect similar to the effect of monitoring the updated webpage content of the target website in real time can be achieved.
The updated content item extracting module 606 is configured to extract, from the target-oriented website, a content item corresponding to a keyword updated by the search result webpage requested in the current circulation with respect to the search result webpage requested in the previous circulation.
In an embodiment, the updated content item extracting module 606 may compare the search result webpage requested by the current loop and the search result webpage requested by the previous loop, which correspond to the same target website, and extract the content item corresponding to the updated keyword of the target website.
As shown in FIG. 8, in one embodiment, the updated content item extraction module 606 includes a most recent item extraction module 802, a comparison module 804, and an updated item retrieval module 806, wherein:
the latest item extraction module 802 is configured to extract the latest content items corresponding to the keywords in each search result webpage requested by the current cycle.
In one embodiment, the latest item extraction module 802 may extract the latest content item corresponding to the keyword at a preset designated position of the source code of the search result web page.
The search result web page corresponding to the keyword comprises a search result list corresponding to the keyword.
The search result list may contain both exact match search results and fuzzy match search results. In one embodiment, the latest entry extraction module 802 may filter out search results in the search result list that are fuzzy matches to the keyword, and retain search results that are exact matches.
In some web sites, the most recent content item corresponding to the keyword is arranged in the search result list at a position that is ahead of other content items. Thus, in one embodiment, the latest entry extraction module 802 may extract the content entry in the first position from the search result list data of the search results with fuzzy matches filtered for keyword correspondence in the search result web page source code as the latest content entry for the keyword correspondence.
In other websites, the latest content entry corresponding to the keyword includes a predetermined designated field, such as "update to" or the like, indicating the latest content. Therefore, in one embodiment, the latest item extraction module 802 may extract a content item corresponding to a preset specified field from the search result list data corresponding to the keyword in the search result web page source code as the latest content item corresponding to the keyword.
The comparison module 804 is used for comparing whether the extracted latest content item is the same as the content item recorded most recently by the target website.
For the latest content item corresponding to the keyword of a certain currently extracted target website, the comparing module 804 may determine whether the latest content item is the same as the content item corresponding to the keyword of the target website recorded recently.
The updated entry obtaining module 806 is configured to obtain the latest extracted content entry different from the most recently recorded content entry as the updated content entry of the corresponding target website.
If the latest content item corresponding to the currently extracted keyword of a certain target website is different from the content item corresponding to the recently recorded keyword of the target website, the updated item obtaining module 806 may obtain the currently extracted latest content item as the content item updated by the target website corresponding to the keyword.
In an embodiment, each time the latest item extraction module 802 extracts the latest content item corresponding to the keyword of a target website, the comparison module 804 compares the latest content item currently extracted with the content item corresponding to the keyword of the target website, determines whether the latest content item currently extracted is the same as the content item corresponding to the keyword of the target website, and if the latest content item currently extracted is not the same as the content item corresponding to the keyword of the target website, the updated item acquisition module 806 may acquire the latest content item currently extracted as the content item corresponding to the updated keyword of the target website.
In another embodiment, the latest item extraction module 802 may first extract the latest content items corresponding to the keywords of all the target websites, and then the comparison module 804 compares the extracted latest content items of each target website with the content items corresponding to the keywords of the corresponding target website recorded recently.
The recording module 608 is configured to, if a content item corresponding to a keyword updated by a target website is extracted, take the current cycle processing time as the update time of the content item, and record a corresponding relationship among the corresponding target website, the keyword, the content item, and the update time.
In an embodiment, if a content item corresponding to a keyword updated by a search result webpage of a certain target website requested by the current loop with respect to a search result webpage of the target website requested by the previous loop is extracted, the processing time of the current loop of the content item may be the receiving time of the search result webpage received in the current loop, or may be the current time, or may be the receiving time or a time within a slight fluctuation range of the current time, that is, a time within a small range which is not related to the receiving time or the current time.
The update time comparison module 610 is used for comparing the update time of the same content item corresponding to the update keyword between the target websites.
In one embodiment, the update time comparison module 610 may subtract the update times of the same content items corresponding to the update keywords of the two target websites to obtain the difference time.
The update difference data generation module 612 is configured to generate update difference data of the same content items corresponding to the update keywords between the target websites according to the comparison result.
The updated difference data includes, but is not limited to, chart data and the like, which correspond to a plurality of expressions such as tables, graphs, bars, and the like.
In one embodiment, the keywords to be compared include a plurality of keywords of different categories, and the update difference data generation module 612 may generate the update difference data by category.
As shown in fig. 9, in an embodiment, the apparatus for comparing update speed of web page content further includes an update difference data processing module 902, configured to send the update difference data to a specified mailbox or a specified application program interface.
Automatically sending the update difference data to a designated mailbox may be used to notify the associated user of the update difference data.
And the updating difference data is automatically sent to the appointed application program interface, so that the application program interface can conveniently carry out preset logic processing on the updating difference data.
In one embodiment, the update difference data processing module 902 is configured to display the update difference data according to a representation corresponding to the update difference data; and so on.
Wherein, for example, if the update difference data is table data, the update difference data processing module 902 may present the update difference data in a table representation form, and so on.
As shown in fig. 10, in an embodiment, the apparatus for comparing update speed of web page content further includes a feature module extracting module 1002, configured to extract a feature identifier of a content item; the above modules relate to the process of recording content items and comparing the content items, and the recording and the comparison are carried out based on the characteristic marks of the content items.
In one embodiment, the feature module extraction module 1002 may extract the feature identifier of the content item at a preset specified position in the content item. For example, the content corresponding to the title field may be extracted from the html tag corresponding to the content item as the feature identifier of the content item. In one embodiment, the feature module extracting module 1002 is further configured to format the feature identifier according to a preset processing logic, so that the feature identifier conforms to a preset format.
In one embodiment, the feature module extraction module 1002 is configured to perform semantic analysis on the content item to obtain a feature identifier of the content item. For example, the content item may be semantically analyzed by a semantic analysis tool.
In one embodiment, the comparison keyword obtaining module 602 is configured to obtain a keyword to be compared.
The keyword search web page request module 604 is configured to request search result web pages corresponding to the keywords from each target website in a looping manner.
The latest item extraction module 802 is configured to extract the latest content items corresponding to the keywords in each search result webpage requested by the current cycle.
The comparing module 804 is configured to compare whether the extracted latest feature identifier is the same as the feature identifier recorded most recently.
The updated entry obtaining module 806 is configured to obtain the latest extracted feature identifier different from the most recently recorded feature identifier as the updated feature identifier of the corresponding target website.
The recording module 608 is configured to, if an updated feature identifier is extracted, take the current cycle processing time as the update time of the feature identifier, and record a corresponding relationship between the corresponding target website, the keyword, the feature identifier, and the update time.
The update time comparison module 610 is configured to compare update times of the same feature identifiers corresponding to the update keywords between the target websites.
The update difference data generation module 612 is configured to generate update difference data of the same feature identifier corresponding to the update keyword between the target websites according to the comparison result.
In an embodiment, the update difference data generating module 612 may further obtain content items corresponding to the feature identifiers, and generate update difference data of the content items corresponding to the feature identifiers corresponding to the update keywords between the target websites according to a comparison result of update times of the same feature identifiers corresponding to the update keywords between the target websites.
The method and the device for comparing the webpage content updating speed circularly request the search result webpage corresponding to the keyword from each target website and extract the updated content item of the search result webpage, thereby being capable of monitoring whether the target website updates the content item and which content items are updated, the method and the device take the current circulation processing time as the updating time of the updated content item, the time is actually equivalent to the monitored time of the updated content item, because the method and the device circularly request the search result webpage corresponding to the keyword from each target website, the time of monitoring the updated content item by the method and the device is very similar to the actual release time of the updated content item, the actual release time is equivalent to the actual updating time, therefore, the method and the device can accurately obtain the updating time of the content item, and comparing the updating speed of the webpage content between the target websites according to the accurate updating time, thereby obtaining an accurate comparison result.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.
Claims (12)
1. A method for comparing the update speed of web page contents comprises the following steps:
acquiring keywords to be compared;
circularly requesting search result webpages corresponding to the keywords from each target website;
extracting content items corresponding to the keywords updated by the search result webpage requested by the circulation relative to the search result webpage requested by the circulation last time by the target-dividing website;
if the updated content item is extracted, taking the current cycle processing time as the updating time of the content item, and recording the corresponding relation among the corresponding target website, the keyword, the content item and the updating time;
comparing the updating time of the same content items corresponding to the updated keywords between the target websites;
and generating updating difference data for updating the same content items corresponding to the keywords among the target websites according to the comparison result.
2. The method for comparing update speeds of web pages as claimed in claim 1, wherein the step of obtaining the keywords to be compared comprises:
acquiring preset webpage content classification categories;
requesting web page content from each target website;
crawling keywords corresponding to various categories in the webpage content of each target website;
filtering repeated keywords in the extracted keywords;
and storing the remaining keywords after filtering as the keywords to be compared.
3. The method for comparing webpage content update speeds according to claim 1, wherein the step of circularly requesting search result webpages corresponding to the keywords from each target website comprises:
and requesting search result webpages corresponding to the keywords from each target website at intervals of preset time, wherein the preset time does not exceed a threshold value.
4. The method for comparing webpage content updating speeds according to claim 1, wherein the step of the target-separated website extracting the content items corresponding to the keywords updated by the search result webpage requested by the current circulation process relative to the search result webpage requested by the previous circulation process comprises:
extracting the latest content items corresponding to the keywords in each search result webpage requested by the circulation;
the target division website compares whether the extracted latest content items are the same as the content items recorded recently;
acquiring the extracted latest content item different from the most recently recorded content item as the updated content item of the corresponding target website.
5. The method for comparing update speed of web page content according to claim 4, further comprising the steps of:
extracting feature identifiers of the content items;
in the steps of recording the content items and comparing the content items, the recording and comparing are performed based on the feature identifiers of the content items.
6. A device for comparing the update speed of web page contents comprises
The comparison keyword acquisition module is used for acquiring keywords to be compared;
a keyword search web page request module, configured to request search result web pages corresponding to the keywords from each target website in a circulating manner;
an updated content item extraction module, configured to extract, by a target website, a content item corresponding to the keyword, which is updated by the search result webpage that is requested by the keyword search webpage request module in the current cycle, relative to the search result webpage that is requested by the updated content item extraction module in the last cycle;
the recording module is used for taking the current cycle processing time as the updating time of the content item and recording the corresponding relation among the corresponding target website, the keywords, the content item and the updating time if the updated content item is extracted;
the updating time comparison module is used for comparing the updating time of the same content items corresponding to the updated keywords between the target websites;
and the update difference data generation module is used for generating update difference data of the same content items corresponding to the updated keywords among the target websites according to the comparison result.
7. The apparatus for comparing web page content update speed according to claim 6, further comprising:
and the comparison keyword setting and storing module is used for acquiring preset webpage content classification categories, requesting webpage content from each target website, crawling keywords corresponding to each category in the webpage content of each target website, filtering repeated keywords in the extracted keywords, and storing the filtered residual keywords as the keywords to be compared.
8. The apparatus for comparing update speed of web page content according to claim 6, wherein the keyword search web page request module is configured to request a search result web page corresponding to the keyword from each target website every preset time interval, and the preset time interval does not exceed a threshold.
9. The apparatus for comparing update speed of web page content according to claim 6, wherein the update content item extracting module comprises:
the latest item extraction module is used for extracting the latest content items corresponding to the keywords in each search result webpage requested by the circulation;
the comparison module is used for comparing whether the extracted latest content item is the same as the latest recorded content item or not by the target-dividing website;
an updated entry obtaining module, configured to obtain the latest extracted content entry different from the most recently recorded content entry as the updated content entry of the corresponding target website.
10. The apparatus for comparing web page content update speed according to claim 9, further comprising:
the characteristic module extraction module is used for extracting the characteristic identification of the content item;
when the recording module records the content item, recording by taking the characteristic identifier of the content item as a standard;
when the comparison module compares the content items, the comparison module compares the content items by taking the feature identification of the content items as a standard;
in the steps of recording the content items and comparing the content items, the recording and comparing are performed based on the feature identifiers of the content items.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 5 are implemented by the processor when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510194529.6A CN106156200B (en) | 2015-04-22 | 2015-04-22 | Webpage content updating speed comparison method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510194529.6A CN106156200B (en) | 2015-04-22 | 2015-04-22 | Webpage content updating speed comparison method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106156200A CN106156200A (en) | 2016-11-23 |
CN106156200B true CN106156200B (en) | 2020-09-08 |
Family
ID=57346295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510194529.6A Active CN106156200B (en) | 2015-04-22 | 2015-04-22 | Webpage content updating speed comparison method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106156200B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109298987B (en) * | 2017-07-25 | 2021-10-15 | 北京国双科技有限公司 | Method and device for detecting running state of web crawler |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049451B (en) * | 2011-10-14 | 2016-08-10 | 腾讯科技(深圳)有限公司 | The tracking of network content update and device |
CN102521295A (en) * | 2011-11-30 | 2012-06-27 | 深圳市五巨科技有限公司 | Method and device for automatically acquiring content updating on designated page |
US8751459B2 (en) * | 2012-05-24 | 2014-06-10 | App Tsunami, Inc. | Method and system to analyze email addresses |
CN104050273B (en) * | 2014-06-24 | 2018-07-10 | 北京奇虎科技有限公司 | For recording newest network file, the installation method for changing search result |
-
2015
- 2015-04-22 CN CN201510194529.6A patent/CN106156200B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN106156200A (en) | 2016-11-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10430425B2 (en) | Generating suggested queries based on social graph information | |
US9390184B2 (en) | Search and retrieval of objects in a social networking system | |
KR101527259B1 (en) | Providing posts to discussion threads in response to a search query | |
US8572129B1 (en) | Automatically generating nodes and edges in an integrated social graph | |
CN112015949A (en) | Video generation method and device, storage medium and electronic equipment | |
CN102054003B (en) | Methods and systems for recommending network information and creating network resource index | |
US10489473B2 (en) | Generating information describing interactions with a content item presented in multiple collections of content | |
CN103020123B (en) | A kind of method searching for bad video website | |
GB2509773A (en) | Automatic genre determination of web content | |
US8645363B2 (en) | Spreading comments to other documents | |
CN103631794A (en) | Method, device and equipment for sorting search results | |
CN107566906B (en) | Video comment processing method and device | |
CN111104583A (en) | Live broadcast room recommendation method, storage medium, electronic device and system | |
CN103605808A (en) | Search-based UGC (user generated content) recommendation method and search-based UGC recommendation system | |
US12069090B2 (en) | Illegal content search device, illegal content search method, and program | |
CN105574030A (en) | Information search method and device | |
EP3706014A1 (en) | Methods, apparatuses, devices, and storage media for content retrieval | |
CN109635072B (en) | Public opinion data distributed storage method, public opinion data distributed storage device, storage medium and terminal equipment | |
US20210342393A1 (en) | Artificial intelligence for content discovery | |
CN113254665A (en) | Knowledge graph expansion method and device, electronic equipment and storage medium | |
CN106156200B (en) | Webpage content updating speed comparison method and device | |
CN106454546A (en) | Caption file processing method and caption file processing device | |
CN107807964A (en) | Digital content sort method, device and computer-readable recording medium | |
WO2015043322A1 (en) | Method and apparatus for performing capturing and authentication by engine, and method and apparatus for providing webpage open abstract | |
CN106156024B (en) | Information processing method and server |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20221117 Address after: 1402, Floor 14, Block A, Haina Baichuan Headquarters Building, No. 6, Baoxing Road, Haibin Community, Xin'an Street, Bao'an District, Shenzhen, Guangdong 518100 Patentee after: Shenzhen Yayue Technology Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |