CN111125482A - Method and device for adjusting data crawling frequency, storage medium and processor - Google Patents

Method and device for adjusting data crawling frequency, storage medium and processor Download PDF

Info

Publication number
CN111125482A
CN111125482A CN201811290140.1A CN201811290140A CN111125482A CN 111125482 A CN111125482 A CN 111125482A CN 201811290140 A CN201811290140 A CN 201811290140A CN 111125482 A CN111125482 A CN 111125482A
Authority
CN
China
Prior art keywords
crawling
information source
data
frequency
adjusting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811290140.1A
Other languages
Chinese (zh)
Other versions
CN111125482B (en
Inventor
陈发发
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811290140.1A priority Critical patent/CN111125482B/en
Publication of CN111125482A publication Critical patent/CN111125482A/en
Application granted granted Critical
Publication of CN111125482B publication Critical patent/CN111125482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for adjusting data crawling frequency, a storage medium and a processor. The method comprises the following steps: performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; based on the position sequence of each information source after sequencing, the crawling frequency for data crawling of each information source is adjusted, and the problem that in the related technology, due to the fact that the data crawling frequency of the crawler platform is fixed, when different information sources are subjected to data crawling, resources of the crawler platform are wasted is solved.

Description

Method and device for adjusting data crawling frequency, storage medium and processor
Technical Field
The application relates to the field of data processing, in particular to a method and a device for adjusting data crawling frequency, a storage medium and a processor.
Background
At present, the crawling frequency of a crawler platform to each information source is fixed, a parameter, namely a crawling interval, is set when the information sources are input, and then each crawling of the information sources is a fixed time interval. Generally, the crawling interval of the sources is 3 hours by default and is not modified, so most sources crawl for 3 hours. Regardless of the quality of the current information source and the amount of data generated by the current information source, the crawling intervals of each information source in the crawler platform are equal, which is obviously unreasonable.
For example: one information source is a website, news in the website is updated once a day, if the set crawling interval is once every 3 hours, a large amount of resources are wasted, and the crawler platform can do useless work for many times; or if a website is very active and updates the content once in a few minutes, the default crawling for 3 hours set by us may cause data omission, because the source is also configured with the maximum page turning number, which is 5 pages by default, and if the website updates more than 5 pages within the 3 hours, the data deletion is caused. Aiming at the fact that the current crawler platform has hundreds of thousands of magnitude of information sources, the current crawler platform has very much resource waste and data loss captured by the information sources by combining the above statement.
Aiming at the problem that in the prior art, because the data crawling frequency of the crawler platform is fixed, when different information sources are subjected to data crawling, resources of the crawler platform are wasted, and an effective solution is not provided at present.
Disclosure of Invention
The main purpose of the present application is to provide a method and an apparatus for adjusting data crawling frequency, a storage medium, and a processor, so as to solve the problem of resource waste of a crawler platform caused by fixed data crawling frequency of the crawler platform when data crawling is performed on different information sources in the related art.
In order to achieve the above object, according to one aspect of the present application, there is provided a method for adjusting a data crawling frequency. The method comprises the following steps: performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; and adjusting the crawling frequency for crawling data of each information source based on the position sequence of each information source after sequencing.
Further, based on the position order of each source after sorting, adjusting the crawling frequency for data crawling on each source comprises: determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and different position ranges correspond to different crawling frequencies for adjusting the data crawled by the information sources; and adjusting the crawling frequency for crawling data of each information source based on the position range of the position sequence of each information source.
Further, after data crawling is performed on the plurality of information sources through the crawler platform at the first crawling frequency, the method further comprises the following steps: storing the data information crawled from each information source into a database platform layer; and/or, before adjusting the crawling frequency of data crawling on each source based on the position order of each source after sorting, the method further comprises the following steps: extracting the data amount crawled by each information source, and storing the extracted data in a vertical layer of a database; calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and if the ratio of each information source is larger than or equal to the preset ratio, adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
Further, if the target information source with the ratio smaller than the preset ratio exists, judging whether the target information source is in a preset information source library or not; if the target information source is in the information source library, the crawling frequency of the target information source is adjusted to be a second crawling frequency, wherein the second crawling frequency is smaller than the first crawling frequency; and if the target information source is not in the information source library, triggering and reminding whether data crawling needs to be continuously carried out on the target information source.
Further, determining a position range of the position order of each sequenced information source, and adjusting the crawling frequency for data crawling on each information source based on the position range of the position order of each information source comprises: if the position range of the position sequence of the target information source is a first position range, adjusting the crawling frequency of the target information source to be a third crawling frequency, wherein the third crawling frequency is smaller than the first crawling frequency; if the position range of the position order of the target information source is a second position range, adjusting the crawling frequency of the target information source to be a fourth crawling frequency, wherein the fourth crawling frequency is smaller than the first crawling frequency and larger than the third crawling frequency, and the minimum value of the position order in the second position range is larger than the maximum value of the position order in the first position range; if the position range of the position order of the target information source is a third position range, adjusting the crawling frequency of the target information source to be a fifth crawling frequency, wherein the fifth crawling frequency is smaller than the first crawling frequency and larger than the fourth crawling frequency, and the minimum value of the position order in the third position range is larger than the maximum value of the position order in the second position range; if the position range of the position order of the target information source is a fourth position range, the crawling frequency of the target information source is adjusted to be a sixth crawling frequency, wherein the sixth crawling frequency is larger than the first crawling frequency, the sixth crawling frequency is larger than the third crawling frequency, and the minimum value of the position order in the fourth position range is larger than the maximum value of the position order in the third position range.
In order to achieve the above object, according to another aspect of the present application, there is provided an adjusting apparatus of data crawling frequency. The device includes: the first crawling unit is used for crawling data of the plurality of information sources at a first crawling frequency through the crawler platform; the sorting unit is used for sorting each information source according to the crawled data volume; and the first adjusting unit is used for adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
Further, the first adjusting unit includes: the determining subunit is used for determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and the different position ranges are different in crawling frequency for adjusting the data crawled by the information sources; and the adjusting subunit is used for adjusting the crawling frequency for crawling the data of each information source according to the position range of the position sequence of each information source.
The device further comprises a storage unit, a database platform layer and a data information storage unit, wherein the storage unit is used for storing the data information crawled by each information source into the database platform layer after the plurality of information sources are subjected to data crawling through the crawler platform at a first crawling frequency; and/or the extraction unit is used for extracting the data amount crawled by each information source before adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing, and storing the extracted data into the vertical layer of the database; the calculation unit is used for calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and the second adjusting unit is used for adjusting the crawling frequency for crawling the data of each information source according to the position sequence of each information source after sequencing under the condition that the ratio of each information source is greater than or equal to the preset ratio.
Through the application, the following steps are adopted: performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; based on the position order of every information source after sequencing, the adjustment carries out the frequency of crawling that data crawled to every information source, has solved among the correlation technique because the data of crawler platform crawls the frequency fixed, when carrying out data snatching to different information sources, causes the extravagant problem of crawler platform resource, through the data volume based on crawling the information source, adjusts the frequency of crawling of information source, and then has reached and has reduced the platform wasting of resources, the effectual effect of crawling data of carrying on of intelligence.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flow chart of a method for adjusting data crawling frequency provided according to an embodiment of the present application; and
fig. 2 is a schematic diagram of an apparatus for adjusting a data crawling frequency according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to an embodiment of the application, a method for adjusting data crawling frequency is provided.
Fig. 1 is a flowchart of a method for adjusting data crawling frequency according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
and S101, performing data crawling on the plurality of information sources at a first crawling frequency through a crawler platform.
Specifically, after the operation is opened to the information source, the crawler platform carries out data normal crawl to the information source, wherein normally crawl the meaning of data and be that the crawler platform carries out data crawl with first crawl frequency, and under the general condition, the frequency is crawled to the acquiescence data of crawler platform is 3 h/times, promptly, the crawler platform carries out data crawl operation once every 3 hours to all information sources. Therefore, in the present embodiment, the first crawling frequency is 3 h/time, but a crawler platform with a first crawling frequency that is not 3 h/time is also within the range defined in the present embodiment.
Specifically, there are a plurality of information sources that need to perform data crawling operation in the crawler platform, and in general, there are a large number of information sources in the crawler platform, where there are basic web page information in the information source configuration, such as: website name (name), website address (url), etc., as well as some crawler technical parameters, such as: maximum page turning number, crawling interval, crawling mode and the like. And the crawler platform crawls webpage information from the corresponding url according to the crawler technical parameters corresponding to each information source.
And S102, sequencing each information source according to the crawled data volume.
Optionally, after data crawling is performed on the plurality of information sources at the first crawling frequency through the crawler platform, the method further includes: storing the data information crawled from each information source into a database platform layer; and/or, based on the position order of each source after sequencing, before adjusting the crawling frequency for data crawling on each source, the method further comprises the following steps: extracting the data amount crawled by each information source, and storing the extracted data in a vertical layer of a database; calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and if the ratio of each information source is larger than or equal to the preset ratio, adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
Specifically, the crawler platform crawls data from the web pages and stores the data in the database platform layer, and the crawled data contain web page contents, such as: title, text, time, author, etc., and also source information such as: the information source id and the information source label aim to more conveniently inquire data crawled by the appointed information source.
Specifically, query keywords that need to extract data may be configured in the front-end service module, where the query conditions are configured according to project requirements and generally include tags, keywords (titles or texts are selectable), time, and the like, and the data in the platform layer is extracted to the vertical layer according to the extraction conditions and displayed to the service front end. And the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source needs to be calculated, and the operation of adjusting the data crawling frequency of the information source according to the position sequence of the information source is executed if the ratio is larger than a certain range.
For example, the preset ratio is 10%. If the ratio of the vertical layer data quantity to the platform layer data quantity of a certain source is lower than 10%, it indicates that the target source can provide less effective information, and the target source in this case has little meaning for adjusting the crawling frequency, so that the adjustment of the data crawling frequency according to the position order is performed on the source only if the ratio is higher than 10%.
It should be noted that the preset ratio in the present embodiment may be set according to actual situations.
Optionally, if a target information source with a ratio smaller than a preset ratio exists, judging whether the target information source is in a preset information source library; if the target information source is in the information source library, the crawling frequency of the target information source is adjusted to be a second crawling frequency, wherein the second crawling frequency is smaller than the first crawling frequency; and if the target information source is not in the information source library, triggering and reminding whether data crawling needs to be continuously carried out on the target information source.
Specifically, if the ratio of some information sources is smaller than the preset ratio, it is determined that the effective information that the target information source can provide is too little, and it is necessary to determine whether the target information source is in a preset information source library, where the information sources in the preset information source library are multiple information sources from which the target information needs to be extracted, and if the target information source exists in the information source library, it is determined that the target information source is an information source that can provide the effective information, even if the effective information that is provided is less, so that the data crawling frequency of the target information source can be correspondingly reduced to the second frequency. If the target information source is detected not to exist in the information source library, a prompt can be triggered to a system administrator to prompt whether data crawling of the target information source needs to be continued or not, or if the influence of the website is high, a more detailed page is considered to be input or data crawled by the target information source is processed in a keyword search mode, so that the system administrator can make corresponding adjustment.
And S103, adjusting the crawling frequency for crawling data of each information source based on the position sequence of each sequenced information source.
Optionally, based on the position order of each source after sorting, adjusting the crawling frequency of data crawling on each source includes: determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and different position ranges correspond to different crawling frequencies for adjusting the data crawled by the information sources; and adjusting the crawling frequency for crawling data of each information source based on the position range of the position sequence of each information source.
Specifically, all the information sources are sequenced according to the data quantity of the information sources in the platform layer from less to more, the sequenced information sources are divided into stages, and the position range of the division is as follows: the first range is the top 0-20% of the overall ranking (i.e., sources ranked between the top 20%), the second range is 20% -50% of the overall ranking (i.e., sources ranked between 20% -50%), the third range is 50% -80% of the overall ranking (i.e., sources ranked between 50% -80%), and the fourth range is 80% -100% of the overall ranking (i.e., sources ranked between 80% -100%).
It should be noted that, all the sources are sorted in order from a small amount to a large amount according to their data amounts in the platform layer, where the data amount of each source in the platform layer is the total amount of data that each source has been crawled and stored in the platform layer within a specific time period, and generally the specific time period takes 7 days as a cycle, and of course, other time periods besides 7 days are also within the scope defined by this embodiment.
Optionally, determining a position range of the position order of each source after the ranking, and adjusting a crawling frequency for data crawling on each source based on the position range of the position order of each source includes: if the position range of the position sequence of the target information source is the first position range, the crawling frequency of the target information source is adjusted to be a third crawling frequency, wherein the third crawling frequency is smaller than the first crawling frequency;
for example, the first location range is a location range ranked between 20%, and in the case that the first crawling frequency is 3 h/time, when the source platform layer data amount is ranked and the location is within the location range of 0-20% of the source ranking, it is indicated that the new data amount generated by the source every day is extremely small, and the crawling interval of the source ranked between 20% of the location range is adjusted to 10 hours.
And if the position range of the position order of the target information source is the second position range, adjusting the crawling frequency of the target information source to be a fourth crawling frequency, wherein the fourth crawling frequency is smaller than the first crawling frequency and larger than the third crawling frequency, and the minimum value of the position order in the second position range is larger than the maximum value of the position order in the first position range.
Specifically, for example, the second location range is a location range ranked between 20% and 50%, and in the case that the first crawling frequency is 3 h/time, when the location after the source platform layer data amount is sorted is in the location range ranked between 20% and 50%, it is indicated that the source generates relatively fewer sources each day, and the crawling interval of the corresponding source is automatically adjusted to the fourth crawling frequency, where the crawling interval of the corresponding source ranked between 20% and 50% is adjusted to 6-10 h/time.
And if the position range of the position order of the target information source is a third position range, adjusting the crawling frequency of the target information source to be a fifth crawling frequency, wherein the fifth crawling frequency is smaller than the first crawling frequency and larger than the fourth crawling frequency, and the minimum value of the position order in the third position range is larger than the maximum value of the position order in the second position range.
Specifically, for example, the third location range is a location range ranked between 50% and 80%, and when the location is in the location range ranked between 50% and 80% after the data size of the source platform layer is sorted under the condition that the first crawling frequency is 3 h/time, it is indicated that the number of the sources generated by the source per day is relatively large, the crawling interval of the corresponding source is automatically adjusted to the fifth crawling frequency, and the crawling interval of the sources ranked between 50% and 80% is adjusted to 3-6 h/time.
And if the position range of the position order of the target information source is the fourth position range, adjusting the crawling frequency of the target information source to be a sixth crawling frequency, wherein the sixth crawling frequency is greater than the first crawling frequency, the sixth crawling frequency is greater than the third crawling frequency, and the minimum value of the position order in the fourth position range is greater than the maximum value of the position order in the third position range.
Specifically, for example, the fourth location range is a location range ranked between 80% and 100%, and in the case that the first crawling frequency is 3 h/time, when the location of the source platform layer data after being ranked is 80% to 100%, it indicates that the data volume generated by the source each day is very large, and the corresponding source crawling interval is automatically adjusted to the sixth crawling frequency, where the sixth crawling frequency is set. The crawling interval of sources ranked in the position range of 80% -100% is adjusted to 1-3 h/time.
It should be noted that the specific value of the crawling interval of the adjusted information source may be determined according to the actual important condition of the website.
Above-mentioned ground, the sequencing according to the data volume that the information source crawled is adjusted, and the corresponding frequency of crawling of the information source that updates some data relatively slower is transferred lowly, has avoided the waste of crawler platform crawl resource, simultaneously according to the sequencing to the information source, transfers highlythe corresponding frequency of crawling of the information source that updates some data relatively faster, has avoided because crawl the frequency and lead to the disappearance data the disappearance that misses.
In summary, the method for adjusting the data crawling frequency provided by the embodiment of the application performs data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; based on the position order of every information source after sequencing, the adjustment carries out the frequency of crawling that data crawled to every information source, has solved among the correlation technique because the data of crawler platform crawls the frequency fixed, when carrying out data snatching to different information sources, causes the extravagant problem of crawler platform resource, through the data volume based on crawling the information source, adjusts the frequency of crawling of information source, and then has reached and has reduced the platform wasting of resources, the effectual effect of crawling data of carrying on of intelligence.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
The embodiment of the present application further provides an adjusting device of data crawling frequency, and it should be noted that the adjusting device of data crawling frequency of the embodiment of the present application can be used for executing the adjusting method for data crawling frequency provided by the embodiment of the present application. The following describes an apparatus for adjusting data crawling frequency according to an embodiment of the present application.
Fig. 2 is a schematic diagram of an apparatus for adjusting data crawling frequency according to an embodiment of the present application. As shown in fig. 2, the apparatus includes: the first crawling unit 201 is used for crawling data of a plurality of information sources through a crawler platform at a first crawling frequency; the sorting unit 202 is used for sorting each information source according to the crawled data volume; and the first adjusting unit 203 is configured to adjust a crawling frequency for data crawling on each source based on the position order of each source after the ranking.
According to the device for adjusting the data crawling frequency, provided by the embodiment of the application, data crawling is performed on a plurality of information sources at the first crawling frequency through the first crawling unit 201 and the crawler platform; the sorting unit 202 sorts each information source according to the crawled data volume; first adjustment unit 203, based on the position order of every information source after the sequencing, the adjustment carries out the frequency of crawling that data crawled to every information source, it is fixed because the data of crawler platform crawls the frequency among the correlation technique to have solved, when carrying out data snatching to different information sources, cause the extravagant problem of crawler platform resource, through the data volume based on crawling the information source, adjust the frequency of crawling of information source, and then reached and reduced the platform wasting of resources, the effectual effect of crawling the data that carries on of intelligence.
Optionally, the first adjusting unit 203 includes: the determining subunit is used for determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and the different position ranges are different in crawling frequency for adjusting the data crawled by the information sources; and the adjusting subunit is used for adjusting the crawling frequency for crawling the data of each information source according to the position range of the position sequence of each information source.
Optionally, the apparatus further includes a storage unit, configured to store, in the database platform layer, the data information crawled for each information source after the crawler platform performs data crawling on the plurality of information sources at the first crawling frequency; and/or the extraction unit is used for extracting the data amount crawled by each information source before adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing, and storing the extracted data into the vertical layer of the database; the calculation unit is used for calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and the second adjusting unit is used for adjusting the crawling frequency for crawling the data of each information source according to the position sequence of each information source after sequencing under the condition that the ratio of each information source is greater than or equal to the preset ratio.
Optionally, the apparatus further includes a determining unit, configured to determine whether the target information source is in a preset information source library when there is a target information source whose ratio is smaller than a preset ratio; the third adjusting unit is used for adjusting the crawling frequency of the target information source to be a second crawling frequency under the condition that the target information source is in the information source library, wherein the second crawling frequency is smaller than the first crawling frequency; and the triggering unit is used for triggering and reminding whether the target information source needs to be subjected to data crawling continuously or not under the condition that the target information source is not in the information source library.
Optionally, the adjusting subunit further comprises: the first adjusting module is used for adjusting the crawling frequency of the target information source to be a third crawling frequency under the condition that the position range of the position sequence of the target information source is the first position range, wherein the third crawling frequency is smaller than the first crawling frequency; the second adjusting module is used for adjusting the crawling frequency of the target information source to be a fourth crawling frequency under the condition that the position range of the position sequence of the target information source is a second position range, wherein the fourth crawling frequency is smaller than the first crawling frequency and is larger than the third crawling frequency, and the minimum value of the position sequence in the second position range is larger than the maximum value of the position sequence in the first position range; the third adjusting module is used for adjusting the crawling frequency of the target information source to be a fifth crawling frequency under the condition that the position range of the position sequence of the target information source is a third position range, wherein the fifth crawling frequency is smaller than the first crawling frequency and is larger than the fourth crawling frequency, and the minimum value of the position sequence in the third position range is larger than the maximum value of the position sequence in the second position range; and the third adjusting module is used for adjusting the crawling frequency of the target information source to be a sixth crawling frequency under the condition that the position range of the position order of the target information source is a fourth position range, wherein the sixth crawling frequency is greater than the first crawling frequency, the sixth crawling frequency is greater than the third crawling frequency, and the minimum value of the position order in the fourth position range is greater than the maximum value of the position order in the third position range.
The above-mentioned adjusting device for data crawling frequency comprises a processor and a memory, the above-mentioned first crawling unit 201, the sorting unit 202, the first adjusting unit 203, etc. are all stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem of resource waste of the crawler platform caused by the fact that data crawling frequency of the crawler platform is fixed and different information sources are subjected to data grabbing in the related technology is solved by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing a method of adjusting a data crawling frequency when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes a method for adjusting data crawling frequency during running.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; and adjusting the crawling frequency for crawling data of each information source based on the position sequence of each information source after sequencing.
Optionally, based on the position order of each source after sorting, adjusting the crawling frequency of data crawling on each source includes: determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and different position ranges correspond to different crawling frequencies for adjusting the data crawled by the information sources; and adjusting the crawling frequency for crawling data of each information source based on the position range of the position sequence of each information source.
Optionally, after performing data crawling on the plurality of information sources at the first crawling frequency through the crawler platform, the method further includes: storing the data information crawled from each information source into a database platform layer; and/or, before adjusting the crawling frequency of data crawling on each source based on the position order of each source after sorting, the method further comprises the following steps: extracting the data amount crawled by each information source, and storing the extracted data in a vertical layer of a database; calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and if the ratio of each information source is larger than or equal to the preset ratio, adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
Optionally, if a target information source with a ratio smaller than a preset ratio exists, judging whether the target information source is in a preset information source library; if the target information source is in the information source library, the crawling frequency of the target information source is adjusted to be a second crawling frequency, wherein the second crawling frequency is smaller than the first crawling frequency; and if the target information source is not in the information source library, triggering and reminding whether data crawling needs to be continuously carried out on the target information source.
Optionally, determining a position range of the position order of each source after the ranking, and adjusting a crawling frequency for data crawling on each source based on the position range of the position order of each source includes: if the position range of the position sequence of the target information source is a first position range, adjusting the crawling frequency of the target information source to be a third crawling frequency, wherein the third crawling frequency is smaller than the first crawling frequency; if the position range of the position order of the target information source is a second position range, adjusting the crawling frequency of the target information source to be a fourth crawling frequency, wherein the fourth crawling frequency is smaller than the first crawling frequency and larger than the third crawling frequency, and the minimum value of the position order in the second position range is larger than the maximum value of the position order in the first position range; if the position range of the position order of the target information source is a third position range, adjusting the crawling frequency of the target information source to be a fifth crawling frequency, wherein the fifth crawling frequency is smaller than the first crawling frequency and larger than the fourth crawling frequency, and the minimum value of the position order in the third position range is larger than the maximum value of the position order in the second position range; if the position range of the position order of the target information source is a fourth position range, the crawling frequency of the target information source is adjusted to be a sixth crawling frequency, wherein the sixth crawling frequency is larger than the first crawling frequency, the sixth crawling frequency is larger than the third crawling frequency, and the minimum value of the position order in the fourth position range is larger than the maximum value of the position order in the third position range. The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform; sequencing each information source according to the crawled data volume; and adjusting the crawling frequency for crawling data of each information source based on the position sequence of each information source after sequencing.
Optionally, based on the position order of each source after sorting, adjusting the crawling frequency of data crawling on each source includes: determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and different position ranges correspond to different crawling frequencies for adjusting the data crawled by the information sources; and adjusting the crawling frequency for crawling data of each information source based on the position range of the position sequence of each information source.
Optionally, after performing data crawling on the plurality of information sources at the first crawling frequency through the crawler platform, the method further includes: storing the data information crawled from each information source into a database platform layer; and/or, before adjusting the crawling frequency of data crawling on each source based on the position order of each source after sorting, the method further comprises the following steps: extracting the data amount crawled by each information source, and storing the extracted data in a vertical layer of a database; calculating the ratio of the data quantity stored in the vertical layer of the database and the data quantity stored in the platform layer of the database of each information source; and if the ratio of each information source is larger than or equal to the preset ratio, adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
Optionally, if a target information source with a ratio smaller than a preset ratio exists, judging whether the target information source is in a preset information source library; if the target information source is in the information source library, the crawling frequency of the target information source is adjusted to be a second crawling frequency, wherein the second crawling frequency is smaller than the first crawling frequency; and if the target information source is not in the information source library, triggering and reminding whether data crawling needs to be continuously carried out on the target information source.
Optionally, determining a position range of the position order of each source after the ranking, and adjusting a crawling frequency for data crawling on each source based on the position range of the position order of each source includes: if the position range of the position sequence of the target information source is a first position range, adjusting the crawling frequency of the target information source to be a third crawling frequency, wherein the third crawling frequency is smaller than the first crawling frequency; if the position range of the position order of the target information source is a second position range, adjusting the crawling frequency of the target information source to be a fourth crawling frequency, wherein the fourth crawling frequency is smaller than the first crawling frequency and larger than the third crawling frequency, and the minimum value of the position order in the second position range is larger than the maximum value of the position order in the first position range; if the position range of the position order of the target information source is a third position range, adjusting the crawling frequency of the target information source to be a fifth crawling frequency, wherein the fifth crawling frequency is smaller than the first crawling frequency and larger than the fourth crawling frequency, and the minimum value of the position order in the third position range is larger than the maximum value of the position order in the second position range; if the position range of the position order of the target information source is a fourth position range, the crawling frequency of the target information source is adjusted to be a sixth crawling frequency, wherein the sixth crawling frequency is larger than the first crawling frequency, the sixth crawling frequency is larger than the third crawling frequency, and the minimum value of the position order in the fourth position range is larger than the maximum value of the position order in the third position range.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for adjusting data crawling frequency is characterized by comprising the following steps:
performing data crawling on a plurality of information sources at a first crawling frequency through a crawler platform;
sequencing each information source according to the crawled data volume;
and adjusting the crawling frequency for crawling data of each information source based on the position sequence of each information source after sequencing.
2. The method of claim 1, wherein adjusting a crawl frequency for data crawl of each source based on the ranked position order of each source comprises:
determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and different position ranges correspond to different crawling frequencies for adjusting the data crawled by the information sources;
and adjusting the crawling frequency for crawling data of each information source based on the position range of the position sequence of each information source.
3. The method of claim 2,
after crawling, by the crawler platform, data for the plurality of information sources at the first crawling frequency, the method further comprising: storing the data information crawled from each information source into a database platform layer; and/or the presence of a gas in the gas,
before adjusting the crawling frequency for data crawling on each information source based on the position order of each information source after sorting, the method further comprises the following steps:
extracting the data amount crawled by each information source, and storing the extracted data in a vertical layer of a database;
calculating a ratio of an amount of data stored by each source in the database vertical tier to an amount of data stored in the database platform tier;
and if the ratio of each information source is larger than or equal to a preset ratio, adjusting the crawling frequency for crawling data of each information source based on the position sequence of each information source after sequencing.
4. The method of claim 3, further comprising:
if the target information source with the ratio smaller than the preset ratio exists, judging whether the target information source is in a preset information source library or not;
if the target information source is in the information source library, the crawling frequency of the target information source is adjusted to be a second crawling frequency, wherein the second crawling frequency is smaller than the first crawling frequency;
and if the target information source is not in the information source library, triggering and reminding whether data crawling needs to be continuously carried out on the target information source.
5. The method of claim 2, wherein determining a position range of the position order of each source after the ranking, and adjusting the crawling frequency for data crawling on each source based on the position range of the position order of each source comprises:
if the position range of the position sequence of the target information source is a first position range, adjusting the crawling frequency of the target information source to be a third crawling frequency, wherein the third crawling frequency is smaller than the first crawling frequency;
if the position range of the position order of the target information source is a second position range, adjusting the crawling frequency of the target information source to be a fourth crawling frequency, wherein the fourth crawling frequency is smaller than the first crawling frequency and larger than the third crawling frequency, and the minimum value of the position order in the second position range is larger than the maximum value of the position order in the first position range;
if the position range of the position order of the target information source is a third position range, adjusting the crawling frequency of the target information source to be a fifth crawling frequency, wherein the fifth crawling frequency is smaller than the first crawling frequency and larger than the fourth crawling frequency, and the minimum value of the position order in the third position range is larger than the maximum value of the position order in the second position range;
if the position range of the position order of the target information source is a fourth position range, the crawling frequency of the target information source is adjusted to be a sixth crawling frequency, wherein the sixth crawling frequency is larger than the first crawling frequency, the sixth crawling frequency is larger than the third crawling frequency, and the minimum value of the position order in the fourth position range is larger than the maximum value of the position order in the third position range.
6. An apparatus for adjusting data crawling frequency, comprising:
the first crawling unit is used for crawling data of the plurality of information sources at a first crawling frequency through the crawler platform;
the sorting unit is used for sorting each information source according to the crawled data volume;
and the first adjusting unit is used for adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing.
7. The apparatus of claim 6, wherein the first adjusting unit comprises:
the determining subunit is used for determining the position range of the position sequence of each sequenced information source, wherein the position range is obtained by grouping the position sequences obtained by sequencing the data amount crawled by the information sources according to a preset rule, and the different position ranges are different in crawling frequency for adjusting the data crawled by the information sources;
and the adjusting subunit is used for adjusting the crawling frequency for crawling the data of each information source according to the position range of the position sequence of each information source.
8. The apparatus of claim 7, further comprising:
the system comprises a storage unit, a database platform layer and a crawler platform, wherein the storage unit is used for storing data information obtained by crawling of each information source into the database platform layer after the crawler platform performs data crawling on a plurality of information sources at a first crawling frequency; and/or the presence of a gas in the gas,
the extraction unit is used for extracting the data amount crawled by each information source before adjusting the crawling frequency for crawling the data of each information source based on the position sequence of each information source after sequencing, and storing the extracted data into a vertical layer of a database;
a calculation unit for calculating a ratio of an amount of data stored in the database vertical layer to an amount of data stored in the database platform layer for each source;
and a second adjusting unit which performs a step of adjusting a crawling frequency for data crawling on each information source according to the position order of each information source after sorting under the condition that the ratio of each information source is greater than or equal to a preset ratio.
9. A storage medium characterized by comprising a stored program, wherein the program executes a method of adjusting data crawling frequency according to any one of claims 1 to 5.
10. A processor, configured to execute a program, wherein the program executes a method for adjusting data crawling frequency according to any one of claims 1 to 5.
CN201811290140.1A 2018-10-31 2018-10-31 Method and device for adjusting data crawling frequency, storage medium and processor Active CN111125482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811290140.1A CN111125482B (en) 2018-10-31 2018-10-31 Method and device for adjusting data crawling frequency, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811290140.1A CN111125482B (en) 2018-10-31 2018-10-31 Method and device for adjusting data crawling frequency, storage medium and processor

Publications (2)

Publication Number Publication Date
CN111125482A true CN111125482A (en) 2020-05-08
CN111125482B CN111125482B (en) 2023-04-07

Family

ID=70494308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811290140.1A Active CN111125482B (en) 2018-10-31 2018-10-31 Method and device for adjusting data crawling frequency, storage medium and processor

Country Status (1)

Country Link
CN (1) CN111125482B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006362A1 (en) * 2007-06-28 2009-01-01 International Business Machines Corporation Hierarchical seedlists for application data
CN102929671A (en) * 2012-10-31 2013-02-13 北京奇虎科技有限公司 Server, application upgrade method and application upgrade system
CN106127503A (en) * 2016-06-06 2016-11-16 广州市邦富软件有限公司 A kind of Analysis of Network Information method based on true social relations and big data
CN107590188A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 A kind of reptile crawling method and its management system for automating vertical subdivision field

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006362A1 (en) * 2007-06-28 2009-01-01 International Business Machines Corporation Hierarchical seedlists for application data
CN102929671A (en) * 2012-10-31 2013-02-13 北京奇虎科技有限公司 Server, application upgrade method and application upgrade system
CN106127503A (en) * 2016-06-06 2016-11-16 广州市邦富软件有限公司 A kind of Analysis of Network Information method based on true social relations and big data
CN107590188A (en) * 2017-08-08 2018-01-16 杭州灵皓科技有限公司 A kind of reptile crawling method and its management system for automating vertical subdivision field

Also Published As

Publication number Publication date
CN111125482B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN106897334B (en) Question pushing method and equipment
CN104346354B (en) It is a kind of that the method and device for recommending word is provided
US8380680B2 (en) Piecemeal list prefetch
CN108549569B (en) Method and equipment for searching information in application program
CN109669776B (en) Detection task processing method, device and system
CN106202092B (en) Data processing method and system
CN110020339B (en) Webpage data acquisition method and device based on non-buried point
KR102391839B1 (en) Method and device for processing user personal, server and storage medium
CN112749863A (en) Keyword price adjusting method and device and electronic equipment
CN106648839B (en) Data processing method and device
CN108121712B (en) Keyword storage method and device
CN111784468A (en) Account association method and device and electronic equipment
CN106610989B (en) Search keyword clustering method and device
CN109992470B (en) Threshold value adjusting method and device
CN109947713B (en) Log monitoring method and device
CN110147473B (en) Crawling method and device for crawler
CN112148972A (en) Method and device for screening information to be recommended
CN111125482B (en) Method and device for adjusting data crawling frequency, storage medium and processor
CN113794727B (en) Threat information feature library generation method, threat information feature library generation device, storage medium and processor
CN110889065B (en) Page stay time determination method, device and equipment
CN109597743B (en) Page circling method, click rate statistical method and related equipment
CN110019295B (en) Database retrieval method, device, system and storage medium
CN110968993A (en) Information processing method and device, storage medium and processor
CN106776654B (en) Data searching method and device
CN111125087A (en) Data storage method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant