CN112231538A

CN112231538A - Method, device, equipment and storage medium for updating scheduling task queue

Info

Publication number: CN112231538A
Application number: CN202011470038.7A
Authority: CN
Inventors: 宋为刚; 王轶; 刘永超
Original assignee: China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Suzhou Software Technology Co Ltd
Priority date: 2020-12-15
Filing date: 2020-12-15
Publication date: 2021-01-15
Anticipated expiration: 2040-12-15
Also published as: CN112231538B

Abstract

The application discloses a method, a device, equipment and a storage medium for updating a scheduling task queue, wherein the method comprises the following steps: acquiring historical text data generated by at least one seed page to be crawled; predicting and crawling scheduling parameters of each seed page according to the historical text data; analyzing the content of the historical text data, and determining the information source score of each seed page; wherein, the source score characterizes the quality condition of the content contained in each seed page; dynamically adjusting the scheduling parameters according to the information source scores; and updating the scheduling task queue according to the adjusted scheduling parameters. The method and the device solve the problems of overlarge consumed resources and low scheduling effect caused by the fact that the information sources with high quality content and high correlation and the common information sources are not distinguished in the related technology.

Description

Method, device, equipment and storage medium for updating scheduling task queue

Technical Field

The present application relates to the field of computer network technology, and relates to, but is not limited to, a method, an apparatus, a device, and a storage medium for updating a scheduling task queue.

Background

With the explosive growth of internet information, the traditional way of collecting data by web crawlers has gradually shown disadvantages. According to the technical scheme, firstly, the weight of a crawling site is set in advance, then a refreshing interval is set according to the weight, and a scheduling task queue is updated according to the refreshing interval. The scheme adopts a mode of presetting the refresh interval for scheduling, fixes the scheduling parameters and cannot flexibly allocate resources. And secondly, acquiring historical data parameters and processed parameters of a seed page to be crawled (a list page containing a plurality of link information, wherein the link information corresponds to a content page), then determining scheduling parameters of the next crawled webpage through a trained prediction model or a certain prediction rule, and updating a scheduling task queue according to the scheduling parameters. According to the scheme, the current scheduling parameters are predicted according to historical data generated by a scheduling crawler, more attention is paid to scheduling data, the value of text information captured by a content page is not mined, and the text information contained in the content page cannot be reflected in scheduling of a seed page, so that the problems that high-quality content and high-correlation information sources are produced and cannot be distinguished from common information sources, the consumed resources are overlarge, and the scheduling effect is not high are caused.

Disclosure of Invention

In view of this, embodiments of the present application provide a method, an apparatus, and a storage medium for updating a scheduling task queue to solve at least one problem in the prior art, so as to at least solve the problems of too large resource consumption and low scheduling effect caused by no distinction between high-quality content and high-correlation information sources and common information sources.

The technical scheme of the application is realized as follows:

in a first aspect, the present application provides a method for updating a scheduling task queue, where the method includes:

acquiring historical text data generated by at least one seed page to be crawled;

predicting and crawling scheduling parameters of each seed page according to the historical text data;

analyzing the content of the historical text data, and determining the information source score of each seed page; wherein, the source score characterizes the quality condition of the content contained in each seed page;

dynamically adjusting the scheduling parameters according to the information source scores;

and updating the scheduling task queue according to the adjusted scheduling parameters.

In a second aspect, the present application provides an apparatus for updating a scheduling task queue, including an obtaining module, a predicting module, a first determining module, an adjusting module, and an updating module, where:

the acquisition module is used for acquiring historical text data generated by at least one seed page to be crawled;

the prediction module is used for predicting and crawling the scheduling parameter of each seed page according to the historical text data;

the first determining module is used for analyzing the content of the historical text data and determining the information source score of each seed page; wherein, the source score characterizes the quality condition of the content contained in each seed page;

the adjusting module is used for dynamically adjusting the scheduling parameters according to the information source scores;

and the updating module is used for updating the scheduling task queue according to the adjusted scheduling parameters.

In a third aspect, the present application provides an apparatus for updating a scheduled task queue, including a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps in the method for updating a scheduled task queue when executing the program.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the above method of updating a scheduled task queue.

The beneficial effect that technical scheme that this application provided brought includes at least:

in the application, firstly, historical text data generated by at least one seed page to be crawled is obtained; secondly, predicting and crawling scheduling parameters of each seed page according to the historical text data; then, performing content analysis on the historical text data to determine the information source score of each seed page; then according to the information source score, dynamically adjusting the scheduling parameter; finally, updating the scheduling task queue according to the adjusted scheduling parameter; therefore, the scheduling parameters of the seed pages are predicted by mining the value of the historical text data, the content value of the historical text data generated by the crawler is mined to obtain the information source score of each seed page, the predicted scheduling parameters are dynamically adjusted based on the information source scores, the effects of hierarchical acquisition, timely and efficient scheduling are achieved, and the resource utilization rate and the effective acquisition rate can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

fig. 1 is a schematic flowchart of a method for updating a scheduling task queue according to an embodiment of the present application;

fig. 2 is a flowchart illustrating another method for updating a scheduling task queue according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a further method for updating a scheduling task queue according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for updating a scheduling task queue according to an embodiment of the present application;

FIG. 5 is a block diagram of a method for updating a scheduling task queue according to an embodiment of the present disclosure;

fig. 6 is a flowchart illustrating a scheduling parameter prediction method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of dynamically adjusting scheduling parameters according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a crawler scheduling process according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram illustrating a component structure of an apparatus for updating a scheduling task queue according to an embodiment of the present application;

fig. 10 is a hardware entity diagram of an apparatus for updating a scheduled task queue according to an embodiment of the present application.

Detailed Description

To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application are only used for distinguishing similar objects and do not represent a specific ordering for the objects, and it should be understood that "first \ second \ third" may be interchanged under specific ordering or sequence if allowed, so that the embodiments of the present application described herein can be implemented in other orders than illustrated or described herein.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present application belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The search engine will typically provide a minute-level real-time index to timely display to the user strong and timely web page information, such as news information in news websites, video updates in video websites, popular posts in forum net friends, etc. In order to obtain the information of the strong timeliness web pages in time, a related search engine needs to maintain a batch of seed pages (also called list pages); the seed page is equivalent to an index page of a content page, taking a news website as an example, the seed page is used for displaying a plurality of news titles, each news title is linked to one content page, and the content page is used for displaying specific news content corresponding to the corresponding news title. Generally, a search engine can capture content page data required by a seed page according to a preset refresh interval through a web crawler system, and update the last captured seed page, so that a new content page can be displayed to a user in time, and timeliness of a related website is guaranteed.

According to the technical scheme, firstly, the weight of a crawling site is set in advance, then a refreshing interval is set according to the weight, and a scheduling task queue is updated according to the refreshing interval. The method grabs the webpage information according to a fixed frequency, and after the crawler is started, the scheduling parameters are static.

The second scheme is to acquire historical data parameters and processed parameters of a seed page to be crawled (a list page containing a plurality of link information, the link information corresponds to a content page), such as time and resources consumed in capturing webpage data, generated data volume, acquired page parameters and the like, then determine scheduling parameters of the next crawled webpage through a trained prediction model or a certain prediction rule, and update a scheduling task queue according to the scheduling parameters.

However, in the first scheme, the scheduling is performed by using a preset refresh interval, so that the scheduling parameters are fixed, and flexible allocation of resources cannot be performed. According to the second scheme, the current scheduling parameters are predicted according to historical data generated by the scheduling crawler, more attention is paid to scheduling data, the value of text information captured by a content page is not mined, and the text information contained in the content page cannot be reflected in scheduling of a seed page, so that the high-quality content and high-correlation information source and the common information source are not distinguished.

People publish a huge amount of information every day through the internet, and traditional websites, such as various web portals, account for a part of this information. In the coming of the mobile internet era, people and people are self media, which are already important platforms for internet information release, such as information quality differences of a sticking bar, a head bar, a microblog and the like, and the resource waste can be caused by the fact that the information is captured without distinguishing.

The embodiment of the application provides a method for updating a scheduling task queue, which can be applied to equipment, wherein the equipment comprises: the functions implemented by the method may be implemented by a processor in the Device calling a program code, and of course, the program code may be stored in a computer storage medium, and thus, the Device at least includes a processor and a storage medium. The processor may be used for processing the process of crawler scheduling, and the memory may be used for storing data required and generated in the process of crawler scheduling.

Fig. 1 is a schematic flowchart of a method for updating a scheduling task queue according to an embodiment of the present application, where as shown in fig. 1, the method at least includes the following steps:

step S110, obtaining historical text data generated by at least one seed page to be crawled.

Here, the seed page is a seed resource to be crawled, for example, a list page including many pieces of link information, and the pieces of link information respectively have corresponding detail pages.

Here, the historical text data is data information obtained by crawling one or more seed pages in a historical period, and may include detail page link information obtained by parsing the seed pages and accessed detail page contents, or may include scheduling data information generated in a scheduling process.

And step S120, predicting and crawling the scheduling parameter of each seed page according to the historical text data.

Here, the scheduling parameter is scheduling data information generated in a scheduling process, such as information of scheduling times, scheduling time, repetition rate interval, and the like in each period.

Here, the predicting may be to predict the scheduling parameter of the current period from the scheduling parameters of the same period in the historical time before the scheduling starts at each time.

Illustratively, linear fitting is performed on the scheduling times of the same time period in the past N days to obtain a weight coefficient of a linear function of the prediction model, so that an output value of the (N + 1) th antenna linear function is calculated and serves as the scheduling times of the current time period. Considering that the law of source distribution information is generally distributed in weeks, the parameter N can be set to 7 empirically.

Step S130, performing content analysis on the historical text data, and determining an information source score of each seed page.

Here, the source score characterizes the quality of the content contained in each of the seed pages.

It can be understood that the text data obtained by crawling the seed page historically is subjected to content mining, so that the text information contained in the detail page related to the seed page is embodied on the scheduling of the seed page, and therefore, the information source score of each seed page is formed by scoring the high-quality content and high-correlation information source seeds and the common information source seeds, and the hierarchical capture is carried out.

And step S140, dynamically adjusting the scheduling parameters according to the information source scores.

In the method, on the basis of scheduling parameters predicted according to historical text data, information source scores, namely values of mining scheduling historical data are taken into consideration, and scheduling parameters are dynamically adjusted according to seed pages with different information source scores, so that the effects of hierarchical acquisition, timeliness and high efficiency scheduling are achieved. The quality and the grabbing speed of the grabbed content are improved, and a large amount of resources are saved.

And step S150, updating the scheduling task queue according to the adjusted scheduling parameters.

Here, the scheduling task queue includes seed pages to be crawled and scheduling time corresponding to each seed page. And when the scheduling times or the scheduling time corresponding to the seed page to be crawled are changed, correspondingly changing the scheduling time in the scheduling task queue. Therefore, the scheduling program selects the seed page to be scheduled at the current moment according to the updated scheduling task queue.

In the embodiment of the application, firstly, historical text data generated by at least one seed page to be crawled is obtained; secondly, predicting and crawling scheduling parameters of each seed page according to the historical text data; then, performing content analysis on the historical text data to determine the information source score of each seed page; then according to the information source score, dynamically adjusting the scheduling parameter; finally, updating the scheduling task queue according to the adjusted scheduling parameter; therefore, the scheduling parameters of the seed pages are predicted by mining the value of the historical text data, the content value of the historical text data generated by the crawler is mined to obtain the information source score of each seed page, the predicted scheduling parameters are dynamically adjusted based on the information source scores, the effects of hierarchical acquisition, timely and efficient scheduling are achieved, and the resource utilization rate and the effective acquisition rate can be improved.

In some embodiments, the scheduling parameter is a scheduling number in a unit time period, and fig. 2 is a flowchart of another method for updating a scheduling task queue according to an embodiment of the present application, as shown in fig. 2, where the method at least includes the following steps:

step S210, obtaining historical text data generated by at least one seed page to be crawled.

Step S220, determining a historical scheduling frequency in a time period corresponding to the current time period from the historical text data.

Here, the historical scheduling number of times that is the same period as the current time period may be acquired from the historical database. Taking the seed page maintenance of a certain information website as an example, since 7 am to 11 am (or 2 pm to 5 pm) of a working day is a peak period of news edition and release of the information website, which shows that the frequency of new detail page data appearing in the period is high and is in a certain regularity, the scheduling times in the same period of friday can be predicted according to the historical scheduling times from 7 am to 11 am of monday to thursday, so that the resource capturing is saved, and the data timeliness is improved as a whole.

Step S230, predicting the scheduling times of each seed page in the current time period according to the historical scheduling times.

For example, the data set of the scheduling times of the moment in the past N days in the database is set as

，iIs shown asiDay, the current isi+1，y _iIs shown asiThe number of schedules within the time of day. Then using least square method to predict the linear function of model

Performing linear fitting to obtain weightwAndb. Finally, the prediction model is used for predicting the scheduling times of the current day, namelyi=N+1, the output value of the linear function. Considering that the law of source distribution information is generally distributed in weeks, the parameter N can be set to 7 empirically.

Step S240, performing content analysis on the historical text data, and determining an information source score of each seed page.

And step S250, dynamically adjusting the scheduling parameters according to the information source score.

And step S260, determining to crawl a scheduling timetable of each seed page according to the adjusted scheduling times.

Here, the scheduling schedule is calculated from the time remaining before each scheduling within the current time period and the number of remaining schedules. For example, when the remaining time is 1 hour and the remaining number of scheduling times is equal to 1, the scheduling schedule of the current time of the seed is [00:00] (representing 00 min: 00 sec), and when the remaining number of scheduling times is equal to 2, the scheduling schedules are [00:00 and 30:00], respectively, and so on.

And step S270, at the next scheduling time, selecting a seed page to be scheduled at the current time to be added into the scheduling task queue according to the scheduling schedule.

In some embodiments, the method further comprises: taking out each seed page from the scheduling task queue through a crawler machine; analyzing each seed page to obtain text data generated by each seed page; determining the scheduling times and the average repetition rate of scheduling generation in the current time period of each seed page; and storing the text data, the scheduling times and the average repetition rate in a historical database.

In the embodiment of the application, before each scheduling is started, determining historical scheduling times in a time period corresponding to a current time period from historical text data, predicting the scheduling times in the current time period of each sub-page according to the historical scheduling times, simultaneously performing content analysis on the historical text data, determining an information source score of each sub-page, adjusting the predicted scheduling times according to the information source scores, determining a scheduling schedule for crawling each sub-page according to the adjusted scheduling times, and selecting the sub-page to be scheduled at the current time to be added into a scheduling task queue at the next scheduling time according to the scheduling schedule; therefore, the scheduling parameters of the seed pages are predicted by mining the value of the historical text data, the content value of the historical text data generated by the crawler is mined to obtain the information source score of each seed page, the predicted scheduling parameters are dynamically adjusted based on the information source scores, the scheduling timetable is dynamically adjusted according to the seed pages with different information source scores, the hierarchical grabbing is carried out, the quality of the grabbed content and the grabbing speed are improved, and a large number of resources are saved.

In some embodiments, the text classification model includes a first classification model and a second classification model, the scheduling parameter is a scheduling number of times in a unit time period, fig. 3 is a flowchart of another method for updating a scheduling task queue provided in this embodiment of the present application, and as shown in fig. 3, the step S130 or the step S240 "performing content analysis on the historical text data and determining the source score of each of the seed pages" may be implemented by:

step S310, filtering the historical text data through the first classification model to obtain the effective text data.

Here, the first classification model is obtained by training preset garbage data and valid data corresponding to the garbage data. The model is adopted to analyze the historical text data generated by the seed page, and the junk data is filtered out to obtain effective historical text data, namely effective text data.

Step S320, performing type labeling on the valid text data through the second classification model to obtain a data type of each valid text data.

Here, the second classification model is obtained by training preset type data. The preset type data can be classified according to more classical categories in the portal website, including news, entertainment, sports, finance, science and technology, real estate, education, culture and the like. And analyzing the effective historical text data by adopting the model to obtain the data type of each data.

Step S330, counting the data type of each valid text data through the second classification model, and determining the type ratio of each data type in the valid text data.

After the effective text data obtained by filtering the first classification model is subjected to type marking, calculating the type ratio of each type in the effective historical text data of the relevant seeds, and marking as the type ratio

，mThe number of the categories is indicated,c _iis shown asiThe type of each category in the seed is proportional.

The above steps S310 to S330 implement a process of "processing the historical text data through the trained text classification model to obtain valid text data generated by each seed page and a type ratio of each category data in each seed page".

Step S340, sorting the data size of the effective text data generated by each seed page to obtain the data size score of each seed page.

Here, the data amount scores of each seed page are obtained by sorting the data amount of the effective text data generated by each seed page, dividing the sorting result into 100 parts on average, and scoring by percentage (high score in the top of the sorting).

Step S350, determining a percentage score of a type included in each seed page according to a preset type weight and the type ratio.

Here, type weights are set according to the emphasis points of different crawler capture data, and the type proportion of each type of data is weighted and summed to obtain the percentage score of the type contained in each seed page.

Step S360, calculating an arithmetic square root of the product of the data volume fraction and the percentile fraction to obtain the information source score of each seed page.

Here, the data amount fraction and the percentile fraction of each seed page are multiplied and the arithmetic square root is calculated to obtain the source score of the seed page.

The above steps S310 to S330 implement a process of "determining the source score of each of the seed pages according to the data amount of the valid text data and the type ratio".

In some embodiments, the scheduling parameter is a scheduling number in a unit time period, fig. 4 is a schematic flowchart of another method for updating a scheduling task queue provided in the embodiment of the present application, and as shown in fig. 4, the step S140 or the step S250 "dynamically adjusting the scheduling parameter according to the source score" may be implemented by:

and step S410, setting a default repetition rate interval of each seed page according to the information source score.

Here, a default interval of the repetition rate may be set according to the requirements of the service and the resource, and the high frequency of the general seed page collection corresponds to the high repetition rate.

For example, the source score is greater than 90 points, the default repetition rate interval is set to [90%,100%), that is, if there are 100 pieces of URL (Uniform Resource Locator) data in the web page and the repetition rate is set to 90 or more, the frequency of crawling needs to be maintained in a state where the web page is updated with 1 to 10 pieces of data. The source score is 80 to 90 points, the default repetition rate interval is set to 80%,90%), and so on, the source score is less than 60 points, and the default repetition rate interval is set to 50%, 60%.

Step S420, determining an average repetition rate of each seed page in the current time period.

Here, the average repetition rate is an average value of repetition rates scheduled to be generated in the current time period. The repetition rate represents the percentage between the number of links historically analyzed by the current seed page and the number of links analyzed by the current seed page at this time, and the main function is to prevent repeated grabbing.

The information distribution is fluctuant, and the prediction result has a certain error from the actual situation. For this case, the scheduling is dynamically adjusted by the average repetition rate.

And step S430, dynamically adjusting the scheduling times by comparing the position relationship of the average repetition rate and the default repetition rate interval on a numerical axis.

Here, after one scheduling of the seed page is completed, the number of remaining scheduling times of the seed page in the current time period is adaptively adjusted by comparing whether the current average repetition rate is in the default repetition rate interval.

In some possible embodiments, in case the average repetition rate is higher than an upper limit of the default repetition rate interval, reducing the number of schedules; or, in case that the average repetition rate is lower than the lower limit of the default repetition rate interval, increasing the number of scheduling times; or, maintaining the scheduling number of times when the average repetition rate is in the default repetition rate interval. And repeating the process until the scheduling time in the current time period is finished or the scheduling times are reduced to zero.

In some possible embodiments, before the next scheduling time of each seed page, the scheduling times of each seed page are judged; and stopping adjusting the scheduling times under the condition that the scheduling times of each seed page is zero or the remaining time from the next scheduling time is zero.

It is noted that default parameters are set during implementation according to the service requirements and resource configuration: if the scheduling times are integers larger than zero and the average repetition rate interval, the scheduling process is carried out according to scheduling parameters predicted by the crawled historical text data after the scheduling process is operated for a period of time.

The method for updating the scheduling task queue is described below with reference to a specific embodiment, however, it should be noted that the specific embodiment is only for better describing the present application and is not to be construed as a limitation to the present application.

With the advent of the mobile internet era, people and people are self media, and the self media are important platforms for internet information distribution, such as a bar, a head bar, a microblog and the like. The information quality is uneven, and the resource is greatly wasted because the information is captured without distinguishing. If the information source seeds are screened, the information sources are scored according to the content information to form the information source portrait, and graded grabbing is carried out, so that the quality of grabbed content and the grabbing speed are improved, and a large amount of resources are saved.

As shown in fig. 5, the method for updating a scheduling task queue provided in the embodiment of the present application mainly includes two stages: a scheduling parameter prediction phase 51 and a crawler scheduling phase 52.

The first stage, the scheduling parameter prediction stage 51.

The method for predicting the fine-grained scheduling parameters is mainly realized by analyzing texts and scheduling data to perform dynamic adjustment. Wherein the scheduling data comprises scheduling history data in the past N days and an average repetition rate in the current time. Fig. 6 is a flowchart illustrating a scheduling parameter prediction method according to an embodiment of the present application, as shown in fig. 6, the method includes the following steps:

step S601, obtaining the effective data volume and the type ratio of the seed page by adopting a classification model.

Here, the classification model is used to perform data type analysis on the text data of the seed page, and the effective data volume and the type ratio of the seed page are obtained. In the implementation process, data marking and classification model training are required.

Firstly, training a text classification model (TextCNN) of the second classification to classify the text garbage. The model is adopted to analyze the historical text data generated by the seed page, and the junk information is filtered out to obtain effective historical text data, namely the effective data volume of the seed page.

The second step is that: training a multi-classification text classification model, wherein corresponding classes can be classified according to more classical classes in a portal, including newsEntertainment, sports, finance, science and technology, real estate, education, culture and the like. Analyzing the effective historical text data by adopting the model to obtain the category attribute of each data, calculating the type ratio of each category in the effective historical text data of the related seed page, and recording the type ratio as

，mThe number of the categories is indicated,c _iis shown asiThe type of each category in the seed page is a percentage.

Step S602, obtaining the source score of the seed page according to the effective data volume and the type ratio.

Firstly, sorting according to the size of effective data volume generated by each seed page, dividing the sorting result into 100 parts on average, and grading according to percentage (high score is obtained at the top of sorting) to obtain data volume fractionS _n。

Then, setting type weight according to the emphasis points of different crawler capture data

(m represents the number of categories,w _iis shown asiThe type weight corresponding to each category, the required type weight is 1, otherwise, the required type weight is 0), the ratio is represented by the typec _iAnd type weightw _iPercentile score of type of calculation

. After data volume fractionS _nAnd percentile fractionS _cMultiplying and calculating the arithmetic square root to obtain the final score of the source seed

。

And finally, setting a repetition rate interval according to different scores of the information sources. A default interval for the repetition rate may be set according to the requirements of the service and the resource (the high repetition rate corresponds to the high frequency of the seed collection). For example, if the number of URL data is greater than 90 minutes, the repetition rate interval is set to [90%,100%), that is, if there are 100 pieces of URL data in the web page, the frequency of crawling needs to be maintained in a state where the web page is updated with 1 to 10 pieces of data. 80 to 90 points, the repetition rate interval is set to [80%,90%), and so on, less than 60 points, to [50%, 60%).

Step S603 predicts the scheduling parameter at the current time based on the historical time data.

Here, before the scheduling starts at each time, the scheduling parameter at the current time is predicted from the historical time data: and (5) scheduling times.

When the scheduling times of a certain time are predicted, the scheduling times data set of the time in the past N days range in the database is set as

Performing linear fitting to obtain weightwAndb. Finally, the prediction model is used for predicting the scheduling times of the current day, namelyi=N+1, the output value of the linear function. The law of source distribution is generally distributed over a week, so the parameter N is set to 7 empirically.

And step S604, dynamically adjusting the scheduling time according to the predicted scheduling parameters and the information source score.

Here, the scheduling schedule of the seed is calculated from the prediction result in step S603, and the scheduling schedule is dynamically adjusted according to the repetition rate section in consideration of the fluctuation of information distribution. With the scheduling parameter predicted in the previous step as a reference, the scheduling time of the seed at the current time is calculated and dynamically adjusted, as shown in fig. 7, the method includes the following steps:

step S701, acquiring the predicted scheduling times at the current time.

Step S702, determining a scheduling schedule according to the scheduling times.

And calculating a scheduling time table according to the time left before each scheduling and the scheduling times. For example, when the remaining time is 1 hour and the remaining number of times of scheduling is equal to 1, the schedule of scheduling at the current time of the seed is [00:00] (representing 00 min: 00 sec), and when the number of times is equal to 2, the schedule is [00:00, 30:00], and so on.

And step S703, performing crawler scheduling according to the scheduling schedule.

The scheduler decides when to schedule which seed according to a scheduling schedule. The information distribution has fluctuation, and the prediction result has certain errors with the actual situation. For this case, the scheduling is dynamically adjusted by the average repetition rate.

Step S704, determine whether the average repetition rate is within a preset interval.

Here, the preset interval is a default repetition rate interval, and if the average repetition rate in the scheduling database is in the preset interval, the step S705 is executed; if the average repetition rate is lower than the preset interval lower limit value, executing step S706; if the average repetition rate is higher than the preset interval upper limit value, step S707 is executed.

Step S705 waits for the next scheduling time.

Here, when one scheduling of the seeds is completed, if the average repetition rate in the scheduling database is in the seed default repetition rate interval, no change is made.

Step S706, the number of times of scheduling is increased once.

Here, if the average repetition rate in the scheduling database is lower than the lower limit value of the default repetition rate interval, the remaining number of scheduling times is added once.

In step S707, the number of times of scheduling is decreased once.

Here, if the average repetition rate in the scheduling database is higher than the upper limit value of the default repetition rate interval, the remaining number of scheduling times is decreased once.

Steps S702 to S707 are repeated until the time is over or the number of schedules is reduced to zero.

The second phase, crawler scheduling phase 52.

Fig. 8 is a schematic flowchart of a crawler scheduling process provided in an embodiment of the present application, and as shown in fig. 8, the flowchart includes the following steps:

step S801, in the scheduling schedule, a seed page to be scheduled at the current time is taken out and added to the scheduling task queue.

Here, the scheduler takes out the seed page to be scheduled at the current time according to the scheduling schedule, and adds the seed page to a scheduling task queue (also referred to as a message queue). The seed page contains information such as link information, scheduling parameters such as scheduling times, repetition rate intervals, etc.

And step S802, taking out the seed page to be scheduled from the scheduling task queue through the crawler machine, and analyzing to obtain the webpage content.

Here, the crawler machine takes out the seed page to be scheduled from the scheduling task queue, analyzes the page of the seed page to obtain the detail page link, and accesses and stores the detail page content.

Step S803, the scheduling data of the current time of the seed page is stored.

And recording and calculating scheduling data information such as scheduling times, average repetition rate and the like generated by scheduling in the current time of each seed page, and storing the scheduling data information in a historical scheduling database. The scheduling frequency is the scheduling frequency at the current time, and the average repetition rate is the average value of the repetition rates generated by the scheduling at the current time. Wherein, the calculation formula of the repetition rate is as follows:

；

wherein the content of the first and second substances,link _graspindicating the number of links to which the current seed resolves,link _exitindicating the number of links to which the current seed page present in the deduplication store resolves. The duplicate removal library is a linked database which contains data collected in the crawler running process, and mainly has the function of preventing repeated grabbing.

In the embodiment of the application, historical data generated by the crawler is mined in two dimensions of text and scheduling data. Firstly, stock data is analyzed from the content dimension, a text classification model is adopted to perform garbage secondary classification (including effective information and garbage information, such as advertisements and billing) on captured text information, and then content multi-classification is performed on the basis of the effective information. Then, an information source portrait is formed according to the effective data quantity and the type ratio, and an information source score is obtained through calculation. And finally, on the basis of the information source score, mining the value of scheduling historical data, and predicting and dynamically adjusting the fine-grained scheduling times. Therefore, fine-grained scheduling time is dynamically adjusted according to the information sources with different scores, the effects of graded acquisition, timely and efficient scheduling are achieved, time delay can be reduced, and the resource utilization rate and the effective acquisition rate are improved.

The method and the device for capturing the crawler seed pages fully excavate text content value and scheduling data value generated in the crawler seed page capturing process, and are not limited to the effect of data on scheduling independently. Firstly, forming an information source picture by mining the value of content, and calculating to obtain an information source score. And then, by scoring the information source and mining the value of scheduling data, the scheduling effects of information source grading acquisition, time delay reduction, resource utilization rate improvement and effective acquisition rate improvement are achieved.

Based on the foregoing embodiments, an apparatus for updating a scheduling task queue is further provided in an embodiment of the present application, where the apparatus includes modules and units included in the modules, and may be implemented by a processor in a device; of course, the implementation can also be realized through a specific logic circuit; in the implementation process, the Processor may be a Central Processing Unit (CPU), a microprocessor Unit (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.

Fig. 9 is a schematic structural diagram of an apparatus for updating a scheduling task queue according to an embodiment of the present application, and as shown in fig. 9, the apparatus 900 includes an obtaining module 910, a predicting module 920, a first determining module 930, an adjusting module 940, and an updating module 950, where:

the obtaining module 910 is configured to obtain historical text data generated by at least one seed page to be crawled;

the predicting module 920 is configured to predict, according to the historical text data, a scheduling parameter for crawling each seed page;

the first determining module 930, configured to perform content analysis on the historical text data, and determine an information source score of each seed page; wherein, the source score characterizes the quality condition of the content contained in each seed page;

the adjusting module 940 is configured to dynamically adjust the scheduling parameter according to the source score;

the updating module 950 is configured to update the scheduling task queue according to the adjusted scheduling parameter.

In some possible embodiments, the first determining module 930 comprises a processing sub-module and a first determining sub-module, wherein: the processing submodule is used for processing the historical text data through a trained text classification model to obtain the effective text data generated by each seed page and the type ratio of each category data in each seed page; the first determining submodule is used for determining the information source score of each seed page according to the data volume of the effective text data and the type ratio.

In some possible embodiments, the text classification model comprises a first classification model and a second classification model, and the processing submodule comprises a filtering unit, a type labeling unit, and a statistical unit, wherein: the filtering unit is used for filtering the historical text data through the first classification model to obtain the effective text data; the first classification model is obtained by training preset junk data and valid data corresponding to the junk data; the type labeling unit is used for performing type labeling on the effective text data through the second classification model to obtain the data type of each effective text data; the second classification model is obtained by training through preset type data; the statistical unit is used for performing statistics on the data type of each effective text data through the second classification model, and determining the type proportion of each data type in the effective text data.

In some possible embodiments, the number of the seed pages to be crawled is at least two, and the first determining submodule includes a sorting unit, a first determining unit, and a second determining unit, where: the sorting unit is used for sorting the data size of the effective text data generated by each seed page to obtain the data size fraction of each seed page; the first determining unit is used for determining the percentage score of the type contained in each seed page according to a preset type weight and the type proportion; and the second determining unit is used for calculating an arithmetic square root of the product of the data volume fraction and the percentile fraction to obtain the source score of each seed page.

In some possible embodiments, the scheduling parameter is a number of times of scheduling in a unit time period, and the prediction module 920 includes a second determination sub-module and a prediction sub-module, where: the second determining submodule is used for determining the historical scheduling times in the time period corresponding to the current time period from the historical text data; and the prediction submodule is used for predicting the scheduling times of each seed page in the current time period according to the historical scheduling times.

In some possible embodiments, the scheduling parameter is a number of times of scheduling in a unit time period, and the adjusting module 940 includes a setting sub-module, a second determining sub-module, and an adjusting sub-module, wherein: the setting submodule is used for setting a default repetition rate interval of each seed page according to the information source score; the second determining submodule is used for determining the average repetition rate of each seed page in the current time period; wherein, the average repetition rate is the average value of the repetition rates generated by scheduling in the current time period; and the adjusting submodule is used for dynamically adjusting the scheduling times by comparing the position relation of the average repetition rate and the default repetition rate interval on a numerical axis.

In some possible embodiments, the adjusting sub-module is further configured to decrease the number of schedules if the average repetition rate is higher than the upper limit of the default repetition rate interval after one schedule for each of the seed pages is completed.

In some possible embodiments, after one scheduling for each of the seed pages is completed, the number of scheduling times is increased if the average repetition rate is lower than the lower limit of the default repetition rate interval.

In some possible embodiments, the apparatus 900 further comprises a determining module and a stop adjusting module, wherein: the judging module is used for judging the scheduling times of each seed page before the next scheduling time of each seed page; and the adjustment stopping module is used for stopping adjusting the scheduling times under the condition that the scheduling times of each seed page is zero or the remaining time from the next scheduling time is zero.

In some possible embodiments, the scheduling parameter is a number of times of scheduling in a unit time period, and the updating module includes a third determining submodule and an updating submodule, wherein: the third determining submodule is used for determining a scheduling timetable for crawling each seed page according to the adjusted scheduling times; and the updating submodule is used for selecting a seed page to be scheduled at the current time to be added into the scheduling task queue at the next scheduling time according to the scheduling time table.

In some possible embodiments, the apparatus 900 further comprises a crawling module, a parsing module, a second determining module, and a storing module, wherein: the crawling module is used for fetching each seed page from the scheduling task queue through a crawler machine; the analysis module is used for analyzing each seed page to obtain text data generated by each seed page; the second determining module is configured to determine the scheduling times and the average repetition rate of scheduling generation in the current time period of each seed page; and the storage module is used for storing the text data, the scheduling times and the average repetition rate in a historical database.

Here, it should be noted that: the above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be noted that, in the embodiment of the present application, if the method for updating the scheduling task queue is implemented in the form of a software functional module, and is sold or used as a stand-alone product, the method may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a device (which may be a smartphone with a camera, a tablet computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

Correspondingly, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in any of the methods for updating a scheduling task queue described in the foregoing embodiments.

Correspondingly, in an embodiment of the present application, a chip is further provided, where the chip includes a programmable logic circuit and/or a program instruction, and when the chip runs, the chip is configured to implement the steps in any of the methods for updating a scheduling task queue in the foregoing embodiments.

Correspondingly, in an embodiment of the present application, there is further provided a computer program product, which is used to implement the steps in the method for updating a scheduling task queue in any of the above embodiments when the computer program product is executed by a processor of a device.

Based on the same technical concept, the embodiment of the present application provides a device for updating a scheduling task queue, which is used for implementing the method for updating a scheduling task queue described in the above method embodiment. Fig. 10 is a hardware entity diagram of an apparatus for updating a scheduled task queue according to an embodiment of the present application, as shown in fig. 10, the apparatus 1000 includes a memory 1010 and a processor 1020, where the memory 1010 stores a computer program that can run on the processor 1020, and the processor 1020 executes the computer program to implement steps in any method for updating a scheduled task queue according to the embodiment of the present application.

The Memory 1010 is configured to store instructions and applications executable by the processor 1020, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 1020 and modules in the device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The steps of any of the above methods for updating a scheduled task queue are performed by processor 1020 when executing a program. The processor 1020 generally controls the overall operation of the device 1000.

The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above-mentioned processor function may be other electronic devices, and the embodiments of the present application are not particularly limited.

The computer storage medium/Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.

The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of updating a queue of scheduled tasks, the method comprising:

2. The method of claim 1, wherein said analyzing said historical textual data to determine a source score for each of said seed pages comprises:

processing the historical text data through a trained text classification model to obtain effective text data generated by each seed page and the type ratio of each category data in each seed page;

and determining the information source score of each seed page according to the data volume of the effective text data and the type ratio.

3. The method of claim 2, wherein the text classification model comprises a first classification model and a second classification model, and the processing of the historical text data by the trained text classification model to obtain the ratio of valid text data generated by each seed page to the type of each class data in each seed page comprises:

filtering the historical text data through the first classification model to obtain the effective text data; the first classification model is obtained by training preset junk data and valid data corresponding to the junk data;

performing type marking on the effective text data through the second classification model to obtain the data type of each effective text data; the second classification model is obtained by training through preset type data;

and counting the data type of each effective text data through the second classification model, and determining the type proportion of each data type in the effective text data.

4. The method of claim 2, wherein the number of seed pages to be crawled is at least two, and the determining the source score of each seed page according to the data volume of the valid text data and the type ratio comprises:

sorting the data size of the effective text data generated by each seed page to obtain the data size fraction of each seed page;

determining the percentage score of the type contained in each seed page according to the preset type weight and the type proportion;

and calculating the arithmetic square root of the product of the data volume fraction and the percentile fraction to obtain the source score of each seed page.

5. The method of any one of claims 1 to 4, wherein the scheduling parameter is a number of times scheduled per unit time period, and predicting the scheduling parameter for crawling each of the seed pages according to the historical text data comprises:

determining historical scheduling times in a time period corresponding to the current time period from the historical text data;

and predicting the scheduling times of each seed page in the current time period according to the historical scheduling times.

6. The method of any of claims 1 to 4, wherein the scheduling parameter is a number of scheduling times per unit time period, and the dynamically adjusting the scheduling parameter according to the source score comprises:

setting a default repetition rate interval of each seed page according to the information source score;

determining an average repetition rate of each seed page in a current time period; wherein, the average repetition rate is the average value of the repetition rates generated by scheduling in the current time period;

and dynamically adjusting the scheduling times by comparing the position relationship of the average repetition rate and the default repetition rate interval on a numerical axis.

7. The method of claim 6, wherein dynamically adjusting the number of schedules by comparing the average repetition rate to the default repetition rate interval in a position relationship on a number axis comprises:

after one scheduling for each of the seed pages is completed, reducing the scheduling times if the average repetition rate is higher than the upper limit of the default repetition rate interval.

8. The method of claim 6, wherein dynamically adjusting the number of schedules by comparing the average repetition rate to the default repetition rate interval in a position relationship on a number axis comprises:

after one scheduling for each of the seed pages is completed, increasing the number of scheduling times in case that the average repetition rate is lower than the lower limit of the default repetition rate interval.

9. The method of claim 8, wherein the method further comprises:

before the next scheduling time of each seed page, judging the scheduling times of each seed page;

and stopping adjusting the scheduling times under the condition that the scheduling times of each seed page is zero or the remaining time from the next scheduling time is zero.

10. The method according to any one of claims 1 to 4, wherein the scheduling parameter is a scheduling number of times in a unit time period, and the updating the scheduling task queue according to the adjusted scheduling parameter comprises:

determining a scheduling timetable for crawling each seed page according to the adjusted scheduling times;

and at the next scheduling time, selecting a seed page to be scheduled at the current time to be added into the scheduling task queue according to the scheduling time table.

11. The method of claim 10, wherein the method further comprises:

taking out each seed page from the scheduling task queue through a crawler machine;

analyzing each seed page to obtain text data generated by each seed page;

determining the scheduling times and the average repetition rate of scheduling generation in the current time period of each seed page;

and storing the text data, the scheduling times and the average repetition rate in a historical database.

12. An apparatus for updating a queue of scheduled tasks, the apparatus comprising an obtaining module, a predicting module, a first determining module, an adjusting module, and an updating module, wherein:

13. An apparatus for updating a queue of scheduled tasks, comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any of claims 1 to 11 when executing the program.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 11.