CN113536085B

CN113536085B - Method and system for scheduling subject term search crawlers based on combined prediction method

Info

Publication number: CN113536085B
Application number: CN202110701204.8A
Authority: CN
Inventors: 陈智超; 裴峥; 孔明明
Original assignee: Xihua University
Current assignee: Xihua University
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2023-05-19
Anticipated expiration: 2041-06-23
Also published as: CN113536085A

Abstract

The invention relates to the technical field of crawler scheduling methods, in particular to a method and a system for searching and crawler scheduling by a subject term based on a combined prediction method, wherein the method comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a subject term extraction module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an updating module, a predicted heat value module and a CPU distribution module; the method comprises the following steps: step 1, acquiring data from a data source; step 2, data preprocessing; step 3, obtaining theme data and calculating a real heat index and an index weight; step 4, calculating a true heat value; step 5, calculating a predicted heat value of each theme in the next period; step 6, extracting new subject words and updating a database; and step 7, distributing the upper limit of the CPU occupancy rate, and acquiring more relevant data of the high-heat theme. The aim of preferentially tracking the high-heat theme under the condition of limited resources is fulfilled.

Description

Method and system for scheduling subject term search crawlers based on combined prediction method

Technical Field

The invention relates to the technical field of crawler scheduling methods, in particular to a method and a system for scheduling a subject term search crawler based on a combined prediction method.

Background

The tracking topic requires the crawler to continuously acquire the topic related data, and if the hot topic is preferentially tracked under the condition of limited server resources, autonomous scheduling of the crawler is required to be realized, and the hot topic related data is preferentially acquired. The current crawler scheduling method mainly comprises a crawler scheduling method based on the update frequency of website data, a crawler scheduling method based on the distribution of URL, a crawler scheduling method based on the network distance, a crawler scheduling method based on node task distribution and the like; the crawler scheduling method based on the website data updating frequency schedules crawlers according to the updating frequency of the data source website, reduces the resource cost of a crawler server to a certain extent, and is suitable for scheduling some website crawlers with slower updating frequency; the crawler scheduling method based on the distributed URL preferentially distributes the URL with high similarity to the crawler to crawl by judging the similarity between the webpage text and the theme set by the user, and cannot meet the requirement of preferentially crawling the theme with high future heat; the crawler scheduling method based on node task allocation is mainly used for solving the problem of load balancing among crawler servers, a large number of URLs are mapped to a hash ring, each crawling node is corresponding to one segment of a ring sequence to ensure that the crawling nodes reasonably allocate tasks, virtual nodes are added, robustness of a crawler system is improved, and tracking of topics through heat allocation tasks cannot be met.

Disclosure of Invention

Based on the problems, the invention provides a method and a system for scheduling the search crawlers of the subject terms based on a combined prediction method, and the method and the system for scheduling the crawlers corresponding to the high-heat subjects by predicting the future heat of each subject, so that more high-heat data are acquired, and the aim of preferentially tracking the high-heat subjects under the condition of limited resources is fulfilled.

In order to solve the technical problems, the invention adopts the following technical scheme:

a subject term search crawler scheduling system based on a combined prediction method comprises

The first acquisition module is used for searching crawlers to acquire data in a data source by using the subject terms according to the keywords set by the user;

the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;

a vector space model for changing the preprocessed text data into a multidimensional vector composed of feature word weights;

the clustering module is used for carrying out clustering processing on the text data, and each cluster obtained is used as a theme;

the subject term extraction module is used for respectively extracting the subject term of each cluster and storing the subject term into the database;

the second acquisition module is used for extracting the subject words in the database and acquiring data from a data source by utilizing the crawler according to the extracted subject words;

the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding quantity, praise quantity and comment quantity of each text data;

the real heat index weight calculation module is used for taking the forwarding quantity, the praise quantity and the comment quantity analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;

the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to obtain characteristic words of each cluster, and taking part of the characteristic words contained in each cluster as subject words and updating the subject words in the original database;

the real heat value calculation module calculates the real heat of each text data by using the corresponding forwarding quantity, praise quantity and comment quantity of each text data analyzed by the analysis module and the index weight obtained by the real heat index weight calculation module, and then calculates the average value of the real heat of the text data contained in each topic according to the topics obtained by the clustering module, wherein the obtained average value result is used as the real heat value of each topic;

the predicted heat value module is used for predicting the predicted heat value of each subject term in the next period;

and the CPU distribution module is used for giving corresponding upper limit of CPU occupancy rate to the crawlers corresponding to each theme according to the predicted heat value by the server and starting the corresponding number of processes.

The method for scheduling the subject term search crawler based on the combined prediction method comprises the following steps:

step 1, setting keywords, and acquiring data from a data source by utilizing a crawler according to the keywords;

step 2, preprocessing data, converting the preprocessed data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, manually marking each cluster as a theme, and storing the feature words contained in each theme as the theme words into a database to form a theme word database;

step 3, extracting the subject words in the subject word database, writing crawlers according to the subject words to acquire data from a data source, analyzing the forwarding quantity, the praise quantity and the comment quantity by using the crawled data as real heat indexes, and determining the weight of the real heat indexes by using a hierarchical analysis method;

step 4, calculating the real heat value of each subject term according to the data and the real heat index obtained in the step 3;

step 5, predicting the predicted heat value of each subject term in the next period by using a combined prediction method;

step 6, after the data obtained in the step 3 are processed in the steps 1 to 2, updating a subject term database;

and 7, giving a weight value of the crawler corresponding to each subject term in the updated subject term database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each subject term by the server according to the weight value, and repeating the steps 3-7.

Further, in the step 2, the method further includes the following steps:

step 21, data cleaning, namely removing characters except Chinese characters in the data by using a regular expression;

step 22, chinese word segmentation, namely segmenting each acquired data text into words;

step 23, removing stop words, and removing the stop words in the words segmented in the step 12;

and step 24, using a vector space model, changing one piece of data into a multidimensional vector formed by the weights of the feature words.

Further, in the step 2, the method further includes the following steps:

and (3) adopting a cluster analysis method to separate each data into clusters, combining the data with highest similarity according to a similarity measurement standard, sequentially combining the data into clusters according to the sequence of the data with high similarity, reducing the similarity among the clusters along with the combination of the clusters until reaching a similarity threshold value, stopping each formed cluster, namely a theme, and storing the characteristic words contained in each theme into a database as the subject words to form a subject word database.

In step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy value method, the prediction heat value of the subject is calculated by using the exponential smoothing method and the back propagation neural network, and weights are given to the calculation results of the exponential smoothing method and the back propagation neural network according to the entropy value method, so as to obtain the prediction heat value of the subject.

Further, the exponential smoothing method adopts a secondary exponential smoothing method to obtain a predicted heat value.

Further, the counter propagation neural network continuously restores the network weight and the threshold value through training of sample data to enable the error function to drop along the negative gradient direction, the error function is enabled to be continuously reduced to the threshold value or reach the preset iteration times, the weights of the input layer and the output layer are obtained, and finally the early-stage true value is input into the trained counter propagation neural network to obtain the predicted heat value.

Further, the entropy method determines index weight according to the entropy value provided by each index observation value, and judges the discrete degree of the predicted heat value according to the entropy value.

Further, in step 7, the CPU of the crawler is allocated by using a multi-process method, the updated weight value of the crawler corresponding to each subject word in the subject word database is given according to the predicted heat value, and the server adjusts the CPU upper limit value and the process start number of the crawler corresponding to each subject word according to the weight value

Compared with the prior art, the invention has the beneficial effects that: the method integrates the algorithms such as the aggregation hierarchical clustering, the exponential smoothing method, the back propagation neural network and the entropy value method, provides a calculation method for predicting the popularity value of the theme, then predicts future popularity of each theme, and schedules crawlers corresponding to the high popularity theme so as to acquire more high popularity data, thereby realizing the purpose of preferentially tracking the high popularity theme under the condition of limited resources, effectively tracking the high popularity theme, acquiring more related data of the high popularity theme, and grasping the popularity of the high popularity theme more comprehensively and timely.

Drawings

FIG. 1 is a flow chart of the present embodiment;

FIG. 2 is a graph showing the distribution of true heat values in the present embodiment;

FIG. 3 is a fitted curve of true heat;

FIG. 4 shows the actual heat trend of each topic from 1 to 7 in this example;

FIG. 5 is a graph showing data volume and heat value for various subjects at stage 8;

FIG. 6 is a graph of the variation of the subject matter exposure with time for a first class;

FIG. 7 is a graph showing the variation of the subject matter exposure of the second category with the number of days;

FIG. 8 is a graph of the variation of subject matter heat over time for a third class;

FIG. 9 is a graph of the variation of subject matter exposure versus the number of options for a fourth category;

FIG. 10 is a graph of the variation of the subject matter heat of the fifth category with the number of days;

FIG. 11 is a graph of variation of subject matter exposure versus number of subjects of the sixth category;

FIG. 12 is a graph of variation of subject matter exposure versus number of subjects of the seventh category;

FIG. 13 is a graph of the variation of the subject matter heat of the eighth category with the number of days;

fig. 14 is a graph showing the change of various types of subject data amount with the number of days.

Detailed Description

The invention is further described below with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.

The system comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an updating module, a predicted heat value module and a CPU distribution module.

Further, the first obtaining module obtains text data from the data source by using the crawler according to the keyword set by the user, where in this embodiment, the data source is selected from the microblogs.

Further, the data preprocessing module is used for preprocessing the data acquired by the acquisition module, and the preprocessing process comprises data cleaning, chinese word segmentation and stop word removal. The data cleaning is mainly a process of removing characters except Chinese by using a regular expression, and the characters like @, # and emoticons belong to objects to be cleaned; the Chinese word segmentation refers to the process of segmenting a Chinese character sequence without space into meaningful words, realizing efficient word graph scanning through a prefix dictionary, generating directed undirected graphs of all possible word forming conditions of Chinese characters in sentences, and then searching a maximum probability path by utilizing dynamic programming so as to find out the maximum segmentation combination of words, and further segmenting each piece of acquired text data into words; the stop word is mainly formed by deleting words which cannot represent text characteristics through constructing a stop word list.

Further, the vector space model is a commonly used text representation model, words in any text data can be segmented by means of a word segmentation technique, and each word is taken as a component according to the segmentation order, so that the text data is represented as a word vector. More generally, let the text data set t= { T ₁ ,t ₂ ,…,t _n }, t is _i (i=1, 2, …, n) is a text data, and word vectors of all text data can be combined into a new word vector and called a feature word library (or feature word space) of the text data set T through processes such as duplication elimination, word frequency threshold definition, word removal and the like, any text data T is based on whether or not a word of the text data appears or the number of times of appearance in the feature word space and the like _i May be represented as a vector in the feature word space, which is referred to as a vector space model of the text data. In the determined feature word bank k= (K) ₁ ,k ₂ ,…,k _m ) (any one of k _j (j=1, 2, …, m) is a feature word), the vector space model of the text data in this embodiment is expressed as

t _i ＝(w _i1 ,w _i2 ,…,w _im )， (1)

Any one of w _ij The following calculation was performed

Wherein r is _ij For the characteristic word k _j At text data t _i Number of occurrences, k _j ∈t _i Representing the characteristic word k _j At text data t _i I T is the number of elements in the set T, i.e., |t|=n and for any feature word k in this embodiment _j ，

Further, the clustering module performs cluster analysis on the text data set T by adopting a cluster analysis method based on a vector space model of the text data, and divides the text data set T into different clusters. The basis of the cluster analysis is the similarity measure between clustered objects, in this embodiment, the similarity measure between two text data is the Euclidean distance, and the calculation formula is as follows:

wherein w in the present embodiment _ij Can be calculated according to formula (2), t _i And t _l Is two text data in the set T. d (t) _i ,t _l ) In the range of 0 to 1, d (t) _i ,t _l ) Smaller represents text data t _i And t _l The more dissimilar, d (t _i ,t _l ) The larger the text data t _i And t _l The more similar. Based on cosine similarity measure between text data, in this embodiment, a condensed hierarchical clustering method is adopted, that is, according to the cosine similarity measure between text data and a set similarity threshold, merging and clustering are preferentially performed according to the text data with highest similarity until all text data are merged and clustered.

Further, the topic word extraction module refers to each cluster as a topic based on the result of the clustering module, and the cluster set is T' = { T ₁ ,T ₂ ,…,T _p Cluster T _q (q=1, 2, …, p) is a subject. Subject T according to vector space model equation (1) of text data _q The matrix can also be represented as follows:

/>

according to formula (2), w _ij (i=1, 2, …, n, j=1, 2, …, m) can also be understood as the feature word k _j At text data t _i Is a weight of (a). Accordingly, the feature word k _j At subject T _q Weight W of (a) _qj The following calculation was performed

Wherein |T _q I is the topic T _q The number of text data in the database. Subject T based on feature words _q Weight of (1), set a weight threshold beta _q Subject T is excluded _q Feature word set { k } ₁ ,k ₂ ,…,k _m Weight in } is less than threshold beta _q Is used as the subject T _q And storing the subject matter into a subject matter database, i.e

Further, the second acquisition module extracts the subject T in the subject word database _q According to the extracted subject matter, utilizing the crawler to obtain data from a data source, and setting the crawler to obtain the data according to the subject T _q The text data set acquired by the subject word of (2) is T' _q ＝{t′ ₁ ,t′ ₂ ,…,t′ _n′ -wherein n' is a crawler according to T _q The number of text data crawled by the subject term;

further, an parsing module for parsing the text data t 'acquired by the second acquiring module' _i′ (i '=1, 2, …, n') resolving the forwarding amount, the praise amount and the comment amount;

further, the real heat index weight calculation module takes the forwarding quantity, the praise quantity and the comment quantity as real heat calculation indexes, and determines index weight omega by using a weight analysis method _i″ (i″＝1,2,3)

Further, in this embodiment, an analytic hierarchy process is used to determine the index weight;

further, the real heat value calculating module calculates the theme T through the text data acquired by the second acquiring module, the data analyzed by the analyzing module and the weight acquired by the real heat index calculating module _q True at τ -th phase of (c)Heat of solidity value

Let text data t' _i′ Forwarding quantity, praise quantity and comment quantity of (a) are b respectively _i′1 ,b _i′2 .b _i′3 And t' _i′ (i '=1, 2, …, n') is that the crawler acquires text data in the τ -th period, t '' _i′ The true heat value calculation formula of (2) is shown as formula (7), and the theme T _q True heat in the τ -th phase +.>

The calculation formula of (2) is shown as formula (8):

and the updating module is used for updating the subject term database after the text data acquired by the second acquisition module are processed by the data preprocessing module and the vector space model.

Further, the predicted heat value module obtains the predicted heat value of each theme through the real heat value of each theme obtained by the real heat value calculation module. For subject T _q Can construct the period number tau and the true heat value

The correspondence of (2) is shown in table 1 below:

TABLE 1

/>

The future number τ and the true heat value are shown in FIG. 2

By constructing the distribution relation of the term tau and the true heat value

The true heat value curve when the fitted future number tau=7 is shown in fig. 3, and the existing prediction method is used for predicting the tau+d-th period theme T _q Is +.>

Further, in this embodiment, the exponential smoothing method and the back propagation neural network are adopted to calculate the predicted popularity value of the subject, and then the weight is given to the calculation results of the predicted popularity value of the subject by the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the subject T _q Is a combination of the predicted heat value of (a)

Further, the CPU distribution module gives weight values to crawlers corresponding to each theme according to the combined predicted heat value, and the server adjusts the theme T according to the weight values _q CPU occupancy rate upper limit of corresponding crawler;

wherein, the crawlers are deployed on the same Linux server, and the topic T is controlled by a CPU limit command carried by a Linux system _q The CPU upper limit of the corresponding crawler is used as constraint, when the crawler does not exceed the specified CPU upper limit, constraint conditions are not added to the crawler, if the crawler is about to exceed the specified CPU upper limit, the server can dynamically adjust the server to ensure that the server floats around the upper limit, and is responsible for the theme T _q CPU usage upper limit of the crawler in the tau+d stage

The calculation of (2) is as follows:

in the formula (i),

for the theme T _q Corresponding to the CPU upper limit value of the crawler in the tau+d stage, M is the CPU resource percentage which can be used by the server when the server is idle, C' represents the CPU percentage which is currently used by the server, and>

for the theme T _q The combined predicted heat value at stage τ+d, p being the total number of topics, if topic T _q If the CPU consumed by the corresponding crawler cannot reach the upper limit value, the process can be repeatedly started, the purpose that the crawler fully utilizes CPU resources is achieved, and the topic T is achieved _q The number of process starts of the corresponding crawler in the τ+d stage +.>

The formula is as follows:

in the formula (i),

for the theme T _q The corresponding crawler accounts for the CPU percentage in the tau+d stage, and the corresponding crawler accounts for the CPU percentage in the tau+d stage>

Representing the subject T _q The residual CPU resource of the corresponding crawler in the tau+d stage is more than the multiple of the CPU resource occupied by the crawler, and the residual CPU resource is +.>

Representing the subject T _q The corresponding crawlers need to start the number of processes in the τ+dth period. In the formula, when->

When prescribing->

The method is used for ensuring that at least one process is running in the crawler corresponding to each theme; when->

When prescribing->

The method aims to prevent other problems of insufficient running memory and the like caused by excessive process starting numbers.

Further, if the current period number exceeds the predicted period number τ+d, the steps 3 to 7 are repeated.

As shown in fig. 1, the method for scheduling the subject term search crawler based on the combined prediction method comprises the following steps:

step 1, setting keywords, and acquiring text data from a data source by utilizing a crawler according to the keywords;

step 2, preprocessing the text data, converting the preprocessed text data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, wherein each cluster is called a theme, and storing part of feature words contained in each theme as a subject word into a database to form a subject word database;

step 3, writing crawlers with corresponding numbers according to the number of the topics, extracting the topic words in a topic word database, acquiring data from a data source by the crawlers according to the topic words, and establishing a real heat index of the crawled text data according to the forwarding quantity, the praise quantity and the comment quantity;

step 4, calculating the true heat value of each theme according to the data and the true heat index obtained in the step 3;

step 5, predicting the predicted heat value of each theme in the next period by using a combined prediction method;

step 6, after the text data obtained in the step 3 are processed in the steps 1 to 2, updating a subject word database;

and 7, giving weight values of the crawlers corresponding to the topics in the updated subject word database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawlers corresponding to the topics by the server according to the weight values, and repeating the steps 3-7.

Examples

According to a keyword library initially set by a user, crawling real-time text data on a newwave microblog by a crawler corresponding to a theme, storing the real-time text data, extracting 3000 text data as training samples, performing Chinese word segmentation and stop word removal respectively, constructing a vector space by using a formula (1), solving characteristic item weights by using a formula (2), performing hierarchical clustering by using a formula (3), and cutting a hierarchical clustering tree by using 100 as a threshold, wherein the stop word uses a Ha-work large stop word dictionary, the non-detachable word dictionary is added manually, and finally forming the theme into the following table 2:

TABLE 2

The method comprises the steps that eight topics are added in an upper table, corresponding text data on a microblog are crawled by utilizing a crawler according to a subject term, forwarding quantity, praise quantity and comment quantity of each piece of text data are analyzed, and the upper limit of the use of a CPU (Central processing Unit) of a server of each crawler is set to be one eighth of the percentage of the remaining CPU of the server in the process; in this embodiment, the weights of the transfer amount, the praise amount, and the comment amount are determined by using a hierarchical analysis method, and a judgment matrix J is constructed first

The weights of the comment quantity, the forwarding quantity and the praise quantity are respectively about 0.7012,0.1596,0.1390 by utilizing an arithmetic average method; taking one day as a period, crawling for one week, and obtaining real heat values of each theme in each period (day) according to formulas (7) and (8), wherein the heat trend of each theme in 1-7 periods is shown in fig. 3, the horizontal axis in fig. 3 is the period number, and the vertical axis is the real heat value. According to the secondary exponential smoothing method, calculating the exponential smoothing predicted heat value of the 8 th period of various subjects, wherein the real heat fluctuation of other subjects except the 7 th period is smaller, so that the smoothing coefficient of the subjects should be smaller between 0 and 1, in the embodiment, 0.3 is taken, and the 7 th period is 0.8.

In the process of predicting the theme heat by using the back propagation neural network, the hidden layer and the output layer activation function of the BP neural network adopt relu, the loss function adopts cross entropy loss function, the optimizer adopts Adam, the hidden layer is set to be 1 layer, the number of hidden layer nodes is set to be 3, the number of input layer nodes is set to be 3, the number of output layer nodes is set to be 1, and the learning rate is set to be 0.01. And dividing the primary real heat value sequences of the topics one to eight into groups, wherein each group consists of 4 real heat values and serves as one sample, the last heat value of each group serves as output, the rest value serves as input, and each sample is learned for updating the connection weights of the input layer and the hidden layer and the connection weights of the hidden layer and the output layer. Setting the maximum training frequency as 1000, and ending the neural network learning process when the error allowable limit is 0.0001. The trained network is utilized to predict the predicted heat value of each topic at stage 8. In the network use process, the real heat value of the 5 th to 7 th stages is used as input, and the predicted heat value of each theme in the 8 th stage is obtained by using a trained network structure. Finally, the two methods are combined by an entropy method, namely, the 8 th phase predicted heat value obtained by a secondary exponential smoothing method and the 8 th phase predicted heat value obtained by a BP neural network are weighted and summed, and the obtained 8 th phase combined predicted heat value (predicted heat value by the entropy method) has the following (rounded) result shown in the table 3:

TABLE 3 Table 3

The entropy and weight of the entropy method are shown in table 4 below:

TABLE 4 Table 4

The server used in this embodiment has 16 CPUs, so that the server can use 1600% of the CPUs under no-load condition, other processes already occupy about 1100% of the CPUs, the upper limit percentage (at least 1%) of the CPUs required to be allocated by the server is calculated according to the prediction result and the formula (18) when each crawler is in the 8 th stage, the number of processes started by each crawler in the 8 th stage can be obtained according to the formula (19), and the upper limit of the CPUs and the number of processes started are shown in the following table 5:

TABLE 5

According to the starting number of the progress of each subject word search crawler in the 8 th stage in the table, crawling data on the microblogs according to the subject words contained in each subject, counting the crawling number of each subject search crawler, and simultaneously counting the crawling amount which is not scheduled and is scheduled according to the predicted heat value, as shown in fig. 4, wherein the abscissa is a subject category, the ordinate is the crawling data amount and the predicted heat value of the crawler according to each subject word, the first column and the second column respectively represent the crawling data amount of each subject after not being scheduled and scheduled, and the third column represents the heat value of each subject in the 8 th stage.

The data volumes contained in the first, fifth, seventh and eighth types of topics with higher heat degree in the 8 th period obtained by scheduling are 3572, 4026, 2338 and 3274 respectively, the data volumes obtained by unscheduled operation are 762, 1285, 594 and 827 respectively, when the crawler running in the embodiment obtains the related data of the topic with higher heat degree in the next period, the average data volume is 380.9% more than that obtained by unscheduled operation, obviously, the data volumes obtained by the corresponding crawlers of each scheduled topic are positively correlated with the predicted heat degree value, the higher the predicted heat degree value is, the more CPU resources are obtained by the crawler with higher heat degree, and the more data volumes are obtained.

And if the current period number exceeds the 8 th period, repeatedly extracting the data of each theme in the near seven period, updating the theme words, calculating the real heat value, predicting the 9 th period heat and scheduling crawlers, and obtaining the data quantity, the predicted heat value and the real heat value of each theme in the 10 th to 17 th periods by using the process, wherein the relation between the real heat value and the predicted heat value of each theme and the change with time is shown in fig. 6 to 13.

The abscissa of fig. 6 to 13 is the future number, the ordinate is the heat value, the cross line represents the predicted heat value, the dot line represents the real heat value of the corresponding future number, and the average relative error of various subjects in the period 9 to 17 is calculated according to the average relative error formula, wherein the average relative error formula is as follows:

wherein the method comprises the steps of

For predicting the heat value +.>

Is the true heat value.

The calculation results are shown in table 6 below:

TABLE 6

As shown in fig. 6 to 13 and the above table, the combined predicted popularity value of each period is close to the real popularity value, it is obvious that scheduling the corresponding crawler of each topic feature word by using the combined predicted popularity value has reasonability, the data volume of each topic changes along with the future number, as shown in fig. 14, the abscissa is the future number, and the ordinate is the data volume crawled by the crawler according to the topic words contained in each topic.

Taking the fifth class and the seventh class of topics as examples, it can be seen from fig. 10 and fig. 12 that the real popularity of the fifth class of topics in each period is far higher than that of the seventh class (the fifth class is about 20000, and the seventh class of topics is about 4000), correspondingly, as shown in fig. 14, the data volume obtained by implementing scheduling according to the predicted value of the fifth class of topics is far higher than that of the seventh class, and is consistent with the target that the high-popularity topics should obtain more data volume, so that the combined prediction method is used for distributing the topic words corresponding to the high-popularity topics to search for more resources, so that the high-popularity topics can obtain more data, and the aim of preferentially tracking the high-popularity topics is achieved.

The above is an embodiment of the present invention. The above embodiments and specific parameters in the embodiments are only for clearly describing the inventive verification process of the inventor, and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by applying the descriptions and the drawings of the invention are included in the scope of the invention.

Claims

1. The topic word searching crawler scheduling system based on the combined prediction method is characterized in that: comprising

the second acquisition module is used for extracting the subject words in the database, and searching crawlers by using the subject words according to the extracted subject words to acquire data from a data source;

the real heat value calculation module calculates the real heat of each piece of text data by using the corresponding forwarding quantity, praise quantity, comment quantity and index weight obtained by the real heat index weight calculation module of each piece of text data analyzed by the analysis module, and then calculates the average value of the real heat of the text data contained by each theme according to the theme obtained by the clustering module, wherein the obtained average value result is used as the real heat value of each theme;

the predicted heat value module is used for predicting the predicted heat value of each subject term in the next period; the predicted heat value module obtains the predicted heat value of each theme by the real heat value of each theme obtained by the real heat value calculation module

The CPU distribution module is used for giving corresponding CPU occupancy rate upper limits to the subject word search crawlers corresponding to the subjects according to the predicted heat value, and starting corresponding process numbers; the CPU distribution module gives weight values to the crawlers corresponding to the topics according to the combined prediction heat value, and the server adjusts the CPU occupancy rate upper limit of the crawlers corresponding to the topics according to the weight values.

2. The method for scheduling the subject term search crawler based on the combined prediction method is characterized by comprising the following steps of:

step 1, setting keywords, searching crawlers in a data source by using the subject terms according to the keywords to obtain data;

step 2, preprocessing the data, converting the preprocessed text data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, defining each cluster as a theme, and storing part of feature words contained in each theme as a theme word into a database;

step 3, extracting subject words in a database, writing a corresponding number of subject word search crawlers according to the number of the subject words to obtain subject data from a data source, analyzing forwarding quantity, praise quantity and comment quantity from the crawled data as real heat indexes, and determining each index weight by using a analytic hierarchy process;

step 4, calculating the real heat value of each piece of text data according to the forwarding quantity, the praise quantity, the comment quantity and the real heat index weight of each piece of text data obtained in the step 3, and then solving the average value of the real heat of the text data contained in each theme according to the theme obtained in the step 2, wherein the obtained average value is used as the real heat value of each theme;

step 5, fitting the real heat value of each theme obtained in the step 4 to obtain a change curve of the real heat value of each theme along with the random number, and obtaining a predicted heat value of each theme in the next period by using a combined prediction method;

step 6, after the data obtained in the step 3 are processed in the steps 1 to 2, extracting new subject words and updating a database;

and 7, updating the weight value of the corresponding subject word search crawler according to the predicted heat value of the subject word, adjusting the CPU occupation rate upper limit of the subject word search crawler corresponding to each subject word by the server according to the weight value, and repeating the steps 3-7.

3. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 2, the method further includes the following steps:

step 23, removing stop words, and removing the stop words in the words segmented in the step 22;

and step 24, converting the text data into a multidimensional vector formed by the characteristic word weights by using a vector space model.

4. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 2, the method further includes the following steps:

and (3) adopting a cluster analysis method to separate each data into clusters, combining the data with highest similarity according to a similarity measurement standard, sequentially combining the data into clusters according to the sequence of the data with high similarity, reducing the similarity among the clusters along with the combination of the clusters until reaching a similarity threshold value, calling each cluster as a theme, and storing the feature words contained in each theme as the theme words into a database to form a theme word database.

5. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy value method, the prediction heat value of the subject is calculated by using the exponential smoothing method and the back propagation neural network respectively, and then a weight is given to the calculation result of the prediction heat value of the subject by using the exponential smoothing method and the back propagation neural network according to the entropy value method, so as to obtain the combined prediction heat value of the subject.

6. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the index smoothing method adopts a secondary index smoothing method to obtain a predicted heat value.

7. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the back propagation neural network continuously restores the network weight and the threshold value to enable the error function to drop along the negative gradient direction through training of sample data, the error function is enabled to be continuously reduced to the threshold value or reach the preset iteration times, the weights of the input layer and the output layer are obtained, and finally the early-stage true value is input into the trained back propagation neural network to obtain the predicted heat value.

8. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the entropy method determines index weight according to the entropy provided by each index observation value, and obtains the discrete degree of two groups of prediction heat values according to the entropy, so that corresponding weight is given to the two prediction heat values and the two prediction heat values are summed.

9. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: in the step 7, a multi-process method is adopted to distribute the CPU of the crawler, the updated weight value of the crawler corresponding to each subject word in the subject word database is given according to the predicted heat value, and the server adjusts the CPU upper limit value and the process starting number of the crawler corresponding to each subject word according to the weight value.