CN113536085A

CN113536085A - Topic word search crawler scheduling method and system based on combined prediction method

Info

Publication number: CN113536085A
Application number: CN202110701204.8A
Authority: CN
Inventors: 陈智超; 裴峥; 孔明明
Original assignee: Xihua University
Current assignee: Xihua University
Priority date: 2021-06-23
Filing date: 2021-06-23
Publication date: 2021-10-22
Anticipated expiration: 2041-06-23
Also published as: CN113536085B

Abstract

The invention relates to the technical field of crawler scheduling methods, in particular to a topic word searching crawler scheduling method based on a combined prediction method and a system thereof, wherein the crawler scheduling method comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a topic word extraction module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an update module, a heat value prediction module and a CPU distribution module; the method comprises the following steps: step 1, acquiring data from a data source; step 2, preprocessing data; step 3, obtaining theme data, and calculating a real heat index and an index weight; step 4, calculating a true heat value; step 5, calculating the predicted heat value of each topic in the next period; step 6, extracting new subject terms and updating a database; and 7, distributing the upper limit of the CPU occupancy rate, and acquiring more related data of the high-heat theme. The method realizes the purpose of preferentially tracking the high-heat theme under the condition of limited resources.

Description

Topic word search crawler scheduling method and system based on combined prediction method

Technical Field

The invention relates to the technical field of crawler scheduling methods, in particular to a topic word search crawler scheduling method and a topic word search crawler scheduling system based on a combined prediction method.

Background

Tracking the theme requires the crawler to continuously acquire the theme related data, and if the hot theme is tracked preferentially under the condition that the server resources are limited, the crawler needs to be scheduled autonomously to acquire the hot theme related data preferentially. The current crawler scheduling methods mainly include a crawler scheduling method based on website data updating frequency, a crawler scheduling method based on URL distribution, a crawler scheduling method based on network distance, a crawler scheduling method based on node task allocation and the like; the crawler scheduling method based on the website data updating frequency schedules crawlers according to the updating frequency of a data source website, reduces resource cost of a crawler server to a certain extent, and is suitable for scheduling some website crawlers with low updating frequency; the crawler scheduling method based on the URL preferentially distributes the URL with high similarity to crawl the crawler by judging the similarity between the webpage text and the theme set by the user, and cannot meet the requirement of preferentially crawling the theme with high future popularity; the crawler scheduling method based on node task allocation is mainly used for solving the problem of load balance among crawler servers, a large number of URLs are mapped to a Hash ring, each crawler node corresponds to a segment of a cyclic sequence to guarantee reasonable task allocation of the crawler nodes, virtual nodes are added, the robustness of a crawler system is improved, and the condition that a topic is tracked through a heat allocation task cannot be met.

Disclosure of Invention

Based on the problems, the invention provides a topic word search crawler scheduling method and a system thereof based on a combined prediction method, crawlers corresponding to high-heat topics are scheduled by predicting future heat of each topic, so that more high-heat data are obtained, and the aim of preferentially tracking the high-heat topics under the condition of limited resources is fulfilled.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

the topic word searching crawler scheduling system based on the combined prediction method comprises

The first acquisition module acquires data in a data source by utilizing a subject word search crawler according to keywords set by a user;

the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;

the vector space model is used for converting the preprocessed text data into a multi-dimensional vector formed by the weights of the feature words;

the clustering module is used for clustering the text data to obtain each cluster as a theme;

the subject term extraction module is used for respectively extracting the subject terms of each cluster and storing the subject terms into the database;

the second acquisition module is used for extracting subject terms in the database and acquiring data from the data source by using a crawler according to the extracted subject terms;

the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding amount, praise amount and comment amount of each piece of text data;

the real heat index weight calculation module is used for taking the forwarding amount, the praise amount and the comment amount analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;

the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to acquire the feature words of each cluster, and updating partial feature words contained in each cluster in an original database by taking the partial feature words as subject words;

the real heat value calculation module calculates the real heat of each text data by using the forwarding amount, the praise amount and the comment amount corresponding to each text data analyzed by the analysis module and the index weight obtained by the real heat index weight calculation module, then calculates the mean value of the real heat of the text data contained in each topic according to the topics obtained by the clustering module, and takes the calculated mean value result as the real heat value of each topic;

the prediction heat value module is used for predicting the prediction heat value of each subject term in the next period;

and the CPU distribution module is used for endowing the crawler corresponding to each theme with the corresponding CPU occupancy rate upper limit according to the predicted heat value by the server and starting the corresponding number of processes.

The title word search crawler scheduling method based on the combined prediction method comprises the following steps:

step 1, setting keywords, and acquiring data in a data source by using a crawler according to the keywords;

step 2, preprocessing the data, changing the preprocessed data into a multidimensional vector formed by the weight of the feature words, dividing the multidimensional vector into clusters, manually marking each cluster as a theme, and storing the feature words contained in each theme as theme words in a database to form a theme word database;

step 3, extracting subject words in the subject word database, compiling a crawler according to the subject words to acquire data from a data source, analyzing forwarding amount, praise amount and comment amount as real heat indexes by using the crawled data, and determining the weight of the real heat indexes by using an analytic hierarchy process;

step 4, calculating the real heat value of each subject term according to the data acquired in the step 3 and the real heat index;

step 5, predicting the predicted heat value of each subject word in the next period by using a combined prediction method;

step 6, after the data obtained in the step 3 is processed in the steps 1 to 2, updating a subject term database;

and 7, endowing the updated crawler with the corresponding weight value of each subject term in the subject term database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each subject term by the server according to the weight value, and repeating the steps 3 to 7.

Further, the step 2 further includes the following steps:

step 21, cleaning data, namely removing characters except Chinese characters in the data by using a regular expression;

step 22, Chinese word segmentation, namely segmenting each acquired data text into words;

step 23, removing stop words, namely removing the stop words in the words segmented in the step 12;

and 24, using a vector space model to change a piece of data into a multi-dimensional vector formed by the weights of the feature words.

Further, the step 2 further includes the following steps:

and (3) independently clustering each data by adopting a cluster analysis method, merging the data with the highest similarity according to a similarity measurement standard, sequentially merging the data into clusters according to the sequence of the similarity of the data from high to low, reducing the similarity among the clusters along with the merging of the clusters until reaching a similarity threshold value, calling each formed cluster as a theme, and storing the characteristic words contained in each theme as theme words in a database to form a theme word database.

Further, in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy method, the exponential smoothing method and the back propagation neural network are used to calculate the predicted heat value of the topic respectively, and then weights are given to the calculation results of the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the predicted heat value of the topic.

Further, the exponential smoothing method adopts a quadratic exponential smoothing method to obtain the predicted heat value.

Further, the back propagation neural network continuously restores the network weight and the threshold value through the training of sample data, so that the error function is reduced along the negative gradient direction, the error function is continuously reduced to the threshold value or reaches a preset iteration number, the weights of an input layer and an output layer are obtained, and finally the early-stage real value is input into the trained back propagation neural network to obtain a predicted heat value.

Further, the entropy method determines index weight according to the size of the entropy provided by each index observation value, and judges the discrete degree of the predicted heat value according to the entropy.

Further, in step 7, a CPU of the crawler is allocated by a multi-open process method, a weighted value of the crawler corresponding to each subject term in the updated subject term database is given according to the predicted heat value, and the server adjusts an upper limit value of the CPU of the crawler corresponding to each subject term and a process open number according to the weighted value

Compared with the prior art, the invention has the beneficial effects that: the calculation method for the topic prediction heat value is provided by integrating the algorithms such as the agglomeration hierarchical clustering method, the exponential smoothing method, the back propagation neural network method and the entropy value method, crawlers corresponding to the high-heat topics are scheduled by predicting the future heat of each topic, so that more high-heat data are obtained, the purpose of preferentially tracking the high-heat topics under the condition of limited resources is achieved, the high-heat topics can be effectively tracked, more hot topic related data are obtained, and the hot topic trend is mastered more comprehensively and timely.

Drawings

FIG. 1 is a flow chart of the present embodiment;

FIG. 2 is a graph showing a distribution of true calorific value according to the present embodiment;

FIG. 3 is a fitted curve of true heat;

FIG. 4 is a graph showing the real heat trend of each subject in the periods 1 to 7 in the present embodiment;

FIG. 5 is a graph of data volume versus heat value for each category of topic at stage 8;

FIG. 6 is a graph of heat of a first category of topics versus number of periods;

FIG. 7 is a graph of heat with number of periods for a second category of topics;

FIG. 8 is a graph of heat with number of periods for a third category of topics;

FIG. 9 is a graph of heat with number of periods for a fourth category of topics;

FIG. 10 is a graph of heat with number of periods for a fifth category of topics;

FIG. 11 is a chart of heat with number of periods for a sixth category of topics;

FIG. 12 is a graph of heat with number of periods for a seventh category of topics;

FIG. 13 is a chart of heat with number of periods for a subject of the eighth category;

fig. 14 is a graph showing the variation of the amount of each type of topic data with the number of periods.

Detailed Description

The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.

The topic word searching crawler scheduling system based on the combined prediction method comprises a first obtaining module, a data preprocessing module, a vector space model, a clustering module, a second obtaining module, a real heat index weight calculating module, a real heat value calculating module, an updating module, a heat value predicting module and a CPU (central processing unit) distributing module.

Further, the first obtaining module obtains the text data from a data source by using a crawler according to a keyword set by a user, where in this embodiment, the data source is selected from a microblog.

Further, the data preprocessing module is used for preprocessing the data acquired by the acquisition module, and the preprocessing process comprises data cleaning, Chinese word segmentation and stop word removal. The data cleaning is mainly a process of removing characters except Chinese characters by using a regular expression, wherein the characters such as @, # and emoticons all belong to objects needing cleaning; the Chinese word segmentation is a process of segmenting a Chinese character sequence without spaces into meaningful words, realizes efficient word graph scanning through a prefix dictionary, generates a directed undirected graph of the situation that all Chinese characters are likely to be formed into words in a sentence, and then searches for a maximum probability path by utilizing dynamic programming so as to find out the maximum segmentation combination of the words and further segment each piece of acquired text data into the words; the stop word is mainly used for deleting words which cannot express text characteristics by constructing a stop word list.

Furthermore, the vector space model is a commonly used text representation model, words in any text data can be segmented by means of a word segmentation and word segmentation technology, and the text data is represented as a word vector by taking each word as a component according to the segmentation sequence. More generally, let the text data set T ═ T₁,t₂,…,t_nWhere t is_i(i-1, 2, …, n) is a text data, and after the processing of de-emphasis, word frequency threshold limitation, stop word removal, etc., the word vectors of all the text data can be combined into a new word vector and called as a feature word library (or feature word space) of the text data set T, and any text data T can be any text data T according to whether the words of the text data appear in the feature word space or the number of times of appearance, etc_iCan be represented as a vector in the feature word space in a manner called direction of text dataAnd (5) measuring a space model. In the determined feature word bank K ═ (K)₁,k₂,…,k_m) (either k)_j(j ═ 1,2, …, m) is a feature word), the vector space model of the text data in the present embodiment is expressed as

t_i＝(w_i1,w_i2,…,w_im)， (1)

Any one of w_ijIs calculated as follows

Wherein r is_ijIs a feature word k_jIn the text data t_iNumber of occurrences in, k_j∈t_iRepresentation feature word k_jIn the text data t_iWhere | T | is the number of elements in the set T, i.e., | T | ═ n, and in this embodiment, for any feature word k_j，

Further, the clustering module performs clustering analysis on the text data set T by using a clustering analysis method based on a vector space model of the text data, and divides the text data set T into different clusters. The basis of the clustering analysis is the similarity measure between the clustered objects, the similarity measure between two text data in this embodiment is the euclidean distance, and the calculation formula is as follows:

wherein w in the present embodiment_ijCan be calculated according to the formula (2), t_iAnd t_lIs two text data in the set T. d (t)_i,t_l) Is in the middle of 0 to 1, d (t)_i,t_l) Smaller representation text data t_iAnd t_lThe more dissimilar, d (t)_i,t_l) The larger the representation text data t_iAnd t_lThe more similar. Text based numberAccording to the cosine similarity measure, in the embodiment, an agglomeration hierarchical clustering method is adopted, that is, according to the cosine similarity measure between text data and a set similarity threshold, the text data with the highest similarity is preferentially merged into clusters until all the text data are merged into clusters.

Further, the topic word extraction module refers to each cluster as a topic based on the result of the clustering module, and the cluster is set to T' ═ T₁,T₂,…,T_pGet the cluster T_q(q ═ 1,2, …, p) is a subject. Topic T based on vector space model formula (1) for text data_qIt can also be represented in a matrix as follows:

according to the formula (2), w_ijThe term (

i

1,2, …, n,

j

1,2, …, m) is also to be understood as meaning the term "k" as used herein_jIn the text data t_iThe weight in (1). Accordingly, the feature word k_jOn the subject T_qWeight W in_qjIs calculated as follows

Wherein | T_qI is topic T_qNumber of text data in (1). Topic T based on feature words_qSet a weight threshold beta_qExcluding the subject T_qFeature word set k₁,k₂,…,k_mIn is less than threshold beta_qThe rest of the feature words are used as the subject T_qAnd storing the subject term in a subject term database, i.e.

Further, the second acquisition module extracts the subject T in the subject word database_qSubject term and root ofAccording to the extracted subject words, data are obtained from a data source by using a crawler, and the crawler is arranged according to a subject T_qThe text data set obtained by the subject term of (1) is T'_q＝{t′₁,t′₂,…,t′_n′Where n' is according to T_qThe number of text data to which the subject word is crawled;

further, the parsing module is configured to obtain the text data t 'obtained by the second obtaining module'_i′(i '═ 1,2, …, n') analyzing the forwarding amount, the praise amount and the comment amount;

further, the real heat index weight calculation module determines the index weight omega by using a weight analysis method by taking the forwarding amount, the praise amount and the comment amount as real heat calculation indexes_i″(i″＝1,2,3)

Further, in this embodiment, an analytic hierarchy process is used to determine the index weight;

furthermore, the real heat value calculation module calculates the theme T according to the text data acquired by the second acquisition module, the data analyzed by the analysis module and the weight acquired by the real heat index calculation module_qThe true heat value in the period t

Let text data t'_i′The forwarding amount, the praise amount and the comment amount of the network are respectively b_i′1,b_i′2.b_i′3And t'_i′(i ═ 1,2, …, n') text data acquired by the crawler at stage τ, t_i"the calculation formula of true heat value is shown in formula (7), and the subject T_qTrue heat in the period t

The calculation formula (c) is shown in formula (8):

and the updating module is used for updating the subject term database after the text data acquired by the second acquisition module is processed by the data preprocessing module and the vector space model.

Furthermore, the prediction heat value module obtains the prediction heat value of each topic through the real heat value of each topic obtained by the real heat value calculation module. For topic T_qThe period number tau and the true heat value can be constructed

The corresponding relationship is shown in the following table 1:

TABLE 1

The period number τ and the true heat value are shown in FIG. 2

By constructing the period number tau and the true heat value

The real heat value curve when the period number tau is 7 is fitted is shown in fig. 3, and the subject T at the period tau + d is predicted by using the existing prediction method_qCombined predicted calorific value

Further, in this embodiment, the exponential smoothing method and the back propagation neural network are used to calculate the predicted heat value of the topic, and then the weight is given to the calculation results of the predicted heat value of the topic of the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the topic T_qCombined predicted calorific value

Further, the CPU distribution module gives a weight value to each topic corresponding to the crawler according to the combined predicted heat value, and the server adjusts the topic T according to the weight value_qCorresponding to the CPU occupancy rate upper limit of the crawler;

the crawler is deployed in the same Linux server, and the topic T is subjected to a cpu limit command carried by the Linux system_qThe CPU upper limit of the corresponding crawler is restricted, when the crawler does not exceed the specified CPU use upper limit, no restriction condition is imposed on the crawler, and if the crawler is about to exceed the specified CPU use upper limit, the server can make dynamic adjustment to ensure that the crawler floats around the upper limit and is responsible for the theme T_qThe upper limit of CPU usage of the crawler in the t + d stage

Is calculated as follows:

in the formula, the first step is that,

as a subject T_qCorresponding to the CPU upper limit value of the crawler in the tau + d period, M is the percentage of CPU resources which can be used when the server is unloaded, C' represents the percentage of CPU which is currently used by the server,

as a subject T_qThe combination of the predicted calorific value at the τ + d stage, p being the total number of subjects, if T_qIf the CPU consumed by the crawler cannot reach the upper limit value, the process can be repeatedly started to achieve the purpose that the crawler makes full use of CPU resources, and the theme T_qCorresponding to the number of processes started by the crawler in the period tau + d

Such as the formula:

in the formula, the first step is that,

as a subject T_qCorresponding to the percentage of CPU taken up by the crawler itself in the t + d phase,

represents a topic T_qThe residual CPU resource of the crawler in the tau + d stage is more than the multiple of the CPU resource occupied by the crawler,

represents a topic T_qCorresponding to the number of processes the crawler needs to start in period t + d. In the formula, when

When it is prescribed

At least one process is operated for ensuring the crawler corresponding to each theme; when in use

When it is prescribed

The method aims to prevent other problems of insufficient running memory and the like caused by excessive process opening number.

Further, if the current period exceeds the predicted period τ + d, repeating the steps 3-7.

As shown in fig. 1, the topic word search crawler scheduling method based on the combined prediction method includes the following steps:

step 1, setting keywords, and acquiring text data from a data source by using a crawler according to the keywords;

step 2, preprocessing the text data, changing the preprocessed text data into a multidimensional vector formed by the weights of the feature words, dividing the multidimensional vector into clusters, wherein each cluster is called a theme, and partial feature words contained in each theme are stored in a database as theme words to form a theme word database;

step 3, compiling a corresponding number of crawlers according to the number of topics, extracting topic words in a topic word database, acquiring data from a data source by the crawlers according to the topic words, and establishing a real heat index for the crawled text data according to forwarding amount, praise amount and comment amount;

step 4, calculating the real heat value of each theme according to the data acquired in the step 3 and the real heat index;

step 5, predicting the predicted heat value of each topic in the next period by using a combined prediction method;

step 6, updating the subject term database after the text data acquired in the step 3 is processed in the steps 1 to 2;

and 7, giving a weight value of the crawler corresponding to each topic in the updated topic word database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each topic by the server according to the weight value, and repeating the steps 3 to 7.

Examples

According to a keyword library initially set by a user, crawlers corresponding to topics crawl real-time text data on a Xinlang microblog and store the real-time text data, 3000 pieces of text data are extracted to serve as training samples, Chinese word segmentation and stop word removal are respectively carried out, a vector space is constructed by using a formula (1), feature item weights are calculated by using a formula (2), hierarchical clustering is carried out by using a formula (3), a hierarchical clustering tree is cut by taking 100 as a threshold value, stop words use a Hagong stop word dictionary, an inseparable word dictionary is manually added, and finally the topics form the following table 2:

TABLE 2

The table is totally eight subjects, according to the subject words, the crawler is used for crawling the corresponding text data on the microblog and analyzing the forwarding amount, the praise amount and the comment amount of each piece of text data, and in the process, the upper limit of the use of the server CPU of each crawler is set to be one eighth of the percentage of the remaining CPU of the server; in this embodiment, the weights of the forwarding amount, the praise amount, and the comment amount are determined using a hierarchical analysis method, and first, a judgment matrix J is constructed

The weights of the appraisal amount, the forwarding amount and the praise amount obtained by an arithmetic mean method are respectively about 0.7012, 0.1596 and 0.1390; taking one day as a period, obtaining the real heat value of each theme in each period (day) of the week according to formulas (7) and (8) after crawling for one week, wherein the heat trend of each theme in periods 1-7 is shown in figure 3, the horizontal axis in figure 3 is the period number, and the vertical axis is the real heat value. The exponential smoothing prediction heat value of the 8 th stage of each type of theme is calculated according to a quadratic exponential smoothing method, and the real heat fluctuation of other themes except the 7 th type of theme is small, so the smoothing coefficient of the theme is a small number between 0 and 1, in this embodiment, 0.3 is taken, and the 7 th type of theme is 0.8.

In the process of predicting the theme heat by using the back propagation neural network, the activation functions of the hidden layer and the output layer of the BP neural network adopt relu, the loss function adopts a cross entropy loss function, the optimizer adopts Adam, the hidden layer is set to be 1 layer, the number of nodes of the hidden layer is set to be 3, the number of nodes of the input layer is set to be 3, the number of nodes of the output layer is set to be 1, and the learning rate is set to be 0.01. And respectively dividing the early real heat value sequences of the subjects from one to eight into groups, wherein each group consists of 4 real heat values and serves as a sample, the last heat value of each group serves as output, the rest values serve as input, and each sample is learned and used for updating the connection weights of the input layer and the hidden layer and the connection weights of the hidden layer and the output layer. And setting the maximum training times to be 1000, and finishing the neural network learning process when the error allowable limit is 0.0001. And utilizing the trained network to predict the predicted heat value of the 8 th stage of each subject. And in the network use process, the real heat value of the 5 th-7 th stage is used as an input, and the predicted heat value of each theme in the 8 th stage is obtained by using the trained network structure. Finally, two methods are integrated by an entropy method, namely, the predicted heat value of the 8 th stage obtained by a quadratic exponential smoothing method and the predicted heat value of the 8 th stage obtained by a BP neural network are weighted and summed to obtain a combined predicted heat value (predicted heat value by the entropy method) of the 8 th stage, and the result (rounding) is shown in the following table 3:

TABLE 3

The entropy and weight of the entropy method are shown in table 4 below:

TABLE 4

The server used in this embodiment has 16 CPUs, so that the percentage of CPUs usable by the server in the no-load condition is 1600%, other processes already occupy CPUs of nearly 1100%, when each crawler is in the 8 th stage according to the prediction result and the formula (18), the upper limit percentage of CPUs (at least 1%) that the server needs to allocate is calculated, the percentage of CPUs occupied by each crawler is found to be 3%, the process start number of each crawler in the 8 th stage according to the formula (19) can be obtained, and the results of the upper limit of CPUs and the process start number are shown in the following table 5:

TABLE 5

According to the process starting number of each topic word searching crawler in the 8 th stage in the table, crawling data on a microblog according to the topic words contained in each topic, the crawling quantity of each topic searching crawler is counted, meanwhile, crawling quantities which are not scheduled and scheduled according to the predicted heat value are respectively counted, as shown in fig. 4, wherein the abscissa is the topic category, the ordinate is the crawling data quantity and the predicted heat value of the crawler according to various topic words, the first column and the second column respectively represent the crawling data quantity of various topics which are not scheduled and scheduled, and the third column represents the heat value of each topic in the 8 th stage.

The data volumes of the first, fifth, seventh and eighth topics with higher heat degree in the 8 th period obtained by scheduling are 3572, 4026, 2338 and 3274, respectively, the data volumes obtained by non-scheduling are 762, 1285, 594 and 827, when the crawler operated by the embodiment obtains the related data of the topic with higher heat degree in the next period, the average data volume is 380.9% more than that obtained by non-scheduling, obviously, the data volume obtained by the crawler corresponding to each scheduled topic is positively correlated with the predicted heat degree value, the higher the predicted heat degree value is, the crawler with higher heat degree obtains more CPU resources, and the more the obtained data volume is.

If the current period number exceeds the 8 th period, repeatedly extracting the data of each theme in the near seven periods, updating the theme words, calculating the real heat value, predicting the heat value in the 9 th period and scheduling the crawlers, and obtaining the data volume, the predicted heat value and the real heat value of each theme in the 10 th to 17 th periods by using the process, wherein the change relations of the real values and the predicted values of the heat values of various themes along with time are shown in fig. 6 to 13.

In fig. 6 to 13, the abscissa is the number of periods, the ordinate is the heat value, the cross line represents the predicted heat value, the dot line represents the true heat value of the corresponding number of periods, and the average relative error of each topic in the period 9 to 17 is calculated according to the average relative error formula as follows:

wherein

In order to predict the value of the degree of heat,

is the true heat value.

The calculation results are shown in table 6 below:

TABLE 6

As shown in fig. 6 to 13 and the above table, it is clear that the combined predicted heat value at each stage is close to the true heat value, it is obvious that it is reasonable to schedule the crawler corresponding to each topic feature word by using the combined predicted heat value, the data volume of each topic changes with the term as shown in fig. 14, the abscissa is the term, and the ordinate is the data volume crawled by the crawler according to the topic words included in each topic.

Taking the fifth and seventh themes as examples, it can be seen from fig. 10 and 12 that the real heat of the fifth theme at each period is much higher than the seventh theme (the fifth theme is about 20000, and the seventh theme is about 4000), and accordingly, as shown in fig. 14, the data volume obtained by the fifth theme by scheduling according to the predicted value is also much higher than the seventh theme, and is consistent with the target that the high-heat theme should obtain more data volume, therefore, the combined prediction method is used to allocate the theme words corresponding to the high-heat theme to search for more resources of the crawler, so that the high-heat theme can obtain more data, and the purpose of preferentially tracking the high-heat theme is achieved.

The above is an embodiment of the present invention. The embodiments and specific parameters in the embodiments are only for the purpose of clearly showing the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all the equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the invention.

Claims

1. Subject term search crawler scheduling system based on combined prediction method, its characterized in that: comprises that

the second acquisition module is used for extracting subject terms in the database and acquiring data from the data source by using the subject term search crawler according to the extracted subject terms;

and the CPU distribution module is used for endowing the corresponding CPU occupancy rate upper limit of the topic word search crawler corresponding to each topic by the server according to the predicted heat value and starting the corresponding number of processes.

2. The topic word search crawler scheduling method based on the combined prediction method is characterized by comprising the following steps:

step 1, setting keywords, and searching crawlers to acquire data in a data source by using subject words according to the keywords;

step 2, preprocessing the data, changing the preprocessed text data into a multidimensional vector formed by the weights of the feature words, dividing the multidimensional vector into clusters, defining each cluster as a theme, and storing part of feature words contained in each theme as theme words in a database;

step 3, extracting subject words in the database, compiling a corresponding number of subject words according to the number of the subjects, searching the crawlers to obtain subject data from a data source, analyzing the forwarding amount, the praise amount and the comment amount from the crawled data to serve as real heat indexes, and determining the weight of each index by utilizing an analytic hierarchy process;

step 4, calculating the real heat value of each text data according to the forwarding amount, the praise amount, the comment amount and the real heat index weight of each text data obtained in the step 3, then according to the topics obtained in the step 2, averaging the real heat of the text data contained in each topic, and taking the averaged value as the real heat value of each topic;

step 5, fitting a change curve of the real heat value of each topic along with the period number through the real heat value of each topic obtained in the step 4, and obtaining a prediction heat value of each topic in the next period by using a combined prediction method;

step 6, after the data obtained in the step 3 is processed in the steps 1 to 2, extracting a new subject term and updating a database;

and 7, updating the weight values of the corresponding topic word searching crawlers according to the predicted heat values of the topic words, adjusting the upper limit of the CPU occupancy rate of the topic word searching crawlers corresponding to the topic words by the server according to the weight values, and repeating the steps 3 to 7.

3. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 2, the method further comprises the following steps:

and 24, converting the text data into a multi-dimensional vector consisting of the feature word weights by using a vector space model.

4. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 2, the method further comprises the following steps:

and (3) independently clustering each data by adopting a cluster analysis method, merging the data with the highest similarity according to a similarity measurement standard, sequentially merging the data into clusters according to the sequence of the similarity of the data from high to low, reducing the similarity among the clusters along with the merging of the clusters until reaching a similarity threshold value, calling each cluster as a theme, and storing the feature words contained in each theme as theme words in a database to form a theme word database.

5. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy method, the exponential smoothing method and the back propagation neural network are used for respectively calculating the predicted heat value of the theme, and then the calculation results of the theme predicted heat values of the exponential smoothing method and the back propagation neural network are given weight according to the entropy method, so that the combined predicted heat value of the theme is obtained.

6. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: the exponential smoothing method adopts a quadratic exponential smoothing method to obtain a predicted heat value.

7. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: and the back propagation neural network continuously restores the network weight and the threshold value through the training of sample data to enable the error function to descend along the negative gradient direction, the error function is continuously reduced to the threshold value or reaches the preset iteration times to obtain the weights of an input layer and an output layer, and finally the early-stage real value is input into the trained back propagation neural network to obtain the predicted heat value.

8. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: the entropy value method determines index weight according to the size of the entropy value provided by each index observation value, and obtains the discrete degree of two groups of predicted heat values through the entropy value, so that the two predicted heat values are endowed with corresponding weight and summed.

9. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: in the step 7, a CPU of the crawler is distributed by adopting a multi-opening process method, a weighted value corresponding to each subject term in the updated subject term database is given to the crawler according to the predicted heat value, and the server adjusts the CPU upper limit value and the process opening number of the crawler corresponding to each subject term according to the weighted value.