CN113536085A - Topic word search crawler scheduling method and system based on combined prediction method - Google Patents

Topic word search crawler scheduling method and system based on combined prediction method Download PDF

Info

Publication number
CN113536085A
CN113536085A CN202110701204.8A CN202110701204A CN113536085A CN 113536085 A CN113536085 A CN 113536085A CN 202110701204 A CN202110701204 A CN 202110701204A CN 113536085 A CN113536085 A CN 113536085A
Authority
CN
China
Prior art keywords
data
module
topic
value
theme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110701204.8A
Other languages
Chinese (zh)
Other versions
CN113536085B (en
Inventor
陈智超
裴峥
孔明明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xihua University
Original Assignee
Xihua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xihua University filed Critical Xihua University
Priority to CN202110701204.8A priority Critical patent/CN113536085B/en
Publication of CN113536085A publication Critical patent/CN113536085A/en
Application granted granted Critical
Publication of CN113536085B publication Critical patent/CN113536085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of crawler scheduling methods, in particular to a topic word searching crawler scheduling method based on a combined prediction method and a system thereof, wherein the crawler scheduling method comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a topic word extraction module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an update module, a heat value prediction module and a CPU distribution module; the method comprises the following steps: step 1, acquiring data from a data source; step 2, preprocessing data; step 3, obtaining theme data, and calculating a real heat index and an index weight; step 4, calculating a true heat value; step 5, calculating the predicted heat value of each topic in the next period; step 6, extracting new subject terms and updating a database; and 7, distributing the upper limit of the CPU occupancy rate, and acquiring more related data of the high-heat theme. The method realizes the purpose of preferentially tracking the high-heat theme under the condition of limited resources.

Description

Topic word search crawler scheduling method and system based on combined prediction method
Technical Field
The invention relates to the technical field of crawler scheduling methods, in particular to a topic word search crawler scheduling method and a topic word search crawler scheduling system based on a combined prediction method.
Background
Tracking the theme requires the crawler to continuously acquire the theme related data, and if the hot theme is tracked preferentially under the condition that the server resources are limited, the crawler needs to be scheduled autonomously to acquire the hot theme related data preferentially. The current crawler scheduling methods mainly include a crawler scheduling method based on website data updating frequency, a crawler scheduling method based on URL distribution, a crawler scheduling method based on network distance, a crawler scheduling method based on node task allocation and the like; the crawler scheduling method based on the website data updating frequency schedules crawlers according to the updating frequency of a data source website, reduces resource cost of a crawler server to a certain extent, and is suitable for scheduling some website crawlers with low updating frequency; the crawler scheduling method based on the URL preferentially distributes the URL with high similarity to crawl the crawler by judging the similarity between the webpage text and the theme set by the user, and cannot meet the requirement of preferentially crawling the theme with high future popularity; the crawler scheduling method based on node task allocation is mainly used for solving the problem of load balance among crawler servers, a large number of URLs are mapped to a Hash ring, each crawler node corresponds to a segment of a cyclic sequence to guarantee reasonable task allocation of the crawler nodes, virtual nodes are added, the robustness of a crawler system is improved, and the condition that a topic is tracked through a heat allocation task cannot be met.
Disclosure of Invention
Based on the problems, the invention provides a topic word search crawler scheduling method and a system thereof based on a combined prediction method, crawlers corresponding to high-heat topics are scheduled by predicting future heat of each topic, so that more high-heat data are obtained, and the aim of preferentially tracking the high-heat topics under the condition of limited resources is fulfilled.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
the topic word searching crawler scheduling system based on the combined prediction method comprises
The first acquisition module acquires data in a data source by utilizing a subject word search crawler according to keywords set by a user;
the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;
the vector space model is used for converting the preprocessed text data into a multi-dimensional vector formed by the weights of the feature words;
the clustering module is used for clustering the text data to obtain each cluster as a theme;
the subject term extraction module is used for respectively extracting the subject terms of each cluster and storing the subject terms into the database;
the second acquisition module is used for extracting subject terms in the database and acquiring data from the data source by using a crawler according to the extracted subject terms;
the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding amount, praise amount and comment amount of each piece of text data;
the real heat index weight calculation module is used for taking the forwarding amount, the praise amount and the comment amount analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;
the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to acquire the feature words of each cluster, and updating partial feature words contained in each cluster in an original database by taking the partial feature words as subject words;
the real heat value calculation module calculates the real heat of each text data by using the forwarding amount, the praise amount and the comment amount corresponding to each text data analyzed by the analysis module and the index weight obtained by the real heat index weight calculation module, then calculates the mean value of the real heat of the text data contained in each topic according to the topics obtained by the clustering module, and takes the calculated mean value result as the real heat value of each topic;
the prediction heat value module is used for predicting the prediction heat value of each subject term in the next period;
and the CPU distribution module is used for endowing the crawler corresponding to each theme with the corresponding CPU occupancy rate upper limit according to the predicted heat value by the server and starting the corresponding number of processes.
The title word search crawler scheduling method based on the combined prediction method comprises the following steps:
step 1, setting keywords, and acquiring data in a data source by using a crawler according to the keywords;
step 2, preprocessing the data, changing the preprocessed data into a multidimensional vector formed by the weight of the feature words, dividing the multidimensional vector into clusters, manually marking each cluster as a theme, and storing the feature words contained in each theme as theme words in a database to form a theme word database;
step 3, extracting subject words in the subject word database, compiling a crawler according to the subject words to acquire data from a data source, analyzing forwarding amount, praise amount and comment amount as real heat indexes by using the crawled data, and determining the weight of the real heat indexes by using an analytic hierarchy process;
step 4, calculating the real heat value of each subject term according to the data acquired in the step 3 and the real heat index;
step 5, predicting the predicted heat value of each subject word in the next period by using a combined prediction method;
step 6, after the data obtained in the step 3 is processed in the steps 1 to 2, updating a subject term database;
and 7, endowing the updated crawler with the corresponding weight value of each subject term in the subject term database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each subject term by the server according to the weight value, and repeating the steps 3 to 7.
Further, the step 2 further includes the following steps:
step 21, cleaning data, namely removing characters except Chinese characters in the data by using a regular expression;
step 22, Chinese word segmentation, namely segmenting each acquired data text into words;
step 23, removing stop words, namely removing the stop words in the words segmented in the step 12;
and 24, using a vector space model to change a piece of data into a multi-dimensional vector formed by the weights of the feature words.
Further, the step 2 further includes the following steps:
and (3) independently clustering each data by adopting a cluster analysis method, merging the data with the highest similarity according to a similarity measurement standard, sequentially merging the data into clusters according to the sequence of the similarity of the data from high to low, reducing the similarity among the clusters along with the merging of the clusters until reaching a similarity threshold value, calling each formed cluster as a theme, and storing the characteristic words contained in each theme as theme words in a database to form a theme word database.
Further, in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy method, the exponential smoothing method and the back propagation neural network are used to calculate the predicted heat value of the topic respectively, and then weights are given to the calculation results of the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the predicted heat value of the topic.
Further, the exponential smoothing method adopts a quadratic exponential smoothing method to obtain the predicted heat value.
Further, the back propagation neural network continuously restores the network weight and the threshold value through the training of sample data, so that the error function is reduced along the negative gradient direction, the error function is continuously reduced to the threshold value or reaches a preset iteration number, the weights of an input layer and an output layer are obtained, and finally the early-stage real value is input into the trained back propagation neural network to obtain a predicted heat value.
Further, the entropy method determines index weight according to the size of the entropy provided by each index observation value, and judges the discrete degree of the predicted heat value according to the entropy.
Further, in step 7, a CPU of the crawler is allocated by a multi-open process method, a weighted value of the crawler corresponding to each subject term in the updated subject term database is given according to the predicted heat value, and the server adjusts an upper limit value of the CPU of the crawler corresponding to each subject term and a process open number according to the weighted value
Compared with the prior art, the invention has the beneficial effects that: the calculation method for the topic prediction heat value is provided by integrating the algorithms such as the agglomeration hierarchical clustering method, the exponential smoothing method, the back propagation neural network method and the entropy value method, crawlers corresponding to the high-heat topics are scheduled by predicting the future heat of each topic, so that more high-heat data are obtained, the purpose of preferentially tracking the high-heat topics under the condition of limited resources is achieved, the high-heat topics can be effectively tracked, more hot topic related data are obtained, and the hot topic trend is mastered more comprehensively and timely.
Drawings
FIG. 1 is a flow chart of the present embodiment;
FIG. 2 is a graph showing a distribution of true calorific value according to the present embodiment;
FIG. 3 is a fitted curve of true heat;
FIG. 4 is a graph showing the real heat trend of each subject in the periods 1 to 7 in the present embodiment;
FIG. 5 is a graph of data volume versus heat value for each category of topic at stage 8;
FIG. 6 is a graph of heat of a first category of topics versus number of periods;
FIG. 7 is a graph of heat with number of periods for a second category of topics;
FIG. 8 is a graph of heat with number of periods for a third category of topics;
FIG. 9 is a graph of heat with number of periods for a fourth category of topics;
FIG. 10 is a graph of heat with number of periods for a fifth category of topics;
FIG. 11 is a chart of heat with number of periods for a sixth category of topics;
FIG. 12 is a graph of heat with number of periods for a seventh category of topics;
FIG. 13 is a chart of heat with number of periods for a subject of the eighth category;
fig. 14 is a graph showing the variation of the amount of each type of topic data with the number of periods.
Detailed Description
The invention will be further described with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
The topic word searching crawler scheduling system based on the combined prediction method comprises a first obtaining module, a data preprocessing module, a vector space model, a clustering module, a second obtaining module, a real heat index weight calculating module, a real heat value calculating module, an updating module, a heat value predicting module and a CPU (central processing unit) distributing module.
Further, the first obtaining module obtains the text data from a data source by using a crawler according to a keyword set by a user, where in this embodiment, the data source is selected from a microblog.
Further, the data preprocessing module is used for preprocessing the data acquired by the acquisition module, and the preprocessing process comprises data cleaning, Chinese word segmentation and stop word removal. The data cleaning is mainly a process of removing characters except Chinese characters by using a regular expression, wherein the characters such as @, # and emoticons all belong to objects needing cleaning; the Chinese word segmentation is a process of segmenting a Chinese character sequence without spaces into meaningful words, realizes efficient word graph scanning through a prefix dictionary, generates a directed undirected graph of the situation that all Chinese characters are likely to be formed into words in a sentence, and then searches for a maximum probability path by utilizing dynamic programming so as to find out the maximum segmentation combination of the words and further segment each piece of acquired text data into the words; the stop word is mainly used for deleting words which cannot express text characteristics by constructing a stop word list.
Furthermore, the vector space model is a commonly used text representation model, words in any text data can be segmented by means of a word segmentation and word segmentation technology, and the text data is represented as a word vector by taking each word as a component according to the segmentation sequence. More generally, let the text data set T ═ T1,t2,…,tnWhere t isi(i-1, 2, …, n) is a text data, and after the processing of de-emphasis, word frequency threshold limitation, stop word removal, etc., the word vectors of all the text data can be combined into a new word vector and called as a feature word library (or feature word space) of the text data set T, and any text data T can be any text data T according to whether the words of the text data appear in the feature word space or the number of times of appearance, etciCan be represented as a vector in the feature word space in a manner called direction of text dataAnd (5) measuring a space model. In the determined feature word bank K ═ (K)1,k2,…,km) (either k)j(j ═ 1,2, …, m) is a feature word), the vector space model of the text data in the present embodiment is expressed as
ti=(wi1,wi2,…,wim), (1)
Any one of wijIs calculated as follows
Figure BDA0003128324030000051
Wherein r isijIs a feature word kjIn the text data tiNumber of occurrences in, kj∈tiRepresentation feature word kjIn the text data tiWhere | T | is the number of elements in the set T, i.e., | T | ═ n, and in this embodiment, for any feature word kj
Figure BDA0003128324030000052
Further, the clustering module performs clustering analysis on the text data set T by using a clustering analysis method based on a vector space model of the text data, and divides the text data set T into different clusters. The basis of the clustering analysis is the similarity measure between the clustered objects, the similarity measure between two text data in this embodiment is the euclidean distance, and the calculation formula is as follows:
Figure BDA0003128324030000053
wherein w in the present embodimentijCan be calculated according to the formula (2), tiAnd tlIs two text data in the set T. d (t)i,tl) Is in the middle of 0 to 1, d (t)i,tl) Smaller representation text data tiAnd tlThe more dissimilar, d (t)i,tl) The larger the representation text data tiAnd tlThe more similar. Text based numberAccording to the cosine similarity measure, in the embodiment, an agglomeration hierarchical clustering method is adopted, that is, according to the cosine similarity measure between text data and a set similarity threshold, the text data with the highest similarity is preferentially merged into clusters until all the text data are merged into clusters.
Further, the topic word extraction module refers to each cluster as a topic based on the result of the clustering module, and the cluster is set to T' ═ T1,T2,…,TpGet the cluster Tq(q ═ 1,2, …, p) is a subject. Topic T based on vector space model formula (1) for text dataqIt can also be represented in a matrix as follows:
Figure BDA0003128324030000054
according to the formula (2), wijThe term ( i 1,2, …, n, j 1,2, …, m) is also to be understood as meaning the term "k" as used hereinjIn the text data tiThe weight in (1). Accordingly, the feature word kjOn the subject TqWeight W inqjIs calculated as follows
Figure BDA0003128324030000061
Wherein | TqI is topic TqNumber of text data in (1). Topic T based on feature wordsqSet a weight threshold betaqExcluding the subject TqFeature word set k1,k2,…,kmIn is less than threshold betaqThe rest of the feature words are used as the subject TqAnd storing the subject term in a subject term database, i.e.
Figure BDA0003128324030000062
Further, the second acquisition module extracts the subject T in the subject word databaseqSubject term and root ofAccording to the extracted subject words, data are obtained from a data source by using a crawler, and the crawler is arranged according to a subject TqThe text data set obtained by the subject term of (1) is T'q={t′1,t′2,…,t′n′Where n' is according to TqThe number of text data to which the subject word is crawled;
further, the parsing module is configured to obtain the text data t 'obtained by the second obtaining module'i′(i '═ 1,2, …, n') analyzing the forwarding amount, the praise amount and the comment amount;
further, the real heat index weight calculation module determines the index weight omega by using a weight analysis method by taking the forwarding amount, the praise amount and the comment amount as real heat calculation indexesi″(i″=1,2,3)
Further, in this embodiment, an analytic hierarchy process is used to determine the index weight;
furthermore, the real heat value calculation module calculates the theme T according to the text data acquired by the second acquisition module, the data analyzed by the analysis module and the weight acquired by the real heat index calculation moduleqThe true heat value in the period t
Figure BDA0003128324030000063
Let text data t'i′The forwarding amount, the praise amount and the comment amount of the network are respectively bi′1,bi′2.bi′3And t'i′(i ═ 1,2, …, n') text data acquired by the crawler at stage τ, ti"the calculation formula of true heat value is shown in formula (7), and the subject TqTrue heat in the period t
Figure BDA0003128324030000064
The calculation formula (c) is shown in formula (8):
Figure BDA0003128324030000065
Figure BDA0003128324030000066
and the updating module is used for updating the subject term database after the text data acquired by the second acquisition module is processed by the data preprocessing module and the vector space model.
Furthermore, the prediction heat value module obtains the prediction heat value of each topic through the real heat value of each topic obtained by the real heat value calculation module. For topic TqThe period number tau and the true heat value can be constructed
Figure BDA0003128324030000067
The corresponding relationship is shown in the following table 1:
TABLE 1
Figure BDA0003128324030000071
The period number τ and the true heat value are shown in FIG. 2
Figure BDA0003128324030000072
By constructing the period number tau and the true heat value
Figure BDA0003128324030000073
The real heat value curve when the period number tau is 7 is fitted is shown in fig. 3, and the subject T at the period tau + d is predicted by using the existing prediction methodqCombined predicted calorific value
Figure BDA0003128324030000074
Further, in this embodiment, the exponential smoothing method and the back propagation neural network are used to calculate the predicted heat value of the topic, and then the weight is given to the calculation results of the predicted heat value of the topic of the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the topic TqCombined predicted calorific value
Figure BDA0003128324030000075
Further, the CPU distribution module gives a weight value to each topic corresponding to the crawler according to the combined predicted heat value, and the server adjusts the topic T according to the weight valueqCorresponding to the CPU occupancy rate upper limit of the crawler;
the crawler is deployed in the same Linux server, and the topic T is subjected to a cpu limit command carried by the Linux systemqThe CPU upper limit of the corresponding crawler is restricted, when the crawler does not exceed the specified CPU use upper limit, no restriction condition is imposed on the crawler, and if the crawler is about to exceed the specified CPU use upper limit, the server can make dynamic adjustment to ensure that the crawler floats around the upper limit and is responsible for the theme TqThe upper limit of CPU usage of the crawler in the t + d stage
Figure BDA0003128324030000076
Is calculated as follows:
Figure BDA0003128324030000077
in the formula, the first step is that,
Figure BDA0003128324030000078
as a subject TqCorresponding to the CPU upper limit value of the crawler in the tau + d period, M is the percentage of CPU resources which can be used when the server is unloaded, C' represents the percentage of CPU which is currently used by the server,
Figure BDA0003128324030000079
as a subject TqThe combination of the predicted calorific value at the τ + d stage, p being the total number of subjects, if TqIf the CPU consumed by the crawler cannot reach the upper limit value, the process can be repeatedly started to achieve the purpose that the crawler makes full use of CPU resources, and the theme TqCorresponding to the number of processes started by the crawler in the period tau + d
Figure BDA00031283240300000710
Such as the formula:
Figure BDA00031283240300000711
in the formula, the first step is that,
Figure BDA00031283240300000712
as a subject TqCorresponding to the percentage of CPU taken up by the crawler itself in the t + d phase,
Figure BDA00031283240300000713
represents a topic TqThe residual CPU resource of the crawler in the tau + d stage is more than the multiple of the CPU resource occupied by the crawler,
Figure BDA0003128324030000081
represents a topic TqCorresponding to the number of processes the crawler needs to start in period t + d. In the formula, when
Figure BDA0003128324030000082
When it is prescribed
Figure BDA0003128324030000083
At least one process is operated for ensuring the crawler corresponding to each theme; when in use
Figure BDA0003128324030000084
When it is prescribed
Figure BDA0003128324030000085
The method aims to prevent other problems of insufficient running memory and the like caused by excessive process opening number.
Further, if the current period exceeds the predicted period τ + d, repeating the steps 3-7.
As shown in fig. 1, the topic word search crawler scheduling method based on the combined prediction method includes the following steps:
step 1, setting keywords, and acquiring text data from a data source by using a crawler according to the keywords;
step 2, preprocessing the text data, changing the preprocessed text data into a multidimensional vector formed by the weights of the feature words, dividing the multidimensional vector into clusters, wherein each cluster is called a theme, and partial feature words contained in each theme are stored in a database as theme words to form a theme word database;
step 3, compiling a corresponding number of crawlers according to the number of topics, extracting topic words in a topic word database, acquiring data from a data source by the crawlers according to the topic words, and establishing a real heat index for the crawled text data according to forwarding amount, praise amount and comment amount;
step 4, calculating the real heat value of each theme according to the data acquired in the step 3 and the real heat index;
step 5, predicting the predicted heat value of each topic in the next period by using a combined prediction method;
step 6, updating the subject term database after the text data acquired in the step 3 is processed in the steps 1 to 2;
and 7, giving a weight value of the crawler corresponding to each topic in the updated topic word database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each topic by the server according to the weight value, and repeating the steps 3 to 7.
Examples
According to a keyword library initially set by a user, crawlers corresponding to topics crawl real-time text data on a Xinlang microblog and store the real-time text data, 3000 pieces of text data are extracted to serve as training samples, Chinese word segmentation and stop word removal are respectively carried out, a vector space is constructed by using a formula (1), feature item weights are calculated by using a formula (2), hierarchical clustering is carried out by using a formula (3), a hierarchical clustering tree is cut by taking 100 as a threshold value, stop words use a Hagong stop word dictionary, an inseparable word dictionary is manually added, and finally the topics form the following table 2:
TABLE 2
Figure BDA0003128324030000091
The table is totally eight subjects, according to the subject words, the crawler is used for crawling the corresponding text data on the microblog and analyzing the forwarding amount, the praise amount and the comment amount of each piece of text data, and in the process, the upper limit of the use of the server CPU of each crawler is set to be one eighth of the percentage of the remaining CPU of the server; in this embodiment, the weights of the forwarding amount, the praise amount, and the comment amount are determined using a hierarchical analysis method, and first, a judgment matrix J is constructed
Figure BDA0003128324030000092
The weights of the appraisal amount, the forwarding amount and the praise amount obtained by an arithmetic mean method are respectively about 0.7012, 0.1596 and 0.1390; taking one day as a period, obtaining the real heat value of each theme in each period (day) of the week according to formulas (7) and (8) after crawling for one week, wherein the heat trend of each theme in periods 1-7 is shown in figure 3, the horizontal axis in figure 3 is the period number, and the vertical axis is the real heat value. The exponential smoothing prediction heat value of the 8 th stage of each type of theme is calculated according to a quadratic exponential smoothing method, and the real heat fluctuation of other themes except the 7 th type of theme is small, so the smoothing coefficient of the theme is a small number between 0 and 1, in this embodiment, 0.3 is taken, and the 7 th type of theme is 0.8.
In the process of predicting the theme heat by using the back propagation neural network, the activation functions of the hidden layer and the output layer of the BP neural network adopt relu, the loss function adopts a cross entropy loss function, the optimizer adopts Adam, the hidden layer is set to be 1 layer, the number of nodes of the hidden layer is set to be 3, the number of nodes of the input layer is set to be 3, the number of nodes of the output layer is set to be 1, and the learning rate is set to be 0.01. And respectively dividing the early real heat value sequences of the subjects from one to eight into groups, wherein each group consists of 4 real heat values and serves as a sample, the last heat value of each group serves as output, the rest values serve as input, and each sample is learned and used for updating the connection weights of the input layer and the hidden layer and the connection weights of the hidden layer and the output layer. And setting the maximum training times to be 1000, and finishing the neural network learning process when the error allowable limit is 0.0001. And utilizing the trained network to predict the predicted heat value of the 8 th stage of each subject. And in the network use process, the real heat value of the 5 th-7 th stage is used as an input, and the predicted heat value of each theme in the 8 th stage is obtained by using the trained network structure. Finally, two methods are integrated by an entropy method, namely, the predicted heat value of the 8 th stage obtained by a quadratic exponential smoothing method and the predicted heat value of the 8 th stage obtained by a BP neural network are weighted and summed to obtain a combined predicted heat value (predicted heat value by the entropy method) of the 8 th stage, and the result (rounding) is shown in the following table 3:
TABLE 3
Figure BDA0003128324030000101
The entropy and weight of the entropy method are shown in table 4 below:
TABLE 4
Figure BDA0003128324030000102
The server used in this embodiment has 16 CPUs, so that the percentage of CPUs usable by the server in the no-load condition is 1600%, other processes already occupy CPUs of nearly 1100%, when each crawler is in the 8 th stage according to the prediction result and the formula (18), the upper limit percentage of CPUs (at least 1%) that the server needs to allocate is calculated, the percentage of CPUs occupied by each crawler is found to be 3%, the process start number of each crawler in the 8 th stage according to the formula (19) can be obtained, and the results of the upper limit of CPUs and the process start number are shown in the following table 5:
TABLE 5
Figure BDA0003128324030000103
According to the process starting number of each topic word searching crawler in the 8 th stage in the table, crawling data on a microblog according to the topic words contained in each topic, the crawling quantity of each topic searching crawler is counted, meanwhile, crawling quantities which are not scheduled and scheduled according to the predicted heat value are respectively counted, as shown in fig. 4, wherein the abscissa is the topic category, the ordinate is the crawling data quantity and the predicted heat value of the crawler according to various topic words, the first column and the second column respectively represent the crawling data quantity of various topics which are not scheduled and scheduled, and the third column represents the heat value of each topic in the 8 th stage.
The data volumes of the first, fifth, seventh and eighth topics with higher heat degree in the 8 th period obtained by scheduling are 3572, 4026, 2338 and 3274, respectively, the data volumes obtained by non-scheduling are 762, 1285, 594 and 827, when the crawler operated by the embodiment obtains the related data of the topic with higher heat degree in the next period, the average data volume is 380.9% more than that obtained by non-scheduling, obviously, the data volume obtained by the crawler corresponding to each scheduled topic is positively correlated with the predicted heat degree value, the higher the predicted heat degree value is, the crawler with higher heat degree obtains more CPU resources, and the more the obtained data volume is.
If the current period number exceeds the 8 th period, repeatedly extracting the data of each theme in the near seven periods, updating the theme words, calculating the real heat value, predicting the heat value in the 9 th period and scheduling the crawlers, and obtaining the data volume, the predicted heat value and the real heat value of each theme in the 10 th to 17 th periods by using the process, wherein the change relations of the real values and the predicted values of the heat values of various themes along with time are shown in fig. 6 to 13.
In fig. 6 to 13, the abscissa is the number of periods, the ordinate is the heat value, the cross line represents the predicted heat value, the dot line represents the true heat value of the corresponding number of periods, and the average relative error of each topic in the period 9 to 17 is calculated according to the average relative error formula as follows:
Figure BDA0003128324030000111
wherein
Figure BDA0003128324030000112
In order to predict the value of the degree of heat,
Figure BDA0003128324030000113
is the true heat value.
The calculation results are shown in table 6 below:
TABLE 6
Figure BDA0003128324030000114
As shown in fig. 6 to 13 and the above table, it is clear that the combined predicted heat value at each stage is close to the true heat value, it is obvious that it is reasonable to schedule the crawler corresponding to each topic feature word by using the combined predicted heat value, the data volume of each topic changes with the term as shown in fig. 14, the abscissa is the term, and the ordinate is the data volume crawled by the crawler according to the topic words included in each topic.
Taking the fifth and seventh themes as examples, it can be seen from fig. 10 and 12 that the real heat of the fifth theme at each period is much higher than the seventh theme (the fifth theme is about 20000, and the seventh theme is about 4000), and accordingly, as shown in fig. 14, the data volume obtained by the fifth theme by scheduling according to the predicted value is also much higher than the seventh theme, and is consistent with the target that the high-heat theme should obtain more data volume, therefore, the combined prediction method is used to allocate the theme words corresponding to the high-heat theme to search for more resources of the crawler, so that the high-heat theme can obtain more data, and the purpose of preferentially tracking the high-heat theme is achieved.
The above is an embodiment of the present invention. The embodiments and specific parameters in the embodiments are only for the purpose of clearly showing the invention verification process of the inventor and are not intended to limit the scope of the invention, which is defined by the claims, and all the equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the invention.

Claims (9)

1. Subject term search crawler scheduling system based on combined prediction method, its characterized in that: comprises that
The first acquisition module acquires data in a data source by utilizing a subject word search crawler according to keywords set by a user;
the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;
the vector space model is used for converting the preprocessed text data into a multi-dimensional vector formed by the weights of the feature words;
the clustering module is used for clustering the text data to obtain each cluster as a theme;
the subject term extraction module is used for respectively extracting the subject terms of each cluster and storing the subject terms into the database;
the second acquisition module is used for extracting subject terms in the database and acquiring data from the data source by using the subject term search crawler according to the extracted subject terms;
the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding amount, praise amount and comment amount of each piece of text data;
the real heat index weight calculation module is used for taking the forwarding amount, the praise amount and the comment amount analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;
the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to acquire the feature words of each cluster, and updating partial feature words contained in each cluster in an original database by taking the partial feature words as subject words;
the real heat value calculation module calculates the real heat of each text data by using the forwarding amount, the praise amount and the comment amount corresponding to each text data analyzed by the analysis module and the index weight obtained by the real heat index weight calculation module, then calculates the mean value of the real heat of the text data contained in each topic according to the topics obtained by the clustering module, and takes the calculated mean value result as the real heat value of each topic;
the prediction heat value module is used for predicting the prediction heat value of each subject term in the next period;
and the CPU distribution module is used for endowing the corresponding CPU occupancy rate upper limit of the topic word search crawler corresponding to each topic by the server according to the predicted heat value and starting the corresponding number of processes.
2. The topic word search crawler scheduling method based on the combined prediction method is characterized by comprising the following steps:
step 1, setting keywords, and searching crawlers to acquire data in a data source by using subject words according to the keywords;
step 2, preprocessing the data, changing the preprocessed text data into a multidimensional vector formed by the weights of the feature words, dividing the multidimensional vector into clusters, defining each cluster as a theme, and storing part of feature words contained in each theme as theme words in a database;
step 3, extracting subject words in the database, compiling a corresponding number of subject words according to the number of the subjects, searching the crawlers to obtain subject data from a data source, analyzing the forwarding amount, the praise amount and the comment amount from the crawled data to serve as real heat indexes, and determining the weight of each index by utilizing an analytic hierarchy process;
step 4, calculating the real heat value of each text data according to the forwarding amount, the praise amount, the comment amount and the real heat index weight of each text data obtained in the step 3, then according to the topics obtained in the step 2, averaging the real heat of the text data contained in each topic, and taking the averaged value as the real heat value of each topic;
step 5, fitting a change curve of the real heat value of each topic along with the period number through the real heat value of each topic obtained in the step 4, and obtaining a prediction heat value of each topic in the next period by using a combined prediction method;
step 6, after the data obtained in the step 3 is processed in the steps 1 to 2, extracting a new subject term and updating a database;
and 7, updating the weight values of the corresponding topic word searching crawlers according to the predicted heat values of the topic words, adjusting the upper limit of the CPU occupancy rate of the topic word searching crawlers corresponding to the topic words by the server according to the weight values, and repeating the steps 3 to 7.
3. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 2, the method further comprises the following steps:
step 21, cleaning data, namely removing characters except Chinese characters in the data by using a regular expression;
step 22, Chinese word segmentation, namely segmenting each acquired data text into words;
step 23, removing stop words, namely removing the stop words in the words segmented in the step 12;
and 24, converting the text data into a multi-dimensional vector consisting of the feature word weights by using a vector space model.
4. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 2, the method further comprises the following steps:
and (3) independently clustering each data by adopting a cluster analysis method, merging the data with the highest similarity according to a similarity measurement standard, sequentially merging the data into clusters according to the sequence of the similarity of the data from high to low, reducing the similarity among the clusters along with the merging of the clusters until reaching a similarity threshold value, calling each cluster as a theme, and storing the feature words contained in each theme as theme words in a database to form a theme word database.
5. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 2, wherein: in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy method, the exponential smoothing method and the back propagation neural network are used for respectively calculating the predicted heat value of the theme, and then the calculation results of the theme predicted heat values of the exponential smoothing method and the back propagation neural network are given weight according to the entropy method, so that the combined predicted heat value of the theme is obtained.
6. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: the exponential smoothing method adopts a quadratic exponential smoothing method to obtain a predicted heat value.
7. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: and the back propagation neural network continuously restores the network weight and the threshold value through the training of sample data to enable the error function to descend along the negative gradient direction, the error function is continuously reduced to the threshold value or reaches the preset iteration times to obtain the weights of an input layer and an output layer, and finally the early-stage real value is input into the trained back propagation neural network to obtain the predicted heat value.
8. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: the entropy value method determines index weight according to the size of the entropy value provided by each index observation value, and obtains the discrete degree of two groups of predicted heat values through the entropy value, so that the two predicted heat values are endowed with corresponding weight and summed.
9. The topic word search crawler scheduling method and system based on the combined prediction method according to claim 5, wherein: in the step 7, a CPU of the crawler is distributed by adopting a multi-opening process method, a weighted value corresponding to each subject term in the updated subject term database is given to the crawler according to the predicted heat value, and the server adjusts the CPU upper limit value and the process opening number of the crawler corresponding to each subject term according to the weighted value.
CN202110701204.8A 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method Active CN113536085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110701204.8A CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110701204.8A CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Publications (2)

Publication Number Publication Date
CN113536085A true CN113536085A (en) 2021-10-22
CN113536085B CN113536085B (en) 2023-05-19

Family

ID=78096566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110701204.8A Active CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Country Status (1)

Country Link
CN (1) CN113536085B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078070A1 (en) * 2022-10-14 2024-04-18 卡奥斯工业智能研究院(青岛)有限公司 Data collection resource quantity control method and apparatus, and device and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132493A1 (en) * 2007-08-10 2009-05-21 Scott Decker Method for retrieving and editing HTML documents
US20090319484A1 (en) * 2008-06-23 2009-12-24 Nadav Golbandi Using Web Feed Information in Information Retrieval
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
US20180060437A1 (en) * 2016-08-29 2018-03-01 EverString Innovation Technology Keyword and business tag extraction
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN112200674A (en) * 2020-10-14 2021-01-08 上海谦璞投资管理有限公司 Stock market emotion index intelligent calculation information system
CN112650848A (en) * 2020-12-30 2021-04-13 交控科技股份有限公司 Urban railway public opinion information analysis method based on text semantic related passenger evaluation

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132493A1 (en) * 2007-08-10 2009-05-21 Scott Decker Method for retrieving and editing HTML documents
US20090319484A1 (en) * 2008-06-23 2009-12-24 Nadav Golbandi Using Web Feed Information in Information Retrieval
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
US20180060437A1 (en) * 2016-08-29 2018-03-01 EverString Innovation Technology Keyword and business tag extraction
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN111324797A (en) * 2020-02-20 2020-06-23 民生科技有限责任公司 Method and device for acquiring data accurately at high speed
CN112200674A (en) * 2020-10-14 2021-01-08 上海谦璞投资管理有限公司 Stock market emotion index intelligent calculation information system
CN112650848A (en) * 2020-12-30 2021-04-13 交控科技股份有限公司 Urban railway public opinion information analysis method based on text semantic related passenger evaluation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOTIAN DIAO: "Research of focused crawler for financial social network" *
王杰: "基于微博大数据的舆情监测系统的设计与实现" *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078070A1 (en) * 2022-10-14 2024-04-18 卡奥斯工业智能研究院(青岛)有限公司 Data collection resource quantity control method and apparatus, and device and storage medium

Also Published As

Publication number Publication date
CN113536085B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
CN105808590B (en) Search engine implementation method, searching method and device
CN110750640B (en) Text data classification method and device based on neural network model and storage medium
CN110532479A (en) A kind of information recommendation method, device and equipment
WO2014085776A2 (en) Web search ranking
CN109471982B (en) Web service recommendation method based on QoS (quality of service) perception of user and service clustering
CN111753167B (en) Search processing method, device, computer equipment and medium
KR20220119745A (en) Methods for retrieving content, devices, devices and computer-readable storage media
CN113343120A (en) Intelligent news recommendation system based on emotion protection
CN110110220A (en) Merge the recommended models of social networks and user's evaluation
Zhang et al. Hybrid recommender system using semi-supervised clustering based on Gaussian mixture model
CN113536085A (en) Topic word search crawler scheduling method and system based on combined prediction method
Ramadhan et al. Collaborative Filtering Recommender System Based on Memory Based in Twitter Using Decision Tree Learning Classification (Case Study: Movie on Netflix)
CN113326432A (en) Model optimization method based on decision tree and recommendation method
CN110162535B (en) Search method, apparatus, device and storage medium for performing personalization
CN115827990B (en) Searching method and device
CN110083766B (en) Query recommendation method and device based on meta-path guiding embedding
CN115712780A (en) Information pushing method and device based on cloud computing and big data
CN113065780B (en) Task allocation method, device, storage medium and computer equipment
CN112650869B (en) Image retrieval reordering method and device, electronic equipment and storage medium
US11822609B2 (en) Prediction of future prominence attributes in data set
CN111767404B (en) Event mining method and device
CN113704617A (en) Article recommendation method, system, electronic device and storage medium
CN112434174A (en) Method, device, equipment and medium for identifying issuing account of multimedia information
CN108197335B (en) Personalized query recommendation method and device based on user behaviors
CN111737489A (en) Building information retrieval method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant