CN113536085B - Method and system for scheduling subject term search crawlers based on combined prediction method - Google Patents

Method and system for scheduling subject term search crawlers based on combined prediction method Download PDF

Info

Publication number
CN113536085B
CN113536085B CN202110701204.8A CN202110701204A CN113536085B CN 113536085 B CN113536085 B CN 113536085B CN 202110701204 A CN202110701204 A CN 202110701204A CN 113536085 B CN113536085 B CN 113536085B
Authority
CN
China
Prior art keywords
data
subject
theme
module
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110701204.8A
Other languages
Chinese (zh)
Other versions
CN113536085A (en
Inventor
陈智超
裴峥
孔明明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xihua University
Original Assignee
Xihua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xihua University filed Critical Xihua University
Priority to CN202110701204.8A priority Critical patent/CN113536085B/en
Publication of CN113536085A publication Critical patent/CN113536085A/en
Application granted granted Critical
Publication of CN113536085B publication Critical patent/CN113536085B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of crawler scheduling methods, in particular to a method and a system for searching and crawler scheduling by a subject term based on a combined prediction method, wherein the method comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a subject term extraction module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an updating module, a predicted heat value module and a CPU distribution module; the method comprises the following steps: step 1, acquiring data from a data source; step 2, data preprocessing; step 3, obtaining theme data and calculating a real heat index and an index weight; step 4, calculating a true heat value; step 5, calculating a predicted heat value of each theme in the next period; step 6, extracting new subject words and updating a database; and step 7, distributing the upper limit of the CPU occupancy rate, and acquiring more relevant data of the high-heat theme. The aim of preferentially tracking the high-heat theme under the condition of limited resources is fulfilled.

Description

Method and system for scheduling subject term search crawlers based on combined prediction method
Technical Field
The invention relates to the technical field of crawler scheduling methods, in particular to a method and a system for scheduling a subject term search crawler based on a combined prediction method.
Background
The tracking topic requires the crawler to continuously acquire the topic related data, and if the hot topic is preferentially tracked under the condition of limited server resources, autonomous scheduling of the crawler is required to be realized, and the hot topic related data is preferentially acquired. The current crawler scheduling method mainly comprises a crawler scheduling method based on the update frequency of website data, a crawler scheduling method based on the distribution of URL, a crawler scheduling method based on the network distance, a crawler scheduling method based on node task distribution and the like; the crawler scheduling method based on the website data updating frequency schedules crawlers according to the updating frequency of the data source website, reduces the resource cost of a crawler server to a certain extent, and is suitable for scheduling some website crawlers with slower updating frequency; the crawler scheduling method based on the distributed URL preferentially distributes the URL with high similarity to the crawler to crawl by judging the similarity between the webpage text and the theme set by the user, and cannot meet the requirement of preferentially crawling the theme with high future heat; the crawler scheduling method based on node task allocation is mainly used for solving the problem of load balancing among crawler servers, a large number of URLs are mapped to a hash ring, each crawling node is corresponding to one segment of a ring sequence to ensure that the crawling nodes reasonably allocate tasks, virtual nodes are added, robustness of a crawler system is improved, and tracking of topics through heat allocation tasks cannot be met.
Disclosure of Invention
Based on the problems, the invention provides a method and a system for scheduling the search crawlers of the subject terms based on a combined prediction method, and the method and the system for scheduling the crawlers corresponding to the high-heat subjects by predicting the future heat of each subject, so that more high-heat data are acquired, and the aim of preferentially tracking the high-heat subjects under the condition of limited resources is fulfilled.
In order to solve the technical problems, the invention adopts the following technical scheme:
a subject term search crawler scheduling system based on a combined prediction method comprises
The first acquisition module is used for searching crawlers to acquire data in a data source by using the subject terms according to the keywords set by the user;
the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;
a vector space model for changing the preprocessed text data into a multidimensional vector composed of feature word weights;
the clustering module is used for carrying out clustering processing on the text data, and each cluster obtained is used as a theme;
the subject term extraction module is used for respectively extracting the subject term of each cluster and storing the subject term into the database;
the second acquisition module is used for extracting the subject words in the database and acquiring data from a data source by utilizing the crawler according to the extracted subject words;
the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding quantity, praise quantity and comment quantity of each text data;
the real heat index weight calculation module is used for taking the forwarding quantity, the praise quantity and the comment quantity analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;
the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to obtain characteristic words of each cluster, and taking part of the characteristic words contained in each cluster as subject words and updating the subject words in the original database;
the real heat value calculation module calculates the real heat of each text data by using the corresponding forwarding quantity, praise quantity and comment quantity of each text data analyzed by the analysis module and the index weight obtained by the real heat index weight calculation module, and then calculates the average value of the real heat of the text data contained in each topic according to the topics obtained by the clustering module, wherein the obtained average value result is used as the real heat value of each topic;
the predicted heat value module is used for predicting the predicted heat value of each subject term in the next period;
and the CPU distribution module is used for giving corresponding upper limit of CPU occupancy rate to the crawlers corresponding to each theme according to the predicted heat value by the server and starting the corresponding number of processes.
The method for scheduling the subject term search crawler based on the combined prediction method comprises the following steps:
step 1, setting keywords, and acquiring data from a data source by utilizing a crawler according to the keywords;
step 2, preprocessing data, converting the preprocessed data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, manually marking each cluster as a theme, and storing the feature words contained in each theme as the theme words into a database to form a theme word database;
step 3, extracting the subject words in the subject word database, writing crawlers according to the subject words to acquire data from a data source, analyzing the forwarding quantity, the praise quantity and the comment quantity by using the crawled data as real heat indexes, and determining the weight of the real heat indexes by using a hierarchical analysis method;
step 4, calculating the real heat value of each subject term according to the data and the real heat index obtained in the step 3;
step 5, predicting the predicted heat value of each subject term in the next period by using a combined prediction method;
step 6, after the data obtained in the step 3 are processed in the steps 1 to 2, updating a subject term database;
and 7, giving a weight value of the crawler corresponding to each subject term in the updated subject term database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawler corresponding to each subject term by the server according to the weight value, and repeating the steps 3-7.
Further, in the step 2, the method further includes the following steps:
step 21, data cleaning, namely removing characters except Chinese characters in the data by using a regular expression;
step 22, chinese word segmentation, namely segmenting each acquired data text into words;
step 23, removing stop words, and removing the stop words in the words segmented in the step 12;
and step 24, using a vector space model, changing one piece of data into a multidimensional vector formed by the weights of the feature words.
Further, in the step 2, the method further includes the following steps:
and (3) adopting a cluster analysis method to separate each data into clusters, combining the data with highest similarity according to a similarity measurement standard, sequentially combining the data into clusters according to the sequence of the data with high similarity, reducing the similarity among the clusters along with the combination of the clusters until reaching a similarity threshold value, stopping each formed cluster, namely a theme, and storing the characteristic words contained in each theme into a database as the subject words to form a subject word database.
In step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy value method, the prediction heat value of the subject is calculated by using the exponential smoothing method and the back propagation neural network, and weights are given to the calculation results of the exponential smoothing method and the back propagation neural network according to the entropy value method, so as to obtain the prediction heat value of the subject.
Further, the exponential smoothing method adopts a secondary exponential smoothing method to obtain a predicted heat value.
Further, the counter propagation neural network continuously restores the network weight and the threshold value through training of sample data to enable the error function to drop along the negative gradient direction, the error function is enabled to be continuously reduced to the threshold value or reach the preset iteration times, the weights of the input layer and the output layer are obtained, and finally the early-stage true value is input into the trained counter propagation neural network to obtain the predicted heat value.
Further, the entropy method determines index weight according to the entropy value provided by each index observation value, and judges the discrete degree of the predicted heat value according to the entropy value.
Further, in step 7, the CPU of the crawler is allocated by using a multi-process method, the updated weight value of the crawler corresponding to each subject word in the subject word database is given according to the predicted heat value, and the server adjusts the CPU upper limit value and the process start number of the crawler corresponding to each subject word according to the weight value
Compared with the prior art, the invention has the beneficial effects that: the method integrates the algorithms such as the aggregation hierarchical clustering, the exponential smoothing method, the back propagation neural network and the entropy value method, provides a calculation method for predicting the popularity value of the theme, then predicts future popularity of each theme, and schedules crawlers corresponding to the high popularity theme so as to acquire more high popularity data, thereby realizing the purpose of preferentially tracking the high popularity theme under the condition of limited resources, effectively tracking the high popularity theme, acquiring more related data of the high popularity theme, and grasping the popularity of the high popularity theme more comprehensively and timely.
Drawings
FIG. 1 is a flow chart of the present embodiment;
FIG. 2 is a graph showing the distribution of true heat values in the present embodiment;
FIG. 3 is a fitted curve of true heat;
FIG. 4 shows the actual heat trend of each topic from 1 to 7 in this example;
FIG. 5 is a graph showing data volume and heat value for various subjects at stage 8;
FIG. 6 is a graph of the variation of the subject matter exposure with time for a first class;
FIG. 7 is a graph showing the variation of the subject matter exposure of the second category with the number of days;
FIG. 8 is a graph of the variation of subject matter heat over time for a third class;
FIG. 9 is a graph of the variation of subject matter exposure versus the number of options for a fourth category;
FIG. 10 is a graph of the variation of the subject matter heat of the fifth category with the number of days;
FIG. 11 is a graph of variation of subject matter exposure versus number of subjects of the sixth category;
FIG. 12 is a graph of variation of subject matter exposure versus number of subjects of the seventh category;
FIG. 13 is a graph of the variation of the subject matter heat of the eighth category with the number of days;
fig. 14 is a graph showing the change of various types of subject data amount with the number of days.
Detailed Description
The invention is further described below with reference to the accompanying drawings. Embodiments of the present invention include, but are not limited to, the following examples.
The system comprises a first acquisition module, a data preprocessing module, a vector space model, a clustering module, a second acquisition module, a real heat index weight calculation module, a real heat value calculation module, an updating module, a predicted heat value module and a CPU distribution module.
Further, the first obtaining module obtains text data from the data source by using the crawler according to the keyword set by the user, where in this embodiment, the data source is selected from the microblogs.
Further, the data preprocessing module is used for preprocessing the data acquired by the acquisition module, and the preprocessing process comprises data cleaning, chinese word segmentation and stop word removal. The data cleaning is mainly a process of removing characters except Chinese by using a regular expression, and the characters like @, # and emoticons belong to objects to be cleaned; the Chinese word segmentation refers to the process of segmenting a Chinese character sequence without space into meaningful words, realizing efficient word graph scanning through a prefix dictionary, generating directed undirected graphs of all possible word forming conditions of Chinese characters in sentences, and then searching a maximum probability path by utilizing dynamic programming so as to find out the maximum segmentation combination of words, and further segmenting each piece of acquired text data into words; the stop word is mainly formed by deleting words which cannot represent text characteristics through constructing a stop word list.
Further, the vector space model is a commonly used text representation model, words in any text data can be segmented by means of a word segmentation technique, and each word is taken as a component according to the segmentation order, so that the text data is represented as a word vector. More generally, let the text data set t= { T 1 ,t 2 ,…,t n }, t is i (i=1, 2, …, n) is a text data, and word vectors of all text data can be combined into a new word vector and called a feature word library (or feature word space) of the text data set T through processes such as duplication elimination, word frequency threshold definition, word removal and the like, any text data T is based on whether or not a word of the text data appears or the number of times of appearance in the feature word space and the like i May be represented as a vector in the feature word space, which is referred to as a vector space model of the text data. In the determined feature word bank k= (K) 1 ,k 2 ,…,k m ) (any one of k j (j=1, 2, …, m) is a feature word), the vector space model of the text data in this embodiment is expressed as
t i =(w i1 ,w i2 ,…,w im ), (1)
Any one of w ij The following calculation was performed
Figure GDA0003217904770000051
Wherein r is ij For the characteristic word k j At text data t i Number of occurrences, k j ∈t i Representing the characteristic word k j At text data t i I T is the number of elements in the set T, i.e., |t|=n and for any feature word k in this embodiment j
Figure GDA0003217904770000052
Further, the clustering module performs cluster analysis on the text data set T by adopting a cluster analysis method based on a vector space model of the text data, and divides the text data set T into different clusters. The basis of the cluster analysis is the similarity measure between clustered objects, in this embodiment, the similarity measure between two text data is the Euclidean distance, and the calculation formula is as follows:
Figure GDA0003217904770000053
wherein w in the present embodiment ij Can be calculated according to formula (2), t i And t l Is two text data in the set T. d (t) i ,t l ) In the range of 0 to 1, d (t) i ,t l ) Smaller represents text data t i And t l The more dissimilar, d (t i ,t l ) The larger the text data t i And t l The more similar. Based on cosine similarity measure between text data, in this embodiment, a condensed hierarchical clustering method is adopted, that is, according to the cosine similarity measure between text data and a set similarity threshold, merging and clustering are preferentially performed according to the text data with highest similarity until all text data are merged and clustered.
Further, the topic word extraction module refers to each cluster as a topic based on the result of the clustering module, and the cluster set is T' = { T 1 ,T 2 ,…,T p Cluster T q (q=1, 2, …, p) is a subject. Subject T according to vector space model equation (1) of text data q The matrix can also be represented as follows:
Figure GDA0003217904770000054
/>
according to formula (2), w ij (i=1, 2, …, n, j=1, 2, …, m) can also be understood as the feature word k j At text data t i Is a weight of (a). Accordingly, the feature word k j At subject T q Weight W of (a) qj The following calculation was performed
Figure GDA0003217904770000055
Wherein |T q I is the topic T q The number of text data in the database. Subject T based on feature words q Weight of (1), set a weight threshold beta q Subject T is excluded q Feature word set { k } 1 ,k 2 ,…,k m Weight in } is less than threshold beta q Is used as the subject T q And storing the subject matter into a subject matter database, i.e
Figure GDA0003217904770000061
Further, the second acquisition module extracts the subject T in the subject word database q According to the extracted subject matter, utilizing the crawler to obtain data from a data source, and setting the crawler to obtain the data according to the subject T q The text data set acquired by the subject word of (2) is T' q ={t′ 1 ,t′ 2 ,…,t′ n′ -wherein n' is a crawler according to T q The number of text data crawled by the subject term;
further, an parsing module for parsing the text data t 'acquired by the second acquiring module' i′ (i '=1, 2, …, n') resolving the forwarding amount, the praise amount and the comment amount;
further, the real heat index weight calculation module takes the forwarding quantity, the praise quantity and the comment quantity as real heat calculation indexes, and determines index weight omega by using a weight analysis method i″ (i″=1,2,3)
Further, in this embodiment, an analytic hierarchy process is used to determine the index weight;
further, the real heat value calculating module calculates the theme T through the text data acquired by the second acquiring module, the data analyzed by the analyzing module and the weight acquired by the real heat index calculating module q True at τ -th phase of (c)Heat of solidity value
Figure GDA0003217904770000062
Let text data t' i′ Forwarding quantity, praise quantity and comment quantity of (a) are b respectively i′1 ,b i′2 .b i′3 And t' i′ (i '=1, 2, …, n') is that the crawler acquires text data in the τ -th period, t '' i′ The true heat value calculation formula of (2) is shown as formula (7), and the theme T q True heat in the τ -th phase +.>
Figure GDA0003217904770000063
The calculation formula of (2) is shown as formula (8):
Figure GDA0003217904770000064
Figure GDA0003217904770000065
and the updating module is used for updating the subject term database after the text data acquired by the second acquisition module are processed by the data preprocessing module and the vector space model.
Further, the predicted heat value module obtains the predicted heat value of each theme through the real heat value of each theme obtained by the real heat value calculation module. For subject T q Can construct the period number tau and the true heat value
Figure GDA0003217904770000066
The correspondence of (2) is shown in table 1 below:
TABLE 1
Figure GDA0003217904770000067
Figure GDA0003217904770000071
/>
The future number τ and the true heat value are shown in FIG. 2
Figure GDA0003217904770000072
By constructing the distribution relation of the term tau and the true heat value
Figure GDA0003217904770000073
The true heat value curve when the fitted future number tau=7 is shown in fig. 3, and the existing prediction method is used for predicting the tau+d-th period theme T q Is +.>
Figure GDA0003217904770000074
Further, in this embodiment, the exponential smoothing method and the back propagation neural network are adopted to calculate the predicted popularity value of the subject, and then the weight is given to the calculation results of the predicted popularity value of the subject by the exponential smoothing method and the back propagation neural network according to the entropy method, so as to obtain the subject T q Is a combination of the predicted heat value of (a)
Figure GDA0003217904770000075
Further, the CPU distribution module gives weight values to crawlers corresponding to each theme according to the combined predicted heat value, and the server adjusts the theme T according to the weight values q CPU occupancy rate upper limit of corresponding crawler;
wherein, the crawlers are deployed on the same Linux server, and the topic T is controlled by a CPU limit command carried by a Linux system q The CPU upper limit of the corresponding crawler is used as constraint, when the crawler does not exceed the specified CPU upper limit, constraint conditions are not added to the crawler, if the crawler is about to exceed the specified CPU upper limit, the server can dynamically adjust the server to ensure that the server floats around the upper limit, and is responsible for the theme T q CPU usage upper limit of the crawler in the tau+d stage
Figure GDA0003217904770000076
The calculation of (2) is as follows:
Figure GDA0003217904770000077
in the formula (i),
Figure GDA0003217904770000078
for the theme T q Corresponding to the CPU upper limit value of the crawler in the tau+d stage, M is the CPU resource percentage which can be used by the server when the server is idle, C' represents the CPU percentage which is currently used by the server, and>
Figure GDA0003217904770000079
for the theme T q The combined predicted heat value at stage τ+d, p being the total number of topics, if topic T q If the CPU consumed by the corresponding crawler cannot reach the upper limit value, the process can be repeatedly started, the purpose that the crawler fully utilizes CPU resources is achieved, and the topic T is achieved q The number of process starts of the corresponding crawler in the τ+d stage +.>
Figure GDA00032179047700000710
The formula is as follows:
Figure GDA00032179047700000711
in the formula (i),
Figure GDA00032179047700000712
for the theme T q The corresponding crawler accounts for the CPU percentage in the tau+d stage, and the corresponding crawler accounts for the CPU percentage in the tau+d stage>
Figure GDA00032179047700000713
Representing the subject T q The residual CPU resource of the corresponding crawler in the tau+d stage is more than the multiple of the CPU resource occupied by the crawler, and the residual CPU resource is +.>
Figure GDA00032179047700000714
Representing the subject T q The corresponding crawlers need to start the number of processes in the τ+dth period. In the formula, when->
Figure GDA00032179047700000715
When prescribing->
Figure GDA00032179047700000716
The method is used for ensuring that at least one process is running in the crawler corresponding to each theme; when->
Figure GDA00032179047700000717
When prescribing->
Figure GDA00032179047700000718
The method aims to prevent other problems of insufficient running memory and the like caused by excessive process starting numbers.
Further, if the current period number exceeds the predicted period number τ+d, the steps 3 to 7 are repeated.
As shown in fig. 1, the method for scheduling the subject term search crawler based on the combined prediction method comprises the following steps:
step 1, setting keywords, and acquiring text data from a data source by utilizing a crawler according to the keywords;
step 2, preprocessing the text data, converting the preprocessed text data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, wherein each cluster is called a theme, and storing part of feature words contained in each theme as a subject word into a database to form a subject word database;
step 3, writing crawlers with corresponding numbers according to the number of the topics, extracting the topic words in a topic word database, acquiring data from a data source by the crawlers according to the topic words, and establishing a real heat index of the crawled text data according to the forwarding quantity, the praise quantity and the comment quantity;
step 4, calculating the true heat value of each theme according to the data and the true heat index obtained in the step 3;
step 5, predicting the predicted heat value of each theme in the next period by using a combined prediction method;
step 6, after the text data obtained in the step 3 are processed in the steps 1 to 2, updating a subject word database;
and 7, giving weight values of the crawlers corresponding to the topics in the updated subject word database according to the predicted heat value, adjusting the CPU occupancy rate upper limit and the process starting number of the crawlers corresponding to the topics by the server according to the weight values, and repeating the steps 3-7.
Examples
According to a keyword library initially set by a user, crawling real-time text data on a newwave microblog by a crawler corresponding to a theme, storing the real-time text data, extracting 3000 text data as training samples, performing Chinese word segmentation and stop word removal respectively, constructing a vector space by using a formula (1), solving characteristic item weights by using a formula (2), performing hierarchical clustering by using a formula (3), and cutting a hierarchical clustering tree by using 100 as a threshold, wherein the stop word uses a Ha-work large stop word dictionary, the non-detachable word dictionary is added manually, and finally forming the theme into the following table 2:
TABLE 2
Figure GDA0003217904770000081
The method comprises the steps that eight topics are added in an upper table, corresponding text data on a microblog are crawled by utilizing a crawler according to a subject term, forwarding quantity, praise quantity and comment quantity of each piece of text data are analyzed, and the upper limit of the use of a CPU (Central processing Unit) of a server of each crawler is set to be one eighth of the percentage of the remaining CPU of the server in the process; in this embodiment, the weights of the transfer amount, the praise amount, and the comment amount are determined by using a hierarchical analysis method, and a judgment matrix J is constructed first
Figure GDA0003217904770000091
The weights of the comment quantity, the forwarding quantity and the praise quantity are respectively about 0.7012,0.1596,0.1390 by utilizing an arithmetic average method; taking one day as a period, crawling for one week, and obtaining real heat values of each theme in each period (day) according to formulas (7) and (8), wherein the heat trend of each theme in 1-7 periods is shown in fig. 3, the horizontal axis in fig. 3 is the period number, and the vertical axis is the real heat value. According to the secondary exponential smoothing method, calculating the exponential smoothing predicted heat value of the 8 th period of various subjects, wherein the real heat fluctuation of other subjects except the 7 th period is smaller, so that the smoothing coefficient of the subjects should be smaller between 0 and 1, in the embodiment, 0.3 is taken, and the 7 th period is 0.8.
In the process of predicting the theme heat by using the back propagation neural network, the hidden layer and the output layer activation function of the BP neural network adopt relu, the loss function adopts cross entropy loss function, the optimizer adopts Adam, the hidden layer is set to be 1 layer, the number of hidden layer nodes is set to be 3, the number of input layer nodes is set to be 3, the number of output layer nodes is set to be 1, and the learning rate is set to be 0.01. And dividing the primary real heat value sequences of the topics one to eight into groups, wherein each group consists of 4 real heat values and serves as one sample, the last heat value of each group serves as output, the rest value serves as input, and each sample is learned for updating the connection weights of the input layer and the hidden layer and the connection weights of the hidden layer and the output layer. Setting the maximum training frequency as 1000, and ending the neural network learning process when the error allowable limit is 0.0001. The trained network is utilized to predict the predicted heat value of each topic at stage 8. In the network use process, the real heat value of the 5 th to 7 th stages is used as input, and the predicted heat value of each theme in the 8 th stage is obtained by using a trained network structure. Finally, the two methods are combined by an entropy method, namely, the 8 th phase predicted heat value obtained by a secondary exponential smoothing method and the 8 th phase predicted heat value obtained by a BP neural network are weighted and summed, and the obtained 8 th phase combined predicted heat value (predicted heat value by the entropy method) has the following (rounded) result shown in the table 3:
TABLE 3 Table 3
Figure GDA0003217904770000092
The entropy and weight of the entropy method are shown in table 4 below:
TABLE 4 Table 4
Figure GDA0003217904770000101
The server used in this embodiment has 16 CPUs, so that the server can use 1600% of the CPUs under no-load condition, other processes already occupy about 1100% of the CPUs, the upper limit percentage (at least 1%) of the CPUs required to be allocated by the server is calculated according to the prediction result and the formula (18) when each crawler is in the 8 th stage, the number of processes started by each crawler in the 8 th stage can be obtained according to the formula (19), and the upper limit of the CPUs and the number of processes started are shown in the following table 5:
TABLE 5
Figure GDA0003217904770000102
According to the starting number of the progress of each subject word search crawler in the 8 th stage in the table, crawling data on the microblogs according to the subject words contained in each subject, counting the crawling number of each subject search crawler, and simultaneously counting the crawling amount which is not scheduled and is scheduled according to the predicted heat value, as shown in fig. 4, wherein the abscissa is a subject category, the ordinate is the crawling data amount and the predicted heat value of the crawler according to each subject word, the first column and the second column respectively represent the crawling data amount of each subject after not being scheduled and scheduled, and the third column represents the heat value of each subject in the 8 th stage.
The data volumes contained in the first, fifth, seventh and eighth types of topics with higher heat degree in the 8 th period obtained by scheduling are 3572, 4026, 2338 and 3274 respectively, the data volumes obtained by unscheduled operation are 762, 1285, 594 and 827 respectively, when the crawler running in the embodiment obtains the related data of the topic with higher heat degree in the next period, the average data volume is 380.9% more than that obtained by unscheduled operation, obviously, the data volumes obtained by the corresponding crawlers of each scheduled topic are positively correlated with the predicted heat degree value, the higher the predicted heat degree value is, the more CPU resources are obtained by the crawler with higher heat degree, and the more data volumes are obtained.
And if the current period number exceeds the 8 th period, repeatedly extracting the data of each theme in the near seven period, updating the theme words, calculating the real heat value, predicting the 9 th period heat and scheduling crawlers, and obtaining the data quantity, the predicted heat value and the real heat value of each theme in the 10 th to 17 th periods by using the process, wherein the relation between the real heat value and the predicted heat value of each theme and the change with time is shown in fig. 6 to 13.
The abscissa of fig. 6 to 13 is the future number, the ordinate is the heat value, the cross line represents the predicted heat value, the dot line represents the real heat value of the corresponding future number, and the average relative error of various subjects in the period 9 to 17 is calculated according to the average relative error formula, wherein the average relative error formula is as follows:
Figure GDA0003217904770000103
wherein the method comprises the steps of
Figure GDA0003217904770000111
For predicting the heat value +.>
Figure GDA0003217904770000112
Is the true heat value.
The calculation results are shown in table 6 below:
TABLE 6
Figure GDA0003217904770000113
As shown in fig. 6 to 13 and the above table, the combined predicted popularity value of each period is close to the real popularity value, it is obvious that scheduling the corresponding crawler of each topic feature word by using the combined predicted popularity value has reasonability, the data volume of each topic changes along with the future number, as shown in fig. 14, the abscissa is the future number, and the ordinate is the data volume crawled by the crawler according to the topic words contained in each topic.
Taking the fifth class and the seventh class of topics as examples, it can be seen from fig. 10 and fig. 12 that the real popularity of the fifth class of topics in each period is far higher than that of the seventh class (the fifth class is about 20000, and the seventh class of topics is about 4000), correspondingly, as shown in fig. 14, the data volume obtained by implementing scheduling according to the predicted value of the fifth class of topics is far higher than that of the seventh class, and is consistent with the target that the high-popularity topics should obtain more data volume, so that the combined prediction method is used for distributing the topic words corresponding to the high-popularity topics to search for more resources, so that the high-popularity topics can obtain more data, and the aim of preferentially tracking the high-popularity topics is achieved.
The above is an embodiment of the present invention. The above embodiments and specific parameters in the embodiments are only for clearly describing the inventive verification process of the inventor, and are not intended to limit the scope of the invention, which is defined by the claims, and all equivalent structural changes made by applying the descriptions and the drawings of the invention are included in the scope of the invention.

Claims (9)

1. The topic word searching crawler scheduling system based on the combined prediction method is characterized in that: comprising
The first acquisition module is used for searching crawlers to acquire data in a data source by using the subject terms according to the keywords set by the user;
the data preprocessing module is used for preprocessing the data acquired by the first acquisition module;
a vector space model for changing the preprocessed text data into a multidimensional vector composed of feature word weights;
the clustering module is used for carrying out clustering processing on the text data, and each cluster obtained is used as a theme;
the subject term extraction module is used for respectively extracting the subject term of each cluster and storing the subject term into the database;
the second acquisition module is used for extracting the subject words in the database, and searching crawlers by using the subject words according to the extracted subject words to acquire data from a data source;
the analysis module is used for analyzing the data acquired by the second acquisition module to obtain the corresponding forwarding quantity, praise quantity and comment quantity of each text data;
the real heat index weight calculation module is used for taking the forwarding quantity, the praise quantity and the comment quantity analyzed by the analysis module as the real heat index of each piece of text data and calculating each index weight;
the updating module is used for processing the data acquired by the second acquisition module through the data preprocessing module, the vector space model and the clustering module to obtain characteristic words of each cluster, and taking part of the characteristic words contained in each cluster as subject words and updating the subject words in the original database;
the real heat value calculation module calculates the real heat of each piece of text data by using the corresponding forwarding quantity, praise quantity, comment quantity and index weight obtained by the real heat index weight calculation module of each piece of text data analyzed by the analysis module, and then calculates the average value of the real heat of the text data contained by each theme according to the theme obtained by the clustering module, wherein the obtained average value result is used as the real heat value of each theme;
the predicted heat value module is used for predicting the predicted heat value of each subject term in the next period; the predicted heat value module obtains the predicted heat value of each theme by the real heat value of each theme obtained by the real heat value calculation module
The CPU distribution module is used for giving corresponding CPU occupancy rate upper limits to the subject word search crawlers corresponding to the subjects according to the predicted heat value, and starting corresponding process numbers; the CPU distribution module gives weight values to the crawlers corresponding to the topics according to the combined prediction heat value, and the server adjusts the CPU occupancy rate upper limit of the crawlers corresponding to the topics according to the weight values.
2. The method for scheduling the subject term search crawler based on the combined prediction method is characterized by comprising the following steps of:
step 1, setting keywords, searching crawlers in a data source by using the subject terms according to the keywords to obtain data;
step 2, preprocessing the data, converting the preprocessed text data into multidimensional vectors formed by weights of feature words, dividing the multidimensional vectors into clusters, defining each cluster as a theme, and storing part of feature words contained in each theme as a theme word into a database;
step 3, extracting subject words in a database, writing a corresponding number of subject word search crawlers according to the number of the subject words to obtain subject data from a data source, analyzing forwarding quantity, praise quantity and comment quantity from the crawled data as real heat indexes, and determining each index weight by using a analytic hierarchy process;
step 4, calculating the real heat value of each piece of text data according to the forwarding quantity, the praise quantity, the comment quantity and the real heat index weight of each piece of text data obtained in the step 3, and then solving the average value of the real heat of the text data contained in each theme according to the theme obtained in the step 2, wherein the obtained average value is used as the real heat value of each theme;
step 5, fitting the real heat value of each theme obtained in the step 4 to obtain a change curve of the real heat value of each theme along with the random number, and obtaining a predicted heat value of each theme in the next period by using a combined prediction method;
step 6, after the data obtained in the step 3 are processed in the steps 1 to 2, extracting new subject words and updating a database;
and 7, updating the weight value of the corresponding subject word search crawler according to the predicted heat value of the subject word, adjusting the CPU occupation rate upper limit of the subject word search crawler corresponding to each subject word by the server according to the weight value, and repeating the steps 3-7.
3. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 2, the method further includes the following steps:
step 21, data cleaning, namely removing characters except Chinese characters in the data by using a regular expression;
step 22, chinese word segmentation, namely segmenting each acquired data text into words;
step 23, removing stop words, and removing the stop words in the words segmented in the step 22;
and step 24, converting the text data into a multidimensional vector formed by the characteristic word weights by using a vector space model.
4. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 2, the method further includes the following steps:
and (3) adopting a cluster analysis method to separate each data into clusters, combining the data with highest similarity according to a similarity measurement standard, sequentially combining the data into clusters according to the sequence of the data with high similarity, reducing the similarity among the clusters along with the combination of the clusters until reaching a similarity threshold value, calling each cluster as a theme, and storing the feature words contained in each theme as the theme words into a database to form a theme word database.
5. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 2, wherein: in the step 5, the combined prediction algorithm includes an exponential smoothing method, a back propagation neural network and an entropy value method, the prediction heat value of the subject is calculated by using the exponential smoothing method and the back propagation neural network respectively, and then a weight is given to the calculation result of the prediction heat value of the subject by using the exponential smoothing method and the back propagation neural network according to the entropy value method, so as to obtain the combined prediction heat value of the subject.
6. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the index smoothing method adopts a secondary index smoothing method to obtain a predicted heat value.
7. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the back propagation neural network continuously restores the network weight and the threshold value to enable the error function to drop along the negative gradient direction through training of sample data, the error function is enabled to be continuously reduced to the threshold value or reach the preset iteration times, the weights of the input layer and the output layer are obtained, and finally the early-stage true value is input into the trained back propagation neural network to obtain the predicted heat value.
8. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: the entropy method determines index weight according to the entropy provided by each index observation value, and obtains the discrete degree of two groups of prediction heat values according to the entropy, so that corresponding weight is given to the two prediction heat values and the two prediction heat values are summed.
9. The method for scheduling a subject term search crawler based on a combined prediction method according to claim 5, wherein: in the step 7, a multi-process method is adopted to distribute the CPU of the crawler, the updated weight value of the crawler corresponding to each subject word in the subject word database is given according to the predicted heat value, and the server adjusts the CPU upper limit value and the process starting number of the crawler corresponding to each subject word according to the weight value.
CN202110701204.8A 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method Active CN113536085B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110701204.8A CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110701204.8A CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Publications (2)

Publication Number Publication Date
CN113536085A CN113536085A (en) 2021-10-22
CN113536085B true CN113536085B (en) 2023-05-19

Family

ID=78096566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110701204.8A Active CN113536085B (en) 2021-06-23 2021-06-23 Method and system for scheduling subject term search crawlers based on combined prediction method

Country Status (1)

Country Link
CN (1) CN113536085B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329179B (en) * 2022-10-14 2023-04-28 卡奥斯工业智能研究院(青岛)有限公司 Data acquisition resource amount control method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200674A (en) * 2020-10-14 2021-01-08 上海谦璞投资管理有限公司 Stock market emotion index intelligent calculation information system
CN112650848A (en) * 2020-12-30 2021-04-13 交控科技股份有限公司 Urban railway public opinion information analysis method based on text semantic related passenger evaluation

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090132493A1 (en) * 2007-08-10 2009-05-21 Scott Decker Method for retrieving and editing HTML documents
US20090319484A1 (en) * 2008-06-23 2009-12-24 Nadav Golbandi Using Web Feed Information in Information Retrieval
CN104933164B (en) * 2015-06-26 2018-10-09 华南理工大学 In internet mass data name entity between relationship extracting method and its system
US11093557B2 (en) * 2016-08-29 2021-08-17 Zoominfo Apollo Llc Keyword and business tag extraction
CN106777043A (en) * 2016-12-09 2017-05-31 宁波大学 A kind of academic resources acquisition methods based on LDA
CN106815297B (en) * 2016-12-09 2020-04-10 宁波大学 Academic resource recommendation service system and method
CN106709052B (en) * 2017-01-06 2020-09-04 电子科技大学 Topic web crawler design method based on keywords
CN107885793A (en) * 2017-10-20 2018-04-06 江苏大学 A kind of hot microblog topic analyzing and predicting method and system
CN111324797B (en) * 2020-02-20 2023-08-11 民生科技有限责任公司 Method and device for precisely acquiring data at high speed

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112200674A (en) * 2020-10-14 2021-01-08 上海谦璞投资管理有限公司 Stock market emotion index intelligent calculation information system
CN112650848A (en) * 2020-12-30 2021-04-13 交控科技股份有限公司 Urban railway public opinion information analysis method based on text semantic related passenger evaluation

Also Published As

Publication number Publication date
CN113536085A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
Vanchinathan et al. Explore-exploit in top-n recommender systems via gaussian processes
CN109299396B (en) Convolutional neural network collaborative filtering recommendation method and system fusing attention model
CN108733798B (en) Knowledge graph-based personalized recommendation method
CN106649434B (en) Cross-domain knowledge migration label embedding method and device
CN109033101B (en) Label recommendation method and device
CN112119388A (en) Training image embedding model and text embedding model
CN109948036B (en) Method and device for calculating weight of participle term
CN110941698B (en) Service discovery method based on convolutional neural network under BERT
CN103886047A (en) Distributed on-line recommending method orientated to stream data
CN108804577B (en) Method for estimating interest degree of information tag
JP2007317068A (en) Recommending device and recommending system
CN112074828A (en) Training image embedding model and text embedding model
CN113822776B (en) Course recommendation method, device, equipment and storage medium
KR101976081B1 (en) Method, system and computer program for semantic image retrieval based on topic modeling
Zhao et al. Learning and transferring ids representation in e-commerce
CN109389424B (en) Flow distribution method and device, electronic equipment and storage medium
CN113536085B (en) Method and system for scheduling subject term search crawlers based on combined prediction method
CN113032367A (en) Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
CN115827990B (en) Searching method and device
CN110162535B (en) Search method, apparatus, device and storage medium for performing personalization
CN109885758B (en) Random walk recommendation method based on bipartite graph
CN116484105A (en) Service processing method, device, computer equipment, storage medium and program product
CN113592589B (en) Textile raw material recommendation method, device and processor
CN113065780B (en) Task allocation method, device, storage medium and computer equipment
Darvishi-Mirshekarlou et al. Reviewing cluster based collaborative filtering approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant