CN111260145A

CN111260145A - Method and system for predicting reading amount of WeChat public number article

Info

Publication number: CN111260145A
Application number: CN202010065180.7A
Authority: CN
Inventors: 窦志成; 文继荣
Original assignee: Renmin University of China
Current assignee: Renmin University of China
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-09

Abstract

The invention relates to a method and a system for predicting the reading quantity of WeChat public articles, which are characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) the method has the advantages that the trained XGboost regression model is adopted, the reading quantity predicted value of the article to be tested is determined according to the article characteristics of the article to be tested, the article modification time of an author can be reduced, the working efficiency of the author and related workers can be improved, higher reading quantity can be obtained, and the method can be widely applied to the field of data prediction.

Description

Method and system for predicting reading amount of WeChat public number article

Technical Field

The invention relates to a prediction method, in particular to a prediction method and a system for the reading amount of WeChat public articles.

Background

Since the 2.0 era of networks, research work on the popularity of certain specific contents on the networks has increased, and the research objects of the work mainly comprise online news, online videos and published contents of users on social platforms. For online news, the number of reviews is generally used as a popularity measuring standard in the existing work, and the task of predicting the number of reviews is divided into two stages of firstly judging whether the news can receive reviews and then qualitatively predicting the number of reviews. To further estimate the number of reviews, the number of reviews observed shortly after news is released is used to predict the distribution of the total number of reviews that are available later. For online video, most of the work uses the playing amount as a measure, and the current video is predicted by using historical playing amount information. In addition, there are some works that focus on the content posted by the user on the social platform, such as Facebook, Twitter, etc., and the attention degree of the posted content is predicted through the friend relationship on the social platform and the network structure of the social network. These prior efforts have achieved certain results to date.

However, the existing methods for predicting the popularity of a certain content mainly focus on web page news, videos and contents published by users on a social platform, and cannot be applied to the prediction of the reading amount of WeChat public number articles, which is mainly reflected in that: 1) the reading amount of the WeChat public number article is expected to be predicted before the article is published, but the current methods almost all carry out prediction after content is published and need to use information observed after the content is published; 2) in the available data, the association relationship between the WeChat user and the public number cannot be obtained, and the social friend relationship of the user is unknown, so that a social relationship network cannot be constructed according to an algorithm to predict the reading amount. Therefore, there is a need for a method of predicting the reading amount before the publication of an article based on only limited information.

Disclosure of Invention

In view of the above problems, it is an object of the present invention to provide a method and system for predicting the reading amount of a wechat bulletin before the publication of the article.

In order to achieve the purpose, the invention adopts the following technical scheme: a method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) and determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.

Further, the specific process of the step 1) is as follows: 1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped; 1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample; 1.3) training an XGboost classification model on the WeChat data set; 1.4) training an XGboost regression model on the WeChat data set.

Further, the specific process of step 1.3) is as follows: 1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores; 1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part; 1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.

Further, the specific process of step 1.4) is as follows: 1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R²：

Wherein, y_iTarget value representing ith WeChat article；

Representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;

wherein the content of the first and second substances,

an average value representing a target value; v_arianceRepresenting the variance of all WeChat target values; 1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels; 1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.

Further, the article characteristics include historical information characteristics, the historical information characteristics include historical issuance frequency and historical reading quantity of the public number to which the article to be detected belongs, and the method includes the following steps: the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t; the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.

Further, the article features comprise title features, the title features comprise title basic compositions, emotional attributes and title entities, wherein: the title basically comprises the title length, the word number and the number of the article title; the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral; title entities are place names, person names and organization names appearing in article titles.

Further, the article features include text features, the text features include text basic composition, text entities, composition elements, average paragraph length, and topics to which the article belongs, wherein: the text basically comprises the length of the article, the number of words and the number of digits of the text; the text entity is a place name, a person name and an organization name appearing in the text of the article; the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article; the average paragraph length is the average word number of each paragraph in the article text; the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.

Further, the article features include a "headliner" feature that includes whether the headline is ambiguous, punctuation, number of interrogatories, number of referring words, number of degree adverbs, and number of emotional words, wherein: whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article; is the punctuation symbol "? "and"! "number of cells; the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.

Further, the article characteristics include time characteristics including article release time, time reading amount, and capturing interval, wherein: the article release time comprises the number of months, days, time and weeks of article release; the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time; the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.

A system for predicting the reading of WeChat public articles, the system comprising: the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; the data acquisition module is used for acquiring article characteristics of the article to be detected; the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model; and the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. according to the method, the predicted value of the reading amount can be obtained before the article is released through the XGboost model according to the article characteristics of the article to be detected, modification opinions are provided for the author of the article, and more reading amount can be obtained while the working efficiency of the author of the article is improved.

2. The method divides the prediction of the article reading amount into two parts, including predicting whether the article is a super article and accurately predicting the reading amount of a non-super article, adopts an XGboost classification model for predicting whether the article is the super article, adopts an XGboost regression model for accurately predicting the reading amount of the non-super article, effectively analyzes the structure, the content and other aspects of the article in detail according to the obtained article characteristics of the article to be detected, determines the predicted value of the reading amount of the article to be detected, can provide certain guidance suggestions for the author to modify the article, improves the working efficiency of the author and related workers, obtains higher reading amount, and can be widely applied to the field of data prediction.

Drawings

Fig. 1 is a schematic diagram of basic information of a WeChat data set obtained in the present invention, where fig. 1(a) is a schematic diagram of basic information of a title of the WeChat data set, an abscissa is a length of the title, and an ordinate is a number of articles, fig. 1(b) is a schematic diagram of basic information of a text of the WeChat data set, an abscissa is a length of the text, and an ordinate is a number of articles;

FIG. 2 is a bar graph of monthly average number of documents issued from the WeChat data set public in the present invention, wherein the abscissa is the number of articles and the ordinate is the number of public;

fig. 3 is a statistical schematic diagram of the number of wechat articles and the reading amount of the articles released by the wechat chapter data set in the present invention, wherein fig. 3(a) is the percentage of wechat articles released in different hours, the abscissa is hours, and the ordinate is percentage, fig. 3(b) is the number of wechat articles released in different day counts, the abscissa is day count, and the ordinate is percentage, fig. 3(c) is the average reading number of the wechat articles released in different times, the abscissa is hours, and the ordinate is the average reading number, fig. 3(d) is the average article reading amount of the wechat articles released in different days per week, the abscissa is day count, and the ordinate is the average reading number;

FIG. 4 is a graph illustrating the read volume growth rate and average daily read volume after the articles in the WeChat data set are published.

Detailed Description

The present invention is described in detail below with reference to the attached drawings. It is to be understood, however, that the drawings are provided solely for the purposes of promoting an understanding of the invention and that they are not to be construed as limiting the invention.

The method for predicting the reading amount of the WeChat public number article provided by the invention relates to the relevant contents of the WeChat public number and the reading amount prediction, and the relevant contents are introduced below so that the contents of the method are more clear to the technical personnel in the field.

WeChat is one of the social platforms with the largest number of Chinese users, and provides two types of accounts, namely a personal account and a public number, for the users, wherein the public number is oriented to most users, and one main function of the WeChat is to publish articles. The individual user can pay attention to the public number, receive and read the articles published by the individual user, and can forward the articles through the personal account to share the articles to more individual users. Based on such a mechanism, pushing articles to users by using the public number becomes an important way of information dissemination, and obtaining high reading amount becomes a goal pursued by the public number. In order to determine reasonable characteristics to predict the reading amount of an article, the invention obtains a WeChat data set, and observes and analyzes the data set, and the specific process is as follows:

1. basic information

22,554,445 articles published by 1,692,012 public numbers in one month (2/10/2018 to 3/10) are acquired as a WeChat data set, wherein the total reading amount is 26,460,684, and basic information such as titles and body lengths of the articles is counted, and as a result, as shown in fig. 1(a) and (b), only pictures or other forms of information are contained in the body text of some articles.

2. Public information

The frequency difference of the articles published by the public numbers in the acquired WeChat data set is obvious, wherein 695 articles (23 articles per day) are published in one month by the most active public numbers, only one article is published in one month by the public number with the lowest frequency, and the overall frequency distribution is shown in FIG. 2. It can be seen that most of the articles published by the public (72.32%) in the month are about 10, and 11.81% of the articles published by the public in the month exceed 30 (1 in the day). The invention considers that the activity degree (sending frequency) of the public number has influence on the reading amount of the article.

3. Time information

The invention observes that the number of articles issued by the public at different time points in a day is different, and the specific distribution situation is shown in figures 3(a) to (d), and has several remarkable characteristics: a) the peak hours of several publications during a day, at 11AM, 5PM and 8PM, respectively, just coincide with the typical off-hours, and therefore, the present invention speculates that the public selects the above-mentioned time publications to increase the reading of the articles. b) The number of letters per day in a week varies, with peaks occurring on saturdays. Based on the above features, the present invention considers the average and variance of readings at different times as part of the characteristics of the predicted reading.

4. Reading amount information

The reading quantity of the WeChat article is continuously and dynamically increased along with the time, in order to research the increasing rule of the reading quantity, 1600 public articles published in 7, month and 4 in 2018 are selected, the reading quantity is acquired every 24 hours from the publishing time, the change condition of the acquired reading quantity is shown in figure 4, and it can be seen that the increasing rate and the absolute value of the reading quantity are gradually decreased, and the reading quantity per day is almost lower than 10 after one week.

Since the articles with reading amount exceeding 10 ten thousand are regarded as super articles, and the reading amount is expressed as 100,000+, the accurate reading amount cannot be obtained, and the task of directly predicting the reading amount cannot be regarded as a pure regression problem. Thus, the present invention divides the prediction task into two consecutive sub-tasks: predicting whether an article is a super article; if not, the particular reading that it is likely to obtain is predicted.

Example one

Based on the above description, the method for predicting the reading amount of the WeChat public articles provided by the invention comprises the following steps:

1) respectively training an XGboost classification model and an XGboost regression model on a WeChat data set, and specifically comprises the following steps:

1.1) according to the release time of the article, dividing a WeChat data set into a training set, a verification set and a test set, wherein the training set is used for training the XGboost model, the verification set is used for adjusting parameters (such as max _ depth, learning _ rate, n _ estimators and the like) of the XGboost model, and the test set is used for testing the XGboost model.

1.2) determining the sample positive and negative (label) of each WeChat article in the WeChat chapter data set, wherein the super article is an article with the reading amount exceeding 10 ten thousand, and the WeChat article is a super article (super article) which is expressed as a positive sample; WeChat articles are non-super articles and are denoted as negative examples.

1.3) training an XGboost classification model on a WeChat data set:

1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall and F1 scores, the construction of the XGboost classification model is a method disclosed in the prior art, and the specific process is not repeated.

1.3.2) training the XGboost classification model through a micro-letter document chapter expressed as a positive sample in a training set and a micro-letter document chapter expressed as a negative sample in part to obtain a super article classification model, wherein the number of non-super articles in the micro-letter document chapter data set is far more than that of super articles, so that only part of the non-super articles in the micro-letter document chapter data set are selected and marked as the negative sample, and the number ratio of the micro-letter articles expressed as the negative sample to the micro-letter articles expressed as the positive sample for model training is 1: 1.

1.3.3) adjusting the parameters of the super article classification model in the verification set, and testing the super article classification model in the test set to obtain the trained super article classification model.

1.4) training an XGboost regression model on the WeChat data set:

1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R²The construction of the XGBoost regression model is a method disclosed in the prior art, and the specific process is not described herein.

Mean absolute error MAE, root mean square error RMSE, and coefficient of determination R²Respectively as follows:

wherein, y_iA target value representing the ith WeChat article;

an average value representing a target value; v_arianceRepresents the variance of the target values of all WeChat chapters.

1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading quantity of the WeChat articles as labels to obtain a regression model for predicting the reading quantity of the WeChat articles.

1.4.3) adjusting parameters of the predictive WeChat article reading quantity regression model in the verification set, and testing the predictive WeChat article reading quantity regression model in the test set to obtain the trained predictive WeChat article reading quantity regression model.

2) The method comprises the steps of obtaining an article to be detected and article characteristics thereof, wherein the article to be detected comprises historical information characteristics, title characteristics, text characteristics, title party characteristics, time characteristics and other characteristics, and the method comprises the following steps:

2.1) obtaining the historical information characteristics of the article to be detected, including the historical sending frequency and the historical reading quantity of the public number to which the article to be detected belongs:

2.1.1) giving the current article a to be tested and the public number o where the article a is located, defining the historical issue frequency as the total number of the issue of the public number o before the article a to be tested in the time t, and setting the time t as one month, two weeks, one week, 3 days and 1 day respectively, wherein the characteristic is mainly used for measuring the activity degree of the public number.

2.1.2) obtaining the total number, the average number, the variance and the median of the reading quantity obtained by the public number o in the time t, wherein the information is used as the historical reading quantity and is mainly used for reflecting the popularity of the public number per se.

It can be found through experiments that the historical information characteristics are very effective for the prediction of reading amount, and the analysis reasons are as follows: a historically popular public number would have more followers, so published articles can be seen by more users; the historical information can reflect the quality of articles of public numbers and ensure the quality of current issued documents from the side.

2.2) acquiring title characteristics of the article to be detected, including title basic composition, emotional attribute and title entity:

2.2.1) determining the basic title components of the article title to be detected, including the title length, the number of words and the number of digits.

2.2.2) adopting an emotion classification model to carry out emotion classification on the title of the article to be detected, and acquiring the emotion attributes of the title of the article to be detected, wherein the emotion attributes comprise positive, negative and neutral, the emotion classification model can adopt an emotion classification model disclosed in the prior art, and the specific classification process is not repeated herein.

2.2.3) determining the number of entities such as place names, person names, organization names and the like appearing in the titles of the articles to be tested.

2.3) obtaining the text characteristics of the article to be detected, including text basic composition, text entity, composition elements, average paragraph length and topic of the article:

2.3.1) determining the basic text composition of the text of the article to be detected, wherein the basic text composition comprises the length of the article, the number of words and the number of digits.

2.3.2) determining the number of entities such as place names, person names, organization names and the like appearing in the text of the article to be detected.

2.3.3) determining the composition elements of the text of the article to be detected, including the number of paragraphs, the number of pictures, the number of web page links, the audio video frequency and the like, and determining the average paragraph length of the text of the article to be detected, namely the average word number of each paragraph in the text.

2.3.4) classifying the subjects of the texts of the articles to be detected by adopting a classification model, and determining the topic categories of the texts of the articles to be detected, wherein the topic categories comprise literature, education, entertainment, culture, science and technology, military, history, society and law, the classification model can adopt the classification model disclosed in the prior art, and the specific classification process is not repeated herein.

2.4) acquiring the characteristics of the 'title party' of the article to be detected, including whether the title is ambiguous, punctuation marks, the number of questionable words, the number of referring words, the number of degree subwords and the number of emotional words:

2.4.1) determining whether the title of the article to be tested is ambiguous, for example, determining whether an ambiguous pronoun exists in the title of the article to be tested, for example, the pronoun "he" in "he is the largest winner of the skill circle in this year" does not give a specific reference, and the title of the article to be tested is ambiguous. Specifically, a classification model is trained by labeling the titles of a small number of articles, and then the classification model is used for classifying the titles of the articles to be detected to determine whether the titles of the articles to be detected are ambiguous.

2.4.2) determining the landmark symbol in the title of the article to be tested? "and"! "number of the cells.

2.4.3) determining the number of questions, the number of referring words, the number of degree adverbs (such as extraordinary, etc.) and the number of emotional words in the title of the article to be tested, wherein the number of emotional words is the number of words expressing positive or negative evaluations and positive or negative emotions.

2.5) acquiring time characteristics of the article to be detected, including article release time, time reading amount and capturing interval:

2.5.1) acquiring the release time of the article to be detected, including the month, day, time and week number of the release of the article to be detected.

2.5.2) obtaining the average reading quantity and the variance of the articles to be tested in the same hour and week number at the release time as the time reading quantity.

2.5.3) obtaining the grabbing interval of the article to be detected, wherein the time interval between the releasing time and the grabbing reading amount time of the article to be detected is used as the grabbing interval.

2.6) acquiring other characteristics of the article to be detected, including the ranking position, namely acquiring the ranking position of the article to be detected in the list when the article is released.

3) Judging whether the article to be detected is a super article (super article) or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading quantity of the article to be detected is more than 10 thousands; if not, go to step 4).

4) And determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.

Example two

The embodiment provides a system for predicting the reading amount of WeChat public articles, which comprises:

and the model training module is used for respectively training the XGboost classification model and the XGboost regression model on the WeChat data set.

And the data acquisition module is used for acquiring article characteristics of the article to be detected.

And the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model.

And the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.

The above embodiments are only used for illustrating the present invention, and the structure, connection mode, manufacturing process, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solution of the present invention should not be excluded from the protection scope of the present invention.

Claims

1. A method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps:

1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;

2) acquiring article characteristics of an article to be detected;

3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4);

2. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 1, wherein the specific process of the step 1) is as follows:

1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped;

1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample;

1.3) training an XGboost classification model on the WeChat data set;

1.4) training an XGboost regression model on the WeChat data set.

3. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.3) is as follows:

1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores;

1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part;

1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.

4. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.4) is as follows:

1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R²：

Wherein, y_iA target value representing the ith WeChat article;

wherein the content of the first and second substances,

an average value representing a target value; v_arianceRepresenting the variance of all WeChat target values;

1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels;

1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.

5. The method for predicting the reading amount of the WeChat public number article according to claim 1, wherein the article characteristics comprise historical information characteristics, and the historical information characteristics comprise historical issuance frequency and historical reading amount of the public number to which the article to be tested belongs, wherein:

the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t;

the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.

6. The method for predicting the reading capacity of WeChat public number articles as claimed in claim 1, wherein the article characteristics comprise headline characteristics, headline characteristics comprise headline basic composition, emotional attributes and headline entities, and wherein:

the title basically comprises the title length, the word number and the number of the article title;

the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral;

title entities are place names, person names and organization names appearing in article titles.

7. The method for predicting the reading capacity of the WeChat public number article as claimed in claim 1, wherein the article features comprise text features, the text features comprise text basic composition, text entity, composition elements, average paragraph length and topic of the article, and the method comprises the following steps:

the text basically comprises the length of the article, the number of words and the number of digits of the text;

the text entity is a place name, a person name and an organization name appearing in the text of the article;

the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article;

the average paragraph length is the average word number of each paragraph in the article text;

the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.

8. The method of claim 1, wherein the article characteristics include a "headliner" characteristic, the "headliner" characteristic including whether the headliner is ambiguous, punctuation, number of questions, number of referring words, number of degree adverbs, and number of emotional words, wherein:

whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article;

is the punctuation symbol "? "and"! "number of cells;

the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.

9. The method for predicting the reading amount of WeChat public number articles as claimed in claim 1, wherein the article characteristics include time characteristics including article publishing time, time reading amount and capturing interval, and wherein:

the article release time comprises the number of months, days, time and weeks of article release;

the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time;

the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.

10. A system for predicting the reading of WeChat public articles, the system comprising:

the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;

the data acquisition module is used for acquiring article characteristics of the article to be detected;

the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model;