CN111260145A - Method and system for predicting reading amount of WeChat public number article - Google Patents

Method and system for predicting reading amount of WeChat public number article Download PDF

Info

Publication number
CN111260145A
CN111260145A CN202010065180.7A CN202010065180A CN111260145A CN 111260145 A CN111260145 A CN 111260145A CN 202010065180 A CN202010065180 A CN 202010065180A CN 111260145 A CN111260145 A CN 111260145A
Authority
CN
China
Prior art keywords
article
wechat
xgboost
reading
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010065180.7A
Other languages
Chinese (zh)
Inventor
窦志成
文继荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN202010065180.7A priority Critical patent/CN111260145A/en
Publication of CN111260145A publication Critical patent/CN111260145A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Marketing (AREA)
  • Computing Systems (AREA)
  • Primary Health Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a system for predicting the reading quantity of WeChat public articles, which are characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) the method has the advantages that the trained XGboost regression model is adopted, the reading quantity predicted value of the article to be tested is determined according to the article characteristics of the article to be tested, the article modification time of an author can be reduced, the working efficiency of the author and related workers can be improved, higher reading quantity can be obtained, and the method can be widely applied to the field of data prediction.

Description

Method and system for predicting reading amount of WeChat public number article
Technical Field
The invention relates to a prediction method, in particular to a prediction method and a system for the reading amount of WeChat public articles.
Background
Since the 2.0 era of networks, research work on the popularity of certain specific contents on the networks has increased, and the research objects of the work mainly comprise online news, online videos and published contents of users on social platforms. For online news, the number of reviews is generally used as a popularity measuring standard in the existing work, and the task of predicting the number of reviews is divided into two stages of firstly judging whether the news can receive reviews and then qualitatively predicting the number of reviews. To further estimate the number of reviews, the number of reviews observed shortly after news is released is used to predict the distribution of the total number of reviews that are available later. For online video, most of the work uses the playing amount as a measure, and the current video is predicted by using historical playing amount information. In addition, there are some works that focus on the content posted by the user on the social platform, such as Facebook, Twitter, etc., and the attention degree of the posted content is predicted through the friend relationship on the social platform and the network structure of the social network. These prior efforts have achieved certain results to date.
However, the existing methods for predicting the popularity of a certain content mainly focus on web page news, videos and contents published by users on a social platform, and cannot be applied to the prediction of the reading amount of WeChat public number articles, which is mainly reflected in that: 1) the reading amount of the WeChat public number article is expected to be predicted before the article is published, but the current methods almost all carry out prediction after content is published and need to use information observed after the content is published; 2) in the available data, the association relationship between the WeChat user and the public number cannot be obtained, and the social friend relationship of the user is unknown, so that a social relationship network cannot be constructed according to an algorithm to predict the reading amount. Therefore, there is a need for a method of predicting the reading amount before the publication of an article based on only limited information.
Disclosure of Invention
In view of the above problems, it is an object of the present invention to provide a method and system for predicting the reading amount of a wechat bulletin before the publication of the article.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) and determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Further, the specific process of the step 1) is as follows: 1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped; 1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample; 1.3) training an XGboost classification model on the WeChat data set; 1.4) training an XGboost regression model on the WeChat data set.
Further, the specific process of step 1.3) is as follows: 1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores; 1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part; 1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.
Further, the specific process of step 1.4) is as follows: 1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2
Figure BDA0002375758720000021
Wherein, yiTarget value representing ith WeChat article;
Figure BDA0002375758720000022
Representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;
Figure BDA0002375758720000023
Figure BDA0002375758720000024
wherein the content of the first and second substances,
Figure BDA0002375758720000025
an average value representing a target value; varianceRepresenting the variance of all WeChat target values; 1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels; 1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.
Further, the article characteristics include historical information characteristics, the historical information characteristics include historical issuance frequency and historical reading quantity of the public number to which the article to be detected belongs, and the method includes the following steps: the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t; the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.
Further, the article features comprise title features, the title features comprise title basic compositions, emotional attributes and title entities, wherein: the title basically comprises the title length, the word number and the number of the article title; the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral; title entities are place names, person names and organization names appearing in article titles.
Further, the article features include text features, the text features include text basic composition, text entities, composition elements, average paragraph length, and topics to which the article belongs, wherein: the text basically comprises the length of the article, the number of words and the number of digits of the text; the text entity is a place name, a person name and an organization name appearing in the text of the article; the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article; the average paragraph length is the average word number of each paragraph in the article text; the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.
Further, the article features include a "headliner" feature that includes whether the headline is ambiguous, punctuation, number of interrogatories, number of referring words, number of degree adverbs, and number of emotional words, wherein: whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article; is the punctuation symbol "? "and"! "number of cells; the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.
Further, the article characteristics include time characteristics including article release time, time reading amount, and capturing interval, wherein: the article release time comprises the number of months, days, time and weeks of article release; the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time; the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.
A system for predicting the reading of WeChat public articles, the system comprising: the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; the data acquisition module is used for acquiring article characteristics of the article to be detected; the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model; and the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. according to the method, the predicted value of the reading amount can be obtained before the article is released through the XGboost model according to the article characteristics of the article to be detected, modification opinions are provided for the author of the article, and more reading amount can be obtained while the working efficiency of the author of the article is improved.
2. The method divides the prediction of the article reading amount into two parts, including predicting whether the article is a super article and accurately predicting the reading amount of a non-super article, adopts an XGboost classification model for predicting whether the article is the super article, adopts an XGboost regression model for accurately predicting the reading amount of the non-super article, effectively analyzes the structure, the content and other aspects of the article in detail according to the obtained article characteristics of the article to be detected, determines the predicted value of the reading amount of the article to be detected, can provide certain guidance suggestions for the author to modify the article, improves the working efficiency of the author and related workers, obtains higher reading amount, and can be widely applied to the field of data prediction.
Drawings
Fig. 1 is a schematic diagram of basic information of a WeChat data set obtained in the present invention, where fig. 1(a) is a schematic diagram of basic information of a title of the WeChat data set, an abscissa is a length of the title, and an ordinate is a number of articles, fig. 1(b) is a schematic diagram of basic information of a text of the WeChat data set, an abscissa is a length of the text, and an ordinate is a number of articles;
FIG. 2 is a bar graph of monthly average number of documents issued from the WeChat data set public in the present invention, wherein the abscissa is the number of articles and the ordinate is the number of public;
fig. 3 is a statistical schematic diagram of the number of wechat articles and the reading amount of the articles released by the wechat chapter data set in the present invention, wherein fig. 3(a) is the percentage of wechat articles released in different hours, the abscissa is hours, and the ordinate is percentage, fig. 3(b) is the number of wechat articles released in different day counts, the abscissa is day count, and the ordinate is percentage, fig. 3(c) is the average reading number of the wechat articles released in different times, the abscissa is hours, and the ordinate is the average reading number, fig. 3(d) is the average article reading amount of the wechat articles released in different days per week, the abscissa is day count, and the ordinate is the average reading number;
FIG. 4 is a graph illustrating the read volume growth rate and average daily read volume after the articles in the WeChat data set are published.
Detailed Description
The present invention is described in detail below with reference to the attached drawings. It is to be understood, however, that the drawings are provided solely for the purposes of promoting an understanding of the invention and that they are not to be construed as limiting the invention.
The method for predicting the reading amount of the WeChat public number article provided by the invention relates to the relevant contents of the WeChat public number and the reading amount prediction, and the relevant contents are introduced below so that the contents of the method are more clear to the technical personnel in the field.
WeChat is one of the social platforms with the largest number of Chinese users, and provides two types of accounts, namely a personal account and a public number, for the users, wherein the public number is oriented to most users, and one main function of the WeChat is to publish articles. The individual user can pay attention to the public number, receive and read the articles published by the individual user, and can forward the articles through the personal account to share the articles to more individual users. Based on such a mechanism, pushing articles to users by using the public number becomes an important way of information dissemination, and obtaining high reading amount becomes a goal pursued by the public number. In order to determine reasonable characteristics to predict the reading amount of an article, the invention obtains a WeChat data set, and observes and analyzes the data set, and the specific process is as follows:
1. basic information
22,554,445 articles published by 1,692,012 public numbers in one month (2/10/2018 to 3/10) are acquired as a WeChat data set, wherein the total reading amount is 26,460,684, and basic information such as titles and body lengths of the articles is counted, and as a result, as shown in fig. 1(a) and (b), only pictures or other forms of information are contained in the body text of some articles.
2. Public information
The frequency difference of the articles published by the public numbers in the acquired WeChat data set is obvious, wherein 695 articles (23 articles per day) are published in one month by the most active public numbers, only one article is published in one month by the public number with the lowest frequency, and the overall frequency distribution is shown in FIG. 2. It can be seen that most of the articles published by the public (72.32%) in the month are about 10, and 11.81% of the articles published by the public in the month exceed 30 (1 in the day). The invention considers that the activity degree (sending frequency) of the public number has influence on the reading amount of the article.
3. Time information
The invention observes that the number of articles issued by the public at different time points in a day is different, and the specific distribution situation is shown in figures 3(a) to (d), and has several remarkable characteristics: a) the peak hours of several publications during a day, at 11AM, 5PM and 8PM, respectively, just coincide with the typical off-hours, and therefore, the present invention speculates that the public selects the above-mentioned time publications to increase the reading of the articles. b) The number of letters per day in a week varies, with peaks occurring on saturdays. Based on the above features, the present invention considers the average and variance of readings at different times as part of the characteristics of the predicted reading.
4. Reading amount information
The reading quantity of the WeChat article is continuously and dynamically increased along with the time, in order to research the increasing rule of the reading quantity, 1600 public articles published in 7, month and 4 in 2018 are selected, the reading quantity is acquired every 24 hours from the publishing time, the change condition of the acquired reading quantity is shown in figure 4, and it can be seen that the increasing rate and the absolute value of the reading quantity are gradually decreased, and the reading quantity per day is almost lower than 10 after one week.
Since the articles with reading amount exceeding 10 ten thousand are regarded as super articles, and the reading amount is expressed as 100,000+, the accurate reading amount cannot be obtained, and the task of directly predicting the reading amount cannot be regarded as a pure regression problem. Thus, the present invention divides the prediction task into two consecutive sub-tasks: predicting whether an article is a super article; if not, the particular reading that it is likely to obtain is predicted.
Example one
Based on the above description, the method for predicting the reading amount of the WeChat public articles provided by the invention comprises the following steps:
1) respectively training an XGboost classification model and an XGboost regression model on a WeChat data set, and specifically comprises the following steps:
1.1) according to the release time of the article, dividing a WeChat data set into a training set, a verification set and a test set, wherein the training set is used for training the XGboost model, the verification set is used for adjusting parameters (such as max _ depth, learning _ rate, n _ estimators and the like) of the XGboost model, and the test set is used for testing the XGboost model.
1.2) determining the sample positive and negative (label) of each WeChat article in the WeChat chapter data set, wherein the super article is an article with the reading amount exceeding 10 ten thousand, and the WeChat article is a super article (super article) which is expressed as a positive sample; WeChat articles are non-super articles and are denoted as negative examples.
1.3) training an XGboost classification model on a WeChat data set:
1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall and F1 scores, the construction of the XGboost classification model is a method disclosed in the prior art, and the specific process is not repeated.
1.3.2) training the XGboost classification model through a micro-letter document chapter expressed as a positive sample in a training set and a micro-letter document chapter expressed as a negative sample in part to obtain a super article classification model, wherein the number of non-super articles in the micro-letter document chapter data set is far more than that of super articles, so that only part of the non-super articles in the micro-letter document chapter data set are selected and marked as the negative sample, and the number ratio of the micro-letter articles expressed as the negative sample to the micro-letter articles expressed as the positive sample for model training is 1: 1.
1.3.3) adjusting the parameters of the super article classification model in the verification set, and testing the super article classification model in the test set to obtain the trained super article classification model.
1.4) training an XGboost regression model on the WeChat data set:
1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2The construction of the XGBoost regression model is a method disclosed in the prior art, and the specific process is not described herein.
Mean absolute error MAE, root mean square error RMSE, and coefficient of determination R2Respectively as follows:
Figure BDA0002375758720000061
Figure BDA0002375758720000062
Figure BDA0002375758720000063
wherein, yiA target value representing the ith WeChat article;
Figure BDA0002375758720000064
representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;
Figure BDA0002375758720000065
an average value representing a target value; varianceRepresents the variance of the target values of all WeChat chapters.
1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading quantity of the WeChat articles as labels to obtain a regression model for predicting the reading quantity of the WeChat articles.
1.4.3) adjusting parameters of the predictive WeChat article reading quantity regression model in the verification set, and testing the predictive WeChat article reading quantity regression model in the test set to obtain the trained predictive WeChat article reading quantity regression model.
2) The method comprises the steps of obtaining an article to be detected and article characteristics thereof, wherein the article to be detected comprises historical information characteristics, title characteristics, text characteristics, title party characteristics, time characteristics and other characteristics, and the method comprises the following steps:
2.1) obtaining the historical information characteristics of the article to be detected, including the historical sending frequency and the historical reading quantity of the public number to which the article to be detected belongs:
2.1.1) giving the current article a to be tested and the public number o where the article a is located, defining the historical issue frequency as the total number of the issue of the public number o before the article a to be tested in the time t, and setting the time t as one month, two weeks, one week, 3 days and 1 day respectively, wherein the characteristic is mainly used for measuring the activity degree of the public number.
2.1.2) obtaining the total number, the average number, the variance and the median of the reading quantity obtained by the public number o in the time t, wherein the information is used as the historical reading quantity and is mainly used for reflecting the popularity of the public number per se.
It can be found through experiments that the historical information characteristics are very effective for the prediction of reading amount, and the analysis reasons are as follows: a historically popular public number would have more followers, so published articles can be seen by more users; the historical information can reflect the quality of articles of public numbers and ensure the quality of current issued documents from the side.
2.2) acquiring title characteristics of the article to be detected, including title basic composition, emotional attribute and title entity:
2.2.1) determining the basic title components of the article title to be detected, including the title length, the number of words and the number of digits.
2.2.2) adopting an emotion classification model to carry out emotion classification on the title of the article to be detected, and acquiring the emotion attributes of the title of the article to be detected, wherein the emotion attributes comprise positive, negative and neutral, the emotion classification model can adopt an emotion classification model disclosed in the prior art, and the specific classification process is not repeated herein.
2.2.3) determining the number of entities such as place names, person names, organization names and the like appearing in the titles of the articles to be tested.
2.3) obtaining the text characteristics of the article to be detected, including text basic composition, text entity, composition elements, average paragraph length and topic of the article:
2.3.1) determining the basic text composition of the text of the article to be detected, wherein the basic text composition comprises the length of the article, the number of words and the number of digits.
2.3.2) determining the number of entities such as place names, person names, organization names and the like appearing in the text of the article to be detected.
2.3.3) determining the composition elements of the text of the article to be detected, including the number of paragraphs, the number of pictures, the number of web page links, the audio video frequency and the like, and determining the average paragraph length of the text of the article to be detected, namely the average word number of each paragraph in the text.
2.3.4) classifying the subjects of the texts of the articles to be detected by adopting a classification model, and determining the topic categories of the texts of the articles to be detected, wherein the topic categories comprise literature, education, entertainment, culture, science and technology, military, history, society and law, the classification model can adopt the classification model disclosed in the prior art, and the specific classification process is not repeated herein.
2.4) acquiring the characteristics of the 'title party' of the article to be detected, including whether the title is ambiguous, punctuation marks, the number of questionable words, the number of referring words, the number of degree subwords and the number of emotional words:
2.4.1) determining whether the title of the article to be tested is ambiguous, for example, determining whether an ambiguous pronoun exists in the title of the article to be tested, for example, the pronoun "he" in "he is the largest winner of the skill circle in this year" does not give a specific reference, and the title of the article to be tested is ambiguous. Specifically, a classification model is trained by labeling the titles of a small number of articles, and then the classification model is used for classifying the titles of the articles to be detected to determine whether the titles of the articles to be detected are ambiguous.
2.4.2) determining the landmark symbol in the title of the article to be tested? "and"! "number of the cells.
2.4.3) determining the number of questions, the number of referring words, the number of degree adverbs (such as extraordinary, etc.) and the number of emotional words in the title of the article to be tested, wherein the number of emotional words is the number of words expressing positive or negative evaluations and positive or negative emotions.
2.5) acquiring time characteristics of the article to be detected, including article release time, time reading amount and capturing interval:
2.5.1) acquiring the release time of the article to be detected, including the month, day, time and week number of the release of the article to be detected.
2.5.2) obtaining the average reading quantity and the variance of the articles to be tested in the same hour and week number at the release time as the time reading quantity.
2.5.3) obtaining the grabbing interval of the article to be detected, wherein the time interval between the releasing time and the grabbing reading amount time of the article to be detected is used as the grabbing interval.
2.6) acquiring other characteristics of the article to be detected, including the ranking position, namely acquiring the ranking position of the article to be detected in the list when the article is released.
3) Judging whether the article to be detected is a super article (super article) or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading quantity of the article to be detected is more than 10 thousands; if not, go to step 4).
4) And determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Example two
The embodiment provides a system for predicting the reading amount of WeChat public articles, which comprises:
and the model training module is used for respectively training the XGboost classification model and the XGboost regression model on the WeChat data set.
And the data acquisition module is used for acquiring article characteristics of the article to be detected.
And the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model.
And the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
The above embodiments are only used for illustrating the present invention, and the structure, connection mode, manufacturing process, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solution of the present invention should not be excluded from the protection scope of the present invention.

Claims (10)

1. A method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps:
1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;
2) acquiring article characteristics of an article to be detected;
3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4);
4) and determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
2. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 1, wherein the specific process of the step 1) is as follows:
1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped;
1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample;
1.3) training an XGboost classification model on the WeChat data set;
1.4) training an XGboost regression model on the WeChat data set.
3. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.3) is as follows:
1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores;
1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part;
1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.
4. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.4) is as follows:
1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2
Figure FDA0002375758710000011
Wherein, yiA target value representing the ith WeChat article;
Figure FDA0002375758710000012
representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;
Figure FDA0002375758710000021
Figure FDA0002375758710000022
wherein the content of the first and second substances,
Figure FDA0002375758710000023
an average value representing a target value; varianceRepresenting the variance of all WeChat target values;
1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels;
1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.
5. The method for predicting the reading amount of the WeChat public number article according to claim 1, wherein the article characteristics comprise historical information characteristics, and the historical information characteristics comprise historical issuance frequency and historical reading amount of the public number to which the article to be tested belongs, wherein:
the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t;
the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.
6. The method for predicting the reading capacity of WeChat public number articles as claimed in claim 1, wherein the article characteristics comprise headline characteristics, headline characteristics comprise headline basic composition, emotional attributes and headline entities, and wherein:
the title basically comprises the title length, the word number and the number of the article title;
the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral;
title entities are place names, person names and organization names appearing in article titles.
7. The method for predicting the reading capacity of the WeChat public number article as claimed in claim 1, wherein the article features comprise text features, the text features comprise text basic composition, text entity, composition elements, average paragraph length and topic of the article, and the method comprises the following steps:
the text basically comprises the length of the article, the number of words and the number of digits of the text;
the text entity is a place name, a person name and an organization name appearing in the text of the article;
the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article;
the average paragraph length is the average word number of each paragraph in the article text;
the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.
8. The method of claim 1, wherein the article characteristics include a "headliner" characteristic, the "headliner" characteristic including whether the headliner is ambiguous, punctuation, number of questions, number of referring words, number of degree adverbs, and number of emotional words, wherein:
whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article;
is the punctuation symbol "? "and"! "number of cells;
the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.
9. The method for predicting the reading amount of WeChat public number articles as claimed in claim 1, wherein the article characteristics include time characteristics including article publishing time, time reading amount and capturing interval, and wherein:
the article release time comprises the number of months, days, time and weeks of article release;
the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time;
the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.
10. A system for predicting the reading of WeChat public articles, the system comprising:
the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;
the data acquisition module is used for acquiring article characteristics of the article to be detected;
the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model;
and the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
CN202010065180.7A 2020-01-20 2020-01-20 Method and system for predicting reading amount of WeChat public number article Pending CN111260145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010065180.7A CN111260145A (en) 2020-01-20 2020-01-20 Method and system for predicting reading amount of WeChat public number article

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010065180.7A CN111260145A (en) 2020-01-20 2020-01-20 Method and system for predicting reading amount of WeChat public number article

Publications (1)

Publication Number Publication Date
CN111260145A true CN111260145A (en) 2020-06-09

Family

ID=70947101

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010065180.7A Pending CN111260145A (en) 2020-01-20 2020-01-20 Method and system for predicting reading amount of WeChat public number article

Country Status (1)

Country Link
CN (1) CN111260145A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015965A (en) * 2020-08-27 2020-12-01 中国搜索信息科技股份有限公司 New media manuscript heat degree calculation method
CN114580373A (en) * 2022-02-22 2022-06-03 四川大学 Intelligent environment-friendly propaganda and education method combining text theme analysis, emotion analysis and GSVM
US20220318488A1 (en) * 2021-03-31 2022-10-06 Storyroom Inc. System and method of content brief generation using machine learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170140291A1 (en) * 2015-11-18 2017-05-18 Institute For Information Industry Method of predicting social article influence and device using the same
CN110019776A (en) * 2017-09-05 2019-07-16 腾讯科技(北京)有限公司 Article classification method and device, storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170140291A1 (en) * 2015-11-18 2017-05-18 Institute For Information Industry Method of predicting social article influence and device using the same
CN110019776A (en) * 2017-09-05 2019-07-16 腾讯科技(北京)有限公司 Article classification method and device, storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘佐: "基于微信公众平台的数据挖掘与可视化研究" *
曾颖;李志涛;周燕;: "基于文本挖掘的文章特征提取及流量控制", 电子技术与软件工程 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015965A (en) * 2020-08-27 2020-12-01 中国搜索信息科技股份有限公司 New media manuscript heat degree calculation method
US20220318488A1 (en) * 2021-03-31 2022-10-06 Storyroom Inc. System and method of content brief generation using machine learning
WO2022212773A1 (en) * 2021-03-31 2022-10-06 Storyroom Inc. System and method of content brief generation using machine learning
US11947898B2 (en) * 2021-03-31 2024-04-02 Storyroom Inc. System and method of content brief generation using machine learning
CN114580373A (en) * 2022-02-22 2022-06-03 四川大学 Intelligent environment-friendly propaganda and education method combining text theme analysis, emotion analysis and GSVM

Similar Documents

Publication Publication Date Title
US8166032B2 (en) System and method for sentiment-based text classification and relevancy ranking
Marengo et al. Digital phenotyping of big five personality via facebook data mining: a meta-analysis
CN111260145A (en) Method and system for predicting reading amount of WeChat public number article
CN112257777B (en) Off-duty prediction method and related device based on hidden Markov model
CN105138577B (en) Big data based event evolution analysis method
CN105426514A (en) Personalized mobile APP recommendation method
Lane A descriptive analysis of qualitative research published in two eminent music education research journals
CN104965931A (en) Big data based public opinion analysis method
CN112418956A (en) Financial product recommendation method and device
Demarest et al. Argue, observe, assess: Measuring disciplinary identities and differences through socio‐epistemic discourse
CN105868254A (en) Information recommendation method and apparatus
WO2013179340A1 (en) Information analysis system and information analysis method
US20150220643A1 (en) Scoring properties of social media postings
Kleiner et al. Language ability and motivation among foreigners in survey responding
US20150220510A1 (en) Interactive data-driven optimization of effective linguistic choices in communication
Teernstra et al. The morality machine: Tracking moral values in tweets
CN104965930A (en) Big data based emergency evolution analysis method
Palmer et al. Induction of a sentiment dictionary for financial analyst communication: a data-driven approach balancing machine learning and human intuition
Pentland et al. Does accuracy matter? Methodological considerations when using automated speech-to-text for social science research
CN111581370B (en) Network public opinion popularity evaluation method and device integrating multichannel data sources
CN109726938B (en) Student thinking state early warning method based on deep learning
Mascaro et al. Not just a wink and smile: an analysis of user-defined success in online dating
Hider et al. Constructing record quality measures based on catalog use
Asnaghi et al. Geographical patterns of formality variation in written Standard California English
CN109559169B (en) Method for identifying sharp users based on online user scoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination