CN111260145A - Method and system for predicting reading amount of WeChat public number article - Google Patents
Method and system for predicting reading amount of WeChat public number article Download PDFInfo
- Publication number
- CN111260145A CN111260145A CN202010065180.7A CN202010065180A CN111260145A CN 111260145 A CN111260145 A CN 111260145A CN 202010065180 A CN202010065180 A CN 202010065180A CN 111260145 A CN111260145 A CN 111260145A
- Authority
- CN
- China
- Prior art keywords
- article
- xgboost
- reading
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000013145 classification model Methods 0.000 claims abstract description 50
- 238000012549 training Methods 0.000 claims abstract description 34
- 238000012360 testing method Methods 0.000 claims description 17
- 239000000203 mixture Substances 0.000 claims description 14
- 230000008451 emotion Effects 0.000 claims description 13
- 230000002996 emotional effect Effects 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 11
- 238000012795 verification Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 7
- 230000008520 organization Effects 0.000 claims description 6
- 230000007935 neutral effect Effects 0.000 claims description 3
- -1 text entity Substances 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 2
- 238000012986 modification Methods 0.000 abstract description 3
- 230000004048 modification Effects 0.000 abstract description 3
- 238000012552 review Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Artificial Intelligence (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Tourism & Hospitality (AREA)
- Evolutionary Computation (AREA)
- Marketing (AREA)
- Computing Systems (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method and a system for predicting the reading quantity of WeChat public articles, which are characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) the method has the advantages that the trained XGboost regression model is adopted, the reading quantity predicted value of the article to be tested is determined according to the article characteristics of the article to be tested, the article modification time of an author can be reduced, the working efficiency of the author and related workers can be improved, higher reading quantity can be obtained, and the method can be widely applied to the field of data prediction.
Description
Technical Field
The invention relates to a prediction method, in particular to a prediction method and a system for the reading amount of WeChat public articles.
Background
Since the 2.0 era of networks, research work on the popularity of certain specific contents on the networks has increased, and the research objects of the work mainly comprise online news, online videos and published contents of users on social platforms. For online news, the number of reviews is generally used as a popularity measuring standard in the existing work, and the task of predicting the number of reviews is divided into two stages of firstly judging whether the news can receive reviews and then qualitatively predicting the number of reviews. To further estimate the number of reviews, the number of reviews observed shortly after news is released is used to predict the distribution of the total number of reviews that are available later. For online video, most of the work uses the playing amount as a measure, and the current video is predicted by using historical playing amount information. In addition, there are some works that focus on the content posted by the user on the social platform, such as Facebook, Twitter, etc., and the attention degree of the posted content is predicted through the friend relationship on the social platform and the network structure of the social network. These prior efforts have achieved certain results to date.
However, the existing methods for predicting the popularity of a certain content mainly focus on web page news, videos and contents published by users on a social platform, and cannot be applied to the prediction of the reading amount of WeChat public number articles, which is mainly reflected in that: 1) the reading amount of the WeChat public number article is expected to be predicted before the article is published, but the current methods almost all carry out prediction after content is published and need to use information observed after the content is published; 2) in the available data, the association relationship between the WeChat user and the public number cannot be obtained, and the social friend relationship of the user is unknown, so that a social relationship network cannot be constructed according to an algorithm to predict the reading amount. Therefore, there is a need for a method of predicting the reading amount before the publication of an article based on only limited information.
Disclosure of Invention
In view of the above problems, it is an object of the present invention to provide a method and system for predicting the reading amount of a wechat bulletin before the publication of the article.
In order to achieve the purpose, the invention adopts the following technical scheme: a method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps: 1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; 2) acquiring article characteristics of an article to be detected; 3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4); 4) and determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Further, the specific process of the step 1) is as follows: 1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped; 1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample; 1.3) training an XGboost classification model on the WeChat data set; 1.4) training an XGboost regression model on the WeChat data set.
Further, the specific process of step 1.3) is as follows: 1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores; 1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part; 1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.
Further, the specific process of step 1.4) is as follows: 1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2:
Wherein, yiTarget value representing ith WeChat article;Representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;
wherein the content of the first and second substances,an average value representing a target value; varianceRepresenting the variance of all WeChat target values; 1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels; 1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.
Further, the article characteristics include historical information characteristics, the historical information characteristics include historical issuance frequency and historical reading quantity of the public number to which the article to be detected belongs, and the method includes the following steps: the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t; the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.
Further, the article features comprise title features, the title features comprise title basic compositions, emotional attributes and title entities, wherein: the title basically comprises the title length, the word number and the number of the article title; the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral; title entities are place names, person names and organization names appearing in article titles.
Further, the article features include text features, the text features include text basic composition, text entities, composition elements, average paragraph length, and topics to which the article belongs, wherein: the text basically comprises the length of the article, the number of words and the number of digits of the text; the text entity is a place name, a person name and an organization name appearing in the text of the article; the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article; the average paragraph length is the average word number of each paragraph in the article text; the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.
Further, the article features include a "headliner" feature that includes whether the headline is ambiguous, punctuation, number of interrogatories, number of referring words, number of degree adverbs, and number of emotional words, wherein: whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article; is the punctuation symbol "? "and"! "number of cells; the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.
Further, the article characteristics include time characteristics including article release time, time reading amount, and capturing interval, wherein: the article release time comprises the number of months, days, time and weeks of article release; the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time; the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.
A system for predicting the reading of WeChat public articles, the system comprising: the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set; the data acquisition module is used for acquiring article characteristics of the article to be detected; the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model; and the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. according to the method, the predicted value of the reading amount can be obtained before the article is released through the XGboost model according to the article characteristics of the article to be detected, modification opinions are provided for the author of the article, and more reading amount can be obtained while the working efficiency of the author of the article is improved.
2. The method divides the prediction of the article reading amount into two parts, including predicting whether the article is a super article and accurately predicting the reading amount of a non-super article, adopts an XGboost classification model for predicting whether the article is the super article, adopts an XGboost regression model for accurately predicting the reading amount of the non-super article, effectively analyzes the structure, the content and other aspects of the article in detail according to the obtained article characteristics of the article to be detected, determines the predicted value of the reading amount of the article to be detected, can provide certain guidance suggestions for the author to modify the article, improves the working efficiency of the author and related workers, obtains higher reading amount, and can be widely applied to the field of data prediction.
Drawings
Fig. 1 is a schematic diagram of basic information of a WeChat data set obtained in the present invention, where fig. 1(a) is a schematic diagram of basic information of a title of the WeChat data set, an abscissa is a length of the title, and an ordinate is a number of articles, fig. 1(b) is a schematic diagram of basic information of a text of the WeChat data set, an abscissa is a length of the text, and an ordinate is a number of articles;
FIG. 2 is a bar graph of monthly average number of documents issued from the WeChat data set public in the present invention, wherein the abscissa is the number of articles and the ordinate is the number of public;
fig. 3 is a statistical schematic diagram of the number of wechat articles and the reading amount of the articles released by the wechat chapter data set in the present invention, wherein fig. 3(a) is the percentage of wechat articles released in different hours, the abscissa is hours, and the ordinate is percentage, fig. 3(b) is the number of wechat articles released in different day counts, the abscissa is day count, and the ordinate is percentage, fig. 3(c) is the average reading number of the wechat articles released in different times, the abscissa is hours, and the ordinate is the average reading number, fig. 3(d) is the average article reading amount of the wechat articles released in different days per week, the abscissa is day count, and the ordinate is the average reading number;
FIG. 4 is a graph illustrating the read volume growth rate and average daily read volume after the articles in the WeChat data set are published.
Detailed Description
The present invention is described in detail below with reference to the attached drawings. It is to be understood, however, that the drawings are provided solely for the purposes of promoting an understanding of the invention and that they are not to be construed as limiting the invention.
The method for predicting the reading amount of the WeChat public number article provided by the invention relates to the relevant contents of the WeChat public number and the reading amount prediction, and the relevant contents are introduced below so that the contents of the method are more clear to the technical personnel in the field.
WeChat is one of the social platforms with the largest number of Chinese users, and provides two types of accounts, namely a personal account and a public number, for the users, wherein the public number is oriented to most users, and one main function of the WeChat is to publish articles. The individual user can pay attention to the public number, receive and read the articles published by the individual user, and can forward the articles through the personal account to share the articles to more individual users. Based on such a mechanism, pushing articles to users by using the public number becomes an important way of information dissemination, and obtaining high reading amount becomes a goal pursued by the public number. In order to determine reasonable characteristics to predict the reading amount of an article, the invention obtains a WeChat data set, and observes and analyzes the data set, and the specific process is as follows:
1. basic information
22,554,445 articles published by 1,692,012 public numbers in one month (2/10/2018 to 3/10) are acquired as a WeChat data set, wherein the total reading amount is 26,460,684, and basic information such as titles and body lengths of the articles is counted, and as a result, as shown in fig. 1(a) and (b), only pictures or other forms of information are contained in the body text of some articles.
2. Public information
The frequency difference of the articles published by the public numbers in the acquired WeChat data set is obvious, wherein 695 articles (23 articles per day) are published in one month by the most active public numbers, only one article is published in one month by the public number with the lowest frequency, and the overall frequency distribution is shown in FIG. 2. It can be seen that most of the articles published by the public (72.32%) in the month are about 10, and 11.81% of the articles published by the public in the month exceed 30 (1 in the day). The invention considers that the activity degree (sending frequency) of the public number has influence on the reading amount of the article.
3. Time information
The invention observes that the number of articles issued by the public at different time points in a day is different, and the specific distribution situation is shown in figures 3(a) to (d), and has several remarkable characteristics: a) the peak hours of several publications during a day, at 11AM, 5PM and 8PM, respectively, just coincide with the typical off-hours, and therefore, the present invention speculates that the public selects the above-mentioned time publications to increase the reading of the articles. b) The number of letters per day in a week varies, with peaks occurring on saturdays. Based on the above features, the present invention considers the average and variance of readings at different times as part of the characteristics of the predicted reading.
4. Reading amount information
The reading quantity of the WeChat article is continuously and dynamically increased along with the time, in order to research the increasing rule of the reading quantity, 1600 public articles published in 7, month and 4 in 2018 are selected, the reading quantity is acquired every 24 hours from the publishing time, the change condition of the acquired reading quantity is shown in figure 4, and it can be seen that the increasing rate and the absolute value of the reading quantity are gradually decreased, and the reading quantity per day is almost lower than 10 after one week.
Since the articles with reading amount exceeding 10 ten thousand are regarded as super articles, and the reading amount is expressed as 100,000+, the accurate reading amount cannot be obtained, and the task of directly predicting the reading amount cannot be regarded as a pure regression problem. Thus, the present invention divides the prediction task into two consecutive sub-tasks: predicting whether an article is a super article; if not, the particular reading that it is likely to obtain is predicted.
Example one
Based on the above description, the method for predicting the reading amount of the WeChat public articles provided by the invention comprises the following steps:
1) respectively training an XGboost classification model and an XGboost regression model on a WeChat data set, and specifically comprises the following steps:
1.1) according to the release time of the article, dividing a WeChat data set into a training set, a verification set and a test set, wherein the training set is used for training the XGboost model, the verification set is used for adjusting parameters (such as max _ depth, learning _ rate, n _ estimators and the like) of the XGboost model, and the test set is used for testing the XGboost model.
1.2) determining the sample positive and negative (label) of each WeChat article in the WeChat chapter data set, wherein the super article is an article with the reading amount exceeding 10 ten thousand, and the WeChat article is a super article (super article) which is expressed as a positive sample; WeChat articles are non-super articles and are denoted as negative examples.
1.3) training an XGboost classification model on a WeChat data set:
1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall and F1 scores, the construction of the XGboost classification model is a method disclosed in the prior art, and the specific process is not repeated.
1.3.2) training the XGboost classification model through a micro-letter document chapter expressed as a positive sample in a training set and a micro-letter document chapter expressed as a negative sample in part to obtain a super article classification model, wherein the number of non-super articles in the micro-letter document chapter data set is far more than that of super articles, so that only part of the non-super articles in the micro-letter document chapter data set are selected and marked as the negative sample, and the number ratio of the micro-letter articles expressed as the negative sample to the micro-letter articles expressed as the positive sample for model training is 1: 1.
1.3.3) adjusting the parameters of the super article classification model in the verification set, and testing the super article classification model in the test set to obtain the trained super article classification model.
1.4) training an XGboost regression model on the WeChat data set:
1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2The construction of the XGBoost regression model is a method disclosed in the prior art, and the specific process is not described herein.
Mean absolute error MAE, root mean square error RMSE, and coefficient of determination R2Respectively as follows:
wherein, yiA target value representing the ith WeChat article;representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;an average value representing a target value; varianceRepresents the variance of the target values of all WeChat chapters.
1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading quantity of the WeChat articles as labels to obtain a regression model for predicting the reading quantity of the WeChat articles.
1.4.3) adjusting parameters of the predictive WeChat article reading quantity regression model in the verification set, and testing the predictive WeChat article reading quantity regression model in the test set to obtain the trained predictive WeChat article reading quantity regression model.
2) The method comprises the steps of obtaining an article to be detected and article characteristics thereof, wherein the article to be detected comprises historical information characteristics, title characteristics, text characteristics, title party characteristics, time characteristics and other characteristics, and the method comprises the following steps:
2.1) obtaining the historical information characteristics of the article to be detected, including the historical sending frequency and the historical reading quantity of the public number to which the article to be detected belongs:
2.1.1) giving the current article a to be tested and the public number o where the article a is located, defining the historical issue frequency as the total number of the issue of the public number o before the article a to be tested in the time t, and setting the time t as one month, two weeks, one week, 3 days and 1 day respectively, wherein the characteristic is mainly used for measuring the activity degree of the public number.
2.1.2) obtaining the total number, the average number, the variance and the median of the reading quantity obtained by the public number o in the time t, wherein the information is used as the historical reading quantity and is mainly used for reflecting the popularity of the public number per se.
It can be found through experiments that the historical information characteristics are very effective for the prediction of reading amount, and the analysis reasons are as follows: a historically popular public number would have more followers, so published articles can be seen by more users; the historical information can reflect the quality of articles of public numbers and ensure the quality of current issued documents from the side.
2.2) acquiring title characteristics of the article to be detected, including title basic composition, emotional attribute and title entity:
2.2.1) determining the basic title components of the article title to be detected, including the title length, the number of words and the number of digits.
2.2.2) adopting an emotion classification model to carry out emotion classification on the title of the article to be detected, and acquiring the emotion attributes of the title of the article to be detected, wherein the emotion attributes comprise positive, negative and neutral, the emotion classification model can adopt an emotion classification model disclosed in the prior art, and the specific classification process is not repeated herein.
2.2.3) determining the number of entities such as place names, person names, organization names and the like appearing in the titles of the articles to be tested.
2.3) obtaining the text characteristics of the article to be detected, including text basic composition, text entity, composition elements, average paragraph length and topic of the article:
2.3.1) determining the basic text composition of the text of the article to be detected, wherein the basic text composition comprises the length of the article, the number of words and the number of digits.
2.3.2) determining the number of entities such as place names, person names, organization names and the like appearing in the text of the article to be detected.
2.3.3) determining the composition elements of the text of the article to be detected, including the number of paragraphs, the number of pictures, the number of web page links, the audio video frequency and the like, and determining the average paragraph length of the text of the article to be detected, namely the average word number of each paragraph in the text.
2.3.4) classifying the subjects of the texts of the articles to be detected by adopting a classification model, and determining the topic categories of the texts of the articles to be detected, wherein the topic categories comprise literature, education, entertainment, culture, science and technology, military, history, society and law, the classification model can adopt the classification model disclosed in the prior art, and the specific classification process is not repeated herein.
2.4) acquiring the characteristics of the 'title party' of the article to be detected, including whether the title is ambiguous, punctuation marks, the number of questionable words, the number of referring words, the number of degree subwords and the number of emotional words:
2.4.1) determining whether the title of the article to be tested is ambiguous, for example, determining whether an ambiguous pronoun exists in the title of the article to be tested, for example, the pronoun "he" in "he is the largest winner of the skill circle in this year" does not give a specific reference, and the title of the article to be tested is ambiguous. Specifically, a classification model is trained by labeling the titles of a small number of articles, and then the classification model is used for classifying the titles of the articles to be detected to determine whether the titles of the articles to be detected are ambiguous.
2.4.2) determining the landmark symbol in the title of the article to be tested? "and"! "number of the cells.
2.4.3) determining the number of questions, the number of referring words, the number of degree adverbs (such as extraordinary, etc.) and the number of emotional words in the title of the article to be tested, wherein the number of emotional words is the number of words expressing positive or negative evaluations and positive or negative emotions.
2.5) acquiring time characteristics of the article to be detected, including article release time, time reading amount and capturing interval:
2.5.1) acquiring the release time of the article to be detected, including the month, day, time and week number of the release of the article to be detected.
2.5.2) obtaining the average reading quantity and the variance of the articles to be tested in the same hour and week number at the release time as the time reading quantity.
2.5.3) obtaining the grabbing interval of the article to be detected, wherein the time interval between the releasing time and the grabbing reading amount time of the article to be detected is used as the grabbing interval.
2.6) acquiring other characteristics of the article to be detected, including the ranking position, namely acquiring the ranking position of the article to be detected in the list when the article is released.
3) Judging whether the article to be detected is a super article (super article) or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading quantity of the article to be detected is more than 10 thousands; if not, go to step 4).
4) And determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Example two
The embodiment provides a system for predicting the reading amount of WeChat public articles, which comprises:
and the model training module is used for respectively training the XGboost classification model and the XGboost regression model on the WeChat data set.
And the data acquisition module is used for acquiring article characteristics of the article to be detected.
And the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model.
And the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
The above embodiments are only used for illustrating the present invention, and the structure, connection mode, manufacturing process, etc. of the components may be changed, and all equivalent changes and modifications performed on the basis of the technical solution of the present invention should not be excluded from the protection scope of the present invention.
Claims (10)
1. A method for predicting the reading amount of WeChat public articles is characterized by comprising the following steps:
1) respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;
2) acquiring article characteristics of an article to be detected;
3) judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model, wherein if yes, the predicted value of the reading amount of the article to be detected is more than 10 thousands; if not, entering the step 4);
4) and determining a reading quantity predicted value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
2. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 1, wherein the specific process of the step 1) is as follows:
1.1) dividing a WeChat data set into a training set, a verification set and a test set according to the release time of an article, wherein each set is not overlapped;
1.2) determining the positive and negative of a sample of each WeChat article in the WeChat chapter data set, wherein the WeChat article is a super article and is represented as a positive sample; if the WeChat article is a non-super article, the WeChat article is represented as a negative sample;
1.3) training an XGboost classification model on the WeChat data set;
1.4) training an XGboost regression model on the WeChat data set.
3. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.3) is as follows:
1.3.1) constructing an XGboost classification model, wherein evaluation indexes adopted by classification tasks of the XGboost classification model comprise accuracy, precision, recall rate and F1 scores;
1.3.2) training an XGboost classification model through a WeChat seal expressed as a positive sample in a training set and a WeChat seal expressed as a negative sample in part;
1.3.3) adjusting the parameters of the XGboost classification model in the verification set, and testing the XGboost classification model in the test set to obtain the trained XGboost classification model.
4. The method for predicting the reading capacity of the WeChat public number articles as claimed in claim 2, wherein the specific process of the step 1.4) is as follows:
1.4.1) constructing an XGboost regression model, wherein evaluation indexes of the XGboost regression model comprise an average absolute error MAE, a root mean square error RMSE and a decision coefficient R2:
Wherein, yiA target value representing the ith WeChat article;representing the predicted value of the ith WeChat article; n represents the number of WeChat articles;
wherein the content of the first and second substances,an average value representing a target value; varianceRepresenting the variance of all WeChat target values;
1.4.2) training an XGboost regression model by taking the article characteristics of the WeChat articles in the training set as samples and the reading amount of the WeChat articles as labels;
1.4.3) adjusting the parameters of the XGboost regression model in the verification set, and testing the XGboost regression model in the test set to obtain the trained XGboost regression model.
5. The method for predicting the reading amount of the WeChat public number article according to claim 1, wherein the article characteristics comprise historical information characteristics, and the historical information characteristics comprise historical issuance frequency and historical reading amount of the public number to which the article to be tested belongs, wherein:
the historical message sending frequency is the total number of messages sent by the public number o before the article a in the time t;
the historical reading is the total, mean, variance and median of readings taken by the public number o over time t.
6. The method for predicting the reading capacity of WeChat public number articles as claimed in claim 1, wherein the article characteristics comprise headline characteristics, headline characteristics comprise headline basic composition, emotional attributes and headline entities, and wherein:
the title basically comprises the title length, the word number and the number of the article title;
the emotion attributes are obtained by performing emotion classification on the titles of the articles by adopting an emotion classification model and comprise positive, negative and neutral;
title entities are place names, person names and organization names appearing in article titles.
7. The method for predicting the reading capacity of the WeChat public number article as claimed in claim 1, wherein the article features comprise text features, the text features comprise text basic composition, text entity, composition elements, average paragraph length and topic of the article, and the method comprises the following steps:
the text basically comprises the length of the article, the number of words and the number of digits of the text;
the text entity is a place name, a person name and an organization name appearing in the text of the article;
the composition elements are paragraph number, picture number, webpage link number and music video frequency of the text of the article;
the average paragraph length is the average word number of each paragraph in the article text;
the topic to which the article belongs is a topic category obtained by classifying the topic of the article text by adopting a classification model.
8. The method of claim 1, wherein the article characteristics include a "headliner" characteristic, the "headliner" characteristic including whether the headliner is ambiguous, punctuation, number of questions, number of referring words, number of degree adverbs, and number of emotional words, wherein:
whether the title is ambiguous is whether there are ambiguous pronouns in the title of the article;
is the punctuation symbol "? "and"! "number of cells;
the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words are the number of questions, the number of referring words, the number of degree adverbs, and the number of emotional words appearing in the title of the article.
9. The method for predicting the reading amount of WeChat public number articles as claimed in claim 1, wherein the article characteristics include time characteristics including article publishing time, time reading amount and capturing interval, and wherein:
the article release time comprises the number of months, days, time and weeks of article release;
the time reading amount is the average reading amount and variance of the article in the same hour and week number at the release time;
the grabbing interval is the time interval between the publishing time and the grabbing reading time of the article.
10. A system for predicting the reading of WeChat public articles, the system comprising:
the model training module is used for respectively training an XGboost classification model and an XGboost regression model on the WeChat data set;
the data acquisition module is used for acquiring article characteristics of the article to be detected;
the super article prediction module is used for judging whether the article to be detected is a super article or not according to the article characteristics of the article to be detected by adopting the trained XGboost classification model;
and the reading amount prediction module is used for determining the reading amount prediction value of the article to be detected according to the article characteristics of the article to be detected by adopting the trained XGboost regression model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010065180.7A CN111260145A (en) | 2020-01-20 | 2020-01-20 | Method and system for predicting reading amount of WeChat public number article |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010065180.7A CN111260145A (en) | 2020-01-20 | 2020-01-20 | Method and system for predicting reading amount of WeChat public number article |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111260145A true CN111260145A (en) | 2020-06-09 |
Family
ID=70947101
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010065180.7A Pending CN111260145A (en) | 2020-01-20 | 2020-01-20 | Method and system for predicting reading amount of WeChat public number article |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111260145A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112015965A (en) * | 2020-08-27 | 2020-12-01 | 中国搜索信息科技股份有限公司 | New media manuscript heat degree calculation method |
CN114580373A (en) * | 2022-02-22 | 2022-06-03 | 四川大学 | Intelligent environment-friendly propaganda and education method combining text theme analysis, emotion analysis and GSVM |
US20220318488A1 (en) * | 2021-03-31 | 2022-10-06 | Storyroom Inc. | System and method of content brief generation using machine learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170140291A1 (en) * | 2015-11-18 | 2017-05-18 | Institute For Information Industry | Method of predicting social article influence and device using the same |
CN110019776A (en) * | 2017-09-05 | 2019-07-16 | 腾讯科技(北京)有限公司 | Article classification method and device, storage medium |
-
2020
- 2020-01-20 CN CN202010065180.7A patent/CN111260145A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170140291A1 (en) * | 2015-11-18 | 2017-05-18 | Institute For Information Industry | Method of predicting social article influence and device using the same |
CN110019776A (en) * | 2017-09-05 | 2019-07-16 | 腾讯科技(北京)有限公司 | Article classification method and device, storage medium |
Non-Patent Citations (2)
Title |
---|
刘佐: "基于微信公众平台的数据挖掘与可视化研究" * |
曾颖;李志涛;周燕;: "基于文本挖掘的文章特征提取及流量控制", 电子技术与软件工程 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112015965A (en) * | 2020-08-27 | 2020-12-01 | 中国搜索信息科技股份有限公司 | New media manuscript heat degree calculation method |
US20220318488A1 (en) * | 2021-03-31 | 2022-10-06 | Storyroom Inc. | System and method of content brief generation using machine learning |
WO2022212773A1 (en) * | 2021-03-31 | 2022-10-06 | Storyroom Inc. | System and method of content brief generation using machine learning |
US11947898B2 (en) * | 2021-03-31 | 2024-04-02 | Storyroom Inc. | System and method of content brief generation using machine learning |
CN114580373A (en) * | 2022-02-22 | 2022-06-03 | 四川大学 | Intelligent environment-friendly propaganda and education method combining text theme analysis, emotion analysis and GSVM |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8166032B2 (en) | System and method for sentiment-based text classification and relevancy ranking | |
Marengo et al. | Digital phenotyping of big five personality via facebook data mining: a meta-analysis | |
CN111260145A (en) | Method and system for predicting reading amount of WeChat public number article | |
CN112257777B (en) | Off-duty prediction method and related device based on hidden Markov model | |
CN105138577B (en) | Big data based event evolution analysis method | |
CN105426514A (en) | Personalized mobile APP recommendation method | |
Lane | A descriptive analysis of qualitative research published in two eminent music education research journals | |
CN104965931A (en) | Big data based public opinion analysis method | |
CN112418956A (en) | Financial product recommendation method and device | |
Demarest et al. | Argue, observe, assess: Measuring disciplinary identities and differences through socio‐epistemic discourse | |
CN105868254A (en) | Information recommendation method and apparatus | |
WO2013179340A1 (en) | Information analysis system and information analysis method | |
US20150220643A1 (en) | Scoring properties of social media postings | |
Kleiner et al. | Language ability and motivation among foreigners in survey responding | |
US20150220510A1 (en) | Interactive data-driven optimization of effective linguistic choices in communication | |
Teernstra et al. | The morality machine: Tracking moral values in tweets | |
CN104965930A (en) | Big data based emergency evolution analysis method | |
Palmer et al. | Induction of a sentiment dictionary for financial analyst communication: a data-driven approach balancing machine learning and human intuition | |
Pentland et al. | Does accuracy matter? Methodological considerations when using automated speech-to-text for social science research | |
CN111581370B (en) | Network public opinion popularity evaluation method and device integrating multichannel data sources | |
CN109726938B (en) | Student thinking state early warning method based on deep learning | |
Mascaro et al. | Not just a wink and smile: an analysis of user-defined success in online dating | |
Hider et al. | Constructing record quality measures based on catalog use | |
Asnaghi et al. | Geographical patterns of formality variation in written Standard California English | |
CN109559169B (en) | Method for identifying sharp users based on online user scoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |