CN114065054A - Method and device for pushing information - Google Patents
Method and device for pushing information Download PDFInfo
- Publication number
- CN114065054A CN114065054A CN202111429883.4A CN202111429883A CN114065054A CN 114065054 A CN114065054 A CN 114065054A CN 202111429883 A CN202111429883 A CN 202111429883A CN 114065054 A CN114065054 A CN 114065054A
- Authority
- CN
- China
- Prior art keywords
- user
- nums
- item
- content
- click
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9536—Search customisation based on social or collaborative filtering
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method and a device for pushing information, which mainly comprise the following steps: a user pulls down a refresh initiation request on a home page of a client; the request is transmitted to the server interface by the client, meanwhile, the client also sends the user and the user context characteristics to the server interface, and the client also sends the user behavior log to a message queue; after receiving a request sent by a client, the server calls a feature pushing system interface and sends features required by a feature pushing algorithm model to the feature pushing system; the recommendation system requests a recall system for recommending contents after receiving a request sent by a server; and after receiving the content sets respectively returned by the recall system, the server performs sequencing processing.
Description
Technical Field
The invention relates to the technical field of information, in particular to a method and a device for determining proper user characteristic information through data mining or big data, and particularly relates to a method and a device for pushing information.
Background
Currently, the internet has become a main channel for most people or information, and accordingly, the amount of information on the internet is exponentially exploded. Particularly, with the popularization and development of mobile devices, the amount of information is becoming more and more abundant, and the focus of users is also being dispersed too much, so that it is difficult to obtain effective information.
However, in the prior art, the search engine relies on the user to input keywords for active retrieval, and particularly under a mobile internet terminal, the user experience is poor, and meanwhile, the problem that the system automatically acquires the user interest and completes pushing cannot be solved.
Therefore, it is an urgent problem to provide valuable and interesting information for users by taking the user interest points as the center. One of the more acute challenges is recommending a system cold start and recommending a continuous update of information. That is, when a platform acquires a new user, how to recommend content of interest to the user on the premise that the platform lacks sufficient data. A more common solution is to obtain information of various social networks of the user in a suitable manner, including but not limited to social account numbers such as WeChat, QQ, Paibao, and the like, and analyze data based on the information to obtain effective recommendations.
However, browsing content and publishing content of the social network of the user often have messy content and contain various information, and also include a large amount of new network expressions, such as yyds, a large amount of harmonic vocabularies and the like, which are often difficult to be understood by the NLP network in the prior art, so that the accuracy of recommendation is greatly restricted.
The challenge of continuously updating the recommendation information is that after a user accumulates a certain usage behavior on the platform, although the platform itself obtains more sufficient information, the interest of the user is often reflected on other social platforms with longer usage time, for example, platforms with higher user stickiness such as WeChat, and for a part of specialized platforms, if the platform is based on own data, deviation from the real interest of the user is inevitable.
In view of the above problems in the prior art, no good solution is available at present.
Disclosure of Invention
The invention discloses a method and a device for pushing information.
The invention discloses a device for pushing information, which is characterized in that: the device comprises a feature pushing system, a recalling system, a sequencing system and/or a reordering system, wherein the recalling system acquires candidate information related to a user from a material library;
the sorting system processes the information provided by the recall system;
the reordering system processes the processing result of the ordering system;
the material library comprises bright news contents.
The feature push system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.
The content data dimension tag subsystem processes the first user data according to a specific dimension, wherein the specific dimension comprises a time dimension, a mutual quantity dimension, a content dimension and/or a user dimension.
The time dimension further includes a publication time, a last reply time, and/or a last operation time.
The interaction quantity dimension further includes reading quantity, reply quantity, collection quantity, like quantity, reply like quantity, and/or sharing quantity.
The content dimension further includes a content length, an average reply length, and/or a number of pictures.
The user dimensions further include user interest, liveness of content, and/or reputation of posting/replying users, etc.
The interaction quantity dimension is determined in a data weighting mode, preferably, the weight values are different, the weight of the reading quantity is usually the lowest, and the weight values of the replying quantity, the praise quantity, the collecting quantity and the sharing quantity are relatively higher and are determined according to the service scene.
A method for pushing information, comprising the steps of:
step A1: a user pulls down a refresh initiation request on a home page of a client;
step A2: the request is transmitted to the server interface by the client, meanwhile, the client also sends the user and the user context characteristics to the server interface, and the client also sends the user behavior log to a message queue;
step A3: after receiving a request sent by a client, the server calls a feature pushing system interface and sends features required by a feature pushing algorithm model to the feature pushing system;
step A4: the recommendation system requests a recall system for recommending contents after receiving a request sent by a server;
step A5: and after receiving the content sets respectively returned by the recall system, the server performs sequencing processing.
The user context characteristics include: user id, machine type, system, request time, request IP, location, etc.;
the step a3 further includes:
step A3-1: storing the user behavior data in the subscription message database into a data warehouse, and partitioning according to specific time characteristics;
step A3-2: looking up the activity characteristics of each day of a user in a specific period of time from the large-width table of the data warehouse, and storing the activity characteristics in a first data structure;
step A3-3: performing weight reduction accumulation calculation through a user-defined UDF function according to the behavior number of the user in each channel for nearly 15 days to obtain a weight score;
the calculation formula of the UDF function is as follows: interest score ∑ number of behaviors over i days ∑ aiThe range of a is usually 0.5-0.99, which represents the interest attenuation speed of the user on the channel content, and the optimal value of a is finally determined to be 0.95 through multiple tests;
step A3-4: and carrying out abnormal value processing on the weight scores, wherein the abnormal value processing refers to that the weight value is assigned to null as 0, and the weight value is not two reserved decimal numbers of null.
The wide table is a data table summarizing user side information, content side information and context characteristics and is used for preparation before training of a machine learning model;
preferably, the user side information includes a user ID, a gender, an age, a mobile phone model, a mobile phone system, a latest click content, an exposure content, a sharing content, a collection content, interest scores of different channel contents recorded in the picture, and the like;
the portrayal is that various behaviors of the user are collected in real time through a real-time calculation program to generate a detailed description of the user, the description comprises a behavior footprint of the user, and the interest score of the user on the content of each channel is calculated through the behavior footprint;
the content side information comprises a content ID, a content title, a content type, a click number of content, an exposure number of content (exposure refers to the appearance of the current screen of the mobile phone), a sharing number of content, a second fraction of content (the number of secondary sharing), a collection number of content, a comment number of content, a click number of content in a WeChat friend circle, a click rate of content, a yield rate of content (the number of off-site clicks divided by the exposure number), a sharing rate of content, a second fraction of content (the proportion of secondary sharing), average read time of content and the like;
the context characteristics comprise user behavior time, a channel where the user is currently located, the position of an article in a current screen, the sliding direction of a finger of the user and the like;
preferably, the step a3-4 further includes the step of splicing the features of the user, such as click, exposure, sharing, collection, click rate and sharing rate, of the user in the past 1, 2, 3 and 15 days to be stored as a field;
the step A3-4 splices the features to avoid dimension explosion;
step A3-5: generating the day-level characteristics of the content side, specifically comprising: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:
step A3-5-a, firstly, data normalization is carried out on click rate, propagation rate, profitability, fraction II and sharing rate, and the normalization method comprises the following steps: (current-minimum)/(maximum-minimum);
step a3-5-b, calculating a composite normalized score (spreading factor normalized by 5) + (yield normalized by 40) + (click rate normalized by 45) + (dichotomy normalized by 5) + (share normalized by 5);
step a3-5-c, calculating a weight score of each content, which is (the exposure number of the content + the click number + the share number + the click number of the WeChat friend circle + the second score)/the exposure number;
a step a3-5-d, finally, normalizing the composite score of the content to the weight score;
the features required by the feature push algorithm model include: request information, user-side information, and/or article-side information;
the subscription message database prefers kafka;
preferably, the step A3-1 of storing into a data warehouse is developed based on an Aliskive cloud dataworks framework;
preferably, the specific time characteristics are respectively selected as hours and days as characteristics;
preferably, the specific time period is selected from 10 to 20 days; preferably, the specific time period is selected to be 15 days;
preferably, the activity characteristics of each day in step 3-2 refer to activity characteristics of each channel each day, and the channels refer to classified display of different types of content in the APP for the purpose of improving the reading experience of the user, where each type of display is called a channel.
Preferably, the activity features include the number of exposures, clicks, shares, comments, and collections;
preferably, the fields of the first data structure include: user id, channel id (e.g., american, health, hot), type of behavior (e.g., exposure, click, share), number of corresponding behaviors over the past 1 day, number of corresponding behaviors over the past 2 days, number of corresponding behaviors over the past 3 days, number of corresponding behaviors over the past 15 days;
the step a3 further includes real-time feature generation, further including:
step A3-6: performing article minute-level data aggregation, specifically comprising: selecting a user behavior log stream from the subscription message database by a real-time feature generation calculation framework, aggregating data of exposure, clicking, sharing, collection, comment and the like of contents for one minute, and storing a result in the second database;
step 3-7: sampling and determining a sample label, which specifically comprises the following steps: in the sampling stage, the label of each sample is determined, the characteristics and the label form a complete sample, generally, the content exposed and clicked by a user is a positive sample, and the sample exposed and not clicked is taken as a negative sample. Preferably, in the negative sample sampling step, two contents under the last click in a request are negative samples even though not exposed, except that the non-click is exposed as a negative sample. After the label is determined, the characteristics before the label is found out from the database and the data warehouse according to the content ID and the user ID and are spliced into a complete sample.
Preferably, the polymerization in step a3-6 results in the form of: content ID, type of activity (exposure, click, share, comment, or collection), number of activities in a minute (e.g., 100 clicks in a minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which content in a minute) to store this result in the database;
preferably, the database selects hbase;
preferably, the real-time feature generation computation framework selects a flink framework.
The polymerization is that: accumulating and summing the user behavior number of the content in a certain time;
the numerical values of exposure, clicking, sharing, collecting, comment number, click rate, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and the corresponding characteristics can be conveniently obtained by a sample sampled later.
The invention has the beneficial effects that:
based on the user portrait, the historical clicks of the user can be known, the user can be recommended to see and see by utilizing content collaborative filtering (similar to beer and diaper), interest scores of the user on contents of various categories (calculated by integrating the behaviors of clicking, exposing, sharing, commenting, approving, collecting and the like of the user) are also available in the portrait, and the user is recalled with the currently hottest contents in the category which is most interested in;
the user can also obtain which contents are never exposed by the portrait so as to search for interest (for example, sports and entertainment are always recommended to the user before, and the user can try to push down scientific contents to test whether the user is interested or not, so that the user interest can be expanded on one hand, and the problem of narrow long tails can be solved on the other hand);
the separation of user portraits into short-term portraits and long-term portraits solves the problem of querying and storing the portraits themselves that are too large.
Drawings
FIG. 1 is a typical feature mining implementation
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The first embodiment is as follows: evaluation platform and competitive product system dynamic integration
The invention discloses a method and a device for pushing information.
The invention discloses a device for pushing information, which is characterized in that: the apparatus includes a user tag library system, a recall system, a ranking system, and/or a reordering system, the recall system obtaining candidate information about a user from a material library;
the sorting system processes the information provided by the recall system;
the reordering system processes the processing result of the ordering system;
the material library comprises bright news contents.
The user tag library system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.
The content data dimension tag subsystem processes the first user data according to a specific dimension, wherein the specific dimension comprises a time dimension, a mutual quantity dimension, a content dimension and/or a user dimension.
The time dimension further includes a publication time, a last reply time, and/or a last operation time.
The interaction quantity dimension further includes reading quantity, reply quantity, collection quantity, like quantity, reply like quantity, and/or sharing quantity.
The content dimension further includes a content length, an average reply length, and/or a number of pictures.
The user dimensions further include user interest, liveness of content, and/or reputation of posting/replying users, etc.
The interaction quantity dimension is determined in a data weighting mode, preferably, the weight values are different, the weight of the reading quantity is usually the lowest, and the weight values of the replying quantity, the praise quantity, the collecting quantity and the sharing quantity are relatively higher and are determined according to the service scene.
Example two: evaluating a literary artist or literary work of art
A method for pushing information, comprising: the method comprises the following steps:
step A1: a user pulls down a refresh initiation request on a home page of a client;
step A2: the request is transmitted to a server interface by a client, and meanwhile, the client also sends user context characteristics to the server interface;
step A3: after receiving a request sent by a client, the server calls a recommendation system interface and sends characteristics required by a recommendation algorithm model to the recommendation system;
step A4: after receiving a request sent by a server, the recommendation system acquires a user portrait according to a user ID and/or off-line trained user vectors stored in a database and other user lists similar to the user interests and respectively sends the user portrait and the user vectors and other user lists to a recall system;
step A5: after receiving the content sets respectively returned by the recall system, the server performs sequencing processing;
the user context characteristics include: user id, machine type, system, request time, request IP, location, etc.;
the features required for the recommendation algorithm model include: request information, user-side information, and/or article-side information;
preferably, the characteristics are as shown in the following table:
the step a3 further includes:
step A3-1: storing the user behavior data in the subscription message database into a data warehouse, and partitioning according to specific time characteristics;
step A3-2: searching activity characteristics of each channel of a user in a specific period of time in the past from the large-width table of the data warehouse, and storing the activity characteristics in a first data structure;
step A3-3: performing weight reduction accumulation calculation through a user-defined UDF function according to the behavior number of the user in each category of nearly 15 days to obtain a weight score;
the calculation formula of the UDF function is as follows: ag _ score ═ Σ behavior number over i days × 0.95i;
Step A3-4: performing abnormal value processing on the weight scores, wherein the abnormal value processing refers to that the weight value is assigned to be 0 when the weight value is null, and the weight value is not two reserved decimal places of null;
preferably, the step a3-4 further includes storing characteristics of clicks, exposures, shares, favorites, ctr, share rate, etc. of the user in the past 1, 2, 3, 15 days as a field;
the step A3-4 splices the features to avoid dimension explosion;
step A3-5: the generation of the day-level features of the content specifically includes: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:
a3-5-a, firstly normalizing CTR, propagation rate, CPM, dichotomy rate and sharing rate, and normalizing the data: (current-minimum)/(maximum-minimum);
step a3-5-b, calculating a composite normalized score (spreading factor normalized by 0) + (CPM normalized by 45) + (CTR normalized by 45) + (dichotomy factor normalized by 5) + (sharing factor normalized by 5);
step a3-5-c. calculating a weight score for each content (current exposure UV + click UV + share UV + off-site UV + secondary share UV of the content)/exposure UV;
a step a3-5-d, finally, normalizing the composite score of the content to the weight score;
the features required for the recommendation algorithm model include: request information, user-side information, and/or article-side information;
the subscription message database prefers kafka;
preferably, the step A3-1 of storing into a data warehouse is developed based on an Aliskive cloud dataworks framework;
preferably, the specific time characteristics are respectively selected as hours and days as characteristics;
preferably, the specific time period is selected from 10 to 20 days; preferably, the specific time period is selected to be 15 days;
the channels refer to classified display of different types of contents in the APP for the purpose of improving reading experience of users, wherein each type of display is called a channel, for example, a news channel, a makeup channel, a science and technology channel and the like in APP software;
preferably, the activity features include the number of exposures, clicks, shares, comments, and collections;
preferably, the fields of the first data structure include: user id, channel id (e.g., american, health, hot), type of behavior (e.g., exposure, click, share), number of corresponding behaviors over the past 1 day, number of corresponding behaviors over the past 2 days, number of corresponding behaviors over the past 3 days, number of corresponding behaviors over the past 15 days;
the step a3 further includes real-time feature generation, further including:
step A3-6: performing article minute-level data aggregation, specifically comprising: selecting a user behavior log stream from the subscription message database by a real-time feature generation calculation framework, aggregating data of exposure, clicking, sharing, collection, comment and the like of contents for one minute, and storing a result in the second database;
step 3-7: sampling and determining a sample label, which specifically comprises the following steps: the sampling stage determines the label, characteristics and label of each sampleLabel (Bao)This is also true in our business, as the content that the user has exposed and clicked on is typically a positive sample, and the sample that has been exposed and not clicked on is a negative sample, although in addition to the negative sample that has been exposed and not clicked on, the two content under the last click in a request are also negative samples, even if not exposed. After the label is determined, the previous characteristics are searched out from the hbase and the data warehouse according to the content ID and the user ID and are spliced into a complete sample.
Preferably, the polymerization in step a3-6 results in the form of: content ID, type of behavior (exposure, click, share, comment, or favorites), number of behaviors in one minute (e.g., 100 clicks in one minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in hbase;
preferably, the real-time feature generation computation framework selects a flink framework.
The polymerization is that: accumulating and summing the user behavior number of the content in a certain time;
the numerical values of exposure, clicking, sharing, collecting, comment number, ctr, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and the corresponding characteristics can be conveniently obtained by a sample after sampling later.
Example three: evaluation of economic or physical objects
For the case that the subscription message database is selected as kafka, and the data warehouse is developed based on the ali cloud dataworks framework, the scheme has the following preferred embodiment. It should be noted that this embodiment is merely a preferred implementation of the present invention, and other similar implementations are all within the scope of the present invention.
The specific embodiment is as follows.
1. The behavior of a user at a client can be reported to the open source flow processing platform kafka topic through a buried point, data in kafka can fall to a plurality of bins in an hour unit on one hand, and on the other hand, a flink job program subscribes for consumption to generate real-time characteristics.
2. User and content day-level data
(1) The user behavior data in kafka falls into a plurality of bins from an Aliskiu dataworks framework, and is partitioned by hour and day, the number of exposure, click, sharing, comment and collection of each channel (tag _ id) of the user in the past 15 days is firstly found out from a large-width table of the bins, and the result form is as follows:
the fields therein are: user id, channel id (e.g., american, health, hot), type of behavior (e.g., exposure, click, share), number of corresponding behaviors over the past 1 day, number of corresponding behaviors over the past 2 days, number of corresponding behaviors over the past 3 days,.. corresponding behaviors over the past 15 days.
(2) And performing weight reduction accumulation to calculate a weight score through a self-defined UDF function according to the behavior number of the user in each category of nearly 15 days, wherein the calculation formula of the UDF is as follows: ag _ score ═ Σ behavior number over i days × 0.95i
The result is the form:
each column represents: user ID, channel ID, behavior type, weight score;
through the calculation, the weight score of the user for each category content is obtained, then slight processing is carried out on some abnormal values, 0 is given to null, two decimal places are not reserved for null, characteristics of clicking, exposure, sharing, collection, ctr, sharing rate and the like of the user in the past 1, 2, 3, 15 days are spliced together and stored as a field, and in this way, in order to avoid dimension explosion, some static attributes of the user are finally spliced.
Preferably, the characteristics of the final generated user can be expressed in the form of:
+-------------------------------------------------------------------+
|Field|Type|Comment|Field|Value
+-------------------------------------------------------------------+
| user _ age | bigint | age | user |23
| user _ sex | bigint | gender | user |2
User _ pregnant | bigint | pregnancy | user | ventilation
User _ phone | string | mobile phone model | user | iPhone |
User _ os/string/mobile phone system/user/iOS
Type | user | WIFI | connected type | user | through _ type | string | | connected type | user _ conn _ type |
User _ duration | double | | | dwell time | user
L user _ ctr _ video | double | video ctr | user |0.15894039735099338
Average dwell time of | user _ avg _ duration | double | | | | user
User share number of user
User _ click bigint user click number user
Video user sharing number | user |4 | video user sharing number | user |
| sv _ user _ frt | double | | | video playing completion rate | user
Video _ user _ srt double video sharing rate user 0.29449495366961065
User srt rate user sharing rate
User read nums user reading
Video _ user _ click | big | user video click number | user |141
Exposure of | contentshow | string | of | user to channel | user
|contentshow_20006:11,contentshow_20007:27,contentshow_20008:20,contentshow_20011:1,contentshow_20013:3,contentshow_20014:2,contentshow_2001
6:2,contentshow_20017:4,contentshow_20023:3,contentshow_20026:2,contentshow_20030:57,contentshow_20058:4,contentshow_20059:9,contentshow_2006
6:2,contentshow_20076:3,contentshow_20079:1,contentshow_20080:1,contentshow_20089:2,contentshow_20093:5,contentshow_20102:1,contentshow_20116:1,contentshow_20144:39,contentshow_20605:1
Click of | content click | string | | | | user on channel | user
|contentclick_20006:1,contentclick_20007:3,contentclick_20008:9,contentclick_20013:2,contentclick_20014:1,contentclick_20016:2,contentclick_20030:10,c
ontentclick_20059:2,contentclick_20076:2,contentclick_20079:5,contentclick_20080:1,contentclick_20093:1,contentclick_20144:3
Sharing of channel by share 20008:1, share 20030:2
Ctr | user of user to channel | ctr | string | | |
|ctr_20008:0.009849,ctr_20030:0.020207,ctr_20059:0.230724,ctr_20080:0.094531,ctr_20094:0.206549,ctr_20102:0.094531,ctr_20104:0.045587,ctr_20116:0.206549
Share rate of user to channel | user | sharehrarate _20008:0.003159, sharehrate _20030:0.010821
User click video frequency | user |94 | |
(3) Generating day-level features of content
Finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from a data warehouse, and calculating a comprehensive score through a UDF function, wherein the formula is as follows:
a. the method comprises the following steps of firstly, respectively normalizing CTR, the propagation rate, CPM, the dichotomy rate and the sharing rate by data, and normalizing the data: (Current-minimum)/(maximum-minimum)
b. Calculating the integrated score (0 is the transmission rate) plus (45 is the CPM) plus (45 is the CTR) plus (5 is the dichotomy) plus (5 is the sharing rate)
c. Calculate weight score for each content (current exposure UV + click UV + share UV + off-site UV + secondary share UV for that content)/exposure UV
d. Finally, the composite score of the content is the weight score and the normalization score
Then, splicing some static attributes of the content by a processing method similar to the user characteristics to finally generate the user characteristics; one preferred mode is as follows:
+-------------------------------------------------------------------+
|Field|Type|Comment|Field|Value
+-------------------------------------------------------------------+
article ID item 34021769
Article title | item | gold view | blowing wheat wave |
The | absstract | string | | | article describes | item | south-Henan journal client reporter have an shine of 6 months and 2 days, and in a wheat field in the ninth division of farm in the yellow pan of Xihua county, a farmer operates a large-scale harvester to harvest wheat. In the three summers, the China and the China are in the same order, the wheat field in the middle and the original land is golden yellow, and the wheat harvest on the farm in the yellow pan meets the war and enters the peak. In the wheat field, the machine sound is loud, the smoke and dust fly, and the scene is rich and harvested.
Keywords | item | wheat field, 6 months, 2 days, wheat, harvest, yellow pan, farm, 10 ten thousand
Type | bigint | content type | item |1
| item _ tag _ ID | bigint | | channel classification ID | item |20059
| item _ status | bigint | article state | item |1
Source of content source | item |2 | source | bigint | |
| item _ publish _ time | bigint | | | publishing time | item |1591138870
Update time bigint update time item 1591138870
The number of hours | item |3 from the time of issuance of | publication _ hours | bigint | | |
Whether is _ timeiiness | bigint | is effective | item |3
Cumulative read | item |78
| repin _ count | bigint | | accumulated collection | item |7
Accumulated praise number | item |7 | digg _ count | bigint | |
| share _ count | bigint | cumulative share | item |7
| bury _ count | binary | accumulated dislike number | item |7
Whether picture | item |1 | exists | has _ image | bigint | | |
Whether or not there is video | item | 0| has _ video | bigint | | |
Whether or not there is audio | item | 0| has _ audio | bigint | | |
Duration | bigint | duration | item |0 of video or audio
|area_id|bigint|||citycode|user|843
| item _ contentshow | string | PV and UV | item exposure for 1, 3, 5,7, 15 days past
|pv_15d_nums:468,uv_15d_nums:213,pv_7d_nums:221,uv_7d_nums:99,pv_5d_nums:147,uv_5d_nums:68,pv_3d_nums:83,uv_3d_nums:42,pv_1d_nums:48,uv_1d_nums:20
I item _ content click string past 1, 3, 5,7, 15 days PV and UV item
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
| item _ share | string | sharing PV and UV | item 1, 3, 5,7, 15 days in the past
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
I item _ favorite string previous 1, 3, 5,7, 15 days to collect PV and UV I item
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
Video playing PV and UV item of 1, 3, 5,7, 15 days past
|pv_15d_nums:5,uv_15d_nums:3,pv_7d_nums:2,uv_7d_nums:1,pv_5d_nums:2,uv_5d_nums:1
Complete string PV and UV item of video over 1, 3, 5,7, 15 days past
Postcontent | string | past 1, 3, 5,7, 15 days video comments PV and UV | item
|pv_15d_nums:3,uv_15d_nums:3,pv_7d_nums:3,uv_7d_nums:3,pv_5d_nums:3,uv_5d_nums:3,pv_3d_nums:3,uv_3d_nums:3,pv_1d_nums:3,uv_1d_nums:3
[ item _ ctr ] string ] past 1, 3, 5,7, 15 day click rate PV and UV [ item | ]
|pv_15d_nums:0.1493,uv_15d_nums:0.3004,pv_7d_nums:0.1493,uv_7d_nums:0.3004,pv_5d_nums:0.1493,uv_5d_nums:0.3004,pv_3d_nums:0.1493,uv_3d_nums:0.3004,pv_1d_nums:0.1493,uv_1d_nums:0.3004
Share _ show | string | share/expose PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0052,uv_15d_nums:0.0107,pv_7d_nums:0.0052,uv_7d_nums:0.0107,pv_5d_nums:0.0052,uv_5d_nums:0.0107,pv_3d_nums:0.0052,uv_3d_nums:0.0107,pv_1d_nums:0.0052,uv_1d_nums:0.0107
| vplay _ show | string | | | plays/exposes PV and UV | item | | in the past 1, 3, 5,7, 15 days
Favorite _ show | string | Collection/Exposure PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0002,uv_15d_nums:0.0006,pv_7d_nums:0.0002,uv_7d_nums:0.0006,pv_5d_nums:0.0002,uv_5d_nums:0.0006,pv_3d_nums:0.0002,uv_3d_nums:0.0006,pv_1d_nums:0.0002,uv_1d_nums:0.0006
Comment _ show | string | comment/expose PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0001,uv_15d_nums:0.0002,pv_7d_nums:0.0001,uv_7d_nums:0.0002,pv_5d_nums:0.0001,uv_5d_nums:0.0002,pv_3d_nums:0.0001,uv_3d_nums:0.0002,pv_1d_nums:0.0001,uv_1d_nums:0.0002
Share _ click | string | share/click PV and UV | item over 1, 3, 5,7, 15 days
|pv_15d_nums:0.0346,uv_15d_nums:0.0356,pv_7d_nums:0.0346,uv_7d_nums:0.0356,pv_5d_nums:0.0346,uv_5d_nums:0.0356,pv_3d_nums:0.0346,uv_3d_nums:0.0356,pv_1d_nums:0.0346,uv_1d_nums:0.0356
| vplay _ click | string | play/click PV and UV | item 1, 3, 5,7, 15 days in the past
Favorite _ click string I past 1, 3, 5,7, 15 days Collection/click PV and UV item
|pv_15d_nums:0.0017,uv_15d_nums:0.0021,pv_7d_nums:0.0017,uv_7d_nums:0.0021,pv_5d_nums:0.0017,uv_5d_nums:0.0021,pv_3d_nums:0.0017,uv_3d_nums:0.0021,pv_1d_nums:0.0017,uv_1d_nums:0.0021
Comment _ click | string | comment/click PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0006,uv_15d_nums:0.0008,pv_7d_nums:0.0006,uv_7d_nums:0.0008,pv_5d_nums:0.0006,uv_5d_nums:0.0008,pv_3d_nums:0.0006,uv_3d_nums:0.0008,pv_1d_nums:0.0006,uv_1d_nums:0.0008
| svplay _ acomplesh _ ratio | string | | | | playback completion rate | item
| label | bigint | sample label | item |0
3. Generation of real-time features
(1) A calculation framework: selecting a flink for a real-time computing frame;
the reason why the flink framework is preferred is that: although the real-time framework has the frameworks of Storm, spark, flink and the like, Storm can only process streaming data and has no capacity of batch processing, and flink provides a plurality of advanced apis, for example, DataStream of flink provides aps such as Map, GroupBy, Window and Join and the like to replace bolt of Storm in one or more of receivers and collectors, and Storm needs to be realized by a programmer when realizing the functions; compared with spark, the flink has the advantages of high throughput and low delay, besides, the flink also supports millisecond-level calculation, spark only supports second-level calculation, and the flink belongs to real streaming calculation;
(2) code logic: the data aggregation of the article minute level is firstly carried out, and based on a flink real-time computing framework, the user behavior log stream stored in kafka is used as the data input of the flink real-time computing framework. Based on a user behavior log stored in kafka, in a flink real-time computing framework, data such as exposure, clicking, sharing, collecting and commenting of contents are aggregated for one minute, and the aggregation result form is as follows: content ID, type of behavior (exposure, click, share, comment, or favorites), number of behaviors in one minute (e.g., 100 clicks in one minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in hbase.
The numerical values of exposure, clicking, sharing, collecting, comment number, ctr, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and a sample after sampling later can conveniently obtain corresponding characteristics.
The reason for choosing hbase is: common databases include hbase, hive, redis, mysql and the like, and the mysql belongs to a relational database, but the service data volume related by the invention is large, the requirement on linear expansion is met, the requirement on automatic operation and maintenance is met, and the application mode is simple, and obviously, the mysql cannot be met; although the function of Redis is similar to that of HBase, the Redis is a nonsql type database based on Key and Value, but the Redis is suitable for being used as a cache, the related service of the invention requires that data cannot be lost, the Redis cannot meet the requirement, and the use cost of the Redis is much higher than that of HBase; hive belongs to a data warehouse tool, and the bottom layer is mapreduce and cannot be used for interactive storage of users; the hbase has the characteristics of large capacity, column storage, strong expansibility and high reliability, and the hbase can provide real-time computing service mainly because the hbase is determined by the architecture and the data structure of the bottom layer, namely LSM-Tree + HTable + Cache, so that the technical architecture has obvious advantages compared with other database systems; therefore, the client can directly locate the HRegion server where the data to be searched is located, then directly search the data to be matched on a region of the server, and the data parts are cached by the cache.
In addition, compared with databases such as mysql and redis, the storage cost of hbase is more economic. For example, the single month cost of the hbase cluster core node with the capacity of 8 cores 32G and the disk capacity of 1800G of the Aliyun is only about 1000 yuan, the mysql with the same configuration is 3000 yuan, and the cache database redis more expensive.
(3) Sampling: the sampling stage determines label of each sample, and features and label constitute a complete sample, and generally, the content exposed and clicked by a user is a positive sample, and the sample exposed and not clicked is taken as a negative sample, which is also the same in our business, but in the negative sample, besides the negative sample exposed and not clicked, the two contents below the last click in a request are taken as negative samples even if not exposed. After the label is determined, the previous features are searched out from the hbase and the data warehouse according to the content ID and the user ID and are spliced into a complete sample. The final characteristics are as follows:
+-------------------------------------------------------------------+
|Field|Type|Comment|Field|Value
+-------------------------------------------------------------------+
article ID item 34021769
Article title | item | gold view | blowing wheat wave |
The | absstract | string | | | article describes | item | south-Henan journal client reporter have an shine of 6 months and 2 days, and in a wheat field in the ninth division of farm in the yellow pan of Xihua county, a farmer operates a large-scale harvester to harvest wheat. In the three summers, the China and the China are in the same order, the wheat field in the middle and the original land is golden yellow, and the wheat harvest on the farm in the yellow pan meets the war and enters the peak. In the wheat field, the machine sound is loud, the smoke and dust fly, and the scene is rich and harvested.
Keywords | item | wheat field, 6 months, 2 days, wheat, harvest, yellow pan, farm, 10 ten thousand
Type | bigint | content type | item |1
| item _ tag _ ID | bigint | | channel classification ID | item |20059
| item _ status | bigint | article state | item |1
Source of content source | item |2 | source | bigint | |
| item _ publish _ time | bigint | | | publishing time | item |1591138870
Update time bigint update time item 1591138870
The number of hours | item |3 from the time of issuance of | publication _ hours | bigint | | |
Whether is _ timeiiness | bigint | is effective | item |3
Cumulative read | item |78
| repin _ count | bigint | | accumulated collection | item |7
Accumulated praise number | item |7 | digg _ count | bigint | |
| share _ count | bigint | cumulative share | item |7
| bury _ count | binary | accumulated dislike number | item |7
Whether picture | item |1 | exists | has _ image | bigint | | |
Whether or not there is video | item | 0| has _ video | bigint | | |
Whether or not there is audio | item | 0| has _ audio | bigint | | |
Duration | bigint | duration | item |0 of video or audio
|area_id|bigint|||citycode|user|843
| item _ contentshow | string | PV and UV | item exposure for 1, 3, 5,7, 15 days past
|pv_15d_nums:468,uv_15d_nums:213,pv_7d_nums:221,uv_7d_nums:99,pv_5d_nums:147,uv_5d_nums:68,pv_3d_nums:83,uv_3d_nums:42,pv_1d_nums:48,uv_1d_nums:20
I item _ content click string past 1, 3, 5,7, 15 days PV and UV item
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
| item _ share | string | sharing PV and UV | item 1, 3, 5,7, 15 days in the past
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
I item _ favorite string previous 1, 3, 5,7, 15 days to collect PV and UV I item
|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3
Video playing PV and UV item of 1, 3, 5,7, 15 days past
|pv_15d_nums:5,uv_15d_nums:3,pv_7d_nums:2,uv_7d_nums:1,pv_5d_nums:2,uv_5d_nums:1
Complete string PV and UV item of video over 1, 3, 5,7, 15 days past
Postcontent | string | past 1, 3, 5,7, 15 days video comments PV and UV | item
|pv_15d_nums:3,uv_15d_nums:3,pv_7d_nums:3,uv_7d_nums:3,pv_5d_nums:3,uv_5d_nums:3,pv_3d_nums:3,uv_3d_nums:3,pv_1d_nums:3,uv_1d_nums:3
[ item _ ctr ] string ] past 1, 3, 5,7, 15 day click rate PV and UV [ item | ]
|pv_15d_nums:0.1493,uv_15d_nums:0.3004,pv_7d_nums:0.1493,uv_7d_nums:0.3004,pv_5d_nums:0.1493,uv_5d_nums:0.3004,pv_3d_nums:0.1493,uv_3d_nums:0.3004,pv_1d_nums:0.1493,uv_1d_nums:0.3004
Share _ show | string | share/expose PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0052,uv_15d_nums:0.0107,pv_7d_nums:0.0052,uv_7d_nums:0.0107,pv_5d_nums:0.0052,uv_5d_nums:0.0107,pv_3d_nums:0.0052,uv_3d_nums:0.0107,pv_1d_nums:0.0052,uv_1d_nums:0.0107
| vplay _ show | string | | | plays/exposes PV and UV | item | | in the past 1, 3, 5,7, 15 days
Favorite _ show | string | Collection/Exposure PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0002,uv_15d_nums:0.0006,pv_7d_nums:0.0002,uv_7d_nums:0.0006,pv_5d_nums:0.0002,uv_5d_nums:0.0006,pv_3d_nums:0.0002,uv_3d_nums:0.0006,pv_1d_nums:0.0002,uv_1d_nums:0.0006
Comment _ show | string | comment/expose PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0001,uv_15d_nums:0.0002,pv_7d_nums:0.0001,uv_7d_nums:0.0002,pv_5d_nums:0.0001,uv_5d_nums:0.0002,pv_3d_nums:0.0001,uv_3d_nums:0.0002,pv_1d_nums:0.0001,uv_1d_nums:0.0002
Share _ click | string | share/click PV and UV | item over 1, 3, 5,7, 15 days
|pv_15d_nums:0.0346,uv_15d_nums:0.0356,pv_7d_nums:0.0346,uv_7d_nums:0.0356,pv_5d_nums:0.0346,uv_5d_nums:0.0356,pv_3d_nums:0.0346,uv_3d_nums:0.0356,pv_1d_nums:0.0346,uv_1d_nums:0.0356
| vplay _ click | string | play/click PV and UV | item 1, 3, 5,7, 15 days in the past
Favorite _ click string I past 1, 3, 5,7, 15 days Collection/click PV and UV item
|pv_15d_nums:0.0017,uv_15d_nums:0.0021,pv_7d_nums:0.0017,uv_7d_nums:0.0021,pv_5d_nums:0.0017,uv_5d_nums:0.0021,pv_3d_nums:0.0017,uv_3d_nums:0.0021,pv_1d_nums:0.0017,uv_1d_nums:0.0021
Comment _ click | string | comment/click PV and UV | item 1, 3, 5,7, 15 days past
|pv_15d_nums:0.0006,uv_15d_nums:0.0008,pv_7d_nums:0.0006,uv_7d_nums:0.0008,pv_5d_nums:0.0006,uv_5d_nums:0.0008,pv_3d_nums:0.0006,uv_3d_nums:0.0008,pv_1d_nums:0.0006,uv_1d_nums:0.0008
| svplay _ acomplesh _ ratio | string | | | | playback completion rate | item
| label | bigint | sample label | item |0
Uid string user ID user 25311456
I requested | string | request ID | user |501359071604462886836102
I requesttime | string | | | request time | user |1604462886
| showlist | string | | | expose the non-clicked sequence | user |33732005,33783349,33837299,33838753,33912723,33928278
Perssessessence string click sequence user 33732005
Work day string workday user 1
| user _ age | bigint | age | user |23
| user _ sex | bigint | gender | user |2
User _ pregnant | bigint | pregnancy | user | ventilation
User _ phone | string | mobile phone model | user | iPhone |
User _ os/string/mobile phone system/user/iOS
Type | user | WIFI | connected type | user | through _ type | string | | connected type | user _ conn _ type |
User _ duration | double | | | dwell time | user
L user _ ctr _ video | double | video ctr | user |0.15894039735099338
Average dwell time of | user _ avg _ duration | double | | | | user
User share number of user
User _ click bigint user click number user
Video user sharing number | user |4 | video user sharing number | user |
| sv _ user _ frt | double | | | video playing completion rate | user
Video _ user _ srt double video sharing rate user 0.29449495366961065
User srt rate user sharing rate
User read nums user reading
Video _ user _ click | big | user video click number | user |141
Exposure of | contentshow | string | of | user to channel | user
|contentshow_20006:11,contentshow_20007:27,contentshow_20008:20,contentshow_20011:1,contentshow_20013:3,contentshow_20014:2,contentshow_2001
6:2,contentshow_20017:4,contentshow_20023:3,contentshow_20026:2,contentshow_20030:57,contentshow_20058:4,contentshow_20059:9,contentshow_2006
6:2,contentshow_20076:3,contentshow_20079:1,contentshow_20080:1,contentshow_20089:2,contentshow_20093:5,contentshow_20102:1,contentshow_20116:1,contentshow_20144:39,contentshow_20605:1
Click of | content click | string | | | | user on channel | user
|contentclick_20006:1,contentclick_20007:3,contentclick_20008:9,contentclick_20013:2,contentclick_20014:1,contentclick_20016:2,contentclick_20030:10,contentclick_20059:2,contentclick_20076:2,contentclick_20079:5,contentclick_20080:1,contentclick_20093:1,contentclick_20144:3
Sharing of channel by share 20008:1, share 20030:2
Ctr | user of user to channel | ctr | string | | |
|ctr_20008:0.009849,ctr_20030:0.020207,ctr_20059:0.230724,ctr_20080:0.094531,ctr_20094:0.206549,ctr_20102:0.094531,ctr_20104:0.045587,ctr_20116:0.206549
Share rate of user to channel | user | sharehrarate _20008:0.003159, sharehrate _20030:0.010821
User click video frequency | user |94 | |
Event type | user | content tclick | eventtype | string | | event type |
User _ show _ video | string | user exposure video number | user |806
|id|string|||item_id|item|3746
| retrieveid | string | | | distinguishes scene ID | item | arm _ relationship
Article title length | item | 6| item _ title _ len | bigint | |
Article content length | item |56 | item _ content _ len | bigint | |
Click _ offset | bigint | click position | item |12
| item _ type | bigint | | | media type | item |2
From the release time of | item _ apply _ time | bigint | article to the current time | item |16 |
| item _ muid | bigint | media number ID | item |2
| item _ ty | bigint | media type | item |20010
| publish _ time | bigint | issuance time | item |1593590495
| tag _ ID | bigint | | channel ID | item |20010
Whether is _ time | bigint | is effective | item |0
Number of article pictures | item |3 | item _ img _ count _ show | bigint | |
| category | bigint | channel ID | item |20010
Status bigint article state item 1
Whether | is _ up | double | | | falls off shelf | item | (the luminance is not vertical
Is _ hot | double | is hot | item
Whether is _ original | double | is original | item
| video _ click | double | | | video click number | item |37
Video share number | item |7
Video _ srt | double | video sharing rate | item |0.09347644411829928
Click number | item |37 of | inventory _ click | double | | article
Article ctr | item |0.20833333333333334 | article _ ctr | double | |
Exposure number | item |92 of | articule _ show | double | | article
User _ tag _ ctr | double | user ctr | item for channel
Exposure of | user _ tag _ show | double | | | | user to channel | item
Click of channel | item by | user _ tag _ click | double | | | user
Comment _ count | double | | | comment number | item
Source read | item | Source _ read _ count | double | | | Source read | item
Wxsharetwice _ source | double | | | | hour out-of-site binary | item
| contentshow _ hour | double | hour exposure number | item |1
Share number | item | hour | | | share _ hour | share _ double | | |
Sharing click number | item outside the station after 1 hour of | shareback _ hour | | double | |
Wxcontenctclick _ hour double hour out-of-station click | item
| exitcontent dwell _ hour | double | | | hour dwell time | item |2357
| content click _ hour | double | | | hour click number | item |1
| facade _ hour | hour collection number | item
Ctr | item |0.063225105 for | ctr _ hour | double | |
Number of comments | item of | | postcontent _ hour | double | | | hour | postcontent _ hour |
[ Contentlick _12hour | double | | |12 hour click number | item |12 |
(| contentshow _12hour | double | | | |12 hour exposure number | item |)
12h ctr | item |0.09453121
[ exitcontent tdetail _12hour [ double ] 12hour stay time [ item ]
Wxcontenctclick _12hour double 12hour off-site click number | item
Wxsharetiwice _12hour double 12hour off-site sharing number item
Share number | item 12 hours | | | share _12hour | | share _12 | share | item
Collection number | item of | favorite _12hour | double | | |12 hours |
| postcontent comment _12hour | double | | |12 hour comment number | item
Number of clicks | item | shared in the past 12 hours |, shareback _12 hours | | double | |
| sex | bigint | gender | user |1
Age, string, age, user, age over 45
Rivalry | string | province | user | Shandong province
City, user, and city
| score | bigint | score | user |64542
| reg _ days | bigint | | registration days | user |699
Active _ days | bigint | active days | user |698
If flag bigint is logged in user 1 by third party or mobile phone number
Last _ location _ time | bigint | last login time | user |1604332800
The time of the last clean up notification | user | 0|
The last time | last _ sys _ time | bigint | | the time | user |1544079479 of the system notification is read
The time | user |1604383947 when | last _ rep _ time | bigint | the last read comment reply
Class _ region _ type | bigint | finally login platform | user |2 (last login platform, 1: ios; 2: android)
Registration type | user |2 (registration platform, 1: ios; 2: android; 3: pc)
Source _ ID | string | Source ID | user | c1006
Device _ type | string | distinction type | user | OPPO PBBT00
The | hist _ share _ num | bigint | time period shares | user |4374
Registration of | reg _ IP | bigint | IP | user |1894108192
| hist _ show | string | 15-day exposure category and score | user | Y
cs_20007:1551.01,cs_20086:0.54,cs_20087:34.97,cs_20097:3.41,cs_20108:29.93,cs_20128:0.95,cs_20144:68.29,cs_20006:328.61,cs_20058:118.86,cs_2006
9:14.37,cs_20080:192.07,cs_20093:64.22,cs_20110:5.04,cs_20604:2.44,cs_20017:15.14,cs_20056:15.76,cs_20059:244.21,cs_20066:202.37,cs_20102:171.87
,cs_20104:23.07,cs_20109:65.21,cs_20605:5.78,cs_20011:237.72,cs_20014:139.36,cs_20016:162.04,cs_20021:2.81,cs_20022:13.16,cs_20023:89.82,cs_2002
6:135.82,cs_20030:3524.74,cs_20076:28.03,cs_20083:16.63,cs_20094:9.97,cs_20095:1.08,cs_20116:13.88,cs_20603:8.37,cs_20008:3105.84,cs_20010:25.01,cs_20013:80.84,cs_20015:362.75,cs_20078:2.56,cs_20079:210.8,cs_20103:4.25,cs_20589:0.57
Click category and score for 15 days
|cc_20017:1.55,cc_20056:1.03,cc_20059:44.27,cc_20066:28.14,cc_20102:90.49,cc_20104:1.61,cc_20109:27.78,cc_20006:29.73,cc_20058:9.24,cc_20080:65.87,cc_20093:34.91,cc_20110:0.86,cc_20007:263.99,cc_20087:15.68,cc_20108:21.55,cc_20144:31.77,cc_20011:69.81,cc_20014:5.46,cc_20016:26.26,cc_20022:0.63,cc_20023:2.83,cc_20026:15.17,cc_20030:353.16,cc_20083:6.67,cc_20094:7.22,cc_20116:4.25,cc_20603:3.1,cc_20008:564.64,cc_20010:2.71,cc_20013:2.42,cc_20015:28.1,cc_20078:1.99,cc_20079:69.84,cc_20103:6.29
The | hist _ share | string |15 days share categories and scores
|ss_20011:10.2,ss_20016:3.01,ss_20026:4.54,ss_20030:75.36,ss_20094:0.95,ss_20603:0.77,ss_20008:141.08,ss_20015:0.49,ss_20079:12.57,ss_20059:9.26,ss_20066:3.95,ss_20102:6.08,ss_20109:2.1,ss_20007:60.41,ss_20087:0.57,ss_20144:4.61,ss_20058:1.3,ss_20080:6.34,ss_20093:5.14
Collection category of | hist _ favorite | string |15 days and scores | fs _20102:1.63, fs _20079:0.54
Total _ score | string | 15-day category Total score | user
|ts_20017:1.55,ts_20056:1.03,ts_20059:72.05000000000001,ts_20066:39.99,ts_20102:121.76999999999998,ts_20104:1.61,ts_20109:34.08,ts_20006:29.73,ts_20058:13.14,ts_20080:84.89,ts_20093:50.33,ts_20110:0.86,ts_20007:445.22,ts_20087:17.39,ts_20108:21.55,ts_20144:45.6,ts_20008:987.88,ts_20010:2.71,ts_20013:2.42,ts_20015:29.57,ts_20078:1.99,ts_20079:111.87,ts_20103:6.29,ts_20011:100.41,ts_20014:5.46,ts_20016:35.29,ts_20022:0.63,ts_20023:2.83,ts_20026:28.79,ts_20030:579.24,ts_20083:6.67,ts_20094:10.07,ts_20116:4.25,ts_20603:5.41
(ii) hit _ ctr | string | | | 7-day category click rate: click/show | user
|7d_ctr_20007:0.25,7d_ctr_20008:0.2,7d_ctr_20014:0.08,7d_ctr_20030:0.23,7d_ctr_20013:0.29,7d_ctr_20066:0.25,7d_ctr_20016:0.04,7d_ctr_20109:0.5,7d_ctr_20006:0.17,7d_ctr_20011:0.33,7d_ctr_20026:0.33,7d_ctr_20058:0.18,7d_ctr_20059:0.25
The collection rate of 7-day category of the list/click 7d ctr 20007:0.25,7d ctr 20008:0.2,7d ctr 20011:0.33,7d ctr 20058:0.18
| up _ score | string | | | offline preference + real-time preference score | user
|20030:54.736000,20008:45.776000,20007:32.544000,20011:17.100000,20014:4.884000,20058:4.332000,20013:3.504000,20016:3.156000,20015:3.000000,20066:2.568000
| prefer _ category _ score | double | | | user preference feature | user |203.614
| prefer _ category _ hist _ ctr | double | | | user preference feature | user |0.22
| prefer _ best _ share | double | | | user preference feature | user |0.46
| prefer _ best _ share | double | | | user preference feature | user |0.66121438629008
Reading amount |3.0| item | pv _15d _ nums | double | | | |15 days
Reading amount of |2.0| item | in 7 days | pv _7d _ nums | double | | |
Per _5d _ nums | double | | | 5-day reading |1.0| item
Reading amount |1.0| item | pv _3d _ nums | double | | | |3 days
Reading amount [ item ] 1 day of [ pv _1d _ nums ] double | | | |1
pV 15d _ nums _ click | double |15 days pV/click |1.0| item
pV _7d _ nums _ click | double | | |7 days pV/click |1.0| item
| pv _5d _ nums _ click | double | | | 5-day pv/click |1.0| item
pV 3d _ nums _ click 3 days pV/click 1.0 item
| pv _1d _ nums _ click | double | | |1 day pc/click | item
| pv _15d _ nums _ show | double | | |15 days pv/show |3.0| item
| pv _7d _ nums _ show | double | | | 7-day pv/show |3.0| item
| pv _5d _ nums _ show | double | | | 5-day pv/show |3.0| item
| pv _3d _ nums _ show | double | | |3 days pv/show |3.0| item
| pv _1d _ nums _ show | double | | |1 day pv/show
Type | retrieve _ type | string | | | recall type | item | arm _ register
| recall _ score | double | | | recall score | item |0.04230078826
Ctr |0.038834951456310676| item for 15 days of | item _ ctr _15d | double | | | article
Ctr |0.038834951456310676| item for 7 days of | item _ ctr _7d | double | | | article
Ctr |0.038834951456310676| item for 5 days of | item _ ctr _5d | double | | | article
Ctr |0.038834951456310676| item for 3 days of | item _ ctr _3d | double | | | article
One-day ctr |0.0| item of | item _ ctr _1d | double | | | article
| usercacategories | string | | | user class | user
Content | string | content column | item | recommendation |
Display tag of "showtag" "string |"
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (10)
1. A method for pushing information, comprising: the method comprises the following steps:
step A1: a user pulls down a refresh initiation request on a home page of a client;
step A2: the request is transmitted to the server interface by the client, meanwhile, the client also sends the user and the user context characteristics to the server interface, and the client also sends the user behavior log to a message queue;
step A3: after receiving a request sent by a client, the server calls a feature pushing system interface and sends features required by a feature pushing algorithm model to the feature pushing system;
step A4: the recommendation system requests a recall system for recommending contents after receiving a request sent by a server;
step A5: and after receiving the content sets respectively returned by the recall system, the server performs sequencing processing.
The user context characteristics include: user id, model, system, request time, request IP, location, etc.
2. The method of claim 1, wherein: the step a3 further includes:
step A3-1: storing the user behavior data in the subscription message database into a data warehouse, and partitioning according to specific time characteristics;
step A3-2: looking up the activity characteristics of each day of a user in a specific period of time from the large-width table of the data warehouse, and storing the activity characteristics in a first data structure;
step A3-3: performing weight reduction accumulation calculation through a user-defined UDF function according to the behavior number of the user in each channel for nearly 15 days to obtain a weight score;
step A3-4: and carrying out abnormal value processing on the weight scores, wherein the abnormal value processing refers to that the weight value is assigned to null as 0, and the weight value is not two reserved decimal numbers of null.
3. The method of claim 2, wherein: the calculation formula of the UDF function is as follows: interest score ∑ number of behaviors over i days ∑ aiAnd a is in a range of 0.5-0.99, and represents the interest decay rate of the user in the channel content.
4. The method of claim 2, wherein: the step A3 further includes a step A3-5, which specifically includes:
generating the day-level characteristics of the content side, specifically comprising: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:
step A3-5-a, firstly, data normalization is carried out on click rate, propagation rate, profitability, fraction II and sharing rate, and the normalization method comprises the following steps: (current-minimum)/(maximum-minimum);
step a3-5-b, calculating a composite normalized score (spreading factor normalized by 5) + (yield normalized by 40) + (click rate normalized by 45) + (dichotomy normalized by 5) + (share normalized by 5);
step a3-5-c, calculating a weight score of each content, which is (the exposure number of the content + the click number + the share number + the click number of the WeChat friend circle + the second score)/the exposure number;
step a3-5-d. finally, the composite score of the content is the weight score.
5. The method of claim 2, wherein: the step a3 further includes real-time feature generation, further including:
step A3-6: performing article minute-level data aggregation, specifically comprising: selecting a user behavior log stream from the subscription message database by a real-time feature generation calculation framework, aggregating data of exposure, clicking, sharing, collection, comment and the like of contents for one minute, and storing a result in the second database;
step 3-7: sampling and determining a sample label, which specifically comprises the following steps: in the sampling stage, the label of each sample is determined, the characteristics and the label form a complete sample, generally, the content exposed and clicked by a user is a positive sample, and the sample exposed and not clicked is taken as a negative sample. Preferably, in the negative sample sampling step, two contents under the last click in a request are negative samples even though not exposed, except that the non-click is exposed as a negative sample. After the label is determined, the characteristics before the label is found out from the database and the data warehouse according to the content ID and the user ID and are spliced into a complete sample.
6. The method of claim 2, wherein: the wide table is a data table summarizing user side information, content side information and context characteristics and is used for preparation before training of the machine learning model.
7. The method of claim 5, wherein: the polymerization in step A3-6 results in the form: content ID, type of activity (exposure, click, share, comment, or favorites), number of activities in a minute (e.g., 100 clicks in a minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in the database.
8. The method of claim 5, wherein: selecting hbase from the database; and the numerical values of the contents, such as exposure, click, share, collection, comment number, click rate, share rate and the like, of nearly 1, 2, 4, 8 and 12 hours can be expanded and calculated based on the minute-level aggregation data in the hbase.
9. An apparatus for pushing information using the method of claim 1, wherein: the device comprises a feature pushing system, a recalling system, a sequencing system and/or a reordering system, wherein the recalling system acquires candidate information related to a user from a material library;
the sorting system processes the information provided by the recall system;
the reordering system processes the processing result of the ordering system;
the material library comprises bright news contents.
10. The apparatus of claim 9, wherein: the feature push system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111429883.4A CN114065054A (en) | 2021-11-29 | 2021-11-29 | Method and device for pushing information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111429883.4A CN114065054A (en) | 2021-11-29 | 2021-11-29 | Method and device for pushing information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114065054A true CN114065054A (en) | 2022-02-18 |
Family
ID=80277021
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111429883.4A Pending CN114065054A (en) | 2021-11-29 | 2021-11-29 | Method and device for pushing information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114065054A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116385102A (en) * | 2023-03-15 | 2023-07-04 | 中电金信软件有限公司 | Information recommendation method, device, computer equipment and storage medium |
-
2021
- 2021-11-29 CN CN202111429883.4A patent/CN114065054A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116385102A (en) * | 2023-03-15 | 2023-07-04 | 中电金信软件有限公司 | Information recommendation method, device, computer equipment and storage medium |
CN116385102B (en) * | 2023-03-15 | 2024-05-31 | 中电金信软件有限公司 | Information recommendation method, device, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110543598B (en) | Information recommendation method and device and terminal | |
US10572565B2 (en) | User behavior models based on source domain | |
US11816172B2 (en) | Data processing method, server, and computer storage medium | |
US10491694B2 (en) | Method and system for measuring user engagement using click/skip in content stream using a probability model | |
US20210248198A1 (en) | Content Recommendation Method and Apparatus, Mobile Terminal, and Server | |
US8468143B1 (en) | System and method for directing questions to consultants through profile matching | |
CN107451861B (en) | Method for identifying user internet access characteristics under big data | |
CN110717093B (en) | Movie recommendation system and method based on Spark | |
US20070038646A1 (en) | Ranking blog content | |
CN112104642B (en) | Abnormal account number determination method and related device | |
CN102667761A (en) | Scalable cluster database | |
CN105516821A (en) | Method and device for screening bullet screen | |
CN107977678B (en) | Method and apparatus for outputting information | |
CN103634687A (en) | Method and system of providing video retrieval results in intelligent television | |
CN101833587A (en) | Network video searching system | |
US9177066B2 (en) | Method and system for displaying comments associated with a query | |
US10733244B2 (en) | Data retrieval system | |
WO2024193216A1 (en) | Pushing object processing method, and training method and apparatus for object pushing model | |
CN114065054A (en) | Method and device for pushing information | |
CN100561477C (en) | Based on key word and shared searching method and the system of classification | |
CN114139048A (en) | Tracking method for user behavior data and page data | |
CN113254798A (en) | Game community information pushing system and method based on big data | |
CN112085390A (en) | Method and system for evaluating film and television work propagation effect | |
CN116089723A (en) | Recommendation system recommendation method and device | |
CN116506498A (en) | Cloud computing-based data accurate pushing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |