CN114065054A

CN114065054A - A method and device for pushing information

Info

Publication number: CN114065054A
Application number: CN202111429883.4A
Authority: CN
Inventors: 吴德超; 蔚钊; 王朋涛
Original assignee: Beijing Lida Zhisheng Technology Co ltd
Current assignee: Beijing Lida Zhisheng Technology Co ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-18
Anticipated expiration: 2041-11-29
Also published as: CN114065054B

Abstract

The invention provides a method and a device for pushing information, which mainly comprise the following steps: a user pulls down a refresh initiation request on a home page of a client; the request is transmitted to the server interface by the client, meanwhile, the client also sends the user and the user context characteristics to the server interface, and the client also sends the user behavior log to a message queue; after receiving a request sent by a client, the server calls a feature pushing system interface and sends features required by a feature pushing algorithm model to the feature pushing system; the recommendation system requests a recall system for recommending contents after receiving a request sent by a server; and after receiving the content sets respectively returned by the recall system, the server performs sequencing processing.

Description

Method and device for pushing information

Technical Field

The invention relates to the technical field of information, in particular to a method and a device for determining proper user characteristic information through data mining or big data, and particularly relates to a method and a device for pushing information.

Background

Currently, the internet has become a main channel for most people or information, and accordingly, the amount of information on the internet is exponentially exploded. Particularly, with the popularization and development of mobile devices, the amount of information is becoming more and more abundant, and the focus of users is also being dispersed too much, so that it is difficult to obtain effective information.

However, in the prior art, the search engine relies on the user to input keywords for active retrieval, and particularly under a mobile internet terminal, the user experience is poor, and meanwhile, the problem that the system automatically acquires the user interest and completes pushing cannot be solved.

Therefore, it is an urgent problem to provide valuable and interesting information for users by taking the user interest points as the center. One of the more acute challenges is recommending a system cold start and recommending a continuous update of information. That is, when a platform acquires a new user, how to recommend content of interest to the user on the premise that the platform lacks sufficient data. A more common solution is to obtain information of various social networks of the user in a suitable manner, including but not limited to social account numbers such as WeChat, QQ, Paibao, and the like, and analyze data based on the information to obtain effective recommendations.

However, browsing content and publishing content of the social network of the user often have messy content and contain various information, and also include a large amount of new network expressions, such as yyds, a large amount of harmonic vocabularies and the like, which are often difficult to be understood by the NLP network in the prior art, so that the accuracy of recommendation is greatly restricted.

The challenge of continuously updating the recommendation information is that after a user accumulates a certain usage behavior on the platform, although the platform itself obtains more sufficient information, the interest of the user is often reflected on other social platforms with longer usage time, for example, platforms with higher user stickiness such as WeChat, and for a part of specialized platforms, if the platform is based on own data, deviation from the real interest of the user is inevitable.

In view of the above problems in the prior art, no good solution is available at present.

Disclosure of Invention

The invention discloses a method and a device for pushing information.

The invention discloses a device for pushing information, which is characterized in that: the device comprises a feature pushing system, a recalling system, a sequencing system and/or a reordering system, wherein the recalling system acquires candidate information related to a user from a material library;

the sorting system processes the information provided by the recall system;

the reordering system processes the processing result of the ordering system;

the material library comprises bright news contents.

The feature push system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.

The content data dimension tag subsystem processes the first user data according to a specific dimension, wherein the specific dimension comprises a time dimension, a mutual quantity dimension, a content dimension and/or a user dimension.

The time dimension further includes a publication time, a last reply time, and/or a last operation time.

The interaction quantity dimension further includes reading quantity, reply quantity, collection quantity, like quantity, reply like quantity, and/or sharing quantity.

The content dimension further includes a content length, an average reply length, and/or a number of pictures.

The user dimensions further include user interest, liveness of content, and/or reputation of posting/replying users, etc.

The interaction quantity dimension is determined in a data weighting mode, preferably, the weight values are different, the weight of the reading quantity is usually the lowest, and the weight values of the replying quantity, the praise quantity, the collecting quantity and the sharing quantity are relatively higher and are determined according to the service scene.

A method for pushing information, comprising the steps of:

step A1: a user pulls down a refresh initiation request on a home page of a client;

step A2: the request is transmitted to the server interface by the client, meanwhile, the client also sends the user and the user context characteristics to the server interface, and the client also sends the user behavior log to a message queue;

step A3: after receiving a request sent by a client, the server calls a feature pushing system interface and sends features required by a feature pushing algorithm model to the feature pushing system;

step A4: the recommendation system requests a recall system for recommending contents after receiving a request sent by a server;

step A5: and after receiving the content sets respectively returned by the recall system, the server performs sequencing processing.

The user context characteristics include: user id, machine type, system, request time, request IP, location, etc.;

the step a3 further includes:

step A3-1: storing the user behavior data in the subscription message database into a data warehouse, and partitioning according to specific time characteristics;

step A3-2: looking up the activity characteristics of each day of a user in a specific period of time from the large-width table of the data warehouse, and storing the activity characteristics in a first data structure;

step A3-3: performing weight reduction accumulation calculation through a user-defined UDF function according to the behavior number of the user in each channel for nearly 15 days to obtain a weight score;

the calculation formula of the UDF function is as follows: interest score ∑ number of behaviors over i days ∑ aⁱThe range of a is usually 0.5-0.99, which represents the interest attenuation speed of the user on the channel content, and the optimal value of a is finally determined to be 0.95 through multiple tests;

step A3-4: and carrying out abnormal value processing on the weight scores, wherein the abnormal value processing refers to that the weight value is assigned to null as 0, and the weight value is not two reserved decimal numbers of null.

The wide table is a data table summarizing user side information, content side information and context characteristics and is used for preparation before training of a machine learning model;

preferably, the user side information includes a user ID, a gender, an age, a mobile phone model, a mobile phone system, a latest click content, an exposure content, a sharing content, a collection content, interest scores of different channel contents recorded in the picture, and the like;

the portrayal is that various behaviors of the user are collected in real time through a real-time calculation program to generate a detailed description of the user, the description comprises a behavior footprint of the user, and the interest score of the user on the content of each channel is calculated through the behavior footprint;

the content side information comprises a content ID, a content title, a content type, a click number of content, an exposure number of content (exposure refers to the appearance of the current screen of the mobile phone), a sharing number of content, a second fraction of content (the number of secondary sharing), a collection number of content, a comment number of content, a click number of content in a WeChat friend circle, a click rate of content, a yield rate of content (the number of off-site clicks divided by the exposure number), a sharing rate of content, a second fraction of content (the proportion of secondary sharing), average read time of content and the like;

the context characteristics comprise user behavior time, a channel where the user is currently located, the position of an article in a current screen, the sliding direction of a finger of the user and the like;

preferably, the step a3-4 further includes the step of splicing the features of the user, such as click, exposure, sharing, collection, click rate and sharing rate, of the user in the past 1, 2, 3 and 15 days to be stored as a field;

the step A3-4 splices the features to avoid dimension explosion;

step A3-5: generating the day-level characteristics of the content side, specifically comprising: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:

step A3-5-a, firstly, data normalization is carried out on click rate, propagation rate, profitability, fraction II and sharing rate, and the normalization method comprises the following steps: (current-minimum)/(maximum-minimum);

step a3-5-b, calculating a composite normalized score (spreading factor normalized by 5) + (yield normalized by 40) + (click rate normalized by 45) + (dichotomy normalized by 5) + (share normalized by 5);

step a3-5-c, calculating a weight score of each content, which is (the exposure number of the content + the click number + the share number + the click number of the WeChat friend circle + the second score)/the exposure number;

a step a3-5-d, finally, normalizing the composite score of the content to the weight score;

the features required by the feature push algorithm model include: request information, user-side information, and/or article-side information;

the subscription message database prefers kafka;

preferably, the step A3-1 of storing into a data warehouse is developed based on an Aliskive cloud dataworks framework;

preferably, the specific time characteristics are respectively selected as hours and days as characteristics;

preferably, the specific time period is selected from 10 to 20 days; preferably, the specific time period is selected to be 15 days;

preferably, the activity characteristics of each day in step 3-2 refer to activity characteristics of each channel each day, and the channels refer to classified display of different types of content in the APP for the purpose of improving the reading experience of the user, where each type of display is called a channel.

Preferably, the activity features include the number of exposures, clicks, shares, comments, and collections;

preferably, the fields of the first data structure include: user id, channel id (e.g., american, health, hot), type of behavior (e.g., exposure, click, share), number of corresponding behaviors over the past 1 day, number of corresponding behaviors over the past 2 days, number of corresponding behaviors over the past 3 days, number of corresponding behaviors over the past 15 days;

the step a3 further includes real-time feature generation, further including:

step A3-6: performing article minute-level data aggregation, specifically comprising: selecting a user behavior log stream from the subscription message database by a real-time feature generation calculation framework, aggregating data of exposure, clicking, sharing, collection, comment and the like of contents for one minute, and storing a result in the second database;

step 3-7: sampling and determining a sample label, which specifically comprises the following steps: in the sampling stage, the label of each sample is determined, the characteristics and the label form a complete sample, generally, the content exposed and clicked by a user is a positive sample, and the sample exposed and not clicked is taken as a negative sample. Preferably, in the negative sample sampling step, two contents under the last click in a request are negative samples even though not exposed, except that the non-click is exposed as a negative sample. After the label is determined, the characteristics before the label is found out from the database and the data warehouse according to the content ID and the user ID and are spliced into a complete sample.

Preferably, the polymerization in step a3-6 results in the form of: content ID, type of activity (exposure, click, share, comment, or collection), number of activities in a minute (e.g., 100 clicks in a minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which content in a minute) to store this result in the database;

preferably, the database selects hbase;

preferably, the real-time feature generation computation framework selects a flink framework.

The polymerization is that: accumulating and summing the user behavior number of the content in a certain time;

the numerical values of exposure, clicking, sharing, collecting, comment number, click rate, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and the corresponding characteristics can be conveniently obtained by a sample sampled later.

The invention has the beneficial effects that:

based on the user portrait, the historical clicks of the user can be known, the user can be recommended to see and see by utilizing content collaborative filtering (similar to beer and diaper), interest scores of the user on contents of various categories (calculated by integrating the behaviors of clicking, exposing, sharing, commenting, approving, collecting and the like of the user) are also available in the portrait, and the user is recalled with the currently hottest contents in the category which is most interested in;

the user can also obtain which contents are never exposed by the portrait so as to search for interest (for example, sports and entertainment are always recommended to the user before, and the user can try to push down scientific contents to test whether the user is interested or not, so that the user interest can be expanded on one hand, and the problem of narrow long tails can be solved on the other hand);

the separation of user portraits into short-term portraits and long-term portraits solves the problem of querying and storing the portraits themselves that are too large.

Drawings

FIG. 1 is a typical feature mining implementation

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The first embodiment is as follows: evaluation platform and competitive product system dynamic integration

The invention discloses a method and a device for pushing information.

The invention discloses a device for pushing information, which is characterized in that: the apparatus includes a user tag library system, a recall system, a ranking system, and/or a reordering system, the recall system obtaining candidate information about a user from a material library;

the sorting system processes the information provided by the recall system;

the reordering system processes the processing result of the ordering system;

the material library comprises bright news contents.

The user tag library system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.

Example two: evaluating a literary artist or literary work of art

A method for pushing information, comprising: the method comprises the following steps:

step A2: the request is transmitted to a server interface by a client, and meanwhile, the client also sends user context characteristics to the server interface;

step A3: after receiving a request sent by a client, the server calls a recommendation system interface and sends characteristics required by a recommendation algorithm model to the recommendation system;

step A4: after receiving a request sent by a server, the recommendation system acquires a user portrait according to a user ID and/or off-line trained user vectors stored in a database and other user lists similar to the user interests and respectively sends the user portrait and the user vectors and other user lists to a recall system;

step A5: after receiving the content sets respectively returned by the recall system, the server performs sequencing processing;

the features required for the recommendation algorithm model include: request information, user-side information, and/or article-side information;

preferably, the characteristics are as shown in the following table:

the step a3 further includes:

step A3-2: searching activity characteristics of each channel of a user in a specific period of time in the past from the large-width table of the data warehouse, and storing the activity characteristics in a first data structure;

step A3-3: performing weight reduction accumulation calculation through a user-defined UDF function according to the behavior number of the user in each category of nearly 15 days to obtain a weight score;

the calculation formula of the UDF function is as follows: ag _ score ═ Σ behavior number over i days × 0.95ⁱ；

Step A3-4: performing abnormal value processing on the weight scores, wherein the abnormal value processing refers to that the weight value is assigned to be 0 when the weight value is null, and the weight value is not two reserved decimal places of null;

preferably, the step a3-4 further includes storing characteristics of clicks, exposures, shares, favorites, ctr, share rate, etc. of the user in the past 1, 2, 3, 15 days as a field;

the step A3-4 splices the features to avoid dimension explosion;

step A3-5: the generation of the day-level features of the content specifically includes: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:

a3-5-a, firstly normalizing CTR, propagation rate, CPM, dichotomy rate and sharing rate, and normalizing the data: (current-minimum)/(maximum-minimum);

step a3-5-b, calculating a composite normalized score (spreading factor normalized by 0) + (CPM normalized by 45) + (CTR normalized by 45) + (dichotomy factor normalized by 5) + (sharing factor normalized by 5);

step a3-5-c. calculating a weight score for each content (current exposure UV + click UV + share UV + off-site UV + secondary share UV of the content)/exposure UV;

the subscription message database prefers kafka;

the channels refer to classified display of different types of contents in the APP for the purpose of improving reading experience of users, wherein each type of display is called a channel, for example, a news channel, a makeup channel, a science and technology channel and the like in APP software;

the step a3 further includes real-time feature generation, further including:

step 3-7: sampling and determining a sample label, which specifically comprises the following steps: the sampling stage determines the label, characteristics and label of each sampleLabel (Bao)This is also true in our business, as the content that the user has exposed and clicked on is typically a positive sample, and the sample that has been exposed and not clicked on is a negative sample, although in addition to the negative sample that has been exposed and not clicked on, the two content under the last click in a request are also negative samples, even if not exposed. After the label is determined, the previous characteristics are searched out from the hbase and the data warehouse according to the content ID and the user ID and are spliced into a complete sample.

Preferably, the polymerization in step a3-6 results in the form of: content ID, type of behavior (exposure, click, share, comment, or favorites), number of behaviors in one minute (e.g., 100 clicks in one minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in hbase;

the numerical values of exposure, clicking, sharing, collecting, comment number, ctr, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and the corresponding characteristics can be conveniently obtained by a sample after sampling later.

Example three: evaluation of economic or physical objects

For the case that the subscription message database is selected as kafka, and the data warehouse is developed based on the ali cloud dataworks framework, the scheme has the following preferred embodiment. It should be noted that this embodiment is merely a preferred implementation of the present invention, and other similar implementations are all within the scope of the present invention.

The specific embodiment is as follows.

1. The behavior of a user at a client can be reported to the open source flow processing platform kafka topic through a buried point, data in kafka can fall to a plurality of bins in an hour unit on one hand, and on the other hand, a flink job program subscribes for consumption to generate real-time characteristics.

2. User and content day-level data

(1) The user behavior data in kafka falls into a plurality of bins from an Aliskiu dataworks framework, and is partitioned by hour and day, the number of exposure, click, sharing, comment and collection of each channel (tag _ id) of the user in the past 15 days is firstly found out from a large-width table of the bins, and the result form is as follows:

the fields therein are: user id, channel id (e.g., american, health, hot), type of behavior (e.g., exposure, click, share), number of corresponding behaviors over the past 1 day, number of corresponding behaviors over the past 2 days, number of corresponding behaviors over the past 3 days,.. corresponding behaviors over the past 15 days.

(2) And performing weight reduction accumulation to calculate a weight score through a self-defined UDF function according to the behavior number of the user in each category of nearly 15 days, wherein the calculation formula of the UDF is as follows: ag _ score ═ Σ behavior number over i days × 0.95ⁱ

The result is the form:

each column represents: user ID, channel ID, behavior type, weight score;

through the calculation, the weight score of the user for each category content is obtained, then slight processing is carried out on some abnormal values, 0 is given to null, two decimal places are not reserved for null, characteristics of clicking, exposure, sharing, collection, ctr, sharing rate and the like of the user in the past 1, 2, 3, 15 days are spliced together and stored as a field, and in this way, in order to avoid dimension explosion, some static attributes of the user are finally spliced.

Preferably, the characteristics of the final generated user can be expressed in the form of:

+-------------------------------------------------------------------+

+-------------------------------------------------------------------+

User _ pregnant | bigint | pregnancy | user | ventilation

User _ os/string/mobile phone system/user/iOS

L user _ ctr _ video | double | video ctr | user |0.15894039735099338

User share number of user

User _ click bigint user click number user

Video _ user _ srt double video sharing rate user 0.29449495366961065

User srt rate user sharing rate

User read nums user reading

Video _ user _ click | big | user video click number | user |141

|contentshow_20006:11,contentshow_20007:27,contentshow_20008:20,contentshow_20011:1,contentshow_20013:3,contentshow_20014:2,contentshow_2001

6:2,contentshow_20017:4,contentshow_20023:3,contentshow_20026:2,contentshow_20030:57,contentshow_20058:4,contentshow_20059:9,contentshow_2006

6:2,contentshow_20076:3,contentshow_20079:1,contentshow_20080:1,contentshow_20089:2,contentshow_20093:5,contentshow_20102:1,contentshow_20116:1,contentshow_20144:39,contentshow_20605:1

|contentclick_20006:1,contentclick_20007:3,contentclick_20008:9,contentclick_20013:2,contentclick_20014:1,contentclick_20016:2,contentclick_20030:10,c

ontentclick_20059:2,contentclick_20076:2,contentclick_20079:5,contentclick_20080:1,contentclick_20093:1,contentclick_20144:3

Sharing of channel by share 20008:1, share 20030:2

Ctr | user of user to channel | ctr | string | | |

|ctr_20008:0.009849,ctr_20030:0.020207,ctr_20059:0.230724,ctr_20080:0.094531,ctr_20094:0.206549,ctr_20102:0.094531,ctr_20104:0.045587,ctr_20116:0.206549

Share rate of user to channel | user | sharehrarate _20008:0.003159, sharehrate _20030:0.010821

User click video frequency | user |94 | |

(3) Generating day-level features of content

Finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from a data warehouse, and calculating a comprehensive score through a UDF function, wherein the formula is as follows:

a. the method comprises the following steps of firstly, respectively normalizing CTR, the propagation rate, CPM, the dichotomy rate and the sharing rate by data, and normalizing the data: (Current-minimum)/(maximum-minimum)

b. Calculating the integrated score (0 is the transmission rate) plus (45 is the CPM) plus (45 is the CTR) plus (5 is the dichotomy) plus (5 is the sharing rate)

c. Calculate weight score for each content (current exposure UV + click UV + share UV + off-site UV + secondary share UV for that content)/exposure UV

d. Finally, the composite score of the content is the weight score and the normalization score

Then, splicing some static attributes of the content by a processing method similar to the user characteristics to finally generate the user characteristics; one preferred mode is as follows:

+-------------------------------------------------------------------+

+-------------------------------------------------------------------+

article ID item 34021769

Article title | item | gold view | blowing wheat wave |

The | absstract | string | | | article describes | item | south-Henan journal client reporter have an shine of 6 months and 2 days, and in a wheat field in the ninth division of farm in the yellow pan of Xihua county, a farmer operates a large-scale harvester to harvest wheat. In the three summers, the China and the China are in the same order, the wheat field in the middle and the original land is golden yellow, and the wheat harvest on the farm in the yellow pan meets the war and enters the peak. In the wheat field, the machine sound is loud, the smoke and dust fly, and the scene is rich and harvested.

Keywords | item | wheat field, 6 months, 2 days, wheat, harvest, yellow pan, farm, 10 ten thousand

Type | bigint | content type | item |1

Update time bigint update time item 1591138870

Whether is _ timeiiness | bigint | is effective | item |3

Cumulative read | item |78

Duration | bigint | duration | item |0 of video or audio

|area_id|bigint|||citycode|user|843

| item _ contentshow | string | PV and UV | item exposure for 1, 3, 5,7, 15 days past

|pv_15d_nums:468,uv_15d_nums:213,pv_7d_nums:221,uv_7d_nums:99,pv_5d_nums:147,uv_5d_nums:68,pv_3d_nums:83,uv_3d_nums:42,pv_1d_nums:48,uv_1d_nums:20

I item _ content click string past 1, 3, 5,7, 15 days PV and UV item

|pv_15d_nums:63,uv_15d_nums:45,pv_7d_nums:40,uv_7d_nums:30,pv_5d_nums:24,uv_5d_nums:20,pv_3d_nums:17,uv_3d_nums:14,pv_1d_nums:4,uv_1d_nums:3

| item _ share | string | sharing PV and UV | item 1, 3, 5,7, 15 days in the past

I item _ favorite string previous 1, 3, 5,7, 15 days to collect PV and UV I item

Video playing PV and UV item of 1, 3, 5,7, 15 days past

|pv_15d_nums:5,uv_15d_nums:3,pv_7d_nums:2,uv_7d_nums:1,pv_5d_nums:2,uv_5d_nums:1

Complete string PV and UV item of video over 1, 3, 5,7, 15 days past

Postcontent | string | past 1, 3, 5,7, 15 days video comments PV and UV | item

|pv_15d_nums:3,uv_15d_nums:3,pv_7d_nums:3,uv_7d_nums:3,pv_5d_nums:3,uv_5d_nums:3,pv_3d_nums:3,uv_3d_nums:3,pv_1d_nums:3,uv_1d_nums:3

[ item _ ctr ] string ] past 1, 3, 5,7, 15 day click rate PV and UV [ item | ]

|pv_15d_nums:0.1493,uv_15d_nums:0.3004,pv_7d_nums:0.1493,uv_7d_nums:0.3004,pv_5d_nums:0.1493,uv_5d_nums:0.3004,pv_3d_nums:0.1493,uv_3d_nums:0.3004,pv_1d_nums:0.1493,uv_1d_nums:0.3004

Share _ show | string | share/expose PV and UV | item 1, 3, 5,7, 15 days past

|pv_15d_nums:0.0052,uv_15d_nums:0.0107,pv_7d_nums:0.0052,uv_7d_nums:0.0107,pv_5d_nums:0.0052,uv_5d_nums:0.0107,pv_3d_nums:0.0052,uv_3d_nums:0.0107,pv_1d_nums:0.0052,uv_1d_nums:0.0107

Favorite _ show | string | Collection/Exposure PV and UV | item 1, 3, 5,7, 15 days past

|pv_15d_nums:0.0002,uv_15d_nums:0.0006,pv_7d_nums:0.0002,uv_7d_nums:0.0006,pv_5d_nums:0.0002,uv_5d_nums:0.0006,pv_3d_nums:0.0002,uv_3d_nums:0.0006,pv_1d_nums:0.0002,uv_1d_nums:0.0006

Comment _ show | string | comment/expose PV and UV | item 1, 3, 5,7, 15 days past

|pv_15d_nums:0.0001,uv_15d_nums:0.0002,pv_7d_nums:0.0001,uv_7d_nums:0.0002,pv_5d_nums:0.0001,uv_5d_nums:0.0002,pv_3d_nums:0.0001,uv_3d_nums:0.0002,pv_1d_nums:0.0001,uv_1d_nums:0.0002

Share _ click | string | share/click PV and UV | item over 1, 3, 5,7, 15 days

|pv_15d_nums:0.0346,uv_15d_nums:0.0356,pv_7d_nums:0.0346,uv_7d_nums:0.0356,pv_5d_nums:0.0346,uv_5d_nums:0.0356,pv_3d_nums:0.0346,uv_3d_nums:0.0356,pv_1d_nums:0.0346,uv_1d_nums:0.0356

| vplay _ click | string | play/click PV and UV | item 1, 3, 5,7, 15 days in the past

Favorite _ click string I past 1, 3, 5,7, 15 days Collection/click PV and UV item

|pv_15d_nums:0.0017,uv_15d_nums:0.0021,pv_7d_nums:0.0017,uv_7d_nums:0.0021,pv_5d_nums:0.0017,uv_5d_nums:0.0021,pv_3d_nums:0.0017,uv_3d_nums:0.0021,pv_1d_nums:0.0017,uv_1d_nums:0.0021

Comment _ click | string | comment/click PV and UV | item 1, 3, 5,7, 15 days past

|pv_15d_nums:0.0006,uv_15d_nums:0.0008,pv_7d_nums:0.0006,uv_7d_nums:0.0008,pv_5d_nums:0.0006,uv_5d_nums:0.0008,pv_3d_nums:0.0006,uv_3d_nums:0.0008,pv_1d_nums:0.0006,uv_1d_nums:0.0008

3. Generation of real-time features

(1) A calculation framework: selecting a flink for a real-time computing frame;

the reason why the flink framework is preferred is that: although the real-time framework has the frameworks of Storm, spark, flink and the like, Storm can only process streaming data and has no capacity of batch processing, and flink provides a plurality of advanced apis, for example, DataStream of flink provides aps such as Map, GroupBy, Window and Join and the like to replace bolt of Storm in one or more of receivers and collectors, and Storm needs to be realized by a programmer when realizing the functions; compared with spark, the flink has the advantages of high throughput and low delay, besides, the flink also supports millisecond-level calculation, spark only supports second-level calculation, and the flink belongs to real streaming calculation;

(2) code logic: the data aggregation of the article minute level is firstly carried out, and based on a flink real-time computing framework, the user behavior log stream stored in kafka is used as the data input of the flink real-time computing framework. Based on a user behavior log stored in kafka, in a flink real-time computing framework, data such as exposure, clicking, sharing, collecting and commenting of contents are aggregated for one minute, and the aggregation result form is as follows: content ID, type of behavior (exposure, click, share, comment, or favorites), number of behaviors in one minute (e.g., 100 clicks in one minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in hbase.

The numerical values of exposure, clicking, sharing, collecting, comment number, ctr, sharing rate and the like of the content in nearly 1 hour, 2 hours, 4 hours, 8 hours and 12 hours can be expanded and calculated based on minute-level aggregation data in the hbase, the data can be stored in a single table of the hbase, and a sample after sampling later can conveniently obtain corresponding characteristics.

The reason for choosing hbase is: common databases include hbase, hive, redis, mysql and the like, and the mysql belongs to a relational database, but the service data volume related by the invention is large, the requirement on linear expansion is met, the requirement on automatic operation and maintenance is met, and the application mode is simple, and obviously, the mysql cannot be met; although the function of Redis is similar to that of HBase, the Redis is a nonsql type database based on Key and Value, but the Redis is suitable for being used as a cache, the related service of the invention requires that data cannot be lost, the Redis cannot meet the requirement, and the use cost of the Redis is much higher than that of HBase; hive belongs to a data warehouse tool, and the bottom layer is mapreduce and cannot be used for interactive storage of users; the hbase has the characteristics of large capacity, column storage, strong expansibility and high reliability, and the hbase can provide real-time computing service mainly because the hbase is determined by the architecture and the data structure of the bottom layer, namely LSM-Tree + HTable + Cache, so that the technical architecture has obvious advantages compared with other database systems; therefore, the client can directly locate the HRegion server where the data to be searched is located, then directly search the data to be matched on a region of the server, and the data parts are cached by the cache.

In addition, compared with databases such as mysql and redis, the storage cost of hbase is more economic. For example, the single month cost of the hbase cluster core node with the capacity of 8 cores 32G and the disk capacity of 1800G of the Aliyun is only about 1000 yuan, the mysql with the same configuration is 3000 yuan, and the cache database redis more expensive.

(3) Sampling: the sampling stage determines label of each sample, and features and label constitute a complete sample, and generally, the content exposed and clicked by a user is a positive sample, and the sample exposed and not clicked is taken as a negative sample, which is also the same in our business, but in the negative sample, besides the negative sample exposed and not clicked, the two contents below the last click in a request are taken as negative samples even if not exposed. After the label is determined, the previous features are searched out from the hbase and the data warehouse according to the content ID and the user ID and are spliced into a complete sample. The final characteristics are as follows:

+-------------------------------------------------------------------+

+-------------------------------------------------------------------+

article ID item 34021769

Article title | item | gold view | blowing wheat wave |

Type | bigint | content type | item |1

Update time bigint update time item 1591138870

Whether is _ timeiiness | bigint | is effective | item |3

Cumulative read | item |78

Duration | bigint | duration | item |0 of video or audio

|area_id|bigint|||citycode|user|843

I item _ content click string past 1, 3, 5,7, 15 days PV and UV item

Video playing PV and UV item of 1, 3, 5,7, 15 days past

Complete string PV and UV item of video over 1, 3, 5,7, 15 days past

Postcontent | string | past 1, 3, 5,7, 15 days video comments PV and UV | item

[ item _ ctr ] string ] past 1, 3, 5,7, 15 day click rate PV and UV [ item | ]

Share _ show | string | share/expose PV and UV | item 1, 3, 5,7, 15 days past

Share _ click | string | share/click PV and UV | item over 1, 3, 5,7, 15 days

Uid string user ID user 25311456

I requested | string | request ID | user |501359071604462886836102

Perssessessence string click sequence user 33732005

Work day string workday user 1

User _ pregnant | bigint | pregnancy | user | ventilation

User _ os/string/mobile phone system/user/iOS

L user _ ctr _ video | double | video ctr | user |0.15894039735099338

User share number of user

User _ click bigint user click number user

Video _ user _ srt double video sharing rate user 0.29449495366961065

User srt rate user sharing rate

User read nums user reading

Video _ user _ click | big | user video click number | user |141

|contentclick_20006:1,contentclick_20007:3,contentclick_20008:9,contentclick_20013:2,contentclick_20014:1,contentclick_20016:2,contentclick_20030:10,contentclick_20059:2,contentclick_20076:2,contentclick_20079:5,contentclick_20080:1,contentclick_20093:1,contentclick_20144:3

Sharing of channel by share 20008:1, share 20030:2

Ctr | user of user to channel | ctr | string | | |

User click video frequency | user |94 | |

User _ show _ video | string | user exposure video number | user |806

|id|string|||item_id|item|3746

Click _ offset | bigint | click position | item |12

Whether is _ time | bigint | is effective | item |0

Status bigint article state item 1

Is _ hot | double | is hot | item

Whether is _ original | double | is original | item

Video share number | item |7

Video _ srt | double | video sharing rate | item |0.09347644411829928

User _ tag _ ctr | double | user ctr | item for channel

Wxcontenctclick _ hour double hour out-of-station click | item

| facade _ hour | hour collection number | item

12h ctr | item |0.09453121

[ exitcontent tdetail _12hour [ double ] 12hour stay time [ item ]

Wxcontenctclick _12hour double 12hour off-site click number | item

Wxsharetiwice _12hour double 12hour off-site sharing number item

Age, string, age, user, age over 45

Rivalry | string | province | user | Shandong province

City, user, and city

Active _ days | bigint | active days | user |698

If flag bigint is logged in user 1 by third party or mobile phone number

Last _ location _ time | bigint | last login time | user |1604332800

The time of the last clean up notification | user | 0|

Class _ region _ type | bigint | finally login platform | user |2 (last login platform, 1: ios; 2: android)

Registration type | user |2 (registration platform, 1: ios; 2: android; 3: pc)

Source _ ID | string | Source ID | user | c1006

Device _ type | string | distinction type | user | OPPO PBBT00

cs_20007:1551.01,cs_20086:0.54,cs_20087:34.97,cs_20097:3.41,cs_20108:29.93,cs_20128:0.95,cs_20144:68.29,cs_20006:328.61,cs_20058:118.86,cs_2006

9:14.37,cs_20080:192.07,cs_20093:64.22,cs_20110:5.04,cs_20604:2.44,cs_20017:15.14,cs_20056:15.76,cs_20059:244.21,cs_20066:202.37,cs_20102:171.87

,cs_20104:23.07,cs_20109:65.21,cs_20605:5.78,cs_20011:237.72,cs_20014:139.36,cs_20016:162.04,cs_20021:2.81,cs_20022:13.16,cs_20023:89.82,cs_2002

6:135.82,cs_20030:3524.74,cs_20076:28.03,cs_20083:16.63,cs_20094:9.97,cs_20095:1.08,cs_20116:13.88,cs_20603:8.37,cs_20008:3105.84,cs_20010:25.01,cs_20013:80.84,cs_20015:362.75,cs_20078:2.56,cs_20079:210.8,cs_20103:4.25,cs_20589:0.57

Click category and score for 15 days

|cc_20017:1.55,cc_20056:1.03,cc_20059:44.27,cc_20066:28.14,cc_20102:90.49,cc_20104:1.61,cc_20109:27.78,cc_20006:29.73,cc_20058:9.24,cc_20080:65.87,cc_20093:34.91,cc_20110:0.86,cc_20007:263.99,cc_20087:15.68,cc_20108:21.55,cc_20144:31.77,cc_20011:69.81,cc_20014:5.46,cc_20016:26.26,cc_20022:0.63,cc_20023:2.83,cc_20026:15.17,cc_20030:353.16,cc_20083:6.67,cc_20094:7.22,cc_20116:4.25,cc_20603:3.1,cc_20008:564.64,cc_20010:2.71,cc_20013:2.42,cc_20015:28.1,cc_20078:1.99,cc_20079:69.84,cc_20103:6.29

The | hist _ share | string |15 days share categories and scores

|ss_20011:10.2,ss_20016:3.01,ss_20026:4.54,ss_20030:75.36,ss_20094:0.95,ss_20603:0.77,ss_20008:141.08,ss_20015:0.49,ss_20079:12.57,ss_20059:9.26,ss_20066:3.95,ss_20102:6.08,ss_20109:2.1,ss_20007:60.41,ss_20087:0.57,ss_20144:4.61,ss_20058:1.3,ss_20080:6.34,ss_20093:5.14

Collection category of | hist _ favorite | string |15 days and scores | fs _20102:1.63, fs _20079:0.54

Total _ score | string | 15-day category Total score | user

|ts_20017:1.55,ts_20056:1.03,ts_20059:72.05000000000001,ts_20066:39.99,ts_20102:121.76999999999998,ts_20104:1.61,ts_20109:34.08,ts_20006:29.73,ts_20058:13.14,ts_20080:84.89,ts_20093:50.33,ts_20110:0.86,ts_20007:445.22,ts_20087:17.39,ts_20108:21.55,ts_20144:45.6,ts_20008:987.88,ts_20010:2.71,ts_20013:2.42,ts_20015:29.57,ts_20078:1.99,ts_20079:111.87,ts_20103:6.29,ts_20011:100.41,ts_20014:5.46,ts_20016:35.29,ts_20022:0.63,ts_20023:2.83,ts_20026:28.79,ts_20030:579.24,ts_20083:6.67,ts_20094:10.07,ts_20116:4.25,ts_20603:5.41

|7d_ctr_20007:0.25,7d_ctr_20008:0.2,7d_ctr_20014:0.08,7d_ctr_20030:0.23,7d_ctr_20013:0.29,7d_ctr_20066:0.25,7d_ctr_20016:0.04,7d_ctr_20109:0.5,7d_ctr_20006:0.17,7d_ctr_20011:0.33,7d_ctr_20026:0.33,7d_ctr_20058:0.18,7d_ctr_20059:0.25

The collection rate of 7-day category of the list/click 7d ctr 20007:0.25,7d ctr 20008:0.2,7d ctr 20011:0.33,7d ctr 20058:0.18

|20030:54.736000,20008:45.776000,20007:32.544000,20011:17.100000,20014:4.884000,20058:4.332000,20013:3.504000,20016:3.156000,20015:3.000000,20066:2.568000

Reading amount [ item ] 1 day of [ pv _1d _ nums ] double | | | |1

pV 15d _ nums _ click | double |15 days pV/click |1.0| item

pV 3d _ nums _ click 3 days pV/click 1.0 item

Display tag of "showtag" "string |"

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for pushing information, comprising: the method comprises the following steps:

The user context characteristics include: user id, model, system, request time, request IP, location, etc.

2. The method of claim 1, wherein: the step a3 further includes:

3. The method of claim 2, wherein: the calculation formula of the UDF function is as follows: interest score ∑ number of behaviors over i days ∑ aⁱAnd a is in a range of 0.5-0.99, and represents the interest decay rate of the user in the channel content.

4. The method of claim 2, wherein: the step A3 further includes a step A3-5, which specifically includes:

generating the day-level characteristics of the content side, specifically comprising: finding out the exposure, click, share, collection and comment numbers of the content in the past 1, 3, 5 and 7 days from the data warehouse, and calculating a comprehensive score through a second UDF function, wherein the formula of the second UDF function is as follows:

step a3-5-d. finally, the composite score of the content is the weight score.

5. The method of claim 2, wherein: the step a3 further includes real-time feature generation, further including:

6. The method of claim 2, wherein: the wide table is a data table summarizing user side information, content side information and context characteristics and is used for preparation before training of the machine learning model.

7. The method of claim 5, wherein: the polymerization in step A3-6 results in the form: content ID, type of activity (exposure, click, share, comment, or favorites), number of activities in a minute (e.g., 100 clicks in a minute, the value of this column is 100), start time of statistics, end time of statistics (statistics identifying which one minute of content) store this result in the database.

8. The method of claim 5, wherein: selecting hbase from the database; and the numerical values of the contents, such as exposure, click, share, collection, comment number, click rate, share rate and the like, of nearly 1, 2, 4, 8 and 12 hours can be expanded and calculated based on the minute-level aggregation data in the hbase.

9. An apparatus for pushing information using the method of claim 1, wherein: the device comprises a feature pushing system, a recalling system, a sequencing system and/or a reordering system, wherein the recalling system acquires candidate information related to a user from a material library;

the sorting system processes the information provided by the recall system;

the reordering system processes the processing result of the ordering system;

the material library comprises bright news contents.

10. The apparatus of claim 9, wherein: the feature push system further comprises: and the data dimension label subsystem is used for processing the first user data obtained from the social network software to obtain the characteristic description of the user.