CN112100432A

CN112100432A - Sample data acquisition method, feature extraction method, processing device and storage medium

Info

Publication number: CN112100432A
Application number: CN202010981752.6A
Authority: CN
Inventors: 陈强
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2020-12-18
Anticipated expiration: 2040-09-17
Also published as: CN112100432B

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a sample data acquisition method, which comprises the steps of separately processing long-term preference of user interest and short-term attention of a user to an instant hot spot to obtain a long-term preference label and an instant hot spot label, dividing a content label consumed by the user into a positive sample label and a negative sample label according to the consumption integrity of the user, and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hot spot label into positive sample data and negative sample data to be used as sample data for extracting characteristics of the user. The invention provides a sample data acquisition method, a feature extraction method, a processing device and a storage medium, wherein positive sample data and negative sample data obtained by the sample data acquisition method in the embodiment are used as sample data for extracting the features of a user, so that the user features capable of accurately representing the interest preference of the user can be acquired.

Description

Sample data acquisition method, feature extraction method, processing device and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a sample data acquisition method, a feature extraction method, a processing device and a storage medium.

Background

In order to find out the points of interest preference of the user, the consumption data of different time spans of the audio and video and the reading content consumed by the user are used for calculating the change of the points of interest of the user, finding new points of interest of the user, describing the points of interest by using a plurality of labels, and describing the change of the points of interest by using a label value attenuation method.

However, the inventor finds that in the prior art, the interest tag consumed by the user is directly used as sample data, and the obtained interest tag of the user is equal to the tag of the content consumed by the user, and cannot accurately represent the interest preference of the user.

Disclosure of Invention

The embodiment of the invention aims to provide a sample data acquisition method, a feature extraction method, a processing device and a storage medium.

In order to solve the above technical problem, an embodiment of the present invention provides a sample data obtaining method, including: acquiring a content tag of content to be consumed, consumption content data of the content to be consumed by a user and consumption integrity of each content to be consumed by the user; determining a long-term preference tag for the user based on a frequency of occurrence of the content tag in the consumed content data; determining an instant hot spot label of the user according to the consumed data of the content label and the attention of the user to the content label; dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity; and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.

Additionally, the dividing the positive exemplar label, the negative exemplar label, the long-term preference label, and the instant hotspot label into positive sample data and negative sample data comprises: and taking the positive sample tag, the long-term preference tag and the instant hot tag as the positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot tag as the negative sample data.

Additionally, the determining a long term preference tag for the user based on the frequency of occurrence of the content tag in the consumed content data comprises: determining a frequency of occurrence of each of the content tags in the consumed content data; filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.

In addition, the determining the instant hotspot tag of the user according to the consumed data of the content tag and the attention of the user to the content tag comprises: determining a first trend of change of the influence of each content tag along with time according to the consumed data of the content tag; determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention degree of the user on each content label; and determining the instant hot spot label of the user according to the second variation trend.

In addition, the determining a first trend of the influence of each content tag over time according to the consumed data of the content tag comprises: acquiring a plurality of content tags of which the total number of consumed users is greater than a preset value according to the consumed data of the content tags; and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.

In addition, the first trend of the influence of each content tag over time is calculated by the following formula:

f_j(t)＝(P_j/P_mid)×0.5^t

wherein f is_j(t) is the first trend, j is the content tag, P_jThe total number of users, P, corresponding to the content label_midIs the median value, t is time.

In addition, a second trend of the influence of each content tag on the user over time is calculated by the following formula:

f_ij(t)＝f_j(t)×g_i(j)

wherein i represents the user, j is the content tag, f_j(t) is the first trend, f_ij(t) is the second tendency of change, g_i(j) For the attention of the user to the jth content tag, and when the user consumes the jth content tag, g_i(j) 1 is ═ 1; g when the user has not consumed the jth of the content tags_i(j)＝0。

The embodiment of the invention also provides a feature extraction method, which comprises the following steps: obtaining the sample data by using the sample data acquisition method of any one of claims 1 to 7; performing model training by using the positive sample data and the negative sample data to obtain a trained tree model; and obtaining the user characteristics according to the trained tree model.

An embodiment of the present invention further provides a processing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method; alternatively, the above-described feature extraction method is performed.

The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the sample data acquisition method when being executed by a processor; alternatively, the above feature extraction method is implemented.

Compared with the prior art, the embodiment of the invention provides a sample data acquisition method, which is characterized in that a long-term preference label of a user is determined according to the frequency of the content label appearing in the consumption content data by acquiring the content label of the content to be consumed, the consumption content data of the content to be consumed by the user and the consumption integrity of each content to be consumed by the user, and an instant hot spot label of the user is determined according to the consumed data of the content label and the attention of the user to the content label. In the embodiment, the long-term preference of the interest of the user and the short-term attention of the user to the instant hotspot are processed separately to obtain the long-term preference tag and the instant hotspot tag, the content tag consumed by the user is divided into the positive sample tag and the negative sample tag according to the consumption completeness of the user, the positive sample tag, the negative sample tag, the long-term preference tag and the instant hotspot tag are divided into the positive sample data and the negative sample data which are used as sample data for extracting the characteristics of the user, and compared with the mode that the interest tag consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the characteristics of the user, so that the user characteristics capable of accurately representing the interest preference of the user can be.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a schematic flow chart of a sample data acquisition method according to a first embodiment of the present invention;

fig. 2 is a schematic flow chart of a feature extraction method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a sample data acquisition apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic configuration diagram of a feature extraction device according to a fourth embodiment of the present invention;

fig. 5 is a schematic configuration diagram of a treating apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The core of the embodiment lies in that a content label of a content to be consumed, consumed content data of the content to be consumed by a user and the consumption integrity of each content to be consumed by the user are obtained, so that a long-term preference label of the user is determined according to the frequency of the content label in the consumed content data, and an instant hot spot label of the user is determined according to the consumed data of the content label and the attention of the user to the content label. In the embodiment, the long-term preference of the interest of the user and the short-term attention of the user to the instant hotspot are processed separately to obtain the long-term preference tag and the instant hotspot tag, the content tag consumed by the user is divided into the positive sample tag and the negative sample tag according to the consumption completeness of the user, the positive sample tag, the negative sample tag, the long-term preference tag and the instant hotspot tag are divided into the positive sample data and the negative sample data which are used as sample data for extracting the characteristics of the user, and compared with the mode that the interest tag consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the characteristics of the user, so that the user characteristics capable of accurately representing the interest preference of the user can be.

The following describes implementation details of the sample data obtaining method of the present embodiment in detail, and the following is only provided for facilitating understanding of the implementation details and is not necessary for implementing the present embodiment.

A schematic flow diagram of the sample data acquisition method in this embodiment is shown in fig. 1:

step 101: the method comprises the steps of obtaining a content label of the content to be consumed, consumption content data of the content to be consumed by a user and the consumption integrity of each content to be consumed by the user.

Specifically, the content to be consumed includes: videos (e.g., videos including movies, television shows, fantasy, etc., and short videos), music, books (e.g., novels, history, training, etc.).

The content tags of the content to be consumed that can be selectively used in this embodiment include:

(1) and the ID of the content to be consumed is the unique identification code of the content.

(2) Content modalities of the content to be consumed, for example: long video, short video, music, books.

(3) The content name of the content to be consumed, for example: movie names, tv show names, general art names, music names, book names, and content names related to short videos.

(4) Content category tags for content to be consumed include content type tags (e.g., sports, entertainment, military, economic, educational, scientific, etc.), short video type tags, type tags for movie theatrical books (e.g., martial arts, antiques, fantasy, history, employment, etc.), and music types (e.g., hormons, melancholy, thanksgiving, inspirations, etc.).

(5) Content keyword tags for content to be consumed include character tags (e.g., director, actors, singers, athletes, political characters, etc.), entity tags (e.g., organization names, city names, etc.), event tags (e.g., hot events, earthquakes, volcanoes, epidemic, etc.) to which the content relates.

(6) The content shelf time of the content to be consumed, that is, the earliest point in time at which the user can consume the content, for example: the distribution time of short videos, the showing time of movie and television series, and the distribution time of music and book works.

(7) The content quality score of the content to be consumed is an index for measuring the popularity of the content to the user, and is calculated based on the user behavior, for example: and comprehensively calculating the quality score of one content according to the indexes of the number of users of the content, the average playing integrity, the average playing times, the total playing duration and the like.

In this embodiment, the user consumes the consumption content data of the content to be consumed, for example: from what time a user starts watching a video, listens to a music or watches a book, to what time it ends. According to the consumption content data of the content to be consumed, the preference of the user can be accurately reflected, for example: if the user watches many movie works or books of a certain type, and listens to many songs of a certain singer, or if the user focuses on short videos related to a certain keyword, it indicates that the user is interested in the content tags of the contents to be consumed. Conversely, if some content users switch content if they play little or a few seconds, it is indicated that the user may not be interested in the content tags of the content to be consumed.

In addition, according to the consumption content data of the content to be consumed by the user, the following can be obtained:

(1) content consumption period preferences. According to the time when the user consumes the content, such as the time when the user starts to watch long and short videos, the time when the user listens to music and watches electronic books, whether the user has the preference of consuming the content in certain time periods (such as morning and before sleep) in a day is measured.

(2) Content novelty consumption preferences. The difference between the content consumption time and the content shelf-loading time is smaller, and the novelty is higher; the larger the difference, the less novelty. Through the distribution of the novelty of the content consumed by the user, the preference of the user on the novelty of the content can be judged.

(3) Content consumption integrity. For example: the completeness of the video or music consumed by the user can be obtained by dividing the time length of the video or music played by the user by the total time length of the video or music, and the completeness of the reading of the user can be obtained by dividing the number of pages of the book read by the user by the total number of pages of the book.

Step 102: the long term preference tag of the user is determined according to the frequency with which the content tag appears in consuming the content data.

In the embodiment, long-term preference tags in which the user interests are located are screened from the consumption content data of the user according to the frequency of the content tags appearing in the consumption content data, and the long-term preference tags are used for representing the long-term interests of the user.

Determining a long term preference tag for a user based on a frequency of occurrence of content tags in consuming content data, comprising: determining a frequency of occurrence of each content tag in consuming the content data; filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.

Specifically, 80% of the content consumed by a user according to the "two-eight principle" is concentrated on about 20% of the entire content area, i.e., a small portion of the tags can cover most of the content viewed by the user. Reflecting on the data, for each particular user, a small fraction of the total content he has consumed appears high frequency for a small proportion of tags, while a large proportion of tags appear low frequency or appear at a frequency of 0. Wherein the high frequency tags represent long-term interest preferences of the user. Based on this, the inventor designs a frequency domain filter to attenuate tags of different frequencies to different degrees.

Let the total n contents to be consumed of video, music, novel form a content set C to be consumed, as shown in formula (1):

C＝[c₁ c₂ c₃ ... c_n] (1)

the tag set L of m content tags contained in the content set C to be consumed is shown in formula (2):

L＝[l₁ l₂ l₃ ... l_m] (2)

assume a set C of content to be consumed that user i has consumed_iAs shown in equation (3):

C_i＝[c₁ c₂ c₃ ... c_p ...] (3)

wherein, c_pAnd p is not more than n for the p-th content to be consumed by the user.

Consumed content collection C_iThe total content tag set that the user has consumed is included as a list L_jAs shown in equation (4):

L_j＝[l₁ l₂ l₃ ... l_j ...] (4)

wherein l_jThe value of the jth content tag that was consumed for user i.

Compute tag set L_jGenerating a tag frequency vector of the user i according to the appearance frequency of all tags in the content consumed by the user i: [ F ]_i0F_i1F_i2...F_ij...F_im]Wherein F is_ijRepresenting the frequency of occurrence of a tag j in the content consumed by a user i, determining a frequency maximum F in a tag frequency vector_maxAs shown in the following equation (5):

F_max＝max([F_i0 F_i1 F_i2 ... F_i _j... F_im]) (5)

the frequencies of all the consumed content tags of the user i are substituted into a filter of the following formula (6), and the expression of the filter is shown in the following formula (6):

s＝tan(πf_c/f_s)(1+z^-1)/(1-z^-1) (6)

wherein the sampling frequency f_s＝F_maxTaking the cut-off frequency f_c＝0.7f_s(3DB attenuates the corresponding frequency). The high-frequency label of the user U is hardly or rarely attenuated, and the low-frequency label is greatly attenuated, so that the high-frequency label representing the long-term interest preference of the user can be screened out. It should be noted that the filter used in the present embodiment is a Butterworth filter (Butterworth filter), which is one of electronic filters, and is also called a maximum flat filter. In practical applications, other filters capable of implementing the frequency domain filtering may be used, which is not limited in this embodiment.

For example, a user of a series "celebration year" watches each episode update of the series, shows obvious playing behaviors for a preview short, a highlight clip and the like of the series, partially clicks and reads the same-name novel, and simultaneously watches lace life contents related to a small amount of ancient packages crossing, a brief and a good day. Calculating the appearance frequencies of content labels such as ' celebration year ', ' ancient dress crossing ', ' stretching ', ' stalking ', ' lace life ', and the like, wherein the frequency of the ' celebration year ' is obviously large and is attenuated to a small extent after high-frequency filtering, the ' ancient dress crossing ', ' stretching ', ' stalking ' can be attenuated to a certain extent, and the lace life ' can be attenuated due to the obviously small frequency. In this way, the resulting content tag vector can highlight the user's long-term interest preferences.

Step 103: and determining the instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label.

In addition to long-term preferences of content types, forms and the like, the consumption of content by users is considerable, for example, many people have no interest in hospitals and infectious diseases, but pay attention to hot content related to medical alarm, coronavirus public opinion and the like. In the embodiment, the instant hot spot label of the user is screened from the content labels consumed by the user according to the consumed data of the content labels and the attention degree of the user to the content labels, and the instant hot spot label represents an instant hot spot concerned by the user in a short term.

Specifically, determining an instant hot spot tag of a user according to consumed data of the content tag and the attention of the user to the content tag includes: determining a first trend of change of the influence of each content tag over time according to consumed data of the content tag; determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention of the user to each content label; and determining the instant hot spot label of the user according to the second variation trend.

Wherein determining a first trend of change of the influence of each content tag over time from consumed data of the content tag comprises: acquiring a plurality of content labels of which the total consumption user number is greater than a preset value according to the consumed data of the content labels; and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.

For example, in this embodiment, a time-domain filtering method may be adopted to calculate the influence change of the hotspot on the user over time, and obtain 100 content tags with the largest number of users consumed in the last half year according to the consumed data of the content tags, where the median of the users of the 100 content tags is set to be P_midThe number of consumption users of the content label j is P_jDefining the influence of content tag j as A_jThe expression is shown in the following formula (7):

A_j＝P_j/P_mid (7)

wherein, when A_jWhen the ratio is more than 1, taking A_jEqual to 1.

Specifically, the first trend of the influence of each content tag with time is calculated by the following formula (8):

f_j(t)＝(P_j/P_mid)×0.5^t (8)

wherein f is_j(t) is the first trend, j is the content tag, P_jThe total number of users, P, corresponding to the content label_midIs a median value and t is time.

Specifically, the second trend of the influence of each content tag on the user over time is calculated by the following formula (9):

f_ij(t)＝f_j(t)×g_i(j) (9)

where i denotes the user, j is the content tag, f_j(t) is a first trend, f_ij(t) is the second trend, g_i(j) Attention of the user to the jth content tag, and when the user consumes the jth content tag, g_i(j) 1 is ═ 1; when the user does not consume the jth content tag, g_i(j)＝0。

It should be noted that the time t in the above formula (8) and formula (9) is discretely calculated by day.

The current influence and the change trend of any content label j in all users can be calculated according to the formula (8), and the current influence and the change trend of any label j on the user i can be calculated according to the formula (9).

Step 104: and dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity.

Specifically, in step 101, the consumption integrity of each content to be consumed by the user is obtained, in this embodiment, a video with the playing integrity of more than 90% by the user is used as a video consumption positive sample, data with the reading integrity of more than 50% is used as a book consumption positive sample, and a music consumption positive sample is obtained when the song is completely listened for more than 1 time every day. And taking a video with the playing integrity of less than 10% of a user as a video consumption negative sample, taking data with the reading integrity of less than 10% as a book consumption negative sample, and taking a song with the song listening duration of less than 10 seconds as a music consumption negative sample. It should be noted that the above-mentioned percentage of the integrity for dividing the positive and negative samples is only an example, and in practical applications, the percentage can be set by the user according to actual requirements.

Step 105: and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.

Specifically, the long-term preference tag of the user after the frequency domain filtering is obtained in the step 102, and the instant hotspot tag of the user is obtained after the time domain filtering is performed in the step 103.

In this embodiment, the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hotspot tag are divided into positive sample data and negative sample data, which includes: and taking the positive sample tag, the long-term preference tag and the instant hot spot tag as positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot spot tag as negative sample data.

Specifically, a positive sample label, a long-term preference label of a user and an instant hot spot label of the user are marked as 1 as positive sample data; and marking the negative sample label, the long-term preference label of the user and the instant hot spot label of the user as 0 as negative sample data, and taking the positive sample data and the negative sample data as sample data for extracting the user characteristics. Compared with the mode of directly taking the interest tag consumed by the user as sample data in the prior art, the method for acquiring the sample data in the embodiment has the advantages that the positive sample data and the negative sample data obtained by the sample data acquisition method are used as the sample data for extracting the user characteristics, and the method is favorable for acquiring the user characteristics capable of accurately representing the interest preference of the user.

Compared with the prior art, the embodiment of the invention provides a sample data acquisition method, which separately processes the long-term preference of the user interest and the short-term attention of the user to the instant hot spot to obtain a long-term preference label and an instant hot spot label, and dividing the content tags consumed by the user into positive sample tags and negative sample tags according to the consumption completeness of the user, and dividing the positive sample tags, the negative sample tags, the long-term preference tags and the instant hotspot tags into positive sample data and negative sample data as the sample data for extracting the characteristics of the user.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

In the prior art, in order to find the interest preference of a user, various keywords are extracted by clustering historical data of user consumption content, and part of keywords are taken out as user characteristics, so that the interest preference of the user is described. Since the user often consumes a piece of content not because the user likes this type of content or a character in the content, it may be a hotspot, which is concerned by both of the two days, and the user will not consume the same type of content after the hotness period. In the existing method for describing the user interest preference, the keywords of the content consumed by the user are equal to the user interest preference, so that the content which is not the user interest preference, such as hot content, can also be used as the user interest preference content, and the keywords of the content are mistakenly used as the user interest points, so that the user interest points cannot be accurately represented.

A second embodiment of the present invention relates to a feature extraction method. A schematic flow chart of the feature extraction method in this embodiment is shown in fig. 2, and specifically includes:

step 201: and obtaining the sample data by using the sample data acquisition method in the embodiment.

The sample obtaining method in this step has been described in detail in the first embodiment, and is not described in detail in this embodiment. The sample data acquired by the sample data acquisition method in the first embodiment includes: positive sample data (positive exemplar, long-term preference, instant hotspot tag) and negative sample data (negative exemplar, long-term preference, and instant hotspot tags).

Step 202: and carrying out model training by using the positive sample data and the negative sample data to obtain a trained tree model.

Specifically, a Gradient Boosting Decision Tree (GBDT) algorithm is adopted to train sample data to generate a Tree model. In the process of model training, the number of trees of the tree model is set according to the number of the long-term preference tags and the number of the instant hot spot tags, and the maximum depth of each tree is 3. It should be noted that, in this embodiment, an implementation manner of training a tree model is given, but it is to be understood that other tree model training manners in the prior art may also be adopted, and details are not described in this embodiment.

Step 203: and acquiring the user characteristics according to the trained tree model.

Specifically, as an implementation manner, the trained tree model can be used in other occasions to extract the interest features of the user, the prediction result output by the GBDT algorithm can be directly obtained, and the prediction of whether the user is interested in a certain content can be obtained. As another implementation mode, the output of each tree node in the trained tree model is combined into a feature vector, and the feature vector is used for describing the features of the user. The feature vector contains deep knowledge such as the relation between labels, and is used for inputting other algorithms such as user prediction and identification, so that the application range is wider, and the using effect is better.

Experiments prove that in a short video recommendation scene, the feature vector of the proposal is used as input, and compared with a mode of using other data as input, the AUC (Area Under customer, which is an evaluation index for measuring the quality of a two-classification model and represents the probability that a predicted positive case is arranged in front of a negative case) value recommended by using a Logistic Regression (LR) algorithm is 12% higher, the AUC recommended by using a GBDT algorithm is 5% higher, and the AUC recommended by using an LR + GBDT algorithm is 3% higher.

Compared with the prior art, the embodiment of the invention provides the feature extraction method, and the positive sample data and the negative sample data obtained by the sample data acquisition method in the first embodiment are used as the sample data for extracting the user features, so that the finally obtained user features can accurately represent the interest preference of the user.

A third embodiment of the present invention relates to a sample data acquiring apparatus, as shown in fig. 3, including: the data extraction module 11 is configured to obtain a content tag of a content to be consumed, consumption content data of the content to be consumed by a user, and a consumption integrity of each content to be consumed by the user;

the sample data acquiring apparatus 1 further includes: a first tag extraction module 12, a second tag extraction module 13 and a third tag extraction module 14 connected to the data extraction module 11.

Wherein, the first tag extraction module 12 is configured to extract data from the data extraction module 11, and determine the long-term preference tag of the user according to the frequency of the content tag appearing in the consumed content data.

The first tag extraction module 12 specifically includes: a frequency determination submodule 121 connected to the data extraction module, a filtering submodule 122 connected to the frequency determination submodule, and a long-term tag generation submodule 123 connected to the sample generation module;

specifically, the frequency determining sub-module 121 is configured to extract data from the data extracting module, and determine the frequency of occurrence of each content tag in the consumed content data. And a filtering sub-module 122, configured to filter a frequency of occurrence of each content tag in the consumed content data to obtain a filtered frequency. And the long-term tag generation sub-module 123 is configured to use the content tag corresponding to the filtered frequency as the long-term preference tag of the user.

The second tag extraction module 13 is configured to determine an instant hot tag of the user from the consumed data of the content tag and the attention of the user to the content tag.

The second tag extraction module 13 specifically includes: a first change trend determining sub-module 131 connected with the data extraction module 11, a second change trend determining sub-module 132 connected with the first change trend determining sub-module 131 and the data extraction module 11, and a hot spot tag generating sub-module 133 connected with the second change trend determining sub-module 132;

specifically, the first trend determining sub-module 131 is configured to determine a first trend of the influence of each content tag over time according to the consumed data of the content tag. And the second variation trend determining sub-module 132 is configured to determine, according to the first variation trend and the attention of the user to each content tag, a second variation trend of the influence of each content tag on the user over time. And the hot spot tag generating sub-module 133 is configured to determine an instant hot spot tag of the user according to the second variation trend.

And the third label extraction module 14 is configured to extract data from the data extraction module, and divide the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity.

Further comprising: the sample generation module 15 is connected to the first label extraction module 12, the second label extraction module 13, and the third label extraction module 14 respectively, and the sample generation module 15 is connected to the third label extraction module 14.

The sample generating module 15 is configured to divide the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hotspot tag into positive sample data and negative sample data, which are used as sample data for extracting features of the user.

It should be noted that the sample data acquisition apparatus 1 in this embodiment is an apparatus embodiment corresponding to the sample data acquisition method in the first embodiment, and the implementation details in the first embodiment may be applied to this embodiment, and are not described herein again.

A fourth embodiment of the present invention relates to a feature extraction device, as shown in fig. 4, including: a sample data acquisition device 1 according to the third embodiment, a model generation device 2 connected to the sample data acquisition device 1; the model generating device 2 is configured to perform model training using the positive sample data and the negative sample data obtained by the sample data obtaining device 1 to obtain a trained tree model, and obtain user characteristics according to the trained tree model.

It should be noted that the feature extraction device in this embodiment is a device embodiment corresponding to the feature extraction method in the second embodiment, and the implementation details in the second embodiment may be applied to this embodiment, and are not described herein again.

A fifth embodiment of the present invention relates to a processing apparatus, as shown in fig. 5, including at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501 to enable the at least one processor 501 to execute the sample data acquisition method in the first embodiment; alternatively, the feature extraction method of the second embodiment is performed.

The memory 502 and the processor 501 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 501 and the memory 502 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 501 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 501.

The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 502 may be used to store data used by processor 501 in performing operations.

The sixth embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the sample data acquiring method in the first embodiment; alternatively, the feature extraction method of the second embodiment is implemented.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A sample data acquisition method is characterized by comprising the following steps:

acquiring a content tag of content to be consumed, consumption content data of the content to be consumed by a user and consumption integrity of each content to be consumed by the user;

determining a long-term preference tag for the user based on a frequency of occurrence of the content tag in the consumed content data;

determining an instant hot spot label of the user according to the consumed data of the content label and the attention of the user to the content label;

dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity;

and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.

2. The method according to claim 1, wherein the dividing the positive exemplar label, the negative exemplar label, the long-term preference label, and the instant hotspot label into positive and negative sample data comprises:

and taking the positive sample tag, the long-term preference tag and the instant hot tag as the positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot tag as the negative sample data.

3. The method of claim 1, wherein the determining the long-term preference tag of the user according to the frequency of occurrence of the content tag in the consumed content data comprises:

determining a frequency of occurrence of each of the content tags in the consumed content data;

filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency;

and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.

4. The method of claim 1, wherein the determining the instant hotspot tag of the user according to the consumed data of the content tag and the attention of the user to the content tag comprises:

determining a first trend of change of the influence of each content tag along with time according to the consumed data of the content tag;

determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention degree of the user on each content label;

and determining the instant hot spot label of the user according to the second variation trend.

5. The method of claim 4, wherein the determining a first trend of influence of each of the content tags over time from the consumed data of the content tags comprises:

acquiring a plurality of content tags of which the total number of consumed users is greater than a preset value according to the consumed data of the content tags;

and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.

6. The sample data acquisition method according to claim 5, wherein the first trend of change of the influence of each content tag with time is calculated by the following formula:

f_j(t)＝(P_j/P_mid)×0.5^t

7. The sample data acquisition method according to claim 4, wherein the second trend of the influence of each content tag on the user over time is calculated by the following formula:

f_ij(t)＝f_j(t)×g_i(j)

8. A method of feature extraction, comprising:

obtaining the sample data by using the sample data acquisition method of any one of claims 1 to 7;

performing model training by using the positive sample data and the negative sample data to obtain a trained tree model;

and obtaining the user characteristics according to the trained tree model.

9. A processing apparatus, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method of any one of claims 1 to 7; or, the feature extraction method according to claim 8 is performed.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the sample data acquisition method according to any one of claims 1 to 7; or, the feature extraction method according to claim 8 is performed.