CN112100432B

CN112100432B - Sample data acquisition method, feature extraction method, processing device and storage medium

Info

Publication number: CN112100432B
Application number: CN202010981752.6A
Authority: CN
Inventors: 陈强
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2024-04-09
Anticipated expiration: 2040-09-17
Also published as: CN112100432A

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a sample data acquisition method, which is used for separately processing long-term preference of interests of a user and short-term attention of the user to instant hot spots to obtain long-term preference labels and instant hot spot labels, dividing content labels consumed by the user into positive sample labels and negative sample labels according to the consumption integrity of the user, and dividing the positive sample labels, the negative sample labels, the long-term preference labels and the instant hot spot labels into positive sample data and negative sample data which are used as sample data for extracting characteristics of the user. The invention provides a sample data acquisition method, a feature extraction method, a processing device and a storage medium, wherein positive sample data and negative sample data obtained by the sample data acquisition method in the embodiment are used as sample data for extracting features of a user, so that the acquisition of the user features capable of accurately representing interest preferences of the user is facilitated.

Description

Sample data acquisition method, feature extraction method, processing device and storage medium

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a sample data acquisition method, a characteristic extraction method, a processing device and a storage medium.

Background

In order to find out the points with the interests and the preferences of the user, the change of the points of interest of the user is calculated by using the audio and video consumed by the user and the consumption data of different time spans of the read content, new points of interest of the user are found, then the points of interest are described by using a plurality of labels, and the change of the points of interest is described by using a method of attenuating the label values.

However, the inventor finds that in the prior art, the interest tag consumed by the user is directly used as sample data, and the obtained interest tag of the user is equal to the tag of the content consumed by the user, so that the interest preference of the user cannot be accurately represented.

Disclosure of Invention

An object of an embodiment of the present invention is to provide a sample data acquisition method, a feature extraction method, a processing device, and a storage medium, which utilize positive sample data and negative sample data obtained by the sample data acquisition method in this embodiment as sample data for extracting features of a user, so as to facilitate acquisition of user features capable of accurately characterizing interest preferences of the user.

In order to solve the above technical problems, an embodiment of the present invention provides a sample data acquisition method, including: acquiring a content tag of a content to be consumed, consumption content data of the content to be consumed by a user, and consumption integrity of each content to be consumed by the user; determining a long-term preference tag of the user according to the frequency of occurrence of the content tag in the consumed content data; determining an instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label; dividing the content label of the content to be consumed into a positive sample label and a negative sample label according to the consumption integrity; dividing the positive sample tag, the negative sample tag, the long-term preference tag and the instant hot tag into positive sample data and negative sample data as sample data for extracting the characteristics of the user.

In addition, the dividing the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hot tag into positive sample data and negative sample data includes: and taking the positive sample label, the long-term preference label and the instant hot label as the positive sample data, and taking the negative sample label, the long-term preference label and the instant hot label as the negative sample data.

In addition, the determining the long-term preference tag of the user according to the frequency of occurrence of the content tag in the consumption content data includes: determining the frequency of occurrence of each of the content tags in the consumed content data; filtering the frequency of each content tag in the consumption content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.

In addition, the determining the instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label comprises the following steps: determining a first change trend of influence of each content label along with time according to consumed data of the content labels; determining a second change trend of influence of each content label on the user along with time according to the first change trend and the attention degree of the user on each content label; and determining the instant hot spot label of the user according to the second change trend.

In addition, the determining a first trend of the influence of each content tag over time according to the consumed data of the content tag includes: acquiring a plurality of content tags with the total consumption user number larger than a preset value according to the consumed data of the content tags; and determining a first change trend of the influence of each content tag along with time according to the intermediate value of the total consumption user number corresponding to the content tags and the total consumption user number corresponding to each content tag.

In addition, the first trend of the influence of each content tag over time is calculated by the following formula:

f _j (t)＝(P _j /P _mid )×0.5 ^t

wherein f _j (t) is the first trend, j is the content label, P _j P for the total consumption user number corresponding to the content label _mid And t is time, which is the intermediate value.

In addition, the second trend of the influence of each content tag on the user over time is calculated by the following formula:

f _ij (t)＝f _j (t)×g _i (j)

wherein i represents the user, j is the content tag, f _j (t) is the first trend, f _ij (t) is the second trend, g _i (j) Regarding the attention of the user to the jth content label, and g when the user consumes the jth content label _i (j) =1; g when the user has not consumed the j-th content tag _i (j)＝0。

The embodiment of the invention also provides a feature extraction method, which comprises the following steps: obtaining the sample data using the sample data obtaining method according to any one of claims 1 to 7; model training is carried out by utilizing the positive sample data and the negative sample data to obtain a trained tree model; and obtaining user characteristics according to the trained tree model.

The embodiment of the invention also provides a processing device, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method described above; alternatively, the above feature extraction method is performed.

The embodiment of the invention also provides a computer readable storage medium storing a computer program which when executed by a processor realizes the sample data acquisition method; alternatively, the above feature extraction method is implemented.

Compared with the prior art, the embodiment of the invention provides a sample data acquisition method, which is used for determining a long-term preference label of a user according to the occurrence frequency of the content label in the consumption content data by acquiring the content label of the content to be consumed, the consumption content data of the content to be consumed by the user and the consumption integrity of each content to be consumed by the user, and determining the instant hot label of the user according to the consumed data of the content label and the attention of the user to the content label. In the embodiment, the long-term preference of the user interest and the short-term attention of the user to the instant hot spot are separately processed to obtain the long-term preference label and the instant hot spot label, the content label consumed by the user is divided into the positive sample label and the negative sample label according to the consumption integrity of the user, the positive sample label, the negative sample label, the long-term preference label and the instant hot spot label are divided into the positive sample data and the negative sample data, and compared with the mode that the interest label consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the user characteristics, so that the user characteristics capable of accurately representing the user interest preference can be obtained.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which the figures of the drawings are not to be taken in a limiting sense, unless otherwise indicated.

Fig. 1 is a flow chart in a sample data acquisition method according to a first embodiment of the present invention;

fig. 2 is a flow chart of a feature extraction method according to a second embodiment of the present invention;

fig. 3 is a schematic structural view of a sample data acquiring device according to a third embodiment of the present invention;

fig. 4 is a schematic structural view of a feature extraction device according to a fourth embodiment of the invention;

fig. 5 is a schematic structural view of a processing apparatus according to a fifth embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the following detailed description of the embodiments of the present invention will be given with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present invention, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments.

The first embodiment of the present invention relates to a sample data acquisition method, and the core of the present embodiment is to determine a long-term preference tag of a user according to the frequency of occurrence of a content tag in consumption content data by acquiring a content tag of a content to be consumed, consumption content data of the content to be consumed by the user, and consumption integrity of each content to be consumed by the user, and determine an instant hot tag of the user according to the consumed data of the content tag and the attention of the user to the content tag. In the embodiment, the long-term preference of the user interest and the short-term attention of the user to the instant hot spot are separately processed to obtain the long-term preference label and the instant hot spot label, the content label consumed by the user is divided into the positive sample label and the negative sample label according to the consumption integrity of the user, the positive sample label, the negative sample label, the long-term preference label and the instant hot spot label are divided into the positive sample data and the negative sample data, and compared with the mode that the interest label consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the user characteristics, so that the user characteristics capable of accurately representing the user interest preference can be obtained.

The details of the sample data obtaining method according to the present embodiment are specifically described below, and the following description is provided only for convenience of understanding, and is not necessary to implement the present embodiment.

A schematic flow chart of a sample data obtaining method in this embodiment is shown in fig. 1:

step 101: the method comprises the steps of obtaining content labels of contents to be consumed, consuming content data of the contents to be consumed by a user, and consuming integrity of each content to be consumed by the user.

Specifically, the content to be consumed includes: video (including, for example, movies, television shows, and short videos), music, books (e.g., novels, histories, pipe training, etc.).

The content tag of the content to be consumed, which can be selectively used in the embodiment, includes:

(1) The content ID of the content to be consumed is the unique identification code of the content.

(2) Content morphology of the content to be consumed, for example: long video, short video, music, books.

(3) Content name of the content to be consumed, for example: movie names, television series names, comprehensive names, music names, book names, and content names involved in short videos.

(4) Content classification tags for content to be consumed include content type tags (e.g., sports, entertainment, military, economy, education, science and technology, etc.), short video type tags, type tags for movie and television drama books (e.g., martial arts, antique, fantasy, history, job site, etc.), and music types (e.g., shock, depression, thanksgiving, inspirational, etc.).

(5) Content keyword tags for content to be consumed include character tags related to the content (e.g., director, actor, singer, athlete, political character, etc.), entity tags (e.g., organization name, city name, etc.), event tags (e.g., hot event, earthquake, volcano, epidemic, etc.).

(6) The content-on-shelf time of the content to be consumed, that is to say the earliest point in time when the user can consume the content, for example: the distribution time of short video, the showing time of movie and television play, and the distribution time of music book works.

(7) Content quality score of content to be consumed, the content quality score is an index for measuring popularity of the content by users, and is calculated based on user behaviors, for example: and calculating the quality fraction of the content according to the total number of users, the average playing integrity, the average playing times, the total playing duration and other indexes.

In this embodiment, the user consumes the consumption content data of the content to be consumed, for example: from what time a user starts to see a video, listens to music or sees a book, to what time it ends. The preference of the user can be accurately reflected according to the consumption content data of the content to be consumed by the user, for example: a user watching a number of certain types of movie works or books, listening to a number of singers' tracks, or focusing on short videos related to a certain keyword, then the user is interested in the content tags of the content to be consumed. Conversely, if some content users switch content with little play or a few seconds of play, it is indicated that the user may not be interested in the content tags of the content to be consumed.

In addition, according to the consumption content data of the content to be consumed by the user, the following can be obtained:

(1) Content consumption period preference. And measuring whether the user has the preference of consuming the content in certain time periods (such as the morning and before sleeping) in the day according to the time of consuming the content, such as the time of starting to watch long and short videos, the time of listening to music and watching an electronic book.

(2) Content novelty consumption preference. The smaller the difference value is, the higher the novelty is; the greater the difference, the less novel. Through the novelty distribution of the content consumed by the user, the preference of the user for the novelty of the content can be judged.

(3) Content consumption integrity. For example: the integrity of the video or music consumed by the user can be obtained by dividing the time of playing the video or music by the total time of the video or music, and the integrity of the reading of the user can be obtained by dividing the number of pages of the book by the total number of pages of the book.

Step 102: the long-term preference tags of the user are determined based on the frequency with which the content tags appear in the consumed content data.

In this embodiment, according to the frequency of occurrence of the content tag in the consumed content data, the long-term preference tag where the user interest is located is selected from the consumed content data of the user, where the long-term preference tag is used to characterize the long-term interest of the user.

Determining a long-term preference tag of the user based on the frequency of occurrence of the content tag in the consumed content data, comprising: determining a frequency of occurrence of each content tag in the consumed content data; filtering the frequency of each content tag in the consumed content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.

Specifically, 80% of the content consumed by users according to the "two eight principle" is focused on about 20% of the total content area, i.e., a small portion of the tags can cover most of the content that the user sees. Reflecting on the data, for each particular user, a small fraction of the tags appear at high frequency and a large fraction of the tags appear at low frequency or at a frequency of 0 for all of the content he has consumed. Wherein the high frequency tags represent a long-term interest preference of the user. Based on this, the inventors devised a frequency domain filter that attenuates tags of different frequencies to different extents.

Let the total n contents to be consumed of video, music and novel form a content set C to be consumed, as shown in formula (1):

C＝[c ₁ c ₂ c ₃ ...c _n ] (1)

the label set L of m content labels contained in the content set C to be consumed is as shown in formula (2):

L＝[l ₁ l ₂ l ₃ ...l _m ] (2)

assume that user i consumes set C of content to be consumed _i As shown in formula (3):

C _i ＝[c ₁ c ₂ c ₃ ...c _p ...] (3)

wherein c _p And the p-th content to be consumed is consumed by a user, wherein p is less than or equal to n.

Consumed content set C _i All content tags consumed by the user are contained as a list L _j As shown in formula (4):

L _j ＝[l ₁ l ₂ l ₃ ...l _j ...] (4)

wherein l _j The value of the j-th content tag consumed for user i.

Computing a set of labels L _j The occurrence frequency of all tags in the content consumed by the user i generates a tag frequency vector of the user i: [ F _i0 F _i1 F _i2 ...F _ij ...F _im ]Wherein F is _ij Representing the frequency of occurrence of tag j in the content consumed by user i, determining the frequency maximum F in the tag frequency vector _max The following formula(5) The following is shown:

F _max ＝max([F _i0 F _i1 F _i2 ...F _ij ...F _im ]) (5)

bringing the frequencies of all consumed content tags of user i into a filter of the following formula (6), the expression of the filter being shown in the following formula (6):

s＝tan(πf _c /f _s )(1+z ^-1 )/(1-z ^-1 ) (6)

wherein the sampling frequency f _s ＝F _max Taking the cut-off frequency f _c ＝0.7f _s (3 DB decays the corresponding frequencies). The high frequency tags of user U are hardly attenuated or attenuated little while the low frequency tags are attenuated significantly, so that the high frequency tags representing the long-term interest preferences of the user can be screened out. It should be noted that the filter used in the present embodiment is a Butterworth filter (Butterworth filter), which is one type of electronic filter and is also called a maximum flattening filter. In practical applications, other filters that can realize the above-described frequency domain filtering may be used, and this is not limited in the present embodiment.

For example, a user who follows the play "the rest of the year of celebration" watches each episode of the play, shows obvious play behavior on the trailer, highlight clips, etc. of the play, clicks and reads the same-name novel partially, and also looks at the lace life content related to a small number of ancient-dress traversals, zhang Re, chen Daoming. The occurrence frequency of content labels such as 'celebration remaining years', 'ancient dress crossing', 'Zhang Re', 'Chen Daoming', 'lace life' is calculated, wherein the frequency of the 'celebration remaining years' is obviously large, the attenuation is small after high-frequency filtering, the 'ancient dress crossing', 'Zhang Re', 'Chen Daoming' can be attenuated to a certain extent, and the 'lace life' can be attenuated due to the obviously small frequency. In this way, the resulting content tag vector can highlight the long-term interest preferences of the user.

Step 103: and determining the instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label.

In addition to the long-term preference of users for content, such as obvious content types, forms and the like, the consumption of content by users has considerable quantity of hot content. In this embodiment, according to the consumed data of the content tag and the attention degree of the user to the content tag, the instant hot spot tag of the user is selected from the content tags consumed by the user, and the instant hot spot tag characterizes the instant hot spot of short-term attention of the user.

Specifically, determining the instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label comprises the following steps: determining a first change trend of influence of each content label along with time according to consumed data of the content labels; determining a second change trend of influence of each content label on the user along with time according to the first change trend and the attention degree of the user on each content label; and determining the instant hot spot label of the user according to the second change trend.

Wherein determining a first trend of the influence of each content tag over time according to the consumed data of the content tag comprises: acquiring a plurality of content tags with the total consumption user number larger than a preset value according to the consumed data of the content tags; and determining a first change trend of the influence of each content tag along with time according to the intermediate value of the total consumption user number corresponding to the content tags and the total consumption user number corresponding to each content tag.

For example, in this embodiment, a time domain filtering method may be used to calculate the influence variation of the hot spot on the user over time, obtain 100 content tags with the largest number of users consumed in the last half year according to the consumed data of the content tags, and set the median of the users of the 100 content tags as P _mid The number of consumer users of content tag j is P _j Defining the influence of the content label j as A _j The expression is shown in the following formula (7):

A _j ＝P _j /P _mid (7)

wherein, when A _j Taking A when the weight is greater than 1 _j Equal to 1.

Specifically, the first trend of change in influence of each content tag over time is calculated by the following formula (8):

f _j (t)＝(P _j /P _mid )×0.5 ^t (8)

wherein f _j (t) is the first trend, j is the content label, P _j P is the total consumption user number corresponding to the content label _mid Is an intermediate value, and t is time.

Specifically, the second trend of change in the influence of each content tag on the user over time is calculated by the following formula (9):

f _ij (t)＝f _j (t)×g _i (j) (9)

wherein i represents a user, j is a content tag, f _j (t) is the first variation trend, f _ij (t) is the second trend, g _i (j) Is the attention of the user to the jth content label, and g when the user consumes the jth content label _i (j) =1; g when the user has not consumed the j-th content tag _i (j)＝0。

It should be noted that, the time t in the above formula (8) and formula (9) is a discrete value in days.

The current influence and the change trend of any content label j in all users can be calculated according to the formula (8), and the current influence and the change trend of any label j on the user i can be calculated according to the formula (9).

Step 104: and dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity.

Specifically, in step 101, the consumption integrity of each content to be consumed by the user has been obtained, in this embodiment, the video with the user playing integrity greater than 90% is taken as a positive sample of video consumption, the data with the reading integrity greater than 50% is taken as a positive sample of book consumption, and the number of times of listening to songs completely per day is greater than 1. Taking a video with the user playing integrity less than 10% as a video consumption negative sample, reading data with the user playing integrity less than 10% as a book consumption negative sample, and listening to songs with the song duration less than 10 seconds as a music consumption negative sample. It should be noted that the above-mentioned integrity percentages of the positive and negative samples are merely examples, and in practical application, the integrity percentages can be set by the user according to the actual requirements.

Step 105: positive sample tags, negative sample tags, long-term preference tags, and instant hot tags are divided into positive sample data and negative sample data as sample data for extracting features of a user.

Specifically, the long-term preference label of the user after the frequency domain filtering is obtained through the step 102, and the instant hot spot label of the user is obtained after the time domain filtering is obtained through the step 103.

In this embodiment, the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hot tag are divided into positive sample data and negative sample data, including: positive sample tags, long-term preference tags, and instant hot tags are used as positive sample data, and negative sample tags, long-term preference tags, and instant hot tags are used as negative sample data.

Specifically, a positive sample label, a user long-term preference label and a user instant hot label are marked as 1 and serve as positive sample data; and labeling the negative sample label, the long-term preference label of the user and the instant hot label of the user as 0 to serve as negative sample data, and taking the positive sample data and the negative sample data as sample data for extracting the characteristics of the user. Compared with the method in the prior art that the interest labels consumed by the users are directly used as sample data, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as sample data for extracting the user characteristics, so that the method is beneficial to obtaining the user characteristics capable of accurately representing the interest preferences of the users.

Compared with the mode of directly taking the interest tag consumed by the user as sample data in the prior art, the sample data acquisition method of the embodiment is used for taking the positive sample data and the negative sample data obtained by the sample data acquisition method of the embodiment as the sample data for extracting the characteristics of the user, thereby being beneficial to obtaining the characteristics of the user capable of accurately characterizing the interest of the user.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

In the prior art, in order to find the interest preference of the user, various keywords are extracted by clustering the historical data of the consumed content of the user, and part of the keywords are taken out to serve as the user characteristics, so that the interest preference of the user is described. Since a user consumes a content many times not because the user likes the content or likes a person in the content, and possibly this is a hotspot, the user is concerned about the content of the same person in the same type in the hot period, and the user does not consume the content of the same person in the same type in the hot period. The prior method for describing the interest preference of the user equates the keywords of the content consumed by the user with the interest preference of the user, so that the content such as hot content which is not the interest preference of the user can be regarded as the interest preference content of the user, and the keywords of the content are mistakenly regarded as the interest points of the user, so that the interest points of the user cannot be accurately represented.

A second embodiment of the present invention relates to a feature extraction method. A schematic flow chart of a feature extraction method in this embodiment is shown in fig. 2, and specifically includes:

step 201: sample data is obtained using the sample data acquisition method in the above embodiment.

The sample acquiring method in this step is described in detail in the first embodiment, and a detailed description is omitted in this embodiment. The sample data acquired by the sample data acquisition method in the first embodiment includes: positive sample data (positive sample tags, long-term preference tags, instant hot tags) and negative sample data (negative sample tags, long-term preference tags, and instant hot tags).

Step 202: model training is performed using the positive and negative sample data to obtain a trained tree model.

Specifically, the sample data is trained by using a gradient-lifted iterative decision tree (GBDT, gradient Boosting Decision Tree) algorithm to generate a tree model. In the process of model training, the number of trees of the tree model is set according to the number of long-term preference labels and instant hot labels, and the maximum depth of each tree is 3. It should be noted that, in this embodiment, an implementation manner of training a tree model is provided, but it may be understood that other tree model training manners in the prior art may also be adopted, which is not described in detail in this embodiment.

Step 203: and obtaining user characteristics according to the trained tree model.

Specifically, as an implementation manner, the interest features of the user can be extracted according to the trained tree model for other occasions, the prediction result output by the GBDT algorithm can be directly obtained, and the prediction whether the user is interested in a piece of content can be obtained. As another implementation, the output on each tree node in the trained tree model is composed into a feature vector, which is used to describe the user's features. The feature vector contains depth knowledge such as connection among labels, and is used for inputting other algorithms, such as user prediction, identification and the like, so that the application range is wider, and the use effect is better.

Experiments prove that Under the short video recommendation scene, the feature vector of the proposal is used as input, compared with a mode of using other data as input, the AUC (Area Under Curve, an evaluation index for measuring the advantages and disadvantages of a two-class model) recommended by using a logistic regression (LR, logistic Regression) algorithm is 12% higher, the AUC value recommended by using a GBDT algorithm is 5% higher, and the AUC recommended by using an LR+GBDT algorithm is 3% higher.

Compared with the prior art, the embodiment of the invention provides a feature extraction method, which uses the positive sample data and the negative sample data obtained by the sample data acquisition method in the first embodiment as sample data for extracting the user features, so that the finally obtained user features can accurately represent the interest preference of the user.

A third embodiment of the present invention relates to a sample data acquisition apparatus, as shown in fig. 3, including: the data extraction module 11 is configured to obtain a content tag of a content to be consumed, consumption content data of the content to be consumed by a user, and consumption integrity of each content to be consumed by the user;

the sample data acquisition device 1 further includes: a first tag extraction module 12, a second tag extraction module 13, and a third tag extraction module 14, which are connected to the data extraction module 11.

Wherein, the first tag extracting module 12 is configured to extract data from the data extracting module 11, and determine a long-term preference tag of the user according to the frequency of occurrence of the content tag in the consumed content data.

The first tag extraction module 12 specifically includes: a frequency determination sub-module 121 connected to the data extraction module, a filtering sub-module 122 connected to the frequency determination sub-module, and a long-term tag generation sub-module 123 connected to the sample generation module;

specifically, the frequency determining sub-module 121 is configured to extract data from the data extracting module and determine the frequency of occurrence of each content tag in the consumed content data. The filtering sub-module 122 is configured to filter the frequency of occurrence of each content tag in the consumed content data to obtain a filtered frequency. The long-term tag generation sub-module 123 is configured to take the content tag corresponding to the filtered frequency as a long-term preference tag of the user.

The second tag extraction module 13 is configured to determine an instant hot tag of the user from the consumed data of the content tag and the attention of the user to the content tag.

The second tag extraction module 13 specifically includes: a first trend determining sub-module 131 connected to the data extracting module 11, a second trend determining sub-module 132 connected to the first trend determining sub-module 131 and the data extracting module 11, and a hot spot tag generating sub-module 133 connected to the second trend determining sub-module 132;

specifically, the first trend determining sub-module 131 is configured to determine a first trend of the influence of each content tag over time according to the consumed data of the content tag. The second trend determining sub-module 132 is configured to determine a second trend of the influence of each content tag on the user over time according to the first trend and the attention of the user to each content tag. The hot spot label generating sub-module 133 is configured to determine an instant hot spot label of the user according to the second variation trend.

The third tag extraction module 14 is configured to extract data from the data extraction module and divide content tags of the content to be consumed into positive sample tags and negative sample tags according to the consumption integrity.

Further comprises: the sample generation module 15, the sample generation module 15 is connected to the first tag extraction module 12, the second tag extraction module 13, and the third tag extraction module 14, respectively.

The sample generation module 15 is configured to divide the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hot tag into positive sample data and negative sample data as sample data for extracting the features of the user.

It should be noted that, the sample data obtaining apparatus 1 in this embodiment is an apparatus embodiment corresponding to the sample data obtaining method in the first embodiment, and implementation details in the first embodiment may be applied to this embodiment, which is not described herein again.

A fourth embodiment of the present invention relates to a feature extraction device, as shown in fig. 4, including: a sample data acquisition device 1 in the third embodiment, a model generation device 2 connected to the sample data acquisition device 1; the model generating device 2 is used for performing model training by using the positive sample data and the negative sample data obtained by the sample data obtaining device 1 to obtain a trained tree model, and obtaining user characteristics according to the trained tree model.

It should be noted that, the feature extraction device in this embodiment is an embodiment of a device corresponding to the feature extraction method in the second embodiment, and implementation details in the second embodiment may be applied to this embodiment, which is not described herein again.

A fifth embodiment of the invention relates to a processing device, as shown in fig. 5, comprising at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; wherein the memory 502 stores instructions executable by the at least one processor 501, the instructions being executable by the at least one processor 501 to enable the at least one processor 501 to perform the sample data acquisition method of the first embodiment; alternatively, the feature extraction method of the second embodiment is performed.

Where the memory 502 and the processor 501 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 501 and the memory 502. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 501 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 501.

The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 502 may be used to store data used by processor 501 in performing operations.

The sixth embodiment of the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the sample data acquisition method in the first embodiment; alternatively, the feature extraction method of the second embodiment is implemented.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the invention and that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A method of sample data acquisition, comprising:

acquiring a content tag of a content to be consumed, consumption content data of the content to be consumed by a user, and consumption integrity of each content to be consumed by the user;

determining a long-term preference tag for the user based on the frequency of occurrence of the content tag in the consumed content data, comprising:

determining the frequency of occurrence of each of the content tags in the consumed content data;

filtering the frequency of each content tag in the consumption content data to obtain a filtered frequency;

taking the content label corresponding to the filtered frequency as a long-term preference label of the user;

determining the instant hot tag of the user according to the consumed data of the content tag and the attention degree of the user to the content tag, wherein the method comprises the following steps:

determining a first change trend of influence of each content label along with time according to consumed data of the content labels;

determining a second change trend of influence of each content label on the user along with time according to the first change trend and the attention degree of the user on each content label;

determining an instant hot spot label of the user according to the second change trend;

dividing the content label of the content to be consumed into a positive sample label and a negative sample label according to the consumption integrity;

dividing the positive sample tag, the negative sample tag, the long-term preference tag and the instant hot tag into positive sample data and negative sample data as sample data for extracting the characteristics of the user.

2. The sample data acquisition method of claim 1, wherein the dividing the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hot tag into positive sample data and negative sample data comprises:

and taking the positive sample label, the long-term preference label and the instant hot label as the positive sample data, and taking the negative sample label, the long-term preference label and the instant hot label as the negative sample data.

3. The sample data acquisition method according to claim 1, wherein the determining a first trend of change in influence of each of the content tags over time from the consumed data of the content tags includes:

acquiring a plurality of content tags with the total consumption user number larger than a preset value according to the consumed data of the content tags;

and determining a first change trend of the influence of each content tag along with time according to the intermediate value of the total consumption user number corresponding to the content tags and the total consumption user number corresponding to each content tag.

4. The sample data obtaining method according to claim 3, wherein the first trend of the influence of each content tag over time is calculated by the following formula:

f _j (t)＝(P _j /P _mid )×0.5 ^t

5. The sample data obtaining method according to claim 1, wherein the second trend of the influence of each content tag on the user over time is calculated by the following formula:

f _ij (t)＝f _j (t)×g _i (j)

wherein i represents the user, j is the content tag, f _j (t) is the first trend, f _ij (t) is the second trend, g _i (j) Regarding the attention of the user to the jth content label, and g when the user consumes the jth content label _i (j) =1; when the user has not consumed the j-th content tag，g _i (j)＝0。

6. A feature extraction method, comprising:

obtaining the sample data using the sample data obtaining method of any one of claims 1 to 5;

model training is carried out by utilizing the positive sample data and the negative sample data to obtain a trained tree model;

and obtaining user characteristics according to the trained tree model.

7. A processing apparatus, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method of any one of claims 1 to 5; alternatively, the feature extraction method as claimed in claim 6 is performed.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the sample data acquisition method according to any one of claims 1 to 5; alternatively, the feature extraction method as claimed in claim 6 is performed.