CN112100432A - Sample data acquisition method, feature extraction method, processing device and storage medium - Google Patents

Sample data acquisition method, feature extraction method, processing device and storage medium Download PDF

Info

Publication number
CN112100432A
CN112100432A CN202010981752.6A CN202010981752A CN112100432A CN 112100432 A CN112100432 A CN 112100432A CN 202010981752 A CN202010981752 A CN 202010981752A CN 112100432 A CN112100432 A CN 112100432A
Authority
CN
China
Prior art keywords
content
user
tag
label
sample data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010981752.6A
Other languages
Chinese (zh)
Other versions
CN112100432B (en
Inventor
陈强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010981752.6A priority Critical patent/CN112100432B/en
Publication of CN112100432A publication Critical patent/CN112100432A/en
Application granted granted Critical
Publication of CN112100432B publication Critical patent/CN112100432B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a sample data acquisition method, which comprises the steps of separately processing long-term preference of user interest and short-term attention of a user to an instant hot spot to obtain a long-term preference label and an instant hot spot label, dividing a content label consumed by the user into a positive sample label and a negative sample label according to the consumption integrity of the user, and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hot spot label into positive sample data and negative sample data to be used as sample data for extracting characteristics of the user. The invention provides a sample data acquisition method, a feature extraction method, a processing device and a storage medium, wherein positive sample data and negative sample data obtained by the sample data acquisition method in the embodiment are used as sample data for extracting the features of a user, so that the user features capable of accurately representing the interest preference of the user can be acquired.

Description

Sample data acquisition method, feature extraction method, processing device and storage medium
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to a sample data acquisition method, a feature extraction method, a processing device and a storage medium.
Background
In order to find out the points of interest preference of the user, the consumption data of different time spans of the audio and video and the reading content consumed by the user are used for calculating the change of the points of interest of the user, finding new points of interest of the user, describing the points of interest by using a plurality of labels, and describing the change of the points of interest by using a label value attenuation method.
However, the inventor finds that in the prior art, the interest tag consumed by the user is directly used as sample data, and the obtained interest tag of the user is equal to the tag of the content consumed by the user, and cannot accurately represent the interest preference of the user.
Disclosure of Invention
The embodiment of the invention aims to provide a sample data acquisition method, a feature extraction method, a processing device and a storage medium.
In order to solve the above technical problem, an embodiment of the present invention provides a sample data obtaining method, including: acquiring a content tag of content to be consumed, consumption content data of the content to be consumed by a user and consumption integrity of each content to be consumed by the user; determining a long-term preference tag for the user based on a frequency of occurrence of the content tag in the consumed content data; determining an instant hot spot label of the user according to the consumed data of the content label and the attention of the user to the content label; dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity; and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.
Additionally, the dividing the positive exemplar label, the negative exemplar label, the long-term preference label, and the instant hotspot label into positive sample data and negative sample data comprises: and taking the positive sample tag, the long-term preference tag and the instant hot tag as the positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot tag as the negative sample data.
Additionally, the determining a long term preference tag for the user based on the frequency of occurrence of the content tag in the consumed content data comprises: determining a frequency of occurrence of each of the content tags in the consumed content data; filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.
In addition, the determining the instant hotspot tag of the user according to the consumed data of the content tag and the attention of the user to the content tag comprises: determining a first trend of change of the influence of each content tag along with time according to the consumed data of the content tag; determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention degree of the user on each content label; and determining the instant hot spot label of the user according to the second variation trend.
In addition, the determining a first trend of the influence of each content tag over time according to the consumed data of the content tag comprises: acquiring a plurality of content tags of which the total number of consumed users is greater than a preset value according to the consumed data of the content tags; and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.
In addition, the first trend of the influence of each content tag over time is calculated by the following formula:
fj(t)=(Pj/Pmid)×0.5t
wherein f isj(t) is the first trend, j is the content tag, PjThe total number of users, P, corresponding to the content labelmidIs the median value, t is time.
In addition, a second trend of the influence of each content tag on the user over time is calculated by the following formula:
fij(t)=fj(t)×gi(j)
wherein i represents the user, j is the content tag, fj(t) is the first trend, fij(t) is the second tendency of change, gi(j) For the attention of the user to the jth content tag, and when the user consumes the jth content tag, gi(j) 1 is ═ 1; g when the user has not consumed the jth of the content tagsi(j)=0。
The embodiment of the invention also provides a feature extraction method, which comprises the following steps: obtaining the sample data by using the sample data acquisition method of any one of claims 1 to 7; performing model training by using the positive sample data and the negative sample data to obtain a trained tree model; and obtaining the user characteristics according to the trained tree model.
An embodiment of the present invention further provides a processing apparatus, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method; alternatively, the above-described feature extraction method is performed.
The embodiment of the invention also provides a computer readable storage medium, which stores a computer program, and the computer program realizes the sample data acquisition method when being executed by a processor; alternatively, the above feature extraction method is implemented.
Compared with the prior art, the embodiment of the invention provides a sample data acquisition method, which is characterized in that a long-term preference label of a user is determined according to the frequency of the content label appearing in the consumption content data by acquiring the content label of the content to be consumed, the consumption content data of the content to be consumed by the user and the consumption integrity of each content to be consumed by the user, and an instant hot spot label of the user is determined according to the consumed data of the content label and the attention of the user to the content label. In the embodiment, the long-term preference of the interest of the user and the short-term attention of the user to the instant hotspot are processed separately to obtain the long-term preference tag and the instant hotspot tag, the content tag consumed by the user is divided into the positive sample tag and the negative sample tag according to the consumption completeness of the user, the positive sample tag, the negative sample tag, the long-term preference tag and the instant hotspot tag are divided into the positive sample data and the negative sample data which are used as sample data for extracting the characteristics of the user, and compared with the mode that the interest tag consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the characteristics of the user, so that the user characteristics capable of accurately representing the interest preference of the user can be.
Drawings
One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.
Fig. 1 is a schematic flow chart of a sample data acquisition method according to a first embodiment of the present invention;
fig. 2 is a schematic flow chart of a feature extraction method according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of a sample data acquisition apparatus according to a third embodiment of the present invention;
fig. 4 is a schematic configuration diagram of a feature extraction device according to a fourth embodiment of the present invention;
fig. 5 is a schematic configuration diagram of a treating apparatus according to a fifth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.
The core of the embodiment lies in that a content label of a content to be consumed, consumed content data of the content to be consumed by a user and the consumption integrity of each content to be consumed by the user are obtained, so that a long-term preference label of the user is determined according to the frequency of the content label in the consumed content data, and an instant hot spot label of the user is determined according to the consumed data of the content label and the attention of the user to the content label. In the embodiment, the long-term preference of the interest of the user and the short-term attention of the user to the instant hotspot are processed separately to obtain the long-term preference tag and the instant hotspot tag, the content tag consumed by the user is divided into the positive sample tag and the negative sample tag according to the consumption completeness of the user, the positive sample tag, the negative sample tag, the long-term preference tag and the instant hotspot tag are divided into the positive sample data and the negative sample data which are used as sample data for extracting the characteristics of the user, and compared with the mode that the interest tag consumed by the user is directly used as the sample data in the prior art, the positive sample data and the negative sample data obtained by the sample data obtaining method in the embodiment are used as the sample data for extracting the characteristics of the user, so that the user characteristics capable of accurately representing the interest preference of the user can be.
The following describes implementation details of the sample data obtaining method of the present embodiment in detail, and the following is only provided for facilitating understanding of the implementation details and is not necessary for implementing the present embodiment.
A schematic flow diagram of the sample data acquisition method in this embodiment is shown in fig. 1:
step 101: the method comprises the steps of obtaining a content label of the content to be consumed, consumption content data of the content to be consumed by a user and the consumption integrity of each content to be consumed by the user.
Specifically, the content to be consumed includes: videos (e.g., videos including movies, television shows, fantasy, etc., and short videos), music, books (e.g., novels, history, training, etc.).
The content tags of the content to be consumed that can be selectively used in this embodiment include:
(1) and the ID of the content to be consumed is the unique identification code of the content.
(2) Content modalities of the content to be consumed, for example: long video, short video, music, books.
(3) The content name of the content to be consumed, for example: movie names, tv show names, general art names, music names, book names, and content names related to short videos.
(4) Content category tags for content to be consumed include content type tags (e.g., sports, entertainment, military, economic, educational, scientific, etc.), short video type tags, type tags for movie theatrical books (e.g., martial arts, antiques, fantasy, history, employment, etc.), and music types (e.g., hormons, melancholy, thanksgiving, inspirations, etc.).
(5) Content keyword tags for content to be consumed include character tags (e.g., director, actors, singers, athletes, political characters, etc.), entity tags (e.g., organization names, city names, etc.), event tags (e.g., hot events, earthquakes, volcanoes, epidemic, etc.) to which the content relates.
(6) The content shelf time of the content to be consumed, that is, the earliest point in time at which the user can consume the content, for example: the distribution time of short videos, the showing time of movie and television series, and the distribution time of music and book works.
(7) The content quality score of the content to be consumed is an index for measuring the popularity of the content to the user, and is calculated based on the user behavior, for example: and comprehensively calculating the quality score of one content according to the indexes of the number of users of the content, the average playing integrity, the average playing times, the total playing duration and the like.
In this embodiment, the user consumes the consumption content data of the content to be consumed, for example: from what time a user starts watching a video, listens to a music or watches a book, to what time it ends. According to the consumption content data of the content to be consumed, the preference of the user can be accurately reflected, for example: if the user watches many movie works or books of a certain type, and listens to many songs of a certain singer, or if the user focuses on short videos related to a certain keyword, it indicates that the user is interested in the content tags of the contents to be consumed. Conversely, if some content users switch content if they play little or a few seconds, it is indicated that the user may not be interested in the content tags of the content to be consumed.
In addition, according to the consumption content data of the content to be consumed by the user, the following can be obtained:
(1) content consumption period preferences. According to the time when the user consumes the content, such as the time when the user starts to watch long and short videos, the time when the user listens to music and watches electronic books, whether the user has the preference of consuming the content in certain time periods (such as morning and before sleep) in a day is measured.
(2) Content novelty consumption preferences. The difference between the content consumption time and the content shelf-loading time is smaller, and the novelty is higher; the larger the difference, the less novelty. Through the distribution of the novelty of the content consumed by the user, the preference of the user on the novelty of the content can be judged.
(3) Content consumption integrity. For example: the completeness of the video or music consumed by the user can be obtained by dividing the time length of the video or music played by the user by the total time length of the video or music, and the completeness of the reading of the user can be obtained by dividing the number of pages of the book read by the user by the total number of pages of the book.
Step 102: the long term preference tag of the user is determined according to the frequency with which the content tag appears in consuming the content data.
In the embodiment, long-term preference tags in which the user interests are located are screened from the consumption content data of the user according to the frequency of the content tags appearing in the consumption content data, and the long-term preference tags are used for representing the long-term interests of the user.
Determining a long term preference tag for a user based on a frequency of occurrence of content tags in consuming content data, comprising: determining a frequency of occurrence of each content tag in consuming the content data; filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency; and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.
Specifically, 80% of the content consumed by a user according to the "two-eight principle" is concentrated on about 20% of the entire content area, i.e., a small portion of the tags can cover most of the content viewed by the user. Reflecting on the data, for each particular user, a small fraction of the total content he has consumed appears high frequency for a small proportion of tags, while a large proportion of tags appear low frequency or appear at a frequency of 0. Wherein the high frequency tags represent long-term interest preferences of the user. Based on this, the inventor designs a frequency domain filter to attenuate tags of different frequencies to different degrees.
Let the total n contents to be consumed of video, music, novel form a content set C to be consumed, as shown in formula (1):
C=[c1 c2 c3 ... cn] (1)
the tag set L of m content tags contained in the content set C to be consumed is shown in formula (2):
L=[l1 l2 l3 ... lm] (2)
assume a set C of content to be consumed that user i has consumediAs shown in equation (3):
Ci=[c1 c2 c3 ... cp ...] (3)
wherein, cpAnd p is not more than n for the p-th content to be consumed by the user.
Consumed content collection CiThe total content tag set that the user has consumed is included as a list LjAs shown in equation (4):
Lj=[l1 l2 l3 ... lj ...] (4)
wherein ljThe value of the jth content tag that was consumed for user i.
Compute tag set LjGenerating a tag frequency vector of the user i according to the appearance frequency of all tags in the content consumed by the user i: [ F ]i0Fi1Fi2...Fij...Fim]Wherein F isijRepresenting the frequency of occurrence of a tag j in the content consumed by a user i, determining a frequency maximum F in a tag frequency vectormaxAs shown in the following equation (5):
Fmax=max([Fi0 Fi1 Fi2 ... Fi j... Fim]) (5)
the frequencies of all the consumed content tags of the user i are substituted into a filter of the following formula (6), and the expression of the filter is shown in the following formula (6):
s=tan(πfc/fs)(1+z-1)/(1-z-1) (6)
wherein the sampling frequency fs=FmaxTaking the cut-off frequency fc=0.7fs(3DB attenuates the corresponding frequency). The high-frequency label of the user U is hardly or rarely attenuated, and the low-frequency label is greatly attenuated, so that the high-frequency label representing the long-term interest preference of the user can be screened out. It should be noted that the filter used in the present embodiment is a Butterworth filter (Butterworth filter), which is one of electronic filters, and is also called a maximum flat filter. In practical applications, other filters capable of implementing the frequency domain filtering may be used, which is not limited in this embodiment.
For example, a user of a series "celebration year" watches each episode update of the series, shows obvious playing behaviors for a preview short, a highlight clip and the like of the series, partially clicks and reads the same-name novel, and simultaneously watches lace life contents related to a small amount of ancient packages crossing, a brief and a good day. Calculating the appearance frequencies of content labels such as ' celebration year ', ' ancient dress crossing ', ' stretching ', ' stalking ', ' lace life ', and the like, wherein the frequency of the ' celebration year ' is obviously large and is attenuated to a small extent after high-frequency filtering, the ' ancient dress crossing ', ' stretching ', ' stalking ' can be attenuated to a certain extent, and the lace life ' can be attenuated due to the obviously small frequency. In this way, the resulting content tag vector can highlight the user's long-term interest preferences.
Step 103: and determining the instant hot spot label of the user according to the consumed data of the content label and the attention degree of the user to the content label.
In addition to long-term preferences of content types, forms and the like, the consumption of content by users is considerable, for example, many people have no interest in hospitals and infectious diseases, but pay attention to hot content related to medical alarm, coronavirus public opinion and the like. In the embodiment, the instant hot spot label of the user is screened from the content labels consumed by the user according to the consumed data of the content labels and the attention degree of the user to the content labels, and the instant hot spot label represents an instant hot spot concerned by the user in a short term.
Specifically, determining an instant hot spot tag of a user according to consumed data of the content tag and the attention of the user to the content tag includes: determining a first trend of change of the influence of each content tag over time according to consumed data of the content tag; determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention of the user to each content label; and determining the instant hot spot label of the user according to the second variation trend.
Wherein determining a first trend of change of the influence of each content tag over time from consumed data of the content tag comprises: acquiring a plurality of content labels of which the total consumption user number is greater than a preset value according to the consumed data of the content labels; and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.
For example, in this embodiment, a time-domain filtering method may be adopted to calculate the influence change of the hotspot on the user over time, and obtain 100 content tags with the largest number of users consumed in the last half year according to the consumed data of the content tags, where the median of the users of the 100 content tags is set to be PmidThe number of consumption users of the content label j is PjDefining the influence of content tag j as AjThe expression is shown in the following formula (7):
Aj=Pj/Pmid (7)
wherein, when AjWhen the ratio is more than 1, taking AjEqual to 1.
Specifically, the first trend of the influence of each content tag with time is calculated by the following formula (8):
fj(t)=(Pj/Pmid)×0.5t (8)
wherein f isj(t) is the first trend, j is the content tag, PjThe total number of users, P, corresponding to the content labelmidIs a median value and t is time.
Specifically, the second trend of the influence of each content tag on the user over time is calculated by the following formula (9):
fij(t)=fj(t)×gi(j) (9)
where i denotes the user, j is the content tag, fj(t) is a first trend, fij(t) is the second trend, gi(j) Attention of the user to the jth content tag, and when the user consumes the jth content tag, gi(j) 1 is ═ 1; when the user does not consume the jth content tag, gi(j)=0。
It should be noted that the time t in the above formula (8) and formula (9) is discretely calculated by day.
The current influence and the change trend of any content label j in all users can be calculated according to the formula (8), and the current influence and the change trend of any label j on the user i can be calculated according to the formula (9).
Step 104: and dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity.
Specifically, in step 101, the consumption integrity of each content to be consumed by the user is obtained, in this embodiment, a video with the playing integrity of more than 90% by the user is used as a video consumption positive sample, data with the reading integrity of more than 50% is used as a book consumption positive sample, and a music consumption positive sample is obtained when the song is completely listened for more than 1 time every day. And taking a video with the playing integrity of less than 10% of a user as a video consumption negative sample, taking data with the reading integrity of less than 10% as a book consumption negative sample, and taking a song with the song listening duration of less than 10 seconds as a music consumption negative sample. It should be noted that the above-mentioned percentage of the integrity for dividing the positive and negative samples is only an example, and in practical applications, the percentage can be set by the user according to actual requirements.
Step 105: and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.
Specifically, the long-term preference tag of the user after the frequency domain filtering is obtained in the step 102, and the instant hotspot tag of the user is obtained after the time domain filtering is performed in the step 103.
In this embodiment, the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hotspot tag are divided into positive sample data and negative sample data, which includes: and taking the positive sample tag, the long-term preference tag and the instant hot spot tag as positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot spot tag as negative sample data.
Specifically, a positive sample label, a long-term preference label of a user and an instant hot spot label of the user are marked as 1 as positive sample data; and marking the negative sample label, the long-term preference label of the user and the instant hot spot label of the user as 0 as negative sample data, and taking the positive sample data and the negative sample data as sample data for extracting the user characteristics. Compared with the mode of directly taking the interest tag consumed by the user as sample data in the prior art, the method for acquiring the sample data in the embodiment has the advantages that the positive sample data and the negative sample data obtained by the sample data acquisition method are used as the sample data for extracting the user characteristics, and the method is favorable for acquiring the user characteristics capable of accurately representing the interest preference of the user.
Compared with the prior art, the embodiment of the invention provides a sample data acquisition method, which separately processes the long-term preference of the user interest and the short-term attention of the user to the instant hot spot to obtain a long-term preference label and an instant hot spot label, and dividing the content tags consumed by the user into positive sample tags and negative sample tags according to the consumption completeness of the user, and dividing the positive sample tags, the negative sample tags, the long-term preference tags and the instant hotspot tags into positive sample data and negative sample data as the sample data for extracting the characteristics of the user.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
In the prior art, in order to find the interest preference of a user, various keywords are extracted by clustering historical data of user consumption content, and part of keywords are taken out as user characteristics, so that the interest preference of the user is described. Since the user often consumes a piece of content not because the user likes this type of content or a character in the content, it may be a hotspot, which is concerned by both of the two days, and the user will not consume the same type of content after the hotness period. In the existing method for describing the user interest preference, the keywords of the content consumed by the user are equal to the user interest preference, so that the content which is not the user interest preference, such as hot content, can also be used as the user interest preference content, and the keywords of the content are mistakenly used as the user interest points, so that the user interest points cannot be accurately represented.
A second embodiment of the present invention relates to a feature extraction method. A schematic flow chart of the feature extraction method in this embodiment is shown in fig. 2, and specifically includes:
step 201: and obtaining the sample data by using the sample data acquisition method in the embodiment.
The sample obtaining method in this step has been described in detail in the first embodiment, and is not described in detail in this embodiment. The sample data acquired by the sample data acquisition method in the first embodiment includes: positive sample data (positive exemplar, long-term preference, instant hotspot tag) and negative sample data (negative exemplar, long-term preference, and instant hotspot tags).
Step 202: and carrying out model training by using the positive sample data and the negative sample data to obtain a trained tree model.
Specifically, a Gradient Boosting Decision Tree (GBDT) algorithm is adopted to train sample data to generate a Tree model. In the process of model training, the number of trees of the tree model is set according to the number of the long-term preference tags and the number of the instant hot spot tags, and the maximum depth of each tree is 3. It should be noted that, in this embodiment, an implementation manner of training a tree model is given, but it is to be understood that other tree model training manners in the prior art may also be adopted, and details are not described in this embodiment.
Step 203: and acquiring the user characteristics according to the trained tree model.
Specifically, as an implementation manner, the trained tree model can be used in other occasions to extract the interest features of the user, the prediction result output by the GBDT algorithm can be directly obtained, and the prediction of whether the user is interested in a certain content can be obtained. As another implementation mode, the output of each tree node in the trained tree model is combined into a feature vector, and the feature vector is used for describing the features of the user. The feature vector contains deep knowledge such as the relation between labels, and is used for inputting other algorithms such as user prediction and identification, so that the application range is wider, and the using effect is better.
Experiments prove that in a short video recommendation scene, the feature vector of the proposal is used as input, and compared with a mode of using other data as input, the AUC (Area Under customer, which is an evaluation index for measuring the quality of a two-classification model and represents the probability that a predicted positive case is arranged in front of a negative case) value recommended by using a Logistic Regression (LR) algorithm is 12% higher, the AUC recommended by using a GBDT algorithm is 5% higher, and the AUC recommended by using an LR + GBDT algorithm is 3% higher.
Compared with the prior art, the embodiment of the invention provides the feature extraction method, and the positive sample data and the negative sample data obtained by the sample data acquisition method in the first embodiment are used as the sample data for extracting the user features, so that the finally obtained user features can accurately represent the interest preference of the user.
The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.
A third embodiment of the present invention relates to a sample data acquiring apparatus, as shown in fig. 3, including: the data extraction module 11 is configured to obtain a content tag of a content to be consumed, consumption content data of the content to be consumed by a user, and a consumption integrity of each content to be consumed by the user;
the sample data acquiring apparatus 1 further includes: a first tag extraction module 12, a second tag extraction module 13 and a third tag extraction module 14 connected to the data extraction module 11.
Wherein, the first tag extraction module 12 is configured to extract data from the data extraction module 11, and determine the long-term preference tag of the user according to the frequency of the content tag appearing in the consumed content data.
The first tag extraction module 12 specifically includes: a frequency determination submodule 121 connected to the data extraction module, a filtering submodule 122 connected to the frequency determination submodule, and a long-term tag generation submodule 123 connected to the sample generation module;
specifically, the frequency determining sub-module 121 is configured to extract data from the data extracting module, and determine the frequency of occurrence of each content tag in the consumed content data. And a filtering sub-module 122, configured to filter a frequency of occurrence of each content tag in the consumed content data to obtain a filtered frequency. And the long-term tag generation sub-module 123 is configured to use the content tag corresponding to the filtered frequency as the long-term preference tag of the user.
The second tag extraction module 13 is configured to determine an instant hot tag of the user from the consumed data of the content tag and the attention of the user to the content tag.
The second tag extraction module 13 specifically includes: a first change trend determining sub-module 131 connected with the data extraction module 11, a second change trend determining sub-module 132 connected with the first change trend determining sub-module 131 and the data extraction module 11, and a hot spot tag generating sub-module 133 connected with the second change trend determining sub-module 132;
specifically, the first trend determining sub-module 131 is configured to determine a first trend of the influence of each content tag over time according to the consumed data of the content tag. And the second variation trend determining sub-module 132 is configured to determine, according to the first variation trend and the attention of the user to each content tag, a second variation trend of the influence of each content tag on the user over time. And the hot spot tag generating sub-module 133 is configured to determine an instant hot spot tag of the user according to the second variation trend.
And the third label extraction module 14 is configured to extract data from the data extraction module, and divide the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity.
Further comprising: the sample generation module 15 is connected to the first label extraction module 12, the second label extraction module 13, and the third label extraction module 14 respectively, and the sample generation module 15 is connected to the third label extraction module 14.
The sample generating module 15 is configured to divide the positive sample tag, the negative sample tag, the long-term preference tag, and the instant hotspot tag into positive sample data and negative sample data, which are used as sample data for extracting features of the user.
It should be noted that the sample data acquisition apparatus 1 in this embodiment is an apparatus embodiment corresponding to the sample data acquisition method in the first embodiment, and the implementation details in the first embodiment may be applied to this embodiment, and are not described herein again.
A fourth embodiment of the present invention relates to a feature extraction device, as shown in fig. 4, including: a sample data acquisition device 1 according to the third embodiment, a model generation device 2 connected to the sample data acquisition device 1; the model generating device 2 is configured to perform model training using the positive sample data and the negative sample data obtained by the sample data obtaining device 1 to obtain a trained tree model, and obtain user characteristics according to the trained tree model.
It should be noted that the feature extraction device in this embodiment is a device embodiment corresponding to the feature extraction method in the second embodiment, and the implementation details in the second embodiment may be applied to this embodiment, and are not described herein again.
A fifth embodiment of the present invention relates to a processing apparatus, as shown in fig. 5, including at least one processor 501; and a memory 502 communicatively coupled to the at least one processor 501; the memory 502 stores instructions executable by the at least one processor 501, and the instructions are executed by the at least one processor 501 to enable the at least one processor 501 to execute the sample data acquisition method in the first embodiment; alternatively, the feature extraction method of the second embodiment is performed.
The memory 502 and the processor 501 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 501 and the memory 502 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 501 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 501.
The processor 501 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 502 may be used to store data used by processor 501 in performing operations.
The sixth embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when being executed by a processor, the computer program implements the sample data acquiring method in the first embodiment; alternatively, the feature extraction method of the second embodiment is implemented.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims (10)

1. A sample data acquisition method is characterized by comprising the following steps:
acquiring a content tag of content to be consumed, consumption content data of the content to be consumed by a user and consumption integrity of each content to be consumed by the user;
determining a long-term preference tag for the user based on a frequency of occurrence of the content tag in the consumed content data;
determining an instant hot spot label of the user according to the consumed data of the content label and the attention of the user to the content label;
dividing the content labels of the content to be consumed into positive sample labels and negative sample labels according to the consumption integrity;
and dividing the positive sample label, the negative sample label, the long-term preference label and the instant hotspot label into positive sample data and negative sample data to be used as sample data for extracting the characteristics of the user.
2. The method according to claim 1, wherein the dividing the positive exemplar label, the negative exemplar label, the long-term preference label, and the instant hotspot label into positive and negative sample data comprises:
and taking the positive sample tag, the long-term preference tag and the instant hot tag as the positive sample data, and taking the negative sample tag, the long-term preference tag and the instant hot tag as the negative sample data.
3. The method of claim 1, wherein the determining the long-term preference tag of the user according to the frequency of occurrence of the content tag in the consumed content data comprises:
determining a frequency of occurrence of each of the content tags in the consumed content data;
filtering the frequency of each content tag appearing in the consumption content data to obtain a filtered frequency;
and taking the content label corresponding to the filtered frequency as a long-term preference label of the user.
4. The method of claim 1, wherein the determining the instant hotspot tag of the user according to the consumed data of the content tag and the attention of the user to the content tag comprises:
determining a first trend of change of the influence of each content tag along with time according to the consumed data of the content tag;
determining a second variation trend of the influence of each content label on the user along with time according to the first variation trend and the attention degree of the user on each content label;
and determining the instant hot spot label of the user according to the second variation trend.
5. The method of claim 4, wherein the determining a first trend of influence of each of the content tags over time from the consumed data of the content tags comprises:
acquiring a plurality of content tags of which the total number of consumed users is greater than a preset value according to the consumed data of the content tags;
and determining a first change trend of the influence of each content label along with time according to the intermediate value of the total consumption user number corresponding to the plurality of content labels and the total consumption user number corresponding to each content label.
6. The sample data acquisition method according to claim 5, wherein the first trend of change of the influence of each content tag with time is calculated by the following formula:
fj(t)=(Pj/Pmid)×0.5t
wherein f isj(t) is the first trend, j is the content tag, PjThe total number of users, P, corresponding to the content labelmidIs the median value, t is time.
7. The sample data acquisition method according to claim 4, wherein the second trend of the influence of each content tag on the user over time is calculated by the following formula:
fij(t)=fj(t)×gi(j)
wherein i represents the user, j is the content tag, fj(t) is the first trend, fij(t) is the second tendency of change, gi(j) For the attention of the user to the jth content tag, and when the user consumes the jth content tag, gi(j) 1 is ═ 1; g when the user has not consumed the jth of the content tagsi(j)=0。
8. A method of feature extraction, comprising:
obtaining the sample data by using the sample data acquisition method of any one of claims 1 to 7;
performing model training by using the positive sample data and the negative sample data to obtain a trained tree model;
and obtaining the user characteristics according to the trained tree model.
9. A processing apparatus, comprising:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the sample data acquisition method of any one of claims 1 to 7; or, the feature extraction method according to claim 8 is performed.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the sample data acquisition method according to any one of claims 1 to 7; or, the feature extraction method according to claim 8 is performed.
CN202010981752.6A 2020-09-17 2020-09-17 Sample data acquisition method, feature extraction method, processing device and storage medium Active CN112100432B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010981752.6A CN112100432B (en) 2020-09-17 2020-09-17 Sample data acquisition method, feature extraction method, processing device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010981752.6A CN112100432B (en) 2020-09-17 2020-09-17 Sample data acquisition method, feature extraction method, processing device and storage medium

Publications (2)

Publication Number Publication Date
CN112100432A true CN112100432A (en) 2020-12-18
CN112100432B CN112100432B (en) 2024-04-09

Family

ID=73759545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010981752.6A Active CN112100432B (en) 2020-09-17 2020-09-17 Sample data acquisition method, feature extraction method, processing device and storage medium

Country Status (1)

Country Link
CN (1) CN112100432B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004090780A2 (en) * 2003-04-05 2004-10-21 Agilent Technologies, Inc. Determining the quality of biomolecule samples
CN104636406A (en) * 2013-11-15 2015-05-20 华为技术有限公司 Method and device for pushing information according to user behaviors
CN107992531A (en) * 2017-11-21 2018-05-04 吉浦斯信息咨询(深圳)有限公司 News personalization intelligent recommendation method and system based on deep learning
CN108804619A (en) * 2018-05-31 2018-11-13 腾讯科技(深圳)有限公司 Interest preference prediction technique, device, computer equipment and storage medium
CN111026908A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Song label determination method and device, computer equipment and storage medium
CN111666450A (en) * 2020-06-04 2020-09-15 北京奇艺世纪科技有限公司 Video recall method and device, electronic equipment and computer-readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004090780A2 (en) * 2003-04-05 2004-10-21 Agilent Technologies, Inc. Determining the quality of biomolecule samples
CN104636406A (en) * 2013-11-15 2015-05-20 华为技术有限公司 Method and device for pushing information according to user behaviors
CN107992531A (en) * 2017-11-21 2018-05-04 吉浦斯信息咨询(深圳)有限公司 News personalization intelligent recommendation method and system based on deep learning
CN108804619A (en) * 2018-05-31 2018-11-13 腾讯科技(深圳)有限公司 Interest preference prediction technique, device, computer equipment and storage medium
CN111026908A (en) * 2019-12-10 2020-04-17 腾讯科技(深圳)有限公司 Song label determination method and device, computer equipment and storage medium
CN111666450A (en) * 2020-06-04 2020-09-15 北京奇艺世纪科技有限公司 Video recall method and device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
CN112100432B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN107832437B (en) Audio/video pushing method, device, equipment and storage medium
US9710472B2 (en) Customized content consumption interface
CN110532479A (en) A kind of information recommendation method, device and equipment
US20190332602A1 (en) Method of data query based on evaluation and device
CN109829108B (en) Information recommendation method and device, electronic equipment and readable storage medium
US11204957B2 (en) Multi-image input and sequenced output based image search
CN109511015B (en) Multimedia resource recommendation method, device, storage medium and equipment
US20170154116A1 (en) Method and system for recommending contents based on social network
CN105468596B (en) Picture retrieval method and device
CN104469430A (en) Video recommending method and system based on context and group combination
JP2011519080A (en) Method and apparatus for selecting related content for display in relation to media
CN111159341B (en) Information recommendation method and device based on user investment and financial management preference
CN106131703A (en) A kind of method of video recommendations and terminal
JP2009514075A (en) How to provide users with selected content items
Ferrer et al. Enhancing genre-based measures of music preference by user-defined liking and social tags
Light et al. Managing the boundaries of taste: culture, valuation, and computational social science
CN103207917A (en) Method for marking multimedia content and method and system for generating recommended content
CN111523050B (en) Content recommendation method, server and storage medium
JP2018073429A (en) Retrieval device, retrieval method, and retrieval program
CN106708871A (en) Method and device for identifying social service characteristics user
KR20140015653A (en) Contents recommendation system and contents recommendation method
CN106775567B (en) Sound effect matching method and system
US11269898B1 (en) Machine learning based database query retrieval
JP2012168986A (en) Method of providing selected content items to user
CN108984752B (en) Intelligent recommendation method for professional books in library

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant