CN111861550B - Family portrait construction method and system based on OTT equipment - Google Patents

Family portrait construction method and system based on OTT equipment Download PDF

Info

Publication number
CN111861550B
CN111861550B CN202010652016.6A CN202010652016A CN111861550B CN 111861550 B CN111861550 B CN 111861550B CN 202010652016 A CN202010652016 A CN 202010652016A CN 111861550 B CN111861550 B CN 111861550B
Authority
CN
China
Prior art keywords
identity
family
tags
data
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010652016.6A
Other languages
Chinese (zh)
Other versions
CN111861550A (en
Inventor
叶凤
王翔
张玉新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shijiu Information Technology Co ltd
Original Assignee
Shanghai Shijiu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shijiu Information Technology Co ltd filed Critical Shanghai Shijiu Information Technology Co ltd
Priority to CN202010652016.6A priority Critical patent/CN111861550B/en
Publication of CN111861550A publication Critical patent/CN111861550A/en
Application granted granted Critical
Publication of CN111861550B publication Critical patent/CN111861550B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Development Economics (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Game Theory and Decision Science (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a family portrait construction method and a family portrait construction system based on OTT equipment, wherein the family portrait construction method comprises the following steps: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family; establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features; according to a semi-supervised learning mode, a batch identity label model of the family portrait is established according to the known identity label and the feature data as well as the unknown identity label and the feature data, and learning training is carried out on the batch identity label model of the family portrait periodically by utilizing a plurality of machine learning algorithms; counting the identity tags of family members according to the calculated result of the batch identity tag model of the trained family portrait, and marking out individual tags and combined tags of the family members; the invention adjusts the actual recommendation strategy based on the viewing preference of non-main members in the family portrait tag system, and obviously improves the recommendation effect of the part capable of mining users.

Description

Family portrait construction method and system based on OTT equipment
Technical Field
The invention relates to the technical field of intelligent content recommendation, in particular to a family portrait construction method and system based on OTT equipment.
Background
In recent years, internet television has rapidly progressed. According to the 44 th statistical report of the development status of the China Internet, the scale of the network video users in China reaches 7.59 hundred million by 6 months in 2019, and the network video users are 3265 ten thousand compared with the network users in 2018, which accounts for 88.8% of the whole network citizens, and the proportion of the network users in China for surfing the Internet is 33.1%. Viewing online video with internet television has become an integral part of the life of a vast family. The internet television media also has obvious change, the media resources are shared, the media content gradually changes to the demands of audiences, and the adoption and editing modes of the media are also diversified. The internet television operating system is continuously promoted and has a more open application platform. The television media are not limited to the traditional editing mode any more, but are communicated with multimedia resources such as mobile phones, computers and the like. Despite many platform challenges, television remains an integral part of most households. Internet television protocol (IPTV) makes the viewing experience more personalized and interactive than traditional terrestrial, satellite and cable television. The video content is more colorful, and the user can select the content with the same preference to watch. Watching television no longer means waiting in front of the screen at regular time, and when the user watches video, he does not find his favorite content, he can choose from his own choice by video on demand. If the user misses the playing time of favorite video or wants to watch the film again, the play can be also tracked through online video recommendation. The Internet television is more flexible to operate, the content is richer and more various, and the Internet television is more and more fit with user experience preferences.
The advent of internet television has made the user's viewing experience more personalized. In the user search page, the user can search for a movie that he wants to see by voice or text. When the user clicks into the movie detail page, specific description information of the movie and hot-cast recommendation related to the current movie are displayed. The user can quickly find the video category of interest on the main page. Not only can the recently opened applications themselves be found in the viewing history, but also the most recent more popular movies associated with the user can be seen in a lower position. Analyzing the viewing time of a home user is critical to improving user experience. Among the family users, the viewing period of each family member is different, and the viewing preference is also different. From a home perspective, users have different viewing habits for traditional television and video on demand. The viewing habits of home users are handled separately on weekdays, weekends and holidays. And according to the viewing content and viewing habit, summarizing the number of people viewing the video at the same time in the family, and viewing preferences of the family at different time intervals. The internet television groups the users by observing specific operation behaviors of the users, and picks out key target groups. The specific identity and preference of the user in the family in each period are researched, the viewing habit of the specific user is simulated, the viewing experience can be improved, and the viscosity of the user can be increased.
And designing a recommendation algorithm more conforming to user experience according to the user viewing preferences and the family identity portrait. In actual business, a thousand-person and thousand-face video recommendation algorithm is more reasonable than a thousand-person and ten-face video recommendation algorithm. The recommended movies which the user likes are more humanized than movies with higher recommending heat, and the user experience can be improved.
Patent document CN107124653a (application number 2017103433272) discloses a method for constructing a television user portrait, said method comprising the steps of: step one, acquiring data of a television terminal user through a data platform and analyzing and classifying; step two, predefining a television user portrait label; step three, classifying the B-class data to construct a B-class data television user portrait first-level tag; classifying the C-class data to construct a C-class data television user image primary label; step five, constructing class B and class C data television user portrait secondary labels; step six, carrying out merging statistics on the primary labels and the secondary labels of each class of the television user portrait; step seven, analyzing program type preference attribute data of the television user, and constructing a portrait tag of the television user; and step eight, updating the constructed television user portrait tag with a predefined television user portrait tag.
Patent document CN110430471a (application number 201910672136X) discloses a television recommendation method and system based on instantaneous computation, the method comprising: acquiring article content data and user behavior data, and constructing a content matrix and a user matrix; carrying out hierarchical text classification on the content matrix, and establishing a knowledge graph; carrying out family portrait modeling on the user matrix; establishing a recommendation model based on the content matrix and the user matrix; according to the recommendation model, performing initial program recommendation to a user based on current item content data; and receiving behavior data of a user aiming at the recommendation result, carrying out instantaneous calculation on the video recommendation model based on reinforcement learning, correcting the recommendation model, and updating the recommendation result.
Patent document CN108769809a (application number 2018105206969) discloses a method, device and computer readable storage medium for collecting household user behavior data based on smart tv. According to the household user behavior data acquisition method based on the intelligent television, the user behavior data of the intelligent television are acquired, meanwhile, the online equipment data of the household network where the intelligent television is located are acquired, and the acquired data are processed to generate household user behavior log data for the cloud server to use.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a family portrait construction method and system based on OTT equipment.
The invention provides a family portrait construction method based on OTT equipment, which comprises the following steps:
step M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
step M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
step M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
step M4: counting the identity tags of family members according to the calculated result of the batch identity tag model of the trained family portrait, and marking out individual tags and combined tags of the family members;
the identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; the identity label of the family member can be primarily judged through the identity label model;
And the batch identity tag model of the family portrait learns the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags.
Preferably, the step M1 includes:
step M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
step M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
Preferably, the step M2 includes:
step M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
step M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
step M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and establishing an identity tag model based on the mapping relation between the scenario description tags of the processed video and the family members and combining information comprising the divided viewing time, viewing areas, production years, actors and video large classifications to obtain identity tags of the known members and characteristic data of the known identity tags.
Preferably, the plurality of machine learning algorithms in the step M3 include: k-nearest neighbor algorithm, logistic regression multi-classification algorithm, and/or gbdt+lr algorithm.
Preferably, the step M3 includes:
step M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
step M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
step M3.3: and (3) normalizing the characteristic data of all the known and unknown identity tags, splitting the normalized mixed identity characteristic data into preset groups, respectively extracting GBDT characteristics, and then carrying out LR logistic regression two-class identity prediction.
Preferably, the step M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members.
According to the present invention, a family portrait construction system based on OTT equipment includes:
Module M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
module M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
module M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
module M4: counting the identity tags of family members according to the calculated result of the batch identity tag model of the trained family portrait, and marking out individual tags and combined tags of the family members;
the identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; the identity label of the family member can be primarily judged through the identity label model;
and the batch identity tag model of the family portrait learns the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags.
Preferably, the module M1 comprises:
module M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
module M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
Preferably, the module M2 comprises:
module M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
module M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
module M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and establishing an identity tag model based on the mapping relation between the scenario description tags of the processed video and the family members and combining information comprising the divided viewing time, viewing areas, production years, actors and video large classifications to obtain identity tags of the known members and characteristic data of the known identity tags.
Preferably, the plurality of machine learning algorithms in the module M3 include: k neighbor algorithm, logistic regression multi-classification algorithm and/or GBDT+LR algorithm;
the module M3 includes:
module M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
module M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
module M3.3: the feature data of all known and unknown identity tags are standardized, all the standardized mixed identity feature data are split into preset groups, GBDT feature extraction is carried out respectively, and then LR logistic regression two-class identity prediction is carried out;
the module M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members. Compared with the prior art, the invention has the following beneficial effects:
1. the invention combines three machine learning algorithms of KNN, logistic regression multi-classification, GBDT+LR and clustering algorithm to construct a family portrait identity tag model FIT (family_identity_tag). The FIT machine learning algorithm model solves the problems of low efficiency and poor accuracy of a single KNN algorithm, solves the problem of overfitting of a logistic regression multi-classification algorithm, and realizes the multi-classification problem by using a Toronan iterative two-classification GBDT+LR algorithm;
2. Time analysis based on each age not only improves the recommendation efficiency of a recommendation system, but also increases the accuracy of family portrait construction;
3. based on the viewing preferences of non-main members in the family portrait tag system, the actual recommendation strategy is adjusted, so that the recommendation effect of the part of the user can be obviously improved.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a schematic diagram of family membership partitioning;
FIG. 2 is a schematic view of the scatter concentration of viewing interests;
FIG. 3 is a schematic view of a view time pattern division;
FIG. 4 is a flow chart of an identity tag for a family member;
FIG. 5 is a GBDT+LR machine learning algorithm;
FIG. 6 is a lexical cloud of an original scenario category representation of a video library;
FIG. 7 is a plot category representation of a video library vocabulary split;
FIG. 8 is a diagram of a family portrait tag hierarchy modeling framework.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
The invention takes the scenario field type of the video as a starting point, and maps the processed type field to family members through a K-means clustering algorithm and a database segmentation technology. Based on the multi-dimensional mapping relation, combining a plurality of machine learning algorithms, and generating family member identity labels in batches.
Televisions are often shared by a family, which typically consists of a single or multiple members. Judging the composition of family members currently watching television, and accurately recommending the family members is a great difficulty in the Internet television industry. The home portrayal tab system constructed herein solves these problems.
Example 1
The invention provides a family portrait construction method based on OTT equipment, which comprises the following steps:
step M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
step M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
step M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
Step M4: counting the identity tags of family members according to the calculated result of the batch identity tag model of the trained family portrait, and marking out individual tags and combined tags of the family members;
the identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; the identity label of the family member can be primarily judged through the identity label model;
the batch identity tag model of the family portrait is used for calculating the viewing time of family members under the scenario tag type dimension of all videos, and combining known and unknown identity tags to form a tag column and a standardized feature column to carry out family member identity prediction of a machine learning model of KNN, logistic regression multi-classification and iterative GBDT+LR algorithm.
The batch identity tag model of the family portrait can learn the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags. The OTT includes a smart tv and OTT box, as well as all videos and applications on the device.
Specifically, the step M1 includes:
step M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
In order to accurately judge family user members, firstly screening the counted film watching records, wherein the purpose of screening is to select typical family compositions;
firstly, by combining the actual service requirement and expert opinion, the film with the total watching time less than 10 minutes is filtered, because the too short watching time can not judge the watching interest of the user, when the user watches the television, a lot of film knowledge is browsed slightly or carelessly, so that the total watching time is too short, and in order to avoid the influence of the operation behavior on the watching interest of the family members, the video with the watching time longer than 10 minutes is directly selected.
The scenario classification of a video consists of a number of adjectives, such as scenario/adventure/comedy/family/fun/love/city/country. We split the content descriptors of each film, remove the slash, and represent the film library film with multiple individual adjectives. Because the data information of the film library is complex, only the fields helpful to the portrait of the family user, namely, the scenario classification words meeting the requirement of the custom rule, such as actors, regions, years, classifications and other basic information, are taken out. As shown in fig. 6-7.
Step M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
Based on the data display, we find that after the scenario of the video is divided into single descriptive words, it is necessary to cluster them to find out the most representative descriptive words.
Specifically, the step M2 includes:
step M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
step M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
according to actual business requirements, as shown in fig. 1, household members are classified into 13 categories according to ages and sexes, and main video preference characteristics of each household member are counted by clustering the scenario categories and analyzing videos watched by each household member and combining expert group opinions.
Step M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and based on the mapping relation between the scenario description labels of the processed video and the family members, combining information comprising the divided viewing time, as shown in fig. 3, the viewing areas, the year of production, actors and the large classification of the video, and establishing an identity label model to obtain the identity labels of the known members and the characteristic data of the known identity labels.
Firstly, synchronizing a film library table, a log table and a manual tag table, wherein the film library table is periodically crawled every day through a programming language python, the log table is synchronous with the log of a user-triggered embedded point event, the manual tag table is synchronous with the manual tag, the user-observed image data is observed manually, and the identity tag is manually marked. Since the manually-applied label has a certain confusion, we simply process and analyze it. Based on film stock and film watching log data, the basic information of the film watching record of the user is processed by a database technology, then the scenario description words are analyzed by a segmentation and clustering method, the film watching interest expression of family members is combined, identity rules are designated by the experience of experts and industry workers in combination with actual service conditions and clustering analysis results, and identity labels conforming to the rules are automatically marked every day according to the designated identity rules. The feature set is calculated by KNN machine learning algorithm based on existing identity tags and manual tags, for example: social attributes, family occupancy, interest preferences, time preferences, and the like. The identity tag owned by the unknown family is learned, and the learning process is iterated continuously, so that the result is more accurate. As shown in fig. 8.
Specifically, the plurality of machine learning algorithms in the step M3 include: k-nearest neighbor algorithm, logistic regression multi-classification algorithm, and/or gbdt+lr algorithm.
Based on the scenario clustering result and the rule model specified by the expert group, the viewing data generates family portrait identity labels with certain data volume, and the marked family identity labels and the manually marked identity labels are used as basic learning data, and the rest identity labels are learned through a semi-supervised machine learning algorithm.
Specifically, the step M3 includes:
step M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
step M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
step M3.3: and (3) normalizing the characteristic data of all the known and unknown identity tags, splitting the normalized mixed identity characteristic data into preset groups, respectively extracting GBDT characteristics, and then carrying out LR logistic regression two-class identity prediction.
Specifically, the step M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members.
According to the present invention, a family portrait construction system based on OTT equipment includes:
module M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
module M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
module M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
module M4: counting the identity tags of family members according to the calculated result of the batch identity tag model of the trained family portrait, and marking out individual tags and combined tags of the family members;
the identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; the identity label of the family member can be primarily judged through the identity label model;
The batch identity tag model of the family portrait is used for calculating the viewing time of family members under the scenario tag type dimension of all videos, and combining known and unknown identity tags to form a tag column and a standardized feature column to carry out family member identity prediction of a machine learning model of KNN, logistic regression multi-classification and iterative GBDT+LR algorithm.
The batch identity tag model of the family portrait can learn the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags. The OTT includes a smart tv and OTT box, as well as all videos and applications on the device.
Specifically, the module M1 includes:
module M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
in order to accurately judge family user members, firstly screening the counted film watching records, wherein the purpose of screening is to select typical family compositions;
firstly, by combining the actual service requirement and expert opinion, the film with the total watching time less than 10 minutes is filtered, because the too short watching time can not judge the watching interest of the user, when the user watches the television, a lot of film knowledge is browsed slightly or carelessly, so that the total watching time is too short, and in order to avoid the influence of the operation behavior on the watching interest of the family members, the video with the watching time longer than 10 minutes is directly selected.
The scenario classification of a video consists of a number of adjectives, such as scenario/adventure/comedy/family/fun/love/city/country. We split the content descriptors of each film, remove the slash, and represent the film library film with multiple individual adjectives. Because the data information of the film library is complex, only the fields helpful to the portrait of the family user, namely, the scenario classification words meeting the requirement of the custom rule, such as actors, regions, years, classifications and other basic information, are taken out. As shown in fig. 6-7.
Module M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
Based on the data display, we find that after the scenario of the video is divided into single descriptive words, it is necessary to cluster them to find out the most representative descriptive words.
Specifically, the module M2 includes:
module M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
module M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
According to actual business requirements, as shown in fig. 1, household members are classified into 13 categories according to ages and sexes, and main video preference characteristics of each household member are counted by clustering the scenario categories and analyzing videos watched by each household member and combining expert group opinions.
Module M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and based on the mapping relation between the scenario description labels of the processed video and the family members, combining information comprising the divided viewing time, as shown in fig. 3, the viewing areas, the year of production, actors and the large classification of the video, and establishing an identity label model to obtain the identity labels of the known members and the characteristic data of the known identity labels.
Firstly, synchronizing a film library table, a log table and a manual tag table, wherein the film library table is periodically crawled every day through a programming language python, the log table is synchronous with the log of a user-triggered embedded point event, the manual tag table is synchronous with the manual tag, the user-observed image data is observed manually, and the identity tag is manually marked. Since the manually-applied label has a certain confusion, we simply process and analyze it. Based on film stock and film watching log data, the basic information of the film watching record of the user is processed by a database technology, then the scenario description words are analyzed by a segmentation and clustering method, the film watching interest expression of family members is combined, identity rules are designated by the experience of experts and industry workers in combination with actual service conditions and clustering analysis results, and identity labels conforming to the rules are automatically marked every day according to the designated identity rules. The feature set is calculated by KNN machine learning algorithm based on existing identity tags and manual tags, for example: social attributes, family occupancy, interest preferences, time preferences, and the like. The identity tag owned by the unknown family is learned, and the learning process is iterated continuously, so that the result is more accurate. As shown in fig. 8.
Specifically, the plurality of machine learning algorithms in the module M3 include: k-nearest neighbor algorithm, logistic regression multi-classification algorithm, and/or gbdt+lr algorithm.
Based on the scenario clustering result and the rule model specified by the expert group, the viewing data generates family portrait identity labels with certain data volume, and the marked family identity labels and the manually marked identity labels are used as basic learning data, and the rest identity labels are learned through a semi-supervised machine learning algorithm.
Specifically, the module M3 includes:
module M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
module M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
module M3.3: and (3) normalizing the characteristic data of all the known and unknown identity tags, splitting the normalized mixed identity characteristic data into preset groups, respectively extracting GBDT characteristics, and then carrying out LR logistic regression two-class identity prediction.
Specifically, the module M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members.
Example 2
Example 2 is a modification of example 1
The invention provides a home portrait construction method based on OTT equipment, which comprises the following steps:
step 1: and designing a data demand document DRD of the video watched by the television family, and burying points at the set top box client. When a user triggers a buried point event, the front end collects data and sends the data to an alicloud log service platform SLS through an API. And synchronizing the original family data log collected by the SLS platform to an Ali cloud big data platform MaxCompute for data analysis and processing.
Step 2: according to the collected original family data log, the original family data log is processed into a unified family member film-viewing record format through a database technology.
User viewing record basic information
Step 3: based on the processed family member viewing records, users are classified into 13 categories according to age and gender. As shown in fig. 1, the dirty original video scenario type field is split into a single description word by a database segmentation technology and a K-means clustering method, and mapped to each family member, as shown in fig. 2.
Step 4: the viewing time of the family members is divided into 8 segments according to the actual requirements, as shown in fig. 3. Based on the processed mapping relation between the type field and the family members, and combining the partitioned viewing time, viewing area, year of production, information of actors and video large classification category, a family member identity tag rule is formulated, and a part of family member identity tags family_identity_tag are firstly marked.
Step 5: based on the partial identity labels of the known family members and the characteristic data of the family members, a K-nearest neighbor algorithm is combined, and a GBDT+LR (Gradientboosting decision Tree+ Logistic Regression) algorithm is used for batch learning of the family member identity labels.
And (3) normalizing the characteristic data of all known and unknown identity tags, and firstly performing KNN (k-NearestNeighbor) identity learning algorithm calculation to obtain probability prediction of all possible family members. And simultaneously, putting all the standardized mixed identity characteristic data into a logistic regression multi-classification algorithm to carry out identity prediction. Meanwhile, all the standardized mixed identity characteristic data are split into 8 groups, GBDT characteristic extraction is carried out respectively, and then logistic regression two-class identity prediction is carried out, as shown in FIG. 5. Finally, the algorithm results of all the machine learning models are evaluated.
Step 6: and counting the identity labels of family members by comparing and analyzing the calculation results of the K nearest neighbor algorithm KNN, the logistic regression multi-classification and GBDT+LR two-classification multi-machine learning algorithm, and marking a single label and a combined label combination of family memebers of the family members.
Example 3
Example 3 is a modification of example 2 and/or example 1
The invention provides a home portrait construction method based on OTT equipment, which comprises the following steps: as shown in fig. 4, the sync tag: identity tags marked by experience workers in the synchronous television industry;
synchronizing logs: triggering a buried point event by a synchronous user, and uploading the buried point event to a user original log of an Ali sls platform;
synchronizing film library: the self-built film library is crawled and processed by the programming language python;
tag processing and analysis: the identification label data format marked by the experience workers in the television industry is processed, the labels are processed into a plurality of identification label labels corresponding to one family, and information such as scenario, large classification, region, year and the like is connected. Calculating characteristic data such as the video watching time length, the video watching times and the like;
and (3) statistical analysis of film watching records: processing the video record format into a unified family member video record format through a database technology;
scenario type cluster analysis: clustering analysis of scenario categories;
identity tag rule formulation: the method is to combine the experiences of specialists and industry workers to prepare a set of identity tag rules according to the analysis result of the type field and the mapping relation between the type field and family members (as shown in figure 2);
Family membership tag: marking a part of family identity labels according to rules formulated by the previous node;
and (3) identity characteristic statistics: observing the 'identity feature statistics' which is based on basic data of 'viewing record statistics analysis', wherein the 'identity feature statistics' is used for counting the viewing time of a user on a type field of a processed single word, the viewing times and the like;
identity characteristics are known: combining the characteristic data of the manual identity tag marked by the industry experience worker and the characteristic data of the known identity tag marked according to the custom rule into the same table;
unknown identity characteristics: there is no data tagged with an identity.
Standardizing known identity feature data: all characteristic data columns are numbers without intervals, and the numbers fall into a small specific space [ -1,1] through a standardized formula;
KNN identity learning: the processed data are put into a preset K neighbor node for calculation;
KNN multi-classification evaluation: putting all processed data into k neighbor computing nodes, and finally performing multi-classification evaluation computation to obtain an evaluation report; the evaluation report has the function of evaluating the accuracy of the K neighbor calculation model;
Refining KNN data: the data of the experimental result cannot be directly put into use, so that the data of the experimental result is extracted through SQL sentences, identity labels with predicted values lower than the average value are removed, the predicted result is split, and the identity labels and the predicted values are written into each family ID;
standardized mixed identity feature data: combining the characteristic data of the known identity and the characteristic data of the unknown identity into a table, setting the family without the identity label to be zero in the column of the identity label, and carrying out standardization processing on all the characteristic data columns represented by the rest numbers so as to enable the characteristic data columns to fall into a specific interval;
logistic regression multiple classification: and (3) putting the standardized mixed identity characteristic data into a logistic regression multi-classification model node for calculation. The calculation flow is shown in fig. 4;
refining the logistic regression multi-classification data: the logistic regression multi-classification result data cannot be directly applied to business, and the result data is firstly subjected to a surprise screening, and identity tags lower than an average predicted value are removed. And splitting each possible identity tag in the family by utilizing a database processing technology, and respectively inserting the split identity tags into the identity tag columns of the corresponding families.
Identity splitting: in the "standardized mixed identity data" are all identity tags, including children/women, children/men, young/women, etc., the data of each type of user group is put into one table separately, for example, the children/women are one table, and the middle-aged/women are one table, as shown in fig. 5.
Gbdt+lr machine learning algorithm: it can be understood that the output result of the GBDT model is input as a feature to one of the two classification models in the LR model. And respectively inputting the split identity data into the GBDT+LR model, wherein the specific implementation flow is shown in FIG. 6.
Refining gbdt+lr data: the calculation result obtained by the GBDT+LR model cannot be directly put into use, and the data of each user group are combined together and put into the column of the identity tag. The data result is represented by numbers, and the identification tags represented by all numbers are uniformly processed into the data of the string type by utilizing a database technology and adding judgment conditions.
Combining three machine learning algorithm results: the identity tags of the three machine learning algorithms are combined into a table.
Comparison analysis of identity tags: the identity tags obtained by the three machine learning algorithms represent three identity results. In contrast to the three columns of identity tags, the final genuine identity tag is only marked when the family membership is present in all three columns of tags.
Determining identity tag combinations of family members: and judging the family member combination by analyzing the finally-marked identity label. For example, if only middle-aged/male, middle-aged/female and juvenile users are in the household, the household can be labeled with a personalized combination tag such as "three-family", "two-generation". If the family has middle-aged/male, middle-aged/female, children users and elderly users can be marked with family labels such as 'third-generation people'. Tags such as time preferences, interest preferences, etc. of family members may also be derived from identity tags.
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present application may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims (10)

1. A family portrait construction method based on OTT equipment is characterized by comprising the following steps:
step M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
step M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
step M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
step M4: counting the identity tags of family members according to the batch identity tag model calculation result of the trained family portraits, and marking out individual tags and combined tags of the family members, thereby obtaining corresponding family portraits;
the identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; preliminarily judging the identity label of the family member through an identity label model;
And the batch identity tag model of the family portrait learns the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags.
2. The OTT device-based family portrait construction method according to claim 1, wherein said step M1 includes:
step M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
step M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
3. The OTT device-based family portrait construction method according to claim 1, wherein the step M2 includes:
step M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
step M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
Step M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and establishing an identity tag model based on the mapping relation between the scenario description tags of the processed video and the family members and combining information comprising the divided viewing time, viewing areas, production years, actors and video large classifications to obtain identity tags of the known members and characteristic data of the known identity tags.
4. The OTT device-based family portrait construction method according to claim 1, wherein the plurality of machine learning algorithms in step M3 include: k-nearest neighbor algorithm, logistic regression multi-classification algorithm, and/or gbdt+lr algorithm.
5. The OTT device-based family portrait construction method according to claim 1, wherein the step M3 includes:
step M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
step M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
step M3.3: and (3) normalizing the characteristic data of all the known and unknown identity tags, splitting the normalized mixed identity characteristic data into preset groups, respectively extracting GBDT characteristics, and then carrying out LR logistic regression two-class identity prediction.
6. The OTT device-based family portrait construction method according to claim 5, wherein said step M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members.
7. An OTT device-based family portrait construction system, comprising:
module M1: establishing a scenario description tag of the video by utilizing cluster analysis according to the data of the video watched by the television family;
module M2: establishing an identity tag model based on preference characteristics required by a preset rule to obtain known identity tags and characteristic data of the known identity features;
module M3: according to a semi-supervised learning mode, a batch identity tag model of the family portrait is built according to the known identity tag, the feature data of the known identity feature, the unknown identity tag and the feature data of the unknown identity tag, and learning training is carried out on the batch identity tag model of the family portrait periodically by utilizing a plurality of machine learning algorithms;
module M4: counting the identity tags of family members according to the batch identity tag model calculation result of the trained family portraits, and marking out individual tags and combined tags of the family members, thereby obtaining corresponding family portraits;
The identity tag model is a rule-based tag model formulated according to the interest preference, the viewing time preference, the regional preference and the year of production preference of family members and the actual business requirements; the identity label of the family member can be primarily judged through the identity label model;
and the batch identity tag model of the family portrait learns the identity tags of unknown family members in batches according to the behavior characteristic data of the known identity tags.
8. The OTT device-based family portraits construction system of claim 7, wherein said module M1 comprises:
module M1.1: screening original family data according to preset requirements according to the data of video watching of television families, and extracting all scenario classification words meeting the user-defined standardized requirements from screened video watching records;
module M1.2: according to actual business requirements, the scenario classification words are split into a plurality of single words through a database segmentation technology and a K-means clustering algorithm, and scenario description tags of videos are built.
9. The OTT device-based family portraits construction system of claim 7, wherein said module M2 comprises:
Module M2.1: processing the collected original family data into a unified family member film-viewing record format through a database technology;
module M2.2: based on the processed family member viewing records, classifying users into preset classes according to age and gender, and mapping scenario description tags of videos onto each family member;
module M2.3: dividing the viewing time of the family members into preset sections according to actual demands, and establishing an identity tag model based on the mapping relation between the scenario description tags of the processed video and the family members and combining information comprising the divided viewing time, viewing areas, production years, actors and video large classifications to obtain identity tags of the known members and characteristic data of the known identity tags.
10. The OTT device-based family portraits construction system of claim 7, wherein the plurality of machine learning algorithms in the module M3 include: k neighbor algorithm, logistic regression multi-classification algorithm and/or GBDT+LR algorithm;
the module M3 includes:
module M3.1: the characteristic data of all known and unknown identity tags are standardized, K nearest neighbor algorithm calculation is carried out, and probability prediction of all family members is obtained;
Module M3.2: feature data of all known and unknown identity tags are standardized, and all the standardized mixed identity feature data are put into a logistic regression multi-classification algorithm to carry out identity prediction;
module M3.3: the feature data of all known and unknown identity tags are standardized, all the standardized mixed identity feature data are split into preset groups, GBDT feature extraction is carried out respectively, and then LR logistic regression two-class identity prediction is carried out;
the module M4 includes: and counting the identity tags of family members by comparing and analyzing the calculation results of a K neighbor algorithm, a logistic regression multi-classification algorithm and a GBDT+LR two-class multi-class machine learning algorithm, and marking out individual tags and combined tags of the family members.
CN202010652016.6A 2020-07-08 2020-07-08 Family portrait construction method and system based on OTT equipment Active CN111861550B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010652016.6A CN111861550B (en) 2020-07-08 2020-07-08 Family portrait construction method and system based on OTT equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010652016.6A CN111861550B (en) 2020-07-08 2020-07-08 Family portrait construction method and system based on OTT equipment

Publications (2)

Publication Number Publication Date
CN111861550A CN111861550A (en) 2020-10-30
CN111861550B true CN111861550B (en) 2023-09-08

Family

ID=73153106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010652016.6A Active CN111861550B (en) 2020-07-08 2020-07-08 Family portrait construction method and system based on OTT equipment

Country Status (1)

Country Link
CN (1) CN111861550B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112637684B (en) * 2020-12-25 2022-02-01 四川长虹电器股份有限公司 Method for detecting user portrait label at smart television terminal
CN114268838B (en) * 2021-12-15 2023-12-26 深圳市酷开网络科技股份有限公司 Family member portrait processing method and device based on OTT user portrait
CN114554296B (en) * 2022-01-26 2024-05-24 浙江原初数据科技有限公司 IPTV user family portrait extraction system and method
CN115134668A (en) * 2022-03-14 2022-09-30 深圳市酷开网络科技股份有限公司 OTT-based family member age group and family structure dividing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2945113A1 (en) * 2014-05-14 2015-11-18 Cisco Technology, Inc. Audience segmentation using machine-learning
CN108154401A (en) * 2018-01-15 2018-06-12 网易无尾熊(杭州)科技有限公司 User's portrait depicting method, device, medium and computing device
CN109615432A (en) * 2018-12-14 2019-04-12 成都德迈安科技有限公司 Consumer behaviour portrait tool based on big data
CN109636481A (en) * 2018-12-19 2019-04-16 未来电视有限公司 User's portrait construction method and device towards domestic consumer
CN110674178A (en) * 2019-08-30 2020-01-10 阿里巴巴集团控股有限公司 Method and system for constructing user portrait label
CN110769457A (en) * 2019-10-09 2020-02-07 深圳市酷开网络科技有限公司 Family relation discovery method, server and computer readable storage medium
CN111199287A (en) * 2019-12-16 2020-05-26 北京淇瑀信息科技有限公司 Feature engineering real-time recommendation method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2945113A1 (en) * 2014-05-14 2015-11-18 Cisco Technology, Inc. Audience segmentation using machine-learning
CN108154401A (en) * 2018-01-15 2018-06-12 网易无尾熊(杭州)科技有限公司 User's portrait depicting method, device, medium and computing device
CN109615432A (en) * 2018-12-14 2019-04-12 成都德迈安科技有限公司 Consumer behaviour portrait tool based on big data
CN109636481A (en) * 2018-12-19 2019-04-16 未来电视有限公司 User's portrait construction method and device towards domestic consumer
CN110674178A (en) * 2019-08-30 2020-01-10 阿里巴巴集团控股有限公司 Method and system for constructing user portrait label
CN110769457A (en) * 2019-10-09 2020-02-07 深圳市酷开网络科技有限公司 Family relation discovery method, server and computer readable storage medium
CN111199287A (en) * 2019-12-16 2020-05-26 北京淇瑀信息科技有限公司 Feature engineering real-time recommendation method and device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
针对家庭宽带用户的精准营销方法研究;龚追飞;金天骄;;邮电设计技术(第07期);全文 *

Also Published As

Publication number Publication date
CN111861550A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
CN111861550B (en) Family portrait construction method and system based on OTT equipment
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN107515873B (en) Junk information identification method and equipment
US8650198B2 (en) Systems and methods for facilitating the gathering of open source intelligence
CN103744928B (en) A kind of network video classification method based on history access record
Yang et al. Mining Chinese social media UGC: a big-data framework for analyzing Douban movie reviews
CN112131472B (en) Information recommendation method, device, electronic equipment and storage medium
CN107301199A (en) A kind of data label generation method and device
CN109963175B (en) Television product accurate recommendation method and system based on explicit and implicit potential factor model
CN106354860A (en) Method for automatically labelling and pushing information resource based on label sets
Özdağoğlu et al. A predictive filtering approach for clarifying bibliometric datasets: an example on the research articles related to industry 4.0
CN111611488A (en) Information recommendation method and device based on artificial intelligence and electronic equipment
CN102214227B (en) Automatic public opinion monitoring method based on internet hierarchical structure storage
CN111858972A (en) Movie recommendation method based on family knowledge graph
CN108810577B (en) User portrait construction method and device and electronic equipment
CN110958472A (en) Video click rate rating prediction method and device, electronic equipment and storage medium
CN114491149A (en) Information processing method and apparatus, electronic device, storage medium, and program product
US20230245144A1 (en) System for identifying and predicting trends
WO2023087933A1 (en) Content recommendation method and apparatus, device, storage medium, and program product
CN113254794B (en) Program data recommendation method and system based on modeling
Krishnamoorthy et al. TV shows popularity and performance prediction using CNN algorithm
CN116484085A (en) Information delivery method, device, equipment, storage medium and program product
CN114841155A (en) Intelligent theme content aggregation method and device, electronic equipment and storage medium
Ma et al. [Retracted] Data Analysis Method of Intelligent Analysis Platform for Big Data of Film and Television
Bai et al. A WeChat official account reading quantity prediction model based on text and image feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant