Disclosure of Invention
The invention aims to provide a system for providing personalized uninterrupted audio content for different users in a vehicle-mounted scene, which can solve the existing problem and solve the problem of sparse active behavior data of the users in the vehicle-mounted scene by utilizing big data and expert knowledge.
In order to achieve the above purpose, the invention provides a recommendation system for streaming audio content listening in a vehicle-mounted scene, which is used in cooperation with a client, a server, a local file system and a storage system, wherein the recommendation system comprises a real-time data collection subsystem, an offline model training subsystem and an online content delivery subsystem; the real-time data collection subsystem collects related information, the related information is input into the storage system, the offline model training subsystem calculates offline model data according to the original data input into the storage system, and finally the online content delivery subsystem delivers the information according to the offline model data; the related information comprises user behavior data, automobile information and scene information.
The recommendation system for streaming audio content listening in a vehicle-mounted scene comprises two links of candidate set generation and candidate set sorting, wherein the candidate set generation comprises active behavior of a user and offline model calculation, and the candidate set sorting is to calculate the like degree of the user on the candidate set.
According to the recommendation system for streaming audio content listening in the vehicle-mounted scene, the user active behavior is that the user actively fills out favorite content tags through corresponding product forms, and the content tags comprise custom play lists and interest selections; the user-defined playlist is displayed on a product interface, and the user defines playlist content based on content classification, content labels and content keywords; the interest selection is to select the content tags of interest by the user on the interface activated by registration.
The recommendation system for streaming audio content listening in a vehicle-mounted scene is characterized in that the offline model calculation is that an offline model training subsystem analyzes data through an algorithm so as to obtain content tags which can be liked by a user, wherein the data comprises user information, user behaviors, automobile information and scene information; the offline model calculation consists of four parts of chasing, user portrait, user attribute recommendation and popular content.
The recommendation system for streaming audio content in a vehicle-mounted scene, wherein the chasing is analyzed by an offline model training subsystem according to a user listening history record stored in a storage system, and the process is as follows: firstly, grouping according to the unique mark of each user, reserving the listening record of the continuous listening type programs, reserving a listening program list of the last three months according to the reverse time sequence, inquiring the next content of each program, and finally storing the result.
The recommendation system for streaming audio content in a vehicle-mounted scene comprises the steps of firstly acquiring user behavior data and audio information, then associating the two types of data according to unique audio marks by an offline model training subsystem, then grouping according to each user, calculating the user portraits of each user, and calculating and obtaining tag weight times of behaviors on each user by using tag weight = behavior type weight time attenuation TF-IDF of each user; the user behavior data comprise audio listening time length, subscription, click play list, search click, album on-demand, next head and negative feedback; the audio information includes a time length, an album to which the audio information belongs, a label of the album, and a category to which the audio information belongs; the formula of the user portrait tag weight is: norm (W) behavior *F t * C×tf×idf), wherein the behavior type weight W behavior { subscription: 5, playlist clicking: 1.4 x r, search: 1.3×r, album on demand: 1.2 x r, next: 1*R, negative feedback: 0.1, album completion rate r= Σplaytime audio /∑Duration audio The method comprises the steps of carrying out a first treatment on the surface of the Time decay F t =max(1,1*e -0.8*max(0,(now-playtime)/(24*3600)) ) Now is the current time, playtime is the time in ms when the behavior occurs; the behavior times C are calculated according to the day and are the times of the same behavior type aiming at the same album; tag importanceThe numerator of the TF calculation formula represents the number of times a label appears on the user, the denominator represents the total number of user labels, and the IDF calculation formulaThe numerator of the power represents the total number of users and the denominator represents the number of users +1 containing a certain tag.
The recommendation system for streaming audio content listening in a vehicle-mounted scene is characterized in that the user attribute recommendation is based on the collected attributes of the seed users and information of the custom playlist, and operation experience, and the offline model training subsystem calculates the preference degree of the users with different attributes on the playlist content according to the following formula:i.e. the relative probability that the user likes the tag l is calculated given the user attributes u1, u2, … …, un; n, N are the total number of data, the frequency of the tag l being "liked" respectively; ni, ni are the total number of data under attribute i, respectively, the frequency of tag l being "liked"; similar to tf-idf, the first term is a penalty term, the higher the tag heat, the lower the value (idf). The second term is a summation of conditional probabilities, the higher the probability that the tag will occur under that attribute, the higher the value (tf); (n-alpha) is a penalty term coefficient, alpha default to 1 (no penalty), and the recommended interval is more than or equal to 0 and less than or equal to 1; beta is the weight for weakening the popular label in each attribute, the default is 1 (without weakening), and the recommended interval is 1-2; the larger the alpha value is, the smaller the heat penalty is, and the scoring is popular; the smaller the alpha value is, the larger the punishment to the heat is, and the individualization is scored; the larger the beta value is, the stronger the heat weakening is, and the individualization is scored; the smaller the beta value, the weaker the heat weakening, and scoring is popular.
The recommendation system for streaming audio content in a vehicle-mounted scene, wherein the popular content is behavior data for counting user album clicks, and the offline model training subsystem calculates the importance of each hour and each content classification by the following formula:the numerator of the TF calculation formula represents the number of times a certain content classification occurs in a certain hour, and the denominator represents the total number of the content classifications in the hour; the numerator of the power of the IDF calculation represents the total number of hours a day, 24, and the denominator represents the number of hours +1 containing the content classification.
According to the recommendation system for streaming audio content listening in the vehicle-mounted scene, the candidate set ranking is achieved through the offline model training subsystem, the obtained content tag weight is used as the basis of overall ranking in the initial stage of less forward feedback behaviors of the user through the user portrait, and the click rate estimation model can be used for automatically learning the proportion and final ranking of the candidate set in the later stage along with the increase of forward feedback data.
According to the recommendation system for streaming audio content listening in the vehicle-mounted scene, the online content delivery subsystem performs online content delivery according to the calculation result of the offline model training subsystem, and the online content delivery is divided into two links of recall and sequencing; recall is to obtain various candidate sets calculated by an offline model of an offline model training subsystem from a storage system, and then calculate the duty ratio of each candidate set according to the obtained offline data statistics; the sorting is to acquire the related information of the current user and the intermediate data calculated offline, extract the characteristics, calculate the content sorting most likely to be liked by the user through a model, and put in the final result.
The recommendation system for streaming audio content in a vehicle-mounted scene has the following advantages:
1. the system adopts streaming listening, reduces excessive interactive operation of drivers in the driving process, and further reduces the risk of traffic accidents.
2. The method solves the problem of sparse active behavior data of the user in the streaming listening in the vehicle-mounted scene. On the product, the user is guided to customize a playlist, favorite content labels are selected during registration, and user behavior data is collected in a multi-dimensional mode by combining subscription, clicking behavior, searching, negative feedback and the like. Algorithmically, user attribute recommendation is realized by collecting user attributes of seed users and custom play lists, establishing a model, and calculating preference degrees of the user attributes and content tags; and calculating the importance of content classification according to the hour dimension, and realizing hot recommendation.
3. The forward feedback data in the early stage is sparse, and the content label weight of the user portrait can be adopted as the standard of the result ordering. When the user recommends the content of the system, the magnitude of the generated forward feedback reaches a certain program (usually about 10 times of the magnitude of the features), a supervised learning model, namely click rate estimation, can be adopted to optimize the sequencing of recommendation results.
4. In algorithm modeling, besides information related to users and contents, automobile and scene information are fused, so that recommended contents are more suitable for vehicle-mounted scenes.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings.
The invention provides a recommendation system for streaming audio content listening in a vehicle-mounted scene, which is used in cooperation with a client, a server, a local file system and a storage system, and comprises a real-time data collection subsystem, an offline model training subsystem and an online content delivery subsystem. The storage system comprises a distributed cache subsystem, an inverted index subsystem, a relational database subsystem and a distributed file subsystem. The recommendation system also depends on a middleware service system, wherein the middleware service system comprises an asynchronous communication subsystem, a distributed real-time processing subsystem, a distributed computing subsystem and a real-time log collecting subsystem based on an Actor model. See fig. 1.
The real-time data collecting subsystem collects related information through a client program, reports the related information to an http web server, records the related information to a local file system through the http web server, performs information supplementing, splitting, cleaning and other operations through the real-time log collecting subsystem, inputs the information to a distributed file subsystem of a storage system, and the offline model training subsystem calculates offline model data according to the input original data of the storage system and finally performs delivery through the online content delivery subsystem according to the offline model data; the related information includes user behavior data, car information, scene information, and the like.
The operation process of the offline model training subsystem comprises two important links of candidate set generation and candidate set sorting, wherein the candidate set generation component is used for calculating active behaviors of a user and an offline model, and the candidate set sorting is used for calculating the like degree of the user on the candidate set.
The active behavior of the user is that the user actively fills in favorite content labels through corresponding product forms, and the content labels comprise custom play lists and interest selections; the user-defined playlist is displayed on a product interface, and the user defines playlist content based on content classification, content labels and content keywords; the interest selection is to select the content tags of interest by the user on the interface activated by registration.
The offline model calculation is that the offline model training subsystem analyzes data through an algorithm, so that content tags which can be liked by a user are obtained, and the data comprise user information, user behaviors, automobile information, scene information and the like; the offline model calculation consists of four parts of chasing, user portrait, user attribute recommendation and popular content.
The chasing is carried out by utilizing a distributed computing subsystem, and the offline model training subsystem analyzes the user listening history record stored in the storage system, and the process is as follows: firstly, grouping according to the unique mark of each user, reserving listening records (such as novels) of continuous listening type programs, reserving a listening program list of the last three months according to the reverse order of time, inquiring the next content of each program, and finally storing the calculated result into an inverted index subsystem for storage.
The user image is that firstly, user behavior data is obtained from a distributed file subsystem, meanwhile, audio information is obtained from a relational database subsystem, then, two types of data are associated according to unique audio marks by an offline model training subsystem, and then, each user is identifiedGrouping, calculating user portraits of each user, and calculating to obtain label weight times of behaviors on each user through time attenuation TF-IDF of label weight = behavior type weight of each user; the user behavior data comprise audio listening time, subscription, click on a playlist list, search click, album on-demand, next, negative feedback and the like; the audio information includes a time length, an album to which the audio information belongs, a label of the album, a category to which the audio information belongs, and the like; the formula of the user portrait tag weight is: norm (W) behavior *F t * C×tf×idf), wherein the behavior type weight W behavior { subscription: 5, playlist clicking: 1.4 x r, search: 1.3×r, album on demand: 1.2 x r, next: 1*R, negative feedback: 0.1, album completion rate r= Σplaytime audio /∑Duration audio Time decay F t =max(1,1*e -0.8*max(0,(now-playtime)/(24*3600)) ) Now is the current time, playtime is the time in ms when the behavior occurs; the behavior times C are calculated according to the day and are the times of the same behavior type aiming at the same album; tag importanceThe numerator of the TF calculation formula represents the number of times a certain tag appears on the user, the denominator represents the total number of user tags, the numerator of the power of the IDF calculation formula represents the total number of users, and the denominator represents the number of users +1 containing a certain tag.
The user attribute recommendation is based on the collected attributes of the seed users and the information of the custom playlist, for example, the collected information is collected through a WeChat applet, and the operation experience, and the offline model training subsystem calculates the preference degree of the users with different attributes on the content of the playlist, through the following formula:i.e. the known user attributes u1, u2, … …, un, the relative probability that the user likes tag 1 is calculated; the process is as follows:
independent row hypothesis
P(u 1 u 2 …u n |l=1)=P(u 1 |l=1)P(u 2 |l=1)…P(u n |l=1)
Bayes formula
Deriving
Setting up
P(l=1)=p,P(l=0)=1-p,P(l=1|u i )=q i ,P(l=0|u i )=1-q i
Finally obtain
N, N are respectively the total number of data, the frequency with which tag 1 is "liked"; ni, ni are the total number of data under attribute i, respectively, the frequency with which tag 1 is "liked"; similar to tf-idf, the first term is a penalty term, the higher the tag heat, the lower the value (idf). The second term is a summation of conditional probabilities, the higher the probability that the tag will occur under that attribute, the higher the value (tf); (n-alpha) is a penalty term coefficient, alpha default to 1 (no penalty), and the recommended interval is more than or equal to 0 and less than or equal to 1; beta is the weight for weakening the popular label in each attribute, the default is 1 (without weakening), and the recommended interval is 1-2; the larger the alpha value is, the smaller the heat penalty is, and the scoring is popular; the smaller the alpha value is, the larger the punishment to the heat is, and the individualization is scored; the larger the beta value is, the stronger the heat weakening is, and the individualization is scored; the smaller the beta value, the weaker the heat weakening, and scoring is popular.
The popular content is behavior data for counting user album clicks, and the offline model training subsystem calculates the importance of each content classification in each hour through the following formula:
the numerator of the TF calculation formula represents the number of times a certain content classification occurs in a certain hour, and the denominator represents the total number of the content classifications in the hour; the numerator of the power of the IDF calculation represents the total number of hours a day, 24, and the denominator represents the number of hours +1 containing the content classification.
The candidate set sorting is that the offline model training subsystem can use the user portraits in the initial stage of less forward feedback behaviors of the user, the obtained content tag weight is used as the basis of the overall sorting, and the click rate estimation model can be used for automatically learning the proportion and the final sorting of the candidate set with the increase of forward feedback data in the later stage. The click rate estimation model ordering process comprises the following steps: collecting user behavior data and business content data, extracting features including scene features, automobile features, user features, content features and the like by an offline model training subsystem, discretizing the features, performing thermal encoding on the features, writing the features into a storage system, simultaneously using logistic regression training data, adding behavior data of a recommendation result to obtain model data, writing the model data into the storage system, reading the features and the model data from the storage system, calculating click rate of the recommendation candidate result in real time, and finally sequencing the recommendation result according to the click rate. See fig. 2.
The on-line content delivery subsystem builds a high-performance and high-availability distributed application based on an asynchronous communication subsystem of an Actor model. The online content delivery subsystem carries out online content delivery according to the calculation result of the offline model training subsystem, and the whole online content delivery subsystem is divided into two links of recall and sequencing; recall is to obtain various candidate sets calculated by an offline model of an offline model training subsystem from a distributed cache subsystem, an inverted index subsystem and a relational database subsystem of a storage system, and then calculate the duty ratio of each candidate set according to the obtained offline data statistics; the specific flow comprises the following steps: firstly, accessing a user-defined play list by a user, entering a user chasing if the user-defined play list is available, enabling the chasing to occupy less than 50% according to the reverse time order, and then combining the chasing with the user-defined play list; if not, other strategies are shifted to, self-selection content labels, user chasing, user portraits, user attributes and default play list waiting selection weights are determined, initialization weights are set to be self-selection content labels 4, user chasing 2, user portraits 2, user attributes 1, default play lists 1 and the like, the weights represent proportional relations among all candidate sets, the default order is also according to the self-selection content labels, the user chasing, the user portraits, the user attributes, the default play lists and the like, all the candidate set weights are set comprehensively and manually, and the weights are changed to be effective immediately. See fig. 3. The sorting is to obtain the related information of the current user and the intermediate data calculated offline from the distributed real-time processing subsystem, the distributed caching subsystem, the inverted index subsystem and the relational database subsystem, extract the characteristics, calculate the content sorting most likely liked by the user through a model, and put in the final result.
The recommendation system for streaming listening to audio content in an in-vehicle scene provided by the invention is further described below with reference to an embodiment.
Example 1
A recommendation system for streaming audio content in a vehicle-mounted scene is used in combination with a client, a server, a local file system and a storage system. The recommendation system comprises a real-time data collection subsystem, an offline model training subsystem and an online content delivery subsystem.
1. And a real-time collection subsystem. And the client collects the audio playing behavior data and reports the audio playing behavior data to the nginx web server. And collecting and summarizing album information by a journal collecting subsystem flime, supplementing album information, and storing the album information on a distributed storage subsystem hdfs according to time. Nginx (engine x) is a high-performance HTTP and reverse proxy web server, and also provides IMAP/POP3/SMTP services. The jume is a highly available, highly reliable, distributed system for collecting, aggregating and transmitting mass logs provided by Cloudera. Hdfs (Hadoop distributed file system) refers to a distributed file system (Distributed File System) designed to fit on general purpose hardware (commodity hardware).
2. And an offline model training subsystem.
(1) And chasing the drama.
The distributed computing program MapReduce is written. The map of task 1 reads the user listening record of the last three months, and reserves the record classified as novel; grouping according to the unique mark of the user, and providing the unique mark for the reduce; the reduce order the data in time order, keep the listening record of the latest time of each album. Task 2 reads the data of task 1, plus all audio information of the album; map groups according to album unique mark, and provides to reduce; the reduce calculates a next set of audio content in the listening history. Task 3map reads the data of task 2, and groups according to the unique mark of the user; the reduce stores the grouped data into an inverted index subsystem elastsearch. MapReduce is a programming model for parallel operation of large-scale datasets (greater than 1 TB). The elastiscearch is a Lucene-based search server.
(2) And calculating the user portrait.
First, the original data is cleaned: the play end event data is associated with the audio information in the service library through the unique audio mark; combining the new data and the historical data through the unique mark of the user; the combined data calculates the attenuation weight of each audio playing time length according to the unique mark of the user and the unique mark of the album, and accumulates the attenuation weights; finally, the accumulated attenuation weights are arranged in descending order.
Calculating a user label: and obtaining a label blacklist and album information (comprising content classification and labels), carrying out blacklist filtering on the album labels, and eliminating the albums containing the blacklabels. And carrying out association combination on the data cleaned in the previous step according to the album unique mark. The decay weights are accumulated for each album label under each user, and then the final weights are calculated by a normalization formula.
(3) Popular recommendations, i.e., user attribute recommendations and popular content.
The data for the user album clicks is collected, counting the number of album clicks per category at each hour divided by the number of categories for all that hour, and taking the quotient as tf. The sum of the number of hours at which each classification occurs, +1, divided by 24, is calculated as the idf, the logarithm of the base 10 quotient. tf is multiplied by idf as the importance of a certain classification at a certain hour. And then, reclassifying each category according to entertainment, knowledge, life and information modes, and calculating the importance of each category in each hour. The data is saved to the inverted index subsystem elastsearch. And the content recommended by each hour of the online delivery subsystem is recalled to the class with the highest importance, and then the content is proportioned according to the normalization processing of the classification importance, so that the recall rate is improved.
3. An online content delivery subsystem. The user requests a subsystem service interface, and the user is transmitted into a unique mark uid, and the system acquires a custom play list, a chasing, a self-selected content label, a user portrait and user attributes according to the uid. If the user-defined play list exists, the related albums are obtained from the reverse index subsystem elastic search through the album labels stored in the play list, and the related albums are combined with the chasing dramas to form a candidate set. And if the user-defined playlist is not included, obtaining album labels which the user likes through the user attributes and the user attribute recommendation model, adding the self-selected content labels, the user portraits and the popular labels, and acquiring relevant albums from the elastic search to form a candidate set. The candidate sets are distributed in number according to the weight. And finally, sorting according to the label weight of the user portrait, and recommending.
The recommendation system for streaming listening to audio content in a vehicle-mounted scene provided by the invention is a system and a method for personalized uninterrupted audio content provided by different users in the vehicle-mounted scene, and solves the problem of sparse active behavior data of the users in the vehicle-mounted scene by utilizing big data and expert knowledge. And the radio station mode is adopted, so that the influence on a driver is reduced. And the automobile information and the scene information are fused, so that the recommended audio content accords with the vehicle-mounted characteristic.
While the present invention has been described in detail through the foregoing description of the preferred embodiment, it should be understood that the foregoing description is not to be considered as limiting the invention. Many modifications and substitutions of the present invention will become apparent to those of ordinary skill in the art upon reading the foregoing. Accordingly, the scope of the invention should be limited only by the attached claims.