CN111309936A - Method for constructing portrait of movie user - Google Patents

Method for constructing portrait of movie user Download PDF

Info

Publication number
CN111309936A
CN111309936A CN201911373310.7A CN201911373310A CN111309936A CN 111309936 A CN111309936 A CN 111309936A CN 201911373310 A CN201911373310 A CN 201911373310A CN 111309936 A CN111309936 A CN 111309936A
Authority
CN
China
Prior art keywords
movie
user
data
portrait
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911373310.7A
Other languages
Chinese (zh)
Inventor
胡亚娇
谢志峰
丁友东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Transpacific Technology Development Ltd
University of Shanghai for Science and Technology
Original Assignee
Beijing Transpacific Technology Development Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Transpacific Technology Development Ltd filed Critical Beijing Transpacific Technology Development Ltd
Priority to CN201911373310.7A priority Critical patent/CN111309936A/en
Publication of CN111309936A publication Critical patent/CN111309936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a construction method of a movie user portrait, which comprises the following steps: selecting users who have issued Chinese movie comments from a movie community website, and collecting static data and dynamic data of the users; step two, constructing a three-layer label system of the movie user portrait according to the acquired multi-dimensional data of the sample movie user; step three, predicting the first-layer label and the second-layer label of the movie user according to the corresponding relation between the multi-dimensional data of the movie user and the labels in the label system and a label hierarchy from bottom to top to construct a relatively perfect single-user portrait model; and fourthly, performing movie preference analysis on group movie users with certain common characteristics to generate a third layer of labels of the movie user portrait and construct a group user portrait. The invention generates a label system by analyzing and mapping the original data of the user, thereby realizing the labeling of the attributes of the movie users and the construction of the portrait model of the group users of the same attribute crowd.

Description

Method for constructing portrait of movie user
Technical Field
The invention relates to a construction method of a movie user portrait, and belongs to the fields of big data, data mining, natural language processing and machine learning.
Background
Under the background of big data and social media, the Internet platform analyzes potential user preferences in user information and behaviors and carries out personalized popularization on the platform information. User portrayal, i.e., user information tagging, is a target user model based on a series of real data. The user representation can label social attributes, living habits, consumption behaviors and the like of the user in a labeling mode. The user portrait is used as a main part of a recommendation system, and is widely used in commercial fields of E-commerce commodity recommendation, advertiser advertisement delivery and the like by mining user individual characteristics, individual differences among users and platform user group characteristics. Under the action of user portrait sketching, the platform can carry out personalized recommendation on the user, the user obtains better experience, and the platform can attract more traffic.
At present, most of research and realization of user portrait are based on the personality scale survey of volunteers, firstly, survey is conducted on users in a scale form, scores are calculated to obtain the personality types of the users, then, training is conducted through words in social data to obtain a model representing the relevance between the social words and the personality types, and finally, the personality types are predicted according to the social data of the users. The method is based on partial user investigation, consumes a large amount of manpower and material resources, has limitation on the research content, and has certain difficulty and unknown accuracy in scale manufacture.
Dittman et al, in Random forest A reliable tool for patient responsiveness, IEEE International Conference on Bioinformatics & biomedicine works phones IEEE,2011, apply Random forest to predict the patient's response to drugs, predict high dimensional data in the experiment using Random forest and 5 other classification learners, the results prove that Random forest has the best effect in the classification prediction of any feature selection strategy.
Wangli et al propose AdaBoost algorithm AdaBoost for multi-label classification, MLR, which is suitable for multi-label classification, and reasonably utilizes the correlation among the labels to be detected, thereby improving the accuracy of multi-label classification.
Liu Sha Jian et al in the "Graph Based Keyphrase Extraction Using LDA topoc Model", Journal of the Chinese Society for Scientific and Technical Information,2016,35(6): 664) 672 propose a keyword Extraction Model combining LDA and TextRank, and perform experiments on the short and medium text data set Huth2003 and the long text data set DUC2001, the results show the effectiveness of the method.
Fang Long et al, in "Structure-Function registration of Academic Text-Application in Automatic Keywords Extraction", Journal of the Chinese society for Scientific and Technical Information,2017,36(6):599- & lt 605 ], propose a structural Function Recognition method based on Academic Text, propose a multi-feature combination Extraction method fusing the structural and functional features of Academic Text, and recognize the structural functions by using section titles of Academic Text, and extract Keywords on a literature set in the field of computer languages by SVM two-classification and Lambdat learning sorting algorithms respectively, and experimental results show that multi-feature combinations are greatly improved in keyword Extraction effect compared with reference features.
Tengfei et al, in "Opinion Target Extraction in Chinese News documents", Proceedings of the 23rd International Conference on computational rules, post volume. Beijing: [ s.n. ], 2010: 782-. Firstly, an NLP tool LTP is used for analyzing a sentence according to syntactic specifications to judge whether a subject exists in the sentence or not so as to divide the sentence into an implicit sentence without the subject and a display sentence containing the subject, then the display sentence adopts a method of extracting all nouns in the sentence and carrying out grammatical analysis to sort candidate subjects, the implicit sentence adopts a method of converting focus concepts into Wikipedia concept vectors to extract the relevance of the sentence so as to extract candidate subjects from context by means of ranking key concepts, news topics and candidate subjects are sorted, and finally the subject of the sentence is selected according to the sorting and context information by means of a central theory.
Shiu-Li Huang et al, in "Electronic Commerce Research and Applications," propose to extract opinion phrases in comment sentences and to extract viewpoint emotion scores from the opinion phrases by customizing POS templates. In the experiment, a set of film comment POS template and a set of automobile comment POS template are respectively induced to extract short opinions of a film and an automobile, cross-domain POS templates are further induced to perform comparison, nouns, verbs, adjectives, degree adverbs and negative words are obtained from the short sentences, vocabulary scores are given, and a set of algorithm is designed to give total scores of the short sentences as viewpoint scores.
The movie reviews are not only reviews of movie elements such as the whole, content, actors and skills, shooting style, music and sound, vision and special effects, but also personal emotions, situations and experiences of movie users, even analysis and expression of the entire movie market and social situations of the movie users, so that the subject of the review sentences may belong to the movie category or other categories. The method adopts a mode of constructing a rich film word stock for extracting the subject of the film category viewpoint sentences, and adopts a mode of combining a central theory and a template for extracting the subjects of other category viewpoint sentences.
The invention combines NLP Chinese processing such as viewpoint extraction, syntactic analysis and the like, provides a method for extracting viewpoints in Chinese movie reviews, deeply mines movie user portrait labels, and constructs a complete movie user portrait model.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method for constructing a movie user portrait, which is improved, a method for constructing a user portrait in the aspect of movies is constructed, a user portrait label system containing a plurality of layers of labels is constructed according to the corresponding relation between user original data and user target attributes, structured data and unstructured texts are analyzed by different methods to generate labels, finally, a complete movie user portrait capable of showing the user viewing characteristics is sketched, movie preference analysis is further carried out on a movie group with certain characteristics, and a group user portrait model is constructed.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for constructing a movie user portrait comprises the following steps:
selecting users who have issued Chinese movie comments from a movie community website, and collecting static data and dynamic data of the users;
step two, constructing a three-layer label system of the movie user portrait according to the acquired multi-dimensional data of the sample movie user;
predicting each label of a first-layer label and a second-layer label of the movie user from bottom to top according to the corresponding relation between the multi-dimensional data of the movie user and the labels in the label system, and constructing a relatively perfect user portrait model of the single movie user;
and fourthly, according to the user characteristics, performing movie preference analysis on group movie users with certain common characteristics to generate a third layer of labels of movie user portraits and construct the group user portraits.
In the first step, from twenty movie types of drama, comedy, action, love, science fiction, animation, suspense, thriller, horror, crime, homopathy, music, singing dance, biography, history, war, western part, fantasy, adventure, disaster and swordsmanship, users in movie reviews under the same amount of popular movies are selected for data acquisition to form sample users, so that the diversity of movie types can be ensured, and the activity and the characteristic diversity of movie users can also be ensured.
The static data and the dynamic data of the user in the first step comprise four types of basic information of the user, movie evaluation information of the user, movie information and user label information, and four tables are established in a database to store the four types of information respectively.
And the multidimensional data in the second step comprise basic data, film evaluation data, diary data and film watching data of the movie user, and different labels are correspondingly constructed according to the data of each dimension of the movie user.
And constructing the movie user portrait according to the corresponding relation between the labels and the data, wherein the labels of the movie user portrait use a classification model, and the labels of the movie user portrait use a clustering model.
In the third step, the film user portrait label system is divided into three layers based on user original data, and the label system is constructed by sequentially generating labels from the lower layer to the upper layer by using statistics, a machine learning classification algorithm and an NLP (non line segment) method.
According to the static data and the dynamic data of the movie user, the movie user data are classified into four fields of basic attributes, social attributes, viewing preferences and individual characteristics, the data of each field are respectively corresponding to tags of the four fields of the movie user, wherein each field respectively comprises more than two movie user tags, each tag corresponds to at least two tag values, and the set of all tags is a tag library of the movie user portrait.
The user social ability tag in the social attributes is a measure of the bidirectional social degree of the movie user, and related data of the user social ability tag is composed of the number of other movie users concerned by the movie user with one-way social data and the number concerned by other users, wherein the concerned number and the concerned number are respectively divided into three grades of strong, medium and weak; according to the maximum value and the minimum value of the concerned number and the concerned number of all users, two threshold values are respectively set for the concerned number and the strong, medium and weak levels of the concerned number, and the users are classified in one-way social contact; the user social ability category is divided into nine levels according to the attention number and the attention number.
The film watching time characteristic related label is obtained by original data user film evaluating time, the film evaluating time is divided into user film evaluating date and user film evaluating time, and prediction of user monthly film evaluating quantity and user active time are predicted respectively; the movie user forecasts the movie evaluation number of the movie user in the current month by the historical movie evaluation amount of the movie user in one month, three months, one year, two years and three years, and the annual activity of the movie user is forecasted by the XGboost model; the future maximum possible active time of the movie user is predicted by the moment of movie user movie ratings.
One of the viewing preference features is a type tag to which a user watches a movie, and the movie tag is classified to use ten categories of movie features according to a method of "shooting in (year), (region/country), (environmental background) and (historical background) telling (content of (character) in (year)" in (form) and (manner) and (style) ";
(1) shooting year (1900 to 2019, one segment every decade)
(2) Region/country (China, Japan, USA, Europe, India, other regions)
(3) Environmental backgrounds (such as highways, cities, poverty, desert, palace, west, etc.)
(4) Historical backgrounds (such as cultural revolution, artistic revival, anti-war, etc.)
(5) Forms (opera, cartoon, drama, music drama, documentary)
(6) Ways (action, comedy, tragedy, thriller, consciousness flow, suspense)
(7) Styles (e.g., ensemble style, amalgamation style, co-occurrence style, painting style, TV style),
(8) role (family, super hero, second generation, earth, country, common people, father, black help, variant people, etc.)
(9) The times (dynasty, the republic of China, etc.)
(10) Contents (love, disaster, cult, fantasy, myth, police gangster, adventure, biography, sex, biography, history, etc.)
Categorizing the movies in the user viewing history, wherein each movie matches at most one value within each type of domain; and matching the film watching type for each movie user, and attaching a film watching label.
The film watching preference consists of film evaluation Chinese text data of film users; movie user movie review sentences are the minimum units for acquiring movie user preferences, long sentences are often involved in movie review, one sentence contains a plurality of clauses, and the phenomena of sharing objects and extending objects exist among the clauses; the analysis of the movie film comment text data adopts a method of analyzing film comment sentences one by one to obtain different viewpoints in each sentence, and extracts a theme idea or viewpoint expressed by the film comment as a whole.
Sometimes, the comment target of a sentence cannot be found in a film comment sentence, and this phenomenon is called an implicit object: a comment object that does not appear in the current sentence, such a sentence being called an implicit sentence; explicit objects: a comment object appearing in a current correct sentence, such sentence being called an explicit sentence; in movie reviews, the phenomenon of implicit objects is quite common; the film sentence is in our data set, the sentence which implies the target accounts for nearly 30% of the total; the method comprises the following steps that the problem that pointed objects are not obvious exists in movie comment data, most objects are concentrated on topics needing to be expressed by a movie, and comment objects of an implicit sentence are judged according to four aspects of movie topics, comment titles, preceding clause subjects and following clause subjects; for the display sentence, the movie type, the theme and the conveyed value are most likely to be the opinion target, so the movie theme, the object and other sequences are used to determine the main body specified by the display sentence; the inquiry of the implicit object in the movie comment depends on the following method:
1) finding out all nouns, adjectives and verbs in the same sentence, and putting the set S ═ tiIn (1) };
2) calculate each tiAnd t0Mutual information MI of;
3) selecting the ten words with the highest MI and comparing them with t0Combining into a word vector;
4) is provided with<kij>Is tiThe inverted index entry of (1), wherein kijFor quantizing tiConcept of wikipediajThe strength of association of (a); vector V is then interpreted as a vector constructed from all wikipedia concepts; each concept cjHaving a weight wj=Σti∈ Vvikij
5) The N concepts with the highest weights are selected.
The user age label of the second layer of movies in the third step segments the user ages; firstly, classifying labels of a user film watching label library according to ten types of film characteristics, then using the classified film watching labels, user social contact strong and weak labels and user influence labels as input characteristics to enter a random forest classification model, and predicting the age bracket to which the user belongs.
And in the third step, the personality labels of the second-layer film users classify the personality of the users according to the 'five-personality' of psychology, and the film watching labels, the social strong and weak labels of the users and the influence labels of the users are used as input features and are transmitted into a random forest classification model to predict the personality of the users.
The user income label of the second layer of movies in the third step divides the user income into three categories; and (4) taking the film watching tag, the social strong and weak tags of the user and the influence tags of the user as input features to be transmitted into a random forest classification model, and predicting the income of the user.
The calculation steps of the user role related labels of the second layer of movies in the third step are as follows: dividing user roles into three parts of gender, marriage and children, classifying the three parts respectively, and transmitting user social situations, user film watching time characteristic related labels and user film watching preference labels as input characteristics into an AdaBoost.
The group viewing preferences in the fourth step use statistical knowledge to respectively calculate the viewing preferences of users in different age groups, personality, income and roles; and constructing a group movie user portrait model.
Compared with the prior art, the invention has the following prominent substantive characteristics and obvious advantages:
the method not only comprehensively analyzes and maps the original data of the user to generate a label system, realizes the film watching labeling of the user in the film watching characteristic aspect, but also realizes the model construction of group user figures of different characteristic crowds to the favorite degrees of different types of films under the research based on the film watching characteristic. The method is an unprecedented user research method in movie user portrayal and has important significance for movie and personalized recommendation.
Drawings
FIG. 1 is a flow chart of a movie user representation construction method.
FIG. 2 is a block diagram of a movie user portrait labeling architecture.
Fig. 3 is a block diagram of a user movie review viewpoint extraction flow.
FIG. 4 is a user social situation calculation flow diagram.
FIG. 5 is a block diagram of a calculation process of the region situation to which the static attribute of the user belongs.
FIG. 6 is a block diagram of a user viewing time characteristic correlation label calculation process.
FIG. 7 is a block diagram of a user social attribute viewing preference tag computation flow.
FIG. 8 is a block diagram of a user static attribute age group prediction process.
FIG. 9 is a block diagram of a user static attribute personality prediction process.
FIG. 10 is a block diagram of a user static attribute revenue prediction process.
FIG. 11 is a block diagram of a user static attribute role prediction flow.
Fig. 12 is a graphical illustration of movie preference demographics for different age groups of a user population.
FIG. 13 is a fan plot illustration of the degree of preference of different personalities of a user population for a history sheet.
FIG. 14 is a histogram of different incomes of a user population versus classical dubbing preferences.
Detailed Description
So that the manner in which the features and aspects of the embodiments of the present disclosure can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to the appended drawings.
As shown in fig. 1, a method for constructing a movie user portrait includes the following steps:
step one, selecting users who have issued Chinese movie comments from a movie community website, and collecting static data and dynamic data of the users. The specific data acquisition method of the embodiment comprises the following steps:
(1) equal amount of movie reviews are selected from popular classifications of twenty movie types, namely drama, comedy, action, love, science fiction, animation, suspense, thriller, terror, crime, same sex, music, singing dance, biography, history, war, western, fantasy, adventure, disaster and martial art.
(2) And collecting the film comments, and analyzing the user ID (field name: user _ ID), the user social homepage link (field name: user _ url) and the user nickname (field name: user _ name) in the film comment HTML text. And storing the three data of the user in a user table of the MongoDB database.
(3) Reading a user _ url field in a user table, accessing a user homepage, and acquiring basic information of the user: a permanent place, a registration time, a personality signature, a number of concerns of the user. And storing the information in a user basic information table of the MongoDB database.
(4) Reading a user _ url field in a user table, splicing a movie review list page link of a user, accessing a movie review detail link, and collecting movie review related information: movie ID, movie rating title, movie rating time, movie rating content long text, rating of movie by user, movie rating useful number and movie rating useless number. And storing the information in a user movie evaluation information table of the MongoDB database.
(5) Reading the value of the movie ID field in the movie review information table of the user, splicing movie links, accessing the movie links and collecting movie information: movie rating, movie type, shooting area, movie duration. And storing the information in a user basic information table of the MongoDB database.
(6) And reading a user _ url field in the user table, splicing a user film viewing label page link, reading a page, and collecting the film viewing labels and the corresponding quantity of the film viewing labels of the user. And storing the information in a user film watching tag information table of a MongoDB database.
Step two, constructing a three-layer label system of the movie user portrait according to the acquired multi-dimensional data of the sample movie user, as shown in fig. 2; the method comprises the following steps of dividing a film user portrait label system into three layers based on user original data, and constructing the label system from the lower layer to the upper layer in a mode of sequentially generating labels by using statistics, a machine learning classification algorithm and an NLP (non line segment) method, wherein the specific label system constructing method of the embodiment comprises the following steps:
(1) and respectively predicting the social situation, the region of the user, the film watching time characteristic and the film watching preference of the user in the user attribute by using part of the original data to form a primary label.
(2) And respectively predicting the age, personality, income, user family role and social role in the user attributes by using part of the primary label data to form a secondary label.
(3) And respectively using partial first-level label data and partial second-level label data to carry out statistics on the movie preference of the common attributes of the user groups, and carrying out statistical analysis on the film viewing preference of groups of different ages, the film viewing preference of groups of different personalities of the user and the film viewing preference of groups of different incomes of the user to form a third-level label.
Predicting each label of a first-layer label and a second-layer label of the movie user from bottom to top according to the corresponding relation between the multi-dimensional data of the movie user and the labels in the label system, and constructing a relatively perfect user portrait model of the single movie user; the specific tag data mining method of the embodiment comprises the following steps:
(1) the user social situation calculation steps are shown in fig. 4 as follows: the attention rate of the original data user is divided into five levels of (0,20), (20,50), (50,100), (100,300) and more than 300, the attention rate of the user is divided into five levels of (0,50), (50,200), (200,500), (500,1500) and more than 1500, and the user is classified into an attention rate level i and an attention rate level j. The social situation of a user on a platform is represented by two labels, one label is a social strong label and a social weak label, the social strong label and the social weak label are divided into five grades from weak to strong (1,2,3,4,5), and the larger value of the grades of the attention number and the attention number is used as the value of the label; the other is a user influence label, the concerned number grade i is divided by the concerned number grade j, namely i/j, the label value is 'normal' if the result is 1, the label value is 'weak' if the result is less than 1, the user influence is weak, and the label value is 'strong' if the result is more than 1, the user influence is strong.
(2) The calculation steps of the user belonging area are shown in fig. 5 as follows: firstly, city names of original data users in city activities are extracted, null values of user common residence places are filled, then regional words of the users are expanded to a province and city regional word bank, and finally provinces and province labels of the users are matched from the regional word bank for each user.
(3) The calculation steps of the label related to the user viewing time feature are shown in fig. 6 as follows: the movie rating time of an original data user is divided into two parts: the film evaluation date of the user and the film evaluation time of the user. For the prediction of the user monthly movie evaluation quantity label, natural months are aggregated for the movie evaluation of the user according to movie evaluation dates to obtain monthly movie evaluation quantities as output of a prediction model, movie evaluation quantities of the user in the month, two months, three months and one year before the current natural month are aggregated respectively, the movie evaluation quantities aggregated in the month are used as characteristic input of the prediction model, and finally, a machine learning XGboost model is used for regression prediction. For the user active time label, firstly extracting the film evaluation time of the user, aggregating all the film evaluations of the user by hours, and taking the hour three before the ranking of the film evaluation quantity in 24 times in a day as the active time of the user.
(4) One of the viewing preference features is a type tag to which a user watches a movie, and the movie tag classification classifies movies in the user's viewing history using ten categories of movie features according to a method of "shooting in (year), (region/country), (environmental background), and (historical background) telling (content of (character) in (year)" in (form) and (manner), and (style), "wherein each movie matches at most one value in each category domain. And matching the film watching type for each movie user, and attaching a film watching label.
The viewing preference label for the user is shown in fig. 7, and is implemented in the following steps: (1) cleaning film evaluation data; (2) extracting a subject; (3) comment viewpoint extraction; (4) and extracting the comment substance emotion.
(1) Film comment data cleaning
Firstly, movie comment data need to be cleaned, a movie comment platform is a Chinese platform, and collected English comments are converted into Chinese comments. The method is characterized in that six aspects of the whole body, the content, the actors and the skills, the shooting style, the music, the sound, the vision, the special effect and the ten major types of film and television special vocabularies and network vocabularies are expanded to a Chinese vocabulary library, a JIEBA word segmentation tool is used for segmenting the vocabularies in a film user film evaluation single sentence, and the part of speech of each word is marked. And removing stop words in the comments by using the Chinese stop word list to obtain Chinese film and comment words and movie user emotion words with practical meanings.
(2) Subject extraction
For an implicit sentence, the comment data has the problem that the pointing object is not obvious, most objects are concentrated on the theme to be expressed by the movie, and the comment object of the implicit sentence is judged according to the movie theme, nouns near adjectives, preceding clause subjects, following clause subjects and comment titles. For a display sentence, nouns before and after an adjective are extracted as the main body of the display sentence.
(3) Review opinion extraction
And extracting negative words, adjectives and degree words in the sentences by adopting the POS template with the nouns removed, and calculating the emotion score of the movie user on the quality of a certain aspect of the movie according to the emotion dictionary.
(4) Sentiment extraction of comment
A user has a theme and emotional tendency in the film comment, and the LDA topic model is adopted to extract the subject term of the film comment.
(5) For the user age labels, as shown in fig. 8, the user ages are classified into four age groups of 18 years or less, 18-25 years, 25-35 years, 35-50 years, and more than 50 years, which are respectively labeled as categories (1,2,3,4, 5). Firstly, classifying labels of a user film watching label library according to ten types of movie features, then manually labeling age groups of some users, and finally, inputting the classified film watching labels, the social strong and weak labels of the users and the influence labels of the users into a random forest classification model as input features to predict the age groups of the users.
(6) As shown in fig. 9, the user personality labels are classified into openness, responsibility, camber, hommization and neurogenic according to the "five-personality" in psychology, and are respectively labeled as categories (1,2,3,4 and 5). And (4) manually marking the personality of a part of users, and inputting the film watching labels classified in the step (5), the social strong and weak labels of the users and the influence labels of the users into a random forest classification model as input features to predict the personality of the users.
(7) As shown in fig. 10, the user income label is classified into a general, and rich category (1,2, 3). And (4) manually marking the income types of a part of users, and then inputting the classified film watching labels, the social strong and weak labels of the users and the influence labels of the users in the step 5 into a random forest classification model as input features to predict the income of the users.
(8) The calculation steps of the user role related labels are shown in fig. 11 as follows: the user roles are divided into three parts, namely gender, married state and child state, the gender of the user is marked as a category (1,2), the marriage state is marked as a category (1,2), the nonmarried state and the married state is marked as a category (1,2), and the yes or no of the child is marked as a category (1, 2). Firstly, manually labeling the categories of a part of users, and then, using the social situations of the classified users in step 1, the related labels of the film watching time characteristics of the classified users in step 3 and the film watching preference labels of the classified users in step 3 as input characteristics to be transmitted into an AdaBoost.
And fourthly, according to the user characteristics, performing movie preference analysis on group movie users with certain common characteristics to generate a third layer of labels of movie user portraits and construct the group user portraits.
And respectively calculating the film watching preferences of the users in different age groups, personality, income and roles by using statistical knowledge. Constructing a group movie user portrait:
(1) the bar graph is drawn as shown in fig. 12, the horizontal axis represents the viewing preference of the user, the vertical axis represents the number of people in 5 age groups of the user, and the relationship between the age of the user and the viewing preference is analyzed.
(2) Drawing a fan-shaped graph as shown in fig. 13, representing the classification of various personalities in the crowd who likes a certain kind of movies by the area of the fan-shaped graph, and analyzing the preference degrees of different crowds for the movies.
(3) The histogram is plotted as in fig. 14, with the horizontal axis representing user income and the vertical axis representing viewing preferences, and viewing preferences for different income groups are analyzed.

Claims (17)

1. A method for constructing a user portrait of a movie is characterized by comprising the following steps:
selecting users who have issued Chinese movie comments from a movie community website, and collecting static data and dynamic data of the users;
step two, constructing a three-layer label system of the movie user portrait according to the acquired multi-dimensional data of the sample movie user;
predicting each label of a first-layer label and a second-layer label of the movie user from bottom to top according to the corresponding relation between the multi-dimensional data of the movie user and the labels in the label system, and constructing a relatively perfect user portrait model of the single movie user;
and fourthly, according to the user characteristics, performing movie preference analysis on group movie users with certain common characteristics to generate a third layer of labels of movie user portraits and construct the group user portraits.
2. The method for constructing the portrait of the movie user as claimed in claim 1, wherein in the step one, the user in the movie review under the popular movie is selected to collect data to form a sample user from twenty movie types including drama, comedy, action, love, science fiction, animation, suspense, thriller, horror, crime, same sex, music, dance, biography, history, war, western, fantasy, adventure, catastroll and swordsman, so that the diversity of the movie types and the activity and the characteristic diversity of the movie user can be ensured.
3. The method for constructing a user portrait of movie as claimed in claim 1, wherein the static data and dynamic data of the user in the first step include four types of basic user information, movie rating information, movie information and tag information, and four tables are established in the database for storing the four types of information.
4. The method for constructing a portrait of a movie user as defined in claim 1, wherein the multidimensional data in the second step includes basic data, comment data, diary data and view data of the movie user, and different tags are constructed according to the data of each dimension of the movie user.
5. The method for constructing a movie user portrait according to claim 1, wherein the model in the third step is based on correspondence between tags and data, the construction of the movie user portrait includes a movie user personal portrait and a movie user group portrait, the tags of the movie user personal portrait are processed by using a machine learning classification model and natural language, and the tags of the movie user group portrait are analyzed by using statistics.
6. A method for constructing a user portrait of a movie as defined in claim 1, wherein in the third step, each tag of the movie user is predicted by using one of statistics, machine learning random forest, XGBoost classification algorithm, adaboost, mlr multi-tag classification algorithm, and syntactic analysis of natural language processing.
7. The method for constructing a movie user portrait according to claim 3, wherein the movie user data is classified into four fields of basic attributes, social attributes, viewing preferences and personality characteristics according to the static data and dynamic data of the movie user, and the data of each field is respectively corresponding to tags of the four fields of the movie user, wherein each field comprises more than two movie user tags, each tag corresponds to at least two tag values, and the set of all tags is a tag library of the movie user portrait.
8. The movie user representation construction method according to claim 7, wherein the user social ability tag in the social attributes is a measure of the bi-directional social degree of the movie user, and the related data of the user social ability tag is composed of the number of other movie users and the number of other users concerned by the movie user, wherein the number of concerned users and the number of concerned users are respectively classified into three levels, namely strong, medium and weak; according to the maximum value and the minimum value of the concerned number and the concerned number of all users, two threshold values are respectively set for the concerned number and the strong, medium and weak levels of the concerned number, and the users are classified in one-way social contact; the user social ability category is divided into nine levels according to the attention number and the attention number.
9. The method for constructing a user representation of a movie according to claim 7, wherein one of the viewing preference features is a type tag to which the user views the movie, and the movie tag is classified according to ten categories of movie features in a method of "shoot in (year), under (region/country), (environmental background) and (historical background)," telling (content of (character) in (year) "in (form) and (way) and (style)"; categorizing the movies in the user viewing history, wherein each movie matches at most one value within each type of domain; and matching the film watching type for each movie user, and attaching a film watching label.
10. The method for constructing a user portrait of a movie according to claim 7, wherein one of the viewing preference characteristics is a user movie evaluation time, the movie evaluation time is divided into a user movie evaluation date and a user movie evaluation time, and the prediction of the monthly movie evaluation amount of the user and the user activity time are predicted respectively; the movie user forecasts the movie evaluation number of the movie user in the current month by the historical movie evaluation amount of the movie user in one month, three months, one year, two years and three years, and the annual activity of the movie user is forecasted by the XGboost model; the future maximum possible active time of the movie user is predicted by the moment of movie user movie ratings.
11. The method of claim 7, wherein the viewing preferences comprise movie user movie ratings Chinese text data; movie user movie review sentences are the minimum units for acquiring movie user preferences, long sentences are often involved in movie review, one sentence contains a plurality of clauses, and the phenomena of sharing objects and extending objects exist among the clauses; the analysis of the movie film comment text data adopts a method of analyzing film comment sentences one by one to obtain different viewpoints in each sentence, and extracts a theme idea or viewpoint expressed by the film comment as a whole.
12. The method for constructing a user portrait of movie as claimed in claim 11, wherein sometimes the comment target of a sentence is not found in a comment sentence, and this phenomenon is called implicit object: a comment object that does not appear in the current sentence, such a sentence being called an implicit sentence; explicit objects: a comment object appearing in a current correct sentence, such sentence being called an explicit sentence; in movie reviews, the phenomenon of implicit objects is quite common; the film sentence is in our data set, the sentence which implies the target accounts for nearly 30% of the total; the method comprises the following steps that the problem that pointed objects are not obvious exists in movie comment data, most objects are concentrated on topics needing to be expressed by a movie, and comment objects of an implicit sentence are judged according to four aspects of movie topics, comment titles, preceding clause subjects and following clause subjects; for the display sentence, the movie type, the theme and the conveyed value are most likely to be the opinion target, so the movie theme, the object and other sequences are used to determine the main body specified by the display sentence; the inquiry of the implicit object in the movie comment depends on the following method:
1) finding out all nouns, adjectives and verbs in the same sentence, and putting the set S ═ tiIn (1) };
2) calculate each tiAnd t0Mutual information MI of;
3) selecting the ten words with the highest MI and comparing them with t0Combining into a word vector;
4) is provided with<kij>Is tiThe inverted index entry of (1), wherein kijFor quantizing tiConcept of wikipediajThe strength of association of (a); vector V is then interpreted as a vector constructed from all wikipedia concepts; each concept cjHaving a weight
Figure RE-RE-FDA0002486052830000031
5) The N concepts with the highest weights are selected.
13. The method for constructing a user representation of a movie as recited in claim 1, wherein the second layer movie user age tag in the third step segments the user age; firstly, classifying labels of a user film watching label library according to ten types of film characteristics, then using the classified film watching labels, user social contact strong and weak labels and user influence labels as input characteristics to enter a random forest classification model, and predicting the age bracket to which the user belongs.
14. The method for constructing a movie user portrait according to claim 1, wherein the second-layer movie user personality label in the third step classifies the user personality according to the psychological five personality, and the viewing label, the social strong and weak label of the user and the user influence label are used as input features to be introduced into a random forest classification model to predict the user personality.
15. The method of claim 1, wherein the second layer of movie user income label in step three classifies user income into three categories; and (4) inputting the film watching label, the social strong and weak label of the user and the influence label of the user into a random forest classification model as input features, and predicting the personality of the user.
16. The method for constructing a user portrait of movie as recited in claim 1, wherein the step of calculating the related labels of the user roles of movie in the second layer in the third step comprises: dividing user roles into three parts of gender, marriage and children, classifying the three parts respectively, and transmitting user social situations, user film watching time characteristic related labels and user film watching preference labels as input characteristics into an AdaBoost.
17. The method for constructing a user portrait of movie as claimed in claim 1, wherein the group viewing preferences in step four use statistical knowledge to calculate the viewing preferences of users in different age groups, personality, income and role categories; and constructing a group movie user portrait model.
CN201911373310.7A 2019-12-27 2019-12-27 Method for constructing portrait of movie user Pending CN111309936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911373310.7A CN111309936A (en) 2019-12-27 2019-12-27 Method for constructing portrait of movie user

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911373310.7A CN111309936A (en) 2019-12-27 2019-12-27 Method for constructing portrait of movie user

Publications (1)

Publication Number Publication Date
CN111309936A true CN111309936A (en) 2020-06-19

Family

ID=71156354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911373310.7A Pending CN111309936A (en) 2019-12-27 2019-12-27 Method for constructing portrait of movie user

Country Status (1)

Country Link
CN (1) CN111309936A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915366A (en) * 2020-07-20 2020-11-10 上海燕汐软件信息科技有限公司 User portrait construction method and device, computer equipment and storage medium
CN112084402A (en) * 2020-08-24 2020-12-15 浙江云合数据科技有限责任公司 Method for predicting user attribute by analyzing application program use data
CN112837087A (en) * 2020-12-16 2021-05-25 北京交通大学 User portrait construction method oriented to consultation service system
CN112861003A (en) * 2021-02-19 2021-05-28 杭州谐云科技有限公司 User portrait construction method and system based on cloud edge collaboration
CN112860808A (en) * 2020-12-30 2021-05-28 深圳市华傲数据技术有限公司 User portrait analysis method, device, medium and equipment based on data tag
CN113457122A (en) * 2021-06-28 2021-10-01 华东师范大学 User image drawing method based on VR emergency environment
CN114201516A (en) * 2020-09-03 2022-03-18 腾讯科技(深圳)有限公司 User portrait construction method, information recommendation method and related device
CN114358814A (en) * 2021-11-29 2022-04-15 国网北京市电力公司 Data processing method and device, electronic equipment and computer readable storage medium
CN114584824A (en) * 2020-12-01 2022-06-03 阿里巴巴集团控股有限公司 Data processing method and system, electronic equipment, server and client equipment
CN117290612A (en) * 2023-11-24 2023-12-26 深圳市华图测控系统有限公司 Prediction matching method and system based on behavior analysis

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886105A (en) * 2014-04-11 2014-06-25 北京工业大学 User influence analysis method based on social network user behaviors
CN108062385A (en) * 2017-12-14 2018-05-22 微梦创科网络科技(中国)有限公司 The method and system of Users' Interests Mining
US20180260860A1 (en) * 2015-09-23 2018-09-13 Giridhari Devanathan A computer-implemented method and system for analyzing and evaluating user reviews
CN109933699A (en) * 2019-03-05 2019-06-25 中国科学院文献情报中心 A kind of construction method and device of academic portrait model
CN110362817A (en) * 2019-06-04 2019-10-22 中国科学院信息工程研究所 A kind of viewpoint proneness analysis method and system towards product attribute
CN110569920A (en) * 2019-09-17 2019-12-13 国家电网有限公司 prediction method for multi-task machine learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103886105A (en) * 2014-04-11 2014-06-25 北京工业大学 User influence analysis method based on social network user behaviors
US20180260860A1 (en) * 2015-09-23 2018-09-13 Giridhari Devanathan A computer-implemented method and system for analyzing and evaluating user reviews
CN108062385A (en) * 2017-12-14 2018-05-22 微梦创科网络科技(中国)有限公司 The method and system of Users' Interests Mining
CN109933699A (en) * 2019-03-05 2019-06-25 中国科学院文献情报中心 A kind of construction method and device of academic portrait model
CN110362817A (en) * 2019-06-04 2019-10-22 中国科学院信息工程研究所 A kind of viewpoint proneness analysis method and system towards product attribute
CN110569920A (en) * 2019-09-17 2019-12-13 国家电网有限公司 prediction method for multi-task machine learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TENGFEI MA: "《Opinion Target Extraction in Chinese News Comments》" *
赖学胜: "《基于海量零售数据用户画像的推荐算法研究》" *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111915366A (en) * 2020-07-20 2020-11-10 上海燕汐软件信息科技有限公司 User portrait construction method and device, computer equipment and storage medium
CN111915366B (en) * 2020-07-20 2024-01-12 上海燕汐软件信息科技有限公司 User portrait construction method, device, computer equipment and storage medium
CN112084402A (en) * 2020-08-24 2020-12-15 浙江云合数据科技有限责任公司 Method for predicting user attribute by analyzing application program use data
CN114201516A (en) * 2020-09-03 2022-03-18 腾讯科技(深圳)有限公司 User portrait construction method, information recommendation method and related device
CN114584824A (en) * 2020-12-01 2022-06-03 阿里巴巴集团控股有限公司 Data processing method and system, electronic equipment, server and client equipment
CN112837087A (en) * 2020-12-16 2021-05-25 北京交通大学 User portrait construction method oriented to consultation service system
CN112860808A (en) * 2020-12-30 2021-05-28 深圳市华傲数据技术有限公司 User portrait analysis method, device, medium and equipment based on data tag
CN112861003A (en) * 2021-02-19 2021-05-28 杭州谐云科技有限公司 User portrait construction method and system based on cloud edge collaboration
CN113457122A (en) * 2021-06-28 2021-10-01 华东师范大学 User image drawing method based on VR emergency environment
CN114358814A (en) * 2021-11-29 2022-04-15 国网北京市电力公司 Data processing method and device, electronic equipment and computer readable storage medium
CN117290612A (en) * 2023-11-24 2023-12-26 深圳市华图测控系统有限公司 Prediction matching method and system based on behavior analysis
CN117290612B (en) * 2023-11-24 2024-02-06 深圳市华图测控系统有限公司 Prediction matching method and system based on behavior analysis

Similar Documents

Publication Publication Date Title
CN111309936A (en) Method for constructing portrait of movie user
Kumar et al. Sentiment analysis of multimodal twitter data
Hu et al. Reviewer credibility and sentiment analysis based user profile modelling for online product recommendation
CN108763362B (en) Local model weighted fusion Top-N movie recommendation method based on random anchor point pair selection
CN107992531B (en) News personalized intelligent recommendation method and system based on deep learning
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
Moreo et al. Lexicon-based comments-oriented news sentiment analyzer system
CN112991017A (en) Accurate recommendation method for label system based on user comment analysis
Liu et al. What affects the online ratings of restaurant consumers: a research perspective on text-mining big data analysis
CN110287314B (en) Long text reliability assessment method and system based on unsupervised clustering
Huang et al. Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow
Liu et al. Using collaborative filtering algorithms combined with Doc2Vec for movie recommendation
Yang et al. Social tag embedding for the recommendation with sparse user-item interactions
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
Iskandarli Applying clustering and topic modeling to automatic analysis of citizens’ comments in E-Government
Wegrzyn-Wolska et al. Tweets mining for French presidential election
KR101652433B1 (en) Behavioral advertising method according to the emotion that are acquired based on the extracted topics from SNS document
Kordabadi et al. A movie recommender system based on topic modeling using machine learning methods
Abdi et al. Using an auxiliary dataset to improve emotion estimation in users’ opinions
Gan et al. CDMF: a deep learning model based on convolutional and dense-layer matrix factorization for context-aware recommendation
Ueno et al. A spoiler detection method for japanese-written reviews of stories
CN115510269A (en) Video recommendation method, device, equipment and storage medium
CN111259228A (en) Personalized news recommendation method based on big data deep learning
Xu et al. Identify user variants based on user behavior on social media
Turdjai et al. Simulation of marketplace customer satisfaction analysis based on machine learning algorithms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619

RJ01 Rejection of invention patent application after publication