CN114596031A - Express terminal user portrait model based on full life cycle data - Google Patents

Express terminal user portrait model based on full life cycle data Download PDF

Info

Publication number
CN114596031A
CN114596031A CN202210230161.4A CN202210230161A CN114596031A CN 114596031 A CN114596031 A CN 114596031A CN 202210230161 A CN202210230161 A CN 202210230161A CN 114596031 A CN114596031 A CN 114596031A
Authority
CN
China
Prior art keywords
user
data
attribute
classification
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210230161.4A
Other languages
Chinese (zh)
Inventor
赵学健
赵可
孙知信
孙哲
汪胡青
宫婧
胡冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202210230161.4A priority Critical patent/CN114596031A/en
Publication of CN114596031A publication Critical patent/CN114596031A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/08Logistics, e.g. warehousing, loading or distribution; Inventory or stock management
    • G06Q10/083Shipping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Educational Administration (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The express terminal user portrait model based on the full life cycle data obtains all data generated by a user in a software or program using process through a web crawler; in order to fully mine information contained in the data, the data are classified into three different attributes, namely basic attribute, interest attribute and feedback attribute, and corresponding service recommendation is carried out according to the different attributes by combining with actual analysis attribute characteristics; respectively constructing three attribute vectors, directly processing data of basic attributes and interest attributes to obtain vectors, and calculating scores according to an assignment formula by considering the scoring comment condition of a user and the preference condition of the user to construct the vectors for the feedback attributes; synthesizing vectors with different attributes to form a multi-attribute vector model for comprehensively describing user characteristics; for the constructed multi-attribute vector model, data mining and analysis can be carried out to evaluate the stability degree of the user; the user portrait model creation method is beneficial to deep research of the user portrait in the logistics express industry and establishes a comprehensive and efficient user portrait model for express enterprises.

Description

Express terminal user portrait model based on full life cycle data
Technical Field
The invention belongs to the technical field of portrait construction, and particularly relates to an express terminal user portrait model based on full life cycle data.
Background
The continuous development and progress of the society enables the existing enterprise service to be more and more humanized, and brings better and better life quality for people. Meanwhile, consumers also put more requirements on service standards and modes of various industries. In the face of many requirements, how enterprises make changes to better serve users becomes an urgent problem. Therefore, an idea of intelligent service is proposed, and various enterprises begin to develop products for providing intelligent service for users, so as to realize personalized service. The key point of realizing the individuation is to know the user and the product, so that the targeted recommendation is realized after the product information is screened according to the user interest.
With the development of electronic commerce technology, logistics enterprises also have more and more customers. While the number of large-order customers has increased dramatically, the mail demand of many idle customers has also begun to steadily increase. In order to better and more conveniently serve the ordinary customers, logistics enterprises launch own program software to help the customers to finish quick services such as sending on-line mails and receiving pickup information by themselves. In order to better realize the 'personalized' service of the logistics enterprise, the needs of 'various and complex' idle users are about to be mined and mastered.
Most of the existing general user portrait construction technologies focus on information extraction and construction of data within a period of time or one type of data in a user life cycle, for example, the problem of sparse data exists when a user portrait is extracted aiming at feedback data. In fact, from the time when the user logs in the software interface to the time when the evaluation is finished after the order is finished, data generated in the whole user life cycle can be utilized and combined to construct a complete user portrait, and therefore user positioning is subdivided.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides an express terminal user portrait model based on full life cycle data, which comprehensively describes the behavior of a user on software through a multi-dimensional feature vector. The user features are divided into primary attributes and secondary attributes, the user features are classified and generalized to form a multi-vector model, and construction of the user portrait is achieved through model extraction. Meanwhile, the basic attribute, the feedback attribute and the interest attribute of the user are considered, an initial user model can be built for a new user without using data, and initial user segmentation is achieved.
The invention provides an express terminal user portrait model based on full life cycle data, which comprises the following steps,
s1, constructing a multi-attribute vector user portrait of comprehensive full-life-cycle data, classifying the acquired data according to attributes, and forming corresponding vectors;
s2, carrying out category division on each characteristic attribute in two categories of basic attributes and interest attributes of users in attribute classification, analyzing the characteristics of the users in different categories and acquiring service recommendation directions suitable for the categories;
and S3, processing feedback information after the user receives and orders according to the grading of the user on the transaction, the evaluation content and the transaction times of the user using software, setting weight distribution after carrying out keyword extraction and identification and emotion analysis on the evaluation text, and digitizing the evaluation content.
As a further technical scheme of the invention, the multi-attribute vector user portrait comprises three attribute categories, namely a basic attribute, an interest attribute and a feedback attribute;
the basic attribute is basic information of a user, and comprises four feature tags: the user ID, age, sex and location, the vector formed by the basic attributes is
B={ID,age,gender,identity,region},
The interest attribute is a data set formed by sorting option data of the user in order transaction, combining comment texts published by the user on a platform, issuing recycled survey data and checking preferences, and comprises four tags: receiving and sending time, transportation mode, product specification and self preference, wherein a vector formed by the receiving and sending time, the transportation mode, the product specification and the self preference is P ═ { AT, type, norm, reference },
the feedback attribute comprises two factors of order rating value and comment emotion and indirect feedback data, and the rating value and the comment emotion are calculated through a formula to form a feedback value R of each orderiForming a vector of
F={R1,R2L Rn
Wherein R isiThe feedback index value of the ith transaction of the user,
combining the vectors formed by the three attributes to obtain a final model of the multi-attribute vector user portrait as
UP={B,P,F}
Figure BDA0003540204680000031
Furthermore, the value of the credit and the sentiment of the comment are calculated by a formula to form a feedback value R of each orderiThe specific calculation of the method comprises the steps of firstly carrying out word segmentation processing on a Chinese word segmentation library jieba of python to extract keywords, then calling the keywords in a sentiments method cyclic data set in a SnowNLP library, carrying out statistics on the emotion expressed by each order comment according to the emotion expressed by the keywords, and assigning values to the emotion; setting the comment emotion to be 1-5 grades which respectively correspond to 1-5 points, wherein the less the score is, the lower the satisfaction degree is; obtaining a comment emotion index value according to the method, wherein s represents order score, and the calculation formula of the feedback evaluation index is as follows:
Ri=α×s+β×e;
wherein, α and β are proposed coefficients, and when the user only has a score and no comment in the transaction, α is 1; otherwise, α ═ β ═ 0.5.
Further, in step S2, the vector model of the created multi-attribute user portrait is classified and clustered at a time, and divided into two stages, specifically as follows,
in the first classification, all users using the software are classified into two main classes by combining a K-means algorithm and an integrated classification algorithm of a support vector machine algorithm: stabilizing the user S and the user L which is about to be lost;
in the second clustering, adopting density peak value clustering, adding coefficient control factors according to the preference of a user, adjusting a calculation formula of relative distance of data points and local density, thereby clustering the S and L clusters respectively, obtaining m and n subdivided clusters respectively, representing the characteristics of all objects of the clusters by the group characteristics of the subdivided clusters, and inducing the label characteristics of the center of the clusters so as to obtain m + n group images;
and step three, when a new user group is added, judging the similarity between the new user and the existing group by calculating the Min-type distance between the new user and the existing clustering center so as to divide the user types, and updating the center of the clustering cluster at any time after the new user is divided.
Furthermore, in the first classification, a classification algorithm integrating K-means and SVM is used, the SVM is a machine learning method with supervised learning, training is carried out before classification, an unsupervised classification result provided by the K-means is used as training data of the SVM, and the hyperplane is continuously adjusted and segmented according to the classification result of the SVM so as to obtain a better classification result; adding a characteristic factor adjusting algorithm of the express terminal user in the classifying process:
X={x1,L,xnthe method comprises the steps that a multi-dimensional vector set is formed, and basic attributes, interest attributes and feedback attribute data of users are contained;
A={a1,a2the center of two cluster classes specified in the first classification is represented by a stable user respectivelyAnd users who are endangered to be lost;
by a binary variable [ z ]ij]n×2E {0,1} to examine the data point;
Figure BDA0003540204680000041
||xi-aj||2the Euclidean distance between the data point and the cluster center;
the determination mode of the clustering center is as follows:
Figure BDA0003540204680000042
introducing a weight value in the calculation of the objective function: w ═ ω12,LωnGet the objective function of the cluster as
Figure BDA0003540204680000043
Iterating the K-means algorithm by taking a minimized objective function as a condition, and updating the classification result of the clustering center until the clustering center is not changed any more, so that two clusters of labeled training sets are obtained to train the SVM algorithm;
then, an optimal classification hyperplane w is searched in the feature space by using the SVMTX + b is 0, thereby separating the data set into two classes,
finding the optimal hyperplane by solving quadratic programming:
Figure BDA0003540204680000051
the constraint conditions are as follows: z is a radical ofij[wT·xi+b]≥1
Obtaining an optimal classification hyperplane by solving the quadratic programming, and adjusting the two clusters of labeled data obtained after the K-means clustering according to the hyperplane; and then training a new classification hyperplane by using the new two clusters of data, and iteratively updating the two clusters and the hyperplane until the result is not changed any more, thereby obtaining the optimal first classification result.
Further, clustering is carried out again on the two large clusters obtained by the first classification to subdivide the users, and a clustering algorithm based on a density peak value is improved;
the parameters required for density peak clustering are relative distance and local density:
since the input vector dimension is uncertain and non-uniform, the calculation formula for defining the relative distance d is as follows
Figure BDA0003540204680000052
Weighting the distances according to the user's own preferences:
if the user only has the enterprise in the preference, alpha is 3;
if a plurality of enterprises exist in the user preference and the enterprise is included, alpha is 2;
if the user does not include the business, α is 1.
Adding the preference coefficient into the Gaussian kernel calculation of the local density to obtain the local density rho of each vector calculated by the following formulai
Figure BDA0003540204680000053
dijIs the distance between two data points, dcIs a manually specified truncation distance.
And acquiring the point with the maximum relative distance and local density as a clustering center, calculating the distances from the rest points to the clustering center, filling with 0 if the dimensions are different, dividing the point into the cluster with the closest distance, and continuously iterating and updating the clustering center.
The method has the advantages that all data generated by the user in the process of using the software or the program are obtained through the web crawler; in order to fully mine information contained in the data, the data are classified into three different attributes, namely basic attribute, interest attribute and feedback attribute, and corresponding service recommendation is carried out according to the different attributes by combining with actual analysis attribute characteristics; respectively constructing three attribute vectors, directly processing data of basic attributes and interest attributes to obtain vectors, and calculating scores according to an assignment formula by comprehensively considering the scoring comment condition of a user and the preference condition of the user to construct the vectors for the feedback attributes; synthesizing vectors with different attributes to form a multi-attribute vector model for comprehensively describing user characteristics; for the constructed multi-attribute vector model, data mining and analysis can be performed to evaluate the stability degree of the user. The method is favorable for deep research on the user portrait in the logistics express industry, and a comprehensive and efficient user portrait providing method is established for express enterprises.
Drawings
FIG. 1 is a diagram of a multi-attribute vector user representation model framework of the present invention;
FIG. 2 is a flow chart of a user profile creation and data mining analysis process of the present invention.
Detailed Description
Referring to fig. 1 and fig. 2, the present embodiment provides an express terminal user portrait model based on full lifecycle data, including:
firstly, collecting complete life cycle data generated by a user on an express software program, and carrying out classification induction and feature description according to three categories of basic attributes, interest attributes and feedback attributes based on the data so as to form a multi-dimensional vector user portrait;
for basic attributes in the multi-dimensional vector, roughly describing and analyzing personal images and characteristic attributes of the user according to basic information such as age, gender, authentication identity, location area and the like of the user, thereby obtaining the service type suitable for the user and reasonably recommending the service type;
for interest attributes in the multi-dimensional vector, collecting and sending time preference, mailing mode preference and conventional article specification preference of the user are summarized according to information generated by the user order, common express companies which can visually express the preference of the user can visually express the user, and the information is analyzed to perform corresponding service recommendation on the user;
for feedback attributes in the multi-dimensional vector, a method for extracting keywords and analyzing emotions and scoring and assigning values to user evaluation contents is provided, the limitation that the conventional method only considers the fact that the statistical emotion error is large due to user scoring is broken through, and a data vector for more objectively describing user feedback information is formed by combining the times of using software by a user;
on the data mining and analysis after the multi-attribute user portrait model is built, a data processing step of primary classification and primary clustering is provided, and data is processed twice, so that the user is more accurately positioned and subdivided;
in the first classification process, based on an integrated algorithm of K-means and a support vector machine, adding a binary variable for positioning a user and rewriting an objective function, and classifying the user into two categories of stable users and users suffering from imminent loss;
in the second clustering, based on the idea of the density peak value clustering algorithm, user preference influence factors are increased, the calculation mode of local density and truncation distance is redefined, and the stability of the users is better clustered, so that the process of subdividing the users is completed;
service recommendation of different grades is carried out on user groups generated after the data processing is carried out twice, and the method is suitable for users of different types, so that the service is more targeted and personalized;
aiming at the problem of user loss generated in the process of ordering program software, a funnel analysis model is adopted to analyze the user loss, and the lowest total loss rate is taken as a target function to perform corresponding optimization, so that the software applicability is continuously improved, and the stability of the user is favorably maintained.
The established multidimensional vector user portrait model is based on full life cycle data generated by a user on express terminal software or a program. Therefore, user data on the app or the applet of the express terminal is captured through a web crawler program of python. Then preprocessing and classifying the huge and miscellaneous data, and firstly dividing the data into three attributes according to a first-level index: a base attribute, an interest attribute, and a feedback attribute.
UP represents a multi-dimensional vector user portrait model;
b represents a user basic attribute;
p represents a user interest attribute;
f represents a user feedback attribute.
UP={B,P,F}
The three attributes describe the information characterizing the user from different aspects and play different analysis effects.
In the basic attribute class, for four feature labels: firstly, the user ID, the age, the gender and the area are classified according to the artificial experience, the crowd characteristic analysis and the corresponding business recommendation direction to form the following classification table:
Figure BDA0003540204680000081
the basic attributes of the user generally refer to basic information of the user, including the user's ID, age, sex, identity, and location, etc.:
ID represents a user ID on the program software;
age represents age information of the user;
the gender information of the user is represented by the gender information;
identity represents identity information of the user;
the region represents information of the region where the user is located:
B={ID,age,gender,identity,region}
for express delivery data analysis, most of basic attribute differences of users have small influence. But the basic information of a user can be refined and planned, the image of the user is depicted, and the receiving and sending express service recommendation suitable for the user is analyzed.
The user's ID name may be skipped because the information that may be involved is too empty.
The age of a user can be roughly divided into four stages, often in the interval of 20 years: young people of 0-20 years old live in various students, so that the receiving and sending time of the young people is relatively concentrated in the middle of the day or the time of not learning class at night, teenagers accept fresh things quickly, the capacity of accepting new services of mobile phones or computers is high, but the group is usually strong in subjective consciousness, so that the popularization is easy and difficult, and the overall popularization difficulty is general; the group of 21-40 years old is more office workers in the young, the receiving and sending time is concentrated in the noon evening, the receiving speed of new things is high, but the group is more subjective, and the popularization difficulty is higher than that of teenager groups; the middle-aged workers in the group of 41-60 years old are more, the time concentration trend is similar to that of the former two groups, the reaction capacity of the group for receiving new things is lower than that of the former two groups, the subjective consciousness and the precautionary consciousness are higher, or the group is possibly unwilling to receive new things, and certain popularization difficulty exists; more people are retired and idled in the group over 61 years old, the receiving and sending time of the people is more decentralized, although the speed of accepting a new mode method is slower, the subjective consciousness and the precautionary consciousness are possibly weaker, and the difficulty of popularizing new services is lower.
The gender of the user also has a certain influence on the difficulty of service promotion. The male is often low in tolerance, low in meticulous degree and short in idle time when working is busy, so that the male is possibly unwilling to spend time to accept new services, and the popularization difficulty is high; women are always more patiently and attentively, and pay more attention to preferential strength, subsidy modes and the like, so that new preferential cost-saving businesses are possibly more interested, and the popularization difficulty is lower.
The identity of the user also determines the service suitable for recommendation mainly by influencing the conventional receiving and sending time and the regional index. For example, the receiving and sending time and the located area of students and stable working office groups are highly centralized, so that the system is suitable for popularizing stability benefits or new services, such as benefit activities in a fixed time period or a receiving and sending fixed area; the receiving and sending time and the location of a changed working group are changed frequently and are dispersed, so that relatively fixed services are not popularized, and relatively large-range service activities can be popularized, for example, price unification of receiving and sending express in various places is handled, or all-day receiving and sending services are provided for the business activities to maintain stability of customers; the idle and non-professional crowds are prone to disperse receiving and sending time, but the areas where the idle and non-professional crowds are located are likely to be concentrated, and due to the fact that the time is sufficient, the crowds are suitable for being popularized in a large number of service types, and corresponding service recommendation can be conducted according to the areas where the idle and non-professional crowds are located, receiving and sending preferences at ordinary times and the like.
The timeliness and the convenient degree of receiving and sending the express are determined to a great extent by the area of the user. Therefore, the method is suitable for popularizing different types of services according to different regions. For example, in places with huge logistics transaction amount, such as Jiangzhe Shanghai and northern ShangZhou, the method is suitable for promotion of quantitative preferential activities, such as continuous ordering and accumulated preferential, so as to try to increase the order transaction times of the user in the express company and maintain stable user; and the logistics transactions of northeast, northwest, southwest and the like are relatively less, and the remote places are more suitable for services with obvious price preference and can attract users to place orders more easily.
In the interest attribute class, for four tags: the receiving and sending time, the transportation mode, the product specification and the self preference are firstly subjected to classification result arrangement based on manual experience, crowd characteristic analysis and service recommendation direction to form the following classification table:
Figure BDA0003540204680000101
Figure BDA0003540204680000111
the interest attributes of the users are obtained by sorting preference enterprises directly filled by the users and option data in order transaction, and combining preference information contained in comment texts published on the platform by the users, issuing recycled survey data and the like to sort and form a data set. Including time, express delivery transportation mode, express delivery specification and user self preference etc. are received and posted to the general use:
AT represents the receiving and sending time commonly used by the user;
type represents a transportation mode commonly used by a user;
norm represents the specification of an article frequently transported by a user;
the preference represents a preferred enterprise for the user at the express company:
P={AT,type,norm,preference}
the receiving and sending time commonly used by the user may depend on different occupations, identities and the like of the user, and can reflect the general idle time of the user. The method records the common receiving and sending time of different users, is convenient for providing more convenient and appropriate service for the users, such as sending pick-up codes and service preference information when the users are close to receiving and sending, or calling to get the pick-up at home and the like, improves the satisfaction degree of the users, and maintains the stability of the users.
The different express delivery transportation modes can lead to different factors such as transportation speed, transportation time consumption and the like, so that the user-preferred transportation mode can often reflect the user's own requirements for receiving and sending the express and can also indirectly reflect the types of articles which are frequently received and sent by the user. Different types of articles are suitable for different transportation modes, for example, air transportation is often adopted when fresh food and fresh flower plant products have higher requirements on preservation time, transportation speed and the like; and the articles such as clothes and daily articles are not high in time requirement, and are usually transported on the common land. Therefore, the suitable transportation mode can be recommended according to the types of the articles in the order received by the user, or the special user can be promoted with the special transportation mode.
The specifications of common receiving and sending articles comprise valuables, large-sized articles, small-sized articles, life articles and the like, and the specifications of the receiving and sending articles of customers are different, so that the subsequent packaging and sorting steps and the transportation mode are different. Different transportation modes and sending package packages can be recommended to the user according to the specifications of receiving and sending products commonly used by the user, and series activities such as service preference of member customers can be constructed on the basis to form enterprise advantages, so that the customers are stabilized.
The user preference can directly reflect the user's preference degree to express enterprises, and the user preference is summarized according to various information generated by the user, and the result can be divided into including self enterprises and not including self enterprises. When the user preference includes own enterprises, the user can be generally divided into stable client ranks, and because the user has certain dependence degree on the enterprises, the user can be more stable by correspondingly recommending some circulation preferential services such as daily check-in for obtaining the preferential service or quantitative preferential services such as more accumulated mails and more preferential services. When the preference of the user does not include the own enterprise, the express company commonly used by the user is a competitor, enterprise advantages which can be established between the own enterprise and the competition need to be considered, and a simpler measure is to establish price advantages or construct convenience conditions for the user, such as receiving and posting more services than other enterprises, so as to retain the user.
The user usually generates feedback after completing an order on the program software, wherein the scoring of the transaction and the emotion expressed by the comment are the most direct feedback results, but there is also an index for indirectly feeding back the user satisfaction degree, such as the times of using the express company by the user, and indirectly reflects the preference degree of the user to the company. Therefore, the two are combined to form a vector for measuring the feedback information of the user to the service of the express company:
Rirepresenting the value of the feedback index after the user carries out the ith transaction on the software;
n represents the number of express companies used by the user:
F={R1,R2L Rn}
wherein R isiThe feedback index value of the ith transaction performed by the user is determined by the contents of the grade and the comment of the user on each order. The score value can be obtained by directly crawling the score data of the user, but the comment value is obtained by performing word extraction, emotion assessment and assignment processing on the comment, and the method is specifically a calculation method specified as follows:
processing the user comments, firstly, performing word segmentation processing on a jieba Chinese word segmentation library in a python library to extract a keyword data set in a section of comment content, then calling keywords in a sentiments method cyclic data set in a SnowNLP library, accumulating the emotion expressed by each keyword to count the emotion expressed by each order comment, and assigning values to the emotion. And setting five grades of 1 to 5 for the comment emotion, wherein the grades correspond to 1 to 5, and the lower the grade is, the lower the satisfaction degree is represented, and the higher the satisfaction degree of the user on the order is represented. According to the method, a numerical value e of the comment emotion index is obtained.
e represents the sentiment numerical value of the user order comment;
s represents the user's rating of the order;
alpha and beta represent coefficients of user scores and user comments;
the calculation formula of the feedback evaluation index is as follows:
Ri=α×s+β×e
where α and β are proposed coefficients. When the user only has scores and does not have comments in the transaction, alpha is 1, which represents that the user scores account for one hundred percent in the index calculation; otherwise, α ═ β ═ 0.5, which represents that the user score is half the sentiment value of the user comment.
Calculating an index vector of the feedback attribute according to the method, and obtaining the index vector of the basic attribute and the interest attribute through data preprocessing, thereby obtaining a complete multi-attribute vector user portrait model by combining:
UP={B,P,F}
Figure BDA0003540204680000131
after a model of the user portrait is established, data mining and analysis are carried out on the user based on the model, so that the user is positioned and subdivided, and subsequent service recommendation is promoted. The data mining process comprises primary classification and primary clustering, users are classified into two categories of stable users and users who are endangered to be lost through the primary classification process, then specific clustering analysis is carried out on the two categories, the two categories are subdivided again, and accordingly more objective and reasonable user group division results are obtained.
In the first classification process, classification is carried out by using an integrated algorithm based on a K-means algorithm and a support vector machine. The support vector machine is a machine learning algorithm with better performance in a common classification algorithm, but a supervised machine learning method is adopted, so that a training set data training algorithm needs to be acquired before specific classification. In order to better improve the efficiency and performance of classification, an unsupervised classification result provided by a K-means algorithm is used as training data of the SVM, then the unsupervised classification result is input into the SVM model, and the segmentation hyperplane is continuously adjusted according to the classification result so as to obtain an optimized classification result.
In the first classification process of the invention, because the data set is definitely classified into two types, the K-means algorithm is firstly improved correspondingly to be a classification algorithm for definitely classifying the two types of classes from a general clustering algorithm:
let X be { X ═ X1,L,xnThe model is a multidimensional vector set formed according to the multidimensional vector user portrait model, wherein the multidimensional vector set comprises basic attributes, interest attributes and feedback attribute data of users.
Let A be { a }1,a2And the centers of the two types of clusters obtained by classification respectively represent two categories of stable users and impending loss users of the express company.
Introducing a binary variable [ z ]ij]n×2E {0,1}, for checking to which class cluster the data point should be classified. The values of the binary variables are defined as follows:
Figure BDA0003540204680000141
wherein, | | xi-aj||2The euclidean distance between the data point and the cluster center is calculated. Since the number of data in the feedback attribute vector may differ for different users, it is provided in calculating the distance that if the number of data in the vector differs, the missing bit value is replaced by 0.
The determination mode of the clustering center is as follows:
Figure BDA0003540204680000151
considering that different feature indexes in the sample data vectors to be clustered have different influence degrees on the user classification result, a weight needs to be introduced into the objective function to comprehensively consider the distribution features of the data.
Let W be { omega ═ weight vector12,Lωn}。
In the above-mentioned basic attribute, interest attribute and feedback attribute, the influence of the feedback attribute value of the user's own preference enterprise and user on the classification result is large, so that the user's omega attribute has a large influence on the classification resulti3; other characteristic indexes have weak capability of reflecting user stability degree, so that the weight omega of the other characteristic indexes is weightedi=1。
Based on comprehensive consideration of binary variable [ z ]ij]n×2E {0,1} and a weight vector W ═ ω12,LωnThe objective function of the improved K-means is:
Figure BDA0003540204680000152
and (3) iterating the algorithm by taking the minimum function as an optimization target, continuously updating the centers of the two classified clusters until the classification result is not changed any more, and obtaining two clusters of labeled data as a training set to train the SVM algorithm.
Using the data with the label as input data of a support vector machine, and searching an optimal hyperplane in a feature data space by using an SVM algorithm:
wT·x+b=0
solving is carried out by quadratic programming in the process of seeking the optimal hyperplane:
Figure BDA0003540204680000153
s.t.zij[wT·xi+b]≥1
because the binary variables are adopted for data division in the K-means calculation process, the binary variables used for distinguishing the positive class from the negative class in the traditional algorithm can be directly replaced in the constraint condition of quadratic programming. And obtaining an optimal classification hyperplane by solving the quadratic programming, and adjusting K-means classification according to the hyperplane to obtain two clusters of labeled data. And then training a new classification hyperplane by using the adjusted two clusters of data, and repeating the two steps by consistent iteration updating until the result is not changed any more. At the moment, the optimal result of the first classification is obtained, namely, users using software or programs are objectively and reasonably classified into two types of stable and impending loss.
After two types of user groups are obtained, the recommended services of the enterprise can be correspondingly divided into two types, one type is that aiming at improving the user satisfaction degree promoted by the users who are about to run away, the user service is tried to be stabilized, personalized services more suitable for the user are promoted mainly by methods of creating convenient conditions, showing special offers and the like for the user, and the satisfaction degree of the user to the enterprise is improved efficiently and quickly; the other type is to provide more stable user business for improving the user dependence degree on enterprise service by stable users, mainly by providing cyclic benefits or quantitative benefits, such as more and more benefits as the number of the accumulated sending times is more, and simultaneously ensure the service quality of the stable users.
In order to better perform positioning segmentation of users, two large clusters generated after the first classification can be clustered respectively to generate a finer user population. The idea of clustering based on density peaks is adopted, so that two parameters of relative distance and local density need to be defined.
Since the different data vector dimensions of the inputs may be non-uniform, the minz distance idea is employed to define the relative distance parameter:
d represents the relative distance between data points;
α represents a weighting coefficient determined according to user preference;
n represents the dimension of the vector;
Fidata vector representing input:
Figure BDA0003540204680000161
weighting the relative distances according to the user's own preferences:
if the user preference only has the enterprise, the alpha is 3;
if a plurality of enterprises exist in the user preference and the enterprise is included, alpha is 2;
if the user does not have the business in his preferences, α is 1.
Preference weighting coefficients are also added to the local density formula calculated in a gaussian kernel mode:
ρirepresenting the local density of each data vector;
dijis the distance between two data points;
dcis the artificially specified truncation distance:
Figure BDA0003540204680000171
calculating the relative distance and the local density of each data point, selecting the point with the maximum relative distance and the local density attribute as a cluster center of the cluster according to the idea of density peak value clustering, then calculating the distance between the rest data points and the cluster center, filling a vacancy of a data vector with 0 if the dimensions are different like the method, then dividing the point into the cluster with the closest distance according to the calculated numerical value, continuously iterating and updating the cluster center, and finally finishing clustering.
After clustering is completed, two categories, namely stable users and users suffering from imminent loss, can be subdivided into m categories and n categories, which respectively represent m user groups with very stable to less stable user groups in stable users and n user groups with low loss possibility to high loss rate in users suffering from imminent loss. The total number of m + n user groups is formed, and different service recommendations can be performed on different groups according to different stability and loss. The more stable users can recommend accumulated service benefits and quantitative benefits for the users, and the more imminent lost users need to establish obvious convenience conditions or benefit conditions for the users, so that personalized recommendation is realized, personalized user services are provided, and the user satisfaction is improved.
For each link that a user returns upper-layer viewing information or directly exits from an ordering process to cause transaction delay or transaction failure condition, the method provides a funnel analysis model applied to user loss detection of the problem, wherein the transaction delay or the transaction failure condition is caused by starting from a software interface login of an express terminal software or a user on a program in the ordering process until final payment is finished:
assuming that on program software, the whole process of ordering by a user comprises the following steps: registering and logging in, getting the coupon, immediately placing an order pop-up window, filling order information, submitting an order and a payment interface. For behavior data generated in the ordering process of the user, a webpage crawler is used for acquiring the ordering specific path of the user, and the crawled data are counted to calculate the loss rate gamma of each pagei(i=1,2,L 6)。
The mutual influence of the loss rate among different page jumps is slight, so that the mutual influence can be ignored.
O represents the total attrition rate of six pages;
wirepresenting the weight of the impact of different page churn rates on the total churn rate, (i ═ 1,2, L6).
Writing out an objective function by taking the lowest total loss rate as an optimization objective:
Figure BDA0003540204680000181
the influence degrees of the user loss rates on different pages on the measurement of the total loss rate are possibly different, for example, the user loss on the pages such as coupon picking, order submitting and order paying can cause transaction failure more possibly, and the transaction failure possibility of the return upper layers of other pages is lower possibly due to information errors, preferential picking errors and the like, so different weights are set, and the total loss rate is scientifically calculated under the condition of considering different influence distributions.
And then obtaining the user application condition of the software program by comparing the loss rates on different pages, and continuously optimizing the improved software by taking the minimum total loss rate as an objective function.
The conversion data can be sorted to form a funnel model after the user conversion rate on each lower page is counted, and the loss path of the user on each link is checked by combining path analysis, so that the loss reason of the user is analyzed in detail, and proper improvement is made.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is intended to be protected by the appended claims. The scope of the invention is defined by the claims and their equivalents.

Claims (6)

1. An express terminal user portrait model based on full life cycle data is characterized by comprising the following steps,
s1, constructing a multi-attribute vector user portrait of comprehensive full-life-cycle data, classifying the acquired data according to attributes, and forming corresponding vectors;
s2, carrying out category division on each characteristic attribute in two categories of basic attributes and interest attributes of users in attribute classification, analyzing the characteristics of the users in different groups, and acquiring a service recommendation direction suitable for the groups;
and S3, processing the feedback information after the user receives and orders according to the grade of the user on the transaction, the evaluation content and the transaction times of the user using software, setting weight distribution after performing keyword extraction and identification and emotion analysis on the evaluation text, and digitizing the evaluation content.
2. The express delivery terminal user representation model based on full lifecycle data of claim 1, wherein the multi-attribute vector user representation comprises three attribute broad classes, respectively a basic attribute, an interest attribute and a feedback attribute;
the basic attribute is basic information of a user, and comprises four feature tags: the user ID, age, sex and location, the vector formed by the basic attributes is
B={ID,age,gender,identity,region},
The interest attribute is a data set formed by sorting option data of the user in order transaction, combining comment texts published by the user on a platform, issuing recycled survey data and checking preferences, and comprises four tags: the receiving and sending time, the transportation mode, the product specification and the preference of the product, and the formed vector is
P={AT,type,norm,preference},
The feedback attribute comprises two factors of order rating value and comment emotion and indirect feedback data, and the rating value and the comment emotion are calculated through a formula to form a feedback value R of each orderiForming a vector of
F={R1,R2L Rn},
Wherein R isiThe feedback index value of the ith transaction of the user,
combining the vectors formed by the three attributes to obtain a final model of the multi-attribute vector user portrait as
UP={B,P,F}
Figure FDA0003540204670000011
3. The express terminal user representation model based on full life cycle data as claimed in claim 2, wherein the feedback value R of each order is formed by calculating the score value and comment emotion through a formulaiThe specific calculation of the method comprises the steps of firstly carrying out word segmentation processing on a Chinese word segmentation library jieba of python to extract keywords, then calling the keywords in a sentiments method cyclic data set in a SnowNLP library, carrying out statistics on the emotion expressed by each order comment according to the emotion expressed by the keywords, and assigning values to the emotion; setting the comment emotion to be 1-5 grades, respectively corresponding to 1-5 points, wherein the less the score is, the satisfaction is representedThe lower the degree is; obtaining a comment emotion index value according to the method, wherein s represents order score, and the calculation formula of the feedback evaluation index is as follows:
Ri=α×s+β×e;
wherein, α and β are proposed coefficients, and when the user only has a score and no comment in the transaction, α is 1; otherwise, α ═ β ═ 0.5.
4. The express delivery terminal user representation model based on full life cycle data as claimed in claim 1, wherein in step S2, the vector model of the built multi-attribute user representation is classified and clustered into two stages, specifically as follows,
in the first classification, all users using the software are classified into two main classes by combining a K-means algorithm and an integrated classification algorithm of a support vector machine algorithm: stabilizing the user S and the user L which is about to be lost;
in the second clustering, adopting density peak value clustering, adding coefficient control factors according to the preference of a user, adjusting a calculation formula of relative distance of data points and local density, thereby clustering the S and L clusters respectively, obtaining m and n subdivided clusters respectively, representing the characteristics of all objects of the clusters by the group characteristics of the subdivided clusters, and inducing the label characteristics of the center of the clusters so as to obtain m + n group images;
and step three, when a new user group is added, judging the similarity between the new user and the existing group by calculating the Min-type distance between the new user and the existing clustering center so as to divide the user types, and updating the center of the clustering cluster at any time after the new user is divided.
5. The express delivery terminal user portrait model based on full life cycle data as claimed in claim 4, characterized in that, in the first classification, a classification algorithm integrating K-means and SVM is used, SVM is a machine learning method with supervised learning, training is performed before classification, unsupervised classification results provided by K-means are used as training data of SVM, and a segmentation hyperplane is continuously adjusted according to the classification results of SVM so as to obtain a better classification result; adding a characteristic factor adjusting algorithm of the express terminal user in the classifying process:
X={x1,L,xnthe method comprises the steps that a multi-dimensional vector set is formed, and basic attributes, interest attributes and feedback attribute data of users are contained;
A={a1,a2the centers of two types of clusters specified in the first classification represent two categories of stable users and users suffering from loss respectively;
by a binary variable [ z ]ij]n×2E {0,1} to examine the data point;
Figure FDA0003540204670000031
||xi-aj||2the Euclidean distance between the data point and the cluster center;
the determination mode of the clustering center is as follows:
Figure FDA0003540204670000032
introducing a weight value in the calculation of the objective function: w ═ ω12,LωnGet the objective function of the cluster as
Figure FDA0003540204670000033
Iterating the K-means algorithm by taking a minimized objective function as a condition, and updating the classification result of the clustering center until the clustering center is not changed any more, so that two clusters of labeled training sets are obtained to train the SVM algorithm; then, an optimal classification hyperplane w is searched in the feature space by using the SVMTX + b is 0, thereby separating the data set into two classes,
finding the optimal hyperplane by solving quadratic programming:
Figure FDA0003540204670000034
the constraint conditions are as follows: z is a radical ofij[wT·xi+b]≥1
Obtaining an optimal classification hyperplane by solving the quadratic programming, and adjusting the two clusters of labeled data obtained after the K-means clustering according to the hyperplane; and then training a new classification hyperplane by using the new two clusters of data, and iteratively updating the two clusters and the hyperplane until the result is not changed any more, thereby obtaining the optimal first classification result.
6. The express delivery terminal user portrait model based on full life cycle data as claimed in claim 4, wherein clustering is performed again on two major clusters obtained by first classification to subdivide users, and a clustering algorithm based on density peak value is improved;
the parameters required for density peak clustering are relative distance and local density:
since the input vector dimension is uncertain and non-uniform, the calculation formula for defining the relative distance d is as follows
Figure FDA0003540204670000041
Weighting the distances according to the user's own preferences:
if the user preference only includes the enterprise, alpha is 3;
if a plurality of enterprises exist in the user preference and the enterprise is included, alpha is 2;
if the user does not include the business, α is 1.
Adding the preference coefficient into the Gaussian kernel calculation of the local density to obtain the local density rho of each vector calculated by the following formulai
Figure FDA0003540204670000042
dijIs the distance between two data points, dcIs a manually specified truncation distance.
And acquiring the point with the maximum relative distance and local density as a clustering center, calculating the distances from the rest points to the clustering center, filling with 0 if the dimensions are different, dividing the point into the cluster with the closest distance, and continuously iterating and updating the clustering center.
CN202210230161.4A 2022-03-10 2022-03-10 Express terminal user portrait model based on full life cycle data Pending CN114596031A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210230161.4A CN114596031A (en) 2022-03-10 2022-03-10 Express terminal user portrait model based on full life cycle data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210230161.4A CN114596031A (en) 2022-03-10 2022-03-10 Express terminal user portrait model based on full life cycle data

Publications (1)

Publication Number Publication Date
CN114596031A true CN114596031A (en) 2022-06-07

Family

ID=81818946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210230161.4A Pending CN114596031A (en) 2022-03-10 2022-03-10 Express terminal user portrait model based on full life cycle data

Country Status (1)

Country Link
CN (1) CN114596031A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115760200A (en) * 2023-01-06 2023-03-07 万链指数(青岛)信息科技有限公司 User portrait construction method based on financial transaction data
CN115953166A (en) * 2022-12-27 2023-04-11 鑫恒绅企业服务(无锡)有限公司 Customer information management method and system based on big data intelligent matching
CN117113241A (en) * 2023-05-12 2023-11-24 中南大学 Intelligent leakage monitoring method based on edge learning

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115953166A (en) * 2022-12-27 2023-04-11 鑫恒绅企业服务(无锡)有限公司 Customer information management method and system based on big data intelligent matching
CN115953166B (en) * 2022-12-27 2024-04-02 鑫恒绅企业服务(无锡)有限公司 Customer information management method and system based on big data intelligent matching
CN115760200A (en) * 2023-01-06 2023-03-07 万链指数(青岛)信息科技有限公司 User portrait construction method based on financial transaction data
CN117113241A (en) * 2023-05-12 2023-11-24 中南大学 Intelligent leakage monitoring method based on edge learning

Similar Documents

Publication Publication Date Title
CN110837931B (en) Customer churn prediction method, device and storage medium
CN114596031A (en) Express terminal user portrait model based on full life cycle data
CN109189904A (en) Individuation search method and system
CN107391680A (en) Content recommendation method, device and equipment
CN113537796A (en) Enterprise risk assessment method, device and equipment
CN114880486A (en) Industry chain identification method and system based on NLP and knowledge graph
CN112184484B (en) Differentiated service method and system for power users
CN116308684B (en) Online shopping platform store information pushing method and system
CN112070543A (en) Method for detecting comment quality in E-commerce website
CN114266443A (en) Data evaluation method and device, electronic equipment and storage medium
CN112685635A (en) Item recommendation method, device, server and storage medium based on classification label
Babaiyan et al. Analyzing customers of South Khorasan telecommunication company with expansion of RFM to LRFM model
CN106997371B (en) Method for constructing single-user intelligent map
Verma et al. Data mining: next generation challenges and futureDirections
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
Hasheminejad et al. Clustering of bank customers based on lifetime value using data mining methods
CN113537878A (en) Package delivery method, device, equipment and storage medium
CN112132396A (en) Customer relationship distribution method and system based on intelligent matching
CN115563176A (en) Electronic commerce data processing system and method
CN115619571A (en) Financing planning method, system and device
CN112506930B (en) Data insight system based on machine learning technology
CN114240553A (en) Recommendation method, device and equipment for vehicle insurance products and storage medium
CN113254775A (en) Credit card product recommendation method based on client browsing behavior sequence
AlAmoudi et al. Extracting attractive app aspects from app reviews using clustering techniques based on kano model
CN117035947B (en) Agricultural product data analysis method and cloud platform based on big data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination